-
cdreimanu
i was browsing the comments of an archival related hackernews post recently, and apparently there’s this huge collection of bulgarian music (digitized LPs if I understood correctly) kept by some russian bloke. i have no idea how to go about this.
gramofonche.chitanka.info
-
cdreimanu
it’s the comment by svilen_dobrev here, for context:
news.ycombinator.com/item?id=38288020
-
cdreimanu
what do y’all think?
-
JAA
44000 hours of security camera footage from the Captiol during the insurrection will get released over the next few months. That'll be interesting to archive.
-
JAA
-
JAA
So for archive.mozilla.org, there is one directory I couldn't list due to timeouts (on my side, but the server's timeout is only marginally higher, and it still fails), namely /pub/firefox/tinderbox-builds/autoland-macosx64-debug/. Apart from that, there are four 404s (/pub/seamonkey/oldnightly/testing/testing/, /pub/seamonkey/nightly/nightly/, /pub/labs/devtools/master/master/, and
-
JAA
/pub/seamonkey/oldnightly/2021-09-01-21-00-03-comm-253/2021-09-01-21-00-03-comm-253/). Everything else seems to have been retrieved properly.
-
pokechu22
-
pokechu22
-
JAA
There are 96867265 files in the dirs I managed to list.
-
JAA
2564939 of them exceed 100 MiB.
-
JAA
29015882 are over 10 MiB.
-
JAA
36821201 over 1 MiB
-
JAA
The >= 100 MiB files are 438.52 TiB in total.
-
JAA
My size summing tool is rather slow, so I can't get a total size right now. Need to find a way to make that faster first.
-
JAA
Ok, I have a total size: 1.66 PiB
-
JAA
arkiver: ^
-
c3manu
you reckon I can just try running
gramofonche.chitanka.info through the bot with one connection?
-
arkiver
JAA: ouch
-
arkiver
perhaps we could archive part of it...
-
arkiver
is most of this size in a certain part of the repository?
-
arkiver
we could get a sample of that, while mirroring the stuff outside of that
-
Pedrosso
I was using DiscordChatExporter to make a total archive, however, I ran out of disk space whilst it was already a long way through. If I restart it with the same output file, does anyone know if it will resume the save, or overwrite what it's already done? I'd really rather not manually write in the IDs of the many guilds it missed
-
Pedrosso
(assuming I have more disk space now, which I do)
-
h2ibot
-
c3manu
!ig 1ticp48603jvfw5izv5xfs8vc ^https?://wheelmap\.org/
-
thuban
c3manu: psst, wrong channel
-
c3manu
lol, thanks m)
-
Pedrosso
^ from that above, I think I'll just write a code to load in the rest of the IDs
-
Pedrosso
-
Pedrosso
JAA: I've been downloading and I do want to follow this advice, however, I'm not quite sure how to contact them about something such as this. You mentioned I could through ar-kiver?
-
JAA
arkiver: I'm afraid I don't have stats on parts of the server. It's large enough to be a pain to work with in general.
-
JAA
Pedrosso: #discard for Discord stuff
-
Pedrosso
thx
-
JAA
Pedrosso: Yes, arkiver is the person to speak to here about getting permission for uploading large amounts of data (and also getting a collection and so on).
-
Pedrosso
Ah. Would that be done here (-bs) or, to not clog this chat up, DM:s?
-
JAA
Either is fine. Just keep us informed if you do it through PM and the project ends up happening. :-)
-
Pedrosso
That I will do.
-
Pedrosso
(informing, that is)
-
JAA
arkiver: Someone on SWH's IRC channel shared a breakdown by top dir: firefox 1053, thunderbird 156, devedition 117, seamonkey 26, mobile 17, xulrunner 3 are the top 6. (Numbers are 'size_tb', whichever unit that is exactly.)
-
JAA
I'll try to extract a full index with path, size, and mtime.
-
nulldata
OnMSFT.com hasn't published any new articles since 10/31. Before that was multiple articles daily. No announcements or mentions of taking a break. Did some digging and found they were acquired on 10/10 by Reflector Media. Maybe something happened with the staff under the new ownership?
-
nulldata
-
nulldata
They also have a podcast that consistently published every Sunday. Last episode was on 10/29
soundcloud.com/onmsft
-
arkiver
Pedrosso: hi, i do not exactly know what this is about - what is this about?
-
arkiver
svtplay.se ?
-
Pedrosso
-
Pedrosso
But to clarify in here, It's a very large official service for Swedish programs, and it uses a stream system. They tend to delete content en-masse and I've yet to find any other copies on the internet of many of them
-
joepie91|m
fwiw, you could check newsgroups if you haven't yet, that's where a lot of the NPO stuff (Dutch equivalent) ends up that's not available elsewhere
-
Pedrosso
I posted about the service once I noticed a documentary,
svtplay.se/din-hjarna had lost both its first seasons due to this deletion
-
nulldata
One of the higher volume writers for OnMSFT seems to imply it's about to be killed.
twitter.com/Dav3Shanahan/status/1720441624029769796
-
eggdrop
-
Pedrosso
svtplay.se/sitemap-details-episodes.xml may not be extensive to all of all of its content as it doesn't just host episodes but it may give a perspective to the sheer scale. The service is fairly region-locked to Sweden so things outside don't have access to much
-
Pedrosso
I've found a -dl script for it and made my own code to automate it so I now have a way to be able to efficiently download them (finding items is no problem due to its great sitemap, but I've no clue how to verify it's extensive. Sure does look extensive though, covering items already deleted) I'm still working on ensuring that what's downloaded is
-
Pedrosso
of highest quality and such
-
Pedrosso
(other sitemaps are shown in the main sitemap.xml)
-
nulldata
Looks like at least 2 of the authors from OnMSFT have started their own site -
msftunboxed.com
-
Pedrosso
Possibly especially of interest to IA is the news which afaik is deleted a couple of months after being put online
-
Pedrosso
Is that enough context?
-
JAA
nulldata: Thanks, I've launched an AB job for it.
-
Pedrosso
(lmk if AT has any preferred image sharing tool)
i.imgur.com/RuQRl77.png the yellow/orange bar here says "4 hours left"
-
Pedrosso
Though to be clear they usually delete/remove content due to copyright
-
JAA
Images can be uploaded to
transfer.archivete.am and if you then insert /inline/ after the domain, it gets displayed in a browser as well rather than forcing a download on the regular URL (though that'll be fixed soon™).
-
JAA
-
Pedrosso
Awesome, (also, got a great png right there)
-
arkiver
Pedrosso: that is indeed useful
-
Pedrosso
So, I was recommended to speak directly to IA &or you about uploading something so big & making a collection and such. I would like a reassurance that this is legally in the clear and everything
-
arkiver
Pedrosso: so is this about you trying to mirror everything? or a part of it?
-
arkiver
any idea what numbers we may be looking at here?
-
arkiver
Pedrosso: let's continue over PM
-
arkiver
or DM, however you want to call it
-
JAA
Here's the archive.mozilla.org file listing in all its g(l)ory:
transfer.archivete.am/a0mjU/archive.mozilla.org-files.jsonl.zst
-
JAA
15.3 GiB after decompression, so have fun.
-
nicolas17
the other day I tried downloading a few Windows binaries and doing deltas/deduplication
-
nicolas17
it helped but not as much as I'd have hoped
-
pokechu22
JAA: does that include e.g.
archive.mozilla.org/pub/firefox/tin…/autoland-macosx64-debug/1477331902 which can be guessed from autoland-macosx64 despite autoland-macosx64-debug not working?
-
nicolas17
if only the GCS bucket was open...
-
JAA
pokechu22: No, it does not.
-
JAA
It's everything except /pub/firefox/tinderbox-builds/autoland-macosx64-debug/.
-
JAA
It might be possible to guess everything* in that directory based on the other autoland-* dirs, but I haven't attempted that.
-
nicolas17
what zstd settings did you use for the listing?
-
JAA
Just -10. I played with higher ones, but that would've taken hours.
-
nicolas17
I'm trying higher ones out of curiosity and I'm not seeing particularly useful savings
-
JAA
Yeah, I've previously noticed that somewhere around -8 to -10 is where the big savings stop.
-
JAA
For text-ish files, at least.
-
JAA
There's often another significant drop with --ultra and -20 through -22, but those are so slow that they're rarely worth it for larger files.
-
JAA
Also, the CPU on this server is a potato.
-
nicolas17
-19 -T4 on my laptop is producing output at 8KiB/s (dunno how fast it's consuming input)
-
JAA
Multi-threaded compression is also going to produce larger output than single-threaded.
-
nicolas17
hm you made this by parsing the html right? and file sizes are like "504M"?
-
nicolas17
my highly stupid script to calculate total file size is gonna take 10 minutes, lol
-
JAA
Correct
-
JAA
I already have the total size.
-
JAA
1.66 PiB
-
nicolas17
💀
-
Barto
even a your momma joke is not that big, damn
-
nicolas17
JAA: 26.64 TiB for seamonkey?
-
nicolas17
JAA: something seems wrong with your listing...
-
nicolas17
{"name":"/pub/android/focus/8.0.8","size":"43M","mtime":"13-Feb-2023 04:22"}
-
nicolas17
{"name":"/pub/android/focus/8.0.8/Focus-arm.apk","size":"43M","mtime":"13-Feb-2023 04:22"}
-
nicolas17
{"name":"/pub/android/focus/8.0.8/Focus-x86.apk","size":"51M","mtime":"13-Feb-2023 04:22"}
-
nicolas17
oh they actually exist like that 💀
-
nicolas17
right, cloud-y object storage doesn't care if a/b is a file and a/b/c is also a file
-
JAA
Huh yeah, interesting.
-
nicolas17
JAA:
transfer.archivete.am/inline/wYgWD/Screenshot_20231118_192300.png I tried to do a thing before realizing I won't have enough RAM for the entire directory :D
-
nicolas17
"[Errno 28] No space left on device" how is this possible, I was creating sparse files
-
nicolas17
turns out, I ran out of inodes, lol
-
JAA
nicolas17: lol. Yeah, it is a chonker.
-
nicolas17
now it's gonna take me forever to delete these
-
JAA
Life hack for next time: make a tmpfs (or even a loop-mounted ext4 or whatever), then simply nuke that when done. :-)
-
nicolas17
wouldn't that eat >15GB of RAM? :P
-
JAA
With a tmpfs, yeah. But you can create an ext4 fs in a file on an existing disk partition, then loop-mount that. When you're done, you unmount and delete the single file.
-
JAA
In hindsight, maybe I should've done that as well for my pad backup thing rather than creating 4.8 million symlinks. lol