00:21:37 i was browsing the comments of an archival related hackernews post recently, and apparently there’s this huge collection of bulgarian music (digitized LPs if I understood correctly) kept by some russian bloke. i have no idea how to go about this. https://gramofonche.chitanka.info/ 00:22:04 it’s the comment by svilen_dobrev here, for context: https://news.ycombinator.com/item?id=38288020 00:22:25 what do y’all think? 02:03:55 44000 hours of security camera footage from the Captiol during the insurrection will get released over the next few months. That'll be interesting to archive. 02:04:22 https://www.theguardian.com/us-news/2023/nov/17/mike-johnson-january-6-video-footage 04:38:55 So for archive.mozilla.org, there is one directory I couldn't list due to timeouts (on my side, but the server's timeout is only marginally higher, and it still fails), namely /pub/firefox/tinderbox-builds/autoland-macosx64-debug/. Apart from that, there are four 404s (/pub/seamonkey/oldnightly/testing/testing/, /pub/seamonkey/nightly/nightly/, /pub/labs/devtools/master/master/, and 04:39:01 /pub/seamonkey/oldnightly/2021-09-01-21-00-03-comm-253/2021-09-01-21-00-03-comm-253/). Everything else seems to have been retrieved properly. 04:56:45 JAA: IA gives https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-macosx64-debug/1477331902/ (from https://web.archive.org/web/*/archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-macosx64-debug*) 04:57:52 based on https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-macosx64/1477331902/ existing too you probably can guess a list 05:00:42 There are 96867265 files in the dirs I managed to list. 05:01:28 2564939 of them exceed 100 MiB. 05:01:45 29015882 are over 10 MiB. 05:02:02 36821201 over 1 MiB 05:03:07 The >= 100 MiB files are 438.52 TiB in total. 05:03:33 My size summing tool is rather slow, so I can't get a total size right now. Need to find a way to make that faster first. 07:49:01 Ok, I have a total size: 1.66 PiB 07:49:05 arkiver: ^ 12:35:13 you reckon I can just try running https://gramofonche.chitanka.info/ through the bot with one connection? 15:26:19 JAA: ouch 15:28:26 perhaps we could archive part of it... 15:28:38 is most of this size in a certain part of the repository? 15:28:49 we could get a sample of that, while mirroring the stuff outside of that 15:52:41 I was using DiscordChatExporter to make a total archive, however, I ran out of disk space whilst it was already a long way through. If I restart it with the same output file, does anyone know if it will resume the save, or overwrite what it's already done? I'd really rather not manually write in the IDs of the many guilds it missed 15:53:18 (assuming I have more disk space now, which I do) 16:00:24 JAABot edited CurrentWarriorProject (-4): https://wiki.archiveteam.org/?diff=51164&oldid=51159 16:00:38 !ig 1ticp48603jvfw5izv5xfs8vc ^https?://wheelmap\.org/ 16:01:43 c3manu: psst, wrong channel 16:01:54 lol, thanks m) 16:52:02 ^ from that above, I think I'll just write a code to load in the rest of the IDs 16:52:25 https://hackint.logs.kiska.pw/archiveteam-bs/20231112#:~:text=talking%20to%20ia%20about%20it https://hackint.logs.kiska.pw/archiveteam-bs/20231112#:~:text=Either%20directly%20or%20through%20arkiver 16:52:25 JAA: I've been downloading and I do want to follow this advice, however, I'm not quite sure how to contact them about something such as this. You mentioned I could through ar-kiver? 17:17:04 arkiver: I'm afraid I don't have stats on parts of the server. It's large enough to be a pain to work with in general. 17:18:32 Pedrosso: #discard for Discord stuff 17:18:39 thx 17:19:16 Pedrosso: Yes, arkiver is the person to speak to here about getting permission for uploading large amounts of data (and also getting a collection and so on). 17:21:26 Ah. Would that be done here (-bs) or, to not clog this chat up, DM:s? 17:27:42 Either is fine. Just keep us informed if you do it through PM and the project ends up happening. :-) 17:38:56 That I will do. 17:39:04 (informing, that is) 18:11:00 arkiver: Someone on SWH's IRC channel shared a breakdown by top dir: firefox 1053, thunderbird 156, devedition 117, seamonkey 26, mobile 17, xulrunner 3 are the top 6. (Numbers are 'size_tb', whichever unit that is exactly.) 18:11:57 I'll try to extract a full index with path, size, and mtime. 18:31:03 OnMSFT.com hasn't published any new articles since 10/31. Before that was multiple articles daily. No announcements or mentions of taking a break. Did some digging and found they were acquired on 10/10 by Reflector Media. Maybe something happened with the staff under the new ownership? 18:31:03 https://www.einpresswire.com/article/660812143/windowsreport-expands-its-microsoft-coverage-with-strategic-onmsft-acquisition 18:33:06 They also have a podcast that consistently published every Sunday. Last episode was on 10/29 https://soundcloud.com/onmsft 18:33:38 Pedrosso: hi, i do not exactly know what this is about - what is this about? 18:33:46 svtplay.se ? 18:36:41 Here's the context https://hackint.logs.kiska.pw/archiveteam-bs/20231112#:~:text=know%20of%20a-,website,-svtplay.se%20(videos 18:36:41 But to clarify in here, It's a very large official service for Swedish programs, and it uses a stream system. They tend to delete content en-masse and I've yet to find any other copies on the internet of many of them 18:38:21 fwiw, you could check newsgroups if you haven't yet, that's where a lot of the NPO stuff (Dutch equivalent) ends up that's not available elsewhere 18:39:59 I posted about the service once I noticed a documentary, https://www.svtplay.se/din-hjarna had lost both its first seasons due to this deletion 18:40:58 One of the higher volume writers for OnMSFT seems to imply it's about to be killed. https://twitter.com/Dav3Shanahan/status/1720441624029769796 18:40:59 nitter: https://nitter.net/Dav3Shanahan/status/1720441624029769796 18:41:58 https://www.svtplay.se/sitemap-details-episodes.xml may not be extensive to all of all of its content as it doesn't just host episodes but it may give a perspective to the sheer scale. The service is fairly region-locked to Sweden so things outside don't have access to much 18:43:19 I've found a -dl script for it and made my own code to automate it so I now have a way to be able to efficiently download them (finding items is no problem due to its great sitemap, but I've no clue how to verify it's extensive. Sure does look extensive though, covering items already deleted) I'm still working on ensuring that what's downloaded is 18:43:19 of highest quality and such 18:43:40 (other sitemaps are shown in the main sitemap.xml) 18:43:40 Looks like at least 2 of the authors from OnMSFT have started their own site - https://msftunboxed.com/ 18:45:32 Possibly especially of interest to IA is the news which afaik is deleted a couple of months after being put online 18:46:43 Is that enough context? 18:48:08 nulldata: Thanks, I've launched an AB job for it. 18:49:51 (lmk if AT has any preferred image sharing tool) https://i.imgur.com/RuQRl77.png the yellow/orange bar here says "4 hours left" 18:51:38 Though to be clear they usually delete/remove content due to copyright 18:52:03 Images can be uploaded to https://transfer.archivete.am/ and if you then insert /inline/ after the domain, it gets displayed in a browser as well rather than forcing a download on the regular URL (though that'll be fixed soon™). 18:52:33 E.g. https://transfer.archivete.am/inline/bG4mu/aatt.png 18:53:38 Awesome, (also, got a great png right there) 18:58:24 Pedrosso: that is indeed useful 19:00:49 So, I was recommended to speak directly to IA &or you about uploading something so big & making a collection and such. I would like a reassurance that this is legally in the clear and everything 19:01:19 Pedrosso: so is this about you trying to mirror everything? or a part of it? 19:01:26 any idea what numbers we may be looking at here? 19:07:41 Pedrosso: let's continue over PM 19:07:49 or DM, however you want to call it 21:06:46 Here's the archive.mozilla.org file listing in all its g(l)ory: https://transfer.archivete.am/a0mjU/archive.mozilla.org-files.jsonl.zst 21:10:19 15.3 GiB after decompression, so have fun. 21:11:00 the other day I tried downloading a few Windows binaries and doing deltas/deduplication 21:14:29 it helped but not as much as I'd have hoped 21:19:16 JAA: does that include e.g. https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-macosx64-debug/1477331902/ which can be guessed from autoland-macosx64 despite autoland-macosx64-debug not working? 21:19:45 if only the GCS bucket was open... 21:20:40 pokechu22: No, it does not. 21:20:52 It's everything except /pub/firefox/tinderbox-builds/autoland-macosx64-debug/. 21:21:16 It might be possible to guess everything* in that directory based on the other autoland-* dirs, but I haven't attempted that. 21:21:36 what zstd settings did you use for the listing? 21:24:09 Just -10. I played with higher ones, but that would've taken hours. 21:25:03 I'm trying higher ones out of curiosity and I'm not seeing particularly useful savings 21:25:23 Yeah, I've previously noticed that somewhere around -8 to -10 is where the big savings stop. 21:25:33 For text-ish files, at least. 21:26:27 There's often another significant drop with --ultra and -20 through -22, but those are so slow that they're rarely worth it for larger files. 21:26:58 Also, the CPU on this server is a potato. 21:27:27 -19 -T4 on my laptop is producing output at 8KiB/s (dunno how fast it's consuming input) 21:31:52 Multi-threaded compression is also going to produce larger output than single-threaded. 21:32:54 hm you made this by parsing the html right? and file sizes are like "504M"? 21:36:57 my highly stupid script to calculate total file size is gonna take 10 minutes, lol 21:41:16 Correct 21:41:20 I already have the total size. 21:41:23 1.66 PiB 21:41:46 💀 21:47:12 even a your momma joke is not that big, damn 22:05:51 JAA: 26.64 TiB for seamonkey? 22:15:13 JAA: something seems wrong with your listing... 22:15:14 {"name":"/pub/android/focus/8.0.8","size":"43M","mtime":"13-Feb-2023 04:22"} 22:15:16 {"name":"/pub/android/focus/8.0.8/Focus-arm.apk","size":"43M","mtime":"13-Feb-2023 04:22"} 22:15:17 {"name":"/pub/android/focus/8.0.8/Focus-x86.apk","size":"51M","mtime":"13-Feb-2023 04:22"} 22:16:07 oh they actually exist like that 💀 22:16:29 right, cloud-y object storage doesn't care if a/b is a file and a/b/c is also a file 22:18:31 Huh yeah, interesting. 22:24:18 JAA: https://transfer.archivete.am/inline/wYgWD/Screenshot_20231118_192300.png I tried to do a thing before realizing I won't have enough RAM for the entire directory :D 23:00:48 "[Errno 28] No space left on device" how is this possible, I was creating sparse files 23:00:54 turns out, I ran out of inodes, lol 23:05:19 nicolas17: lol. Yeah, it is a chonker. 23:22:58 now it's gonna take me forever to delete these 23:56:26 Life hack for next time: make a tmpfs (or even a loop-mounted ext4 or whatever), then simply nuke that when done. :-) 23:56:54 wouldn't that eat >15GB of RAM? :P 23:57:52 With a tmpfs, yeah. But you can create an ext4 fs in a file on an existing disk partition, then loop-mount that. When you're done, you unmount and delete the single file. 23:58:30 In hindsight, maybe I should've done that as well for my pad backup thing rather than creating 4.8 million symlinks. lol