01:08:47 boet: Sounds fun, and comprehensive DNS datasets are hard to come by! So please release the raw data in bulk, too. I'd assume it won't be very large, and being textual, it'd compress very well, so an upload to IA would likely be a good idea. 05:11:04 https://www.gamingonlinux.com/2024/07/humble-games-confirmed-a-restructuring-of-operations-with-reports-of-all-staff-gone/ 05:24:55 off 05:25:34 s/off/oof 06:20:24 !tell kerim What do you want to tell us about https://www.animekalesi.com ? 06:20:25 -eggdrop- [tell] ok, I'll tell kerim when they join next 08:13:18 Exorcism edited Mailman/2 (+0): https://wiki.archiveteam.org/?diff=53142&oldid=53141 08:36:22 Exorcism edited MoinMoin (+0): https://wiki.archiveteam.org/?diff=53143&oldid=53140 11:03:07 I wonder if I could get some guidance. I'm trying to retrieve 150 x 50gb warc.gz files from archive.org and it's going very slowly. Also the extraction from the warcs is super slow (lots of small files). We're trying to make all of Google Poly available again and this is one of our road blocks. 11:03:08 I'm not entirely sure what I'm asking - but is there anything I could be doing differently? 11:04:55 (for clarification there are two related projects: https://polygone.art and https://poly.pizza who were involved in the initial scrape but they've chosen only to make a subset of the files available and we specifically want to make them available more comprehensively) 13:13:11 andybak: are you downloading this concurrently? 14:32:18 arkiver - no. I'm using aria2c which I think is using 16 separate connections 14:32:24 arkiver - I'm using aria2c which I think is using 16 separate connections 14:32:59 aria_cmd = "aria2c -c -s 16 -x 16 {0}" 14:38:15 hmmmm. i just switched to an SSD on USB 3.1 and it seems a lot better. Might just be a crappy USB port or spinning disk 14:41:49 as far as I know, multiple connections is concurrently 14:44:38 You may get better throughput if you download multiple files in parallel rather than a single file with multiple connections. I'm not sure if aria2c supports the former at all, but with the options above, it definitely does the latter. 14:45:30 I can always launch multiple instances of aria. i'll play around. 14:45:33 Right, --max-concurrent-downloads aka -j. 14:46:08 Is grabbing the warcs themselves the right thing to do here? Instead of - I dunno - grabbing the file contents directly from wayback urls? 14:46:30 IA has two copies of each item, and each copy is on a single HDD. So by going highly parallel, those two HDDs get very sad with seeking. 14:47:04 At that magnitude, downloading the WARCs is the right approach. Whether unpacking them makes sense depends on how you'll use them. 14:47:34 what are the alternatives to unpacking them? treating them like a virtual file system and grabbing files as needed? 14:47:49 I hadn't even thought of that. I guess I need to test the overhead of that approach. 14:48:13 Yeah, a custom self-hosted Wayback Machine, if you will. 14:48:36 I'm usually iterating through 1000s of small files rapidly collating metadata. 14:48:55 There's pywb and openwayback, but not sure they're appropriate for this use case. 15:03:47 yeah. I think i'm ok now i've realised that the bottleneck isn't actually the download! i've never had broadband fast enough before that it wasn't the limiting factor 19:21:11 AppleVis feels like a particularly important closure: 1) underserved community that has relied heavily on this resource, 2) lack of any clear alternative, 3) site was influential to Apple employees and even management, and 4) since the community is blind and visually impaired users, special care may need to be taken to ensure the archive of the site works with screen readers or other accessibility tools. 20:34:53 Relating to that, I assume downloading a WARC from an item where other WARCs are currently being uploaded would run into that same seeking problem? 20:35:40 or derived, I guess 20:36:54 Derives run on a separate machine. With archive.php tasks, it could happen, yeah. 20:41:37 https://x.com/bokieiey/status/1818506690826059827 hope someone archives that lol 20:41:37 nitter: https://nitter.lucabased.xyz/bokieiey/status/1818506690826059827 20:57:26 Only one left: https://www.ebay.com/itm/266337902355 20:58:08 Well, one lot of 10, I guess. 21:05:31 ah damn, gone. 22:32:36 JAA++ 22:32:36 -eggdrop- [karma] 'JAA' now has 87 karma! 22:32:40 i now have more channel space 22:45:32 yarrow_alt: We're already archiving it 22:45:46 What do you mean that "special care may need to be taken to ensure the archive of the site works..."? 22:45:51 What specifically? 22:49:35 oh right, i should compile a list and remove them from eggdrop's channelfile and firebot's database.. 22:49:52 (eggdrop will keep trying to join forever) 22:50:37 JustAnotherArchivist edited Gfycat (+23): https://wiki.archiveteam.org/?diff=53144&oldid=50940 22:51:37 JustAnotherArchivist edited Operation London Bridge (-1): https://wiki.archiveteam.org/?diff=53145&oldid=48983 22:51:38 JustAnotherArchivist edited V Live (+23): https://wiki.archiveteam.org/?diff=53146&oldid=50784 22:51:39 JustAnotherArchivist edited BuzzVideo (+23): https://wiki.archiveteam.org/?diff=53147&oldid=49421 22:51:40 JustAnotherArchivist edited Pandora.tv (+23): https://wiki.archiveteam.org/?diff=53148&oldid=49429 22:52:37 JustAnotherArchivist edited Revue (+23): https://wiki.archiveteam.org/?diff=53149&oldid=49496 22:52:38 JustAnotherArchivist edited Egloos (+23): https://wiki.archiveteam.org/?diff=53150&oldid=50983 22:52:39 JustAnotherArchivist edited Skyblog (+23): https://wiki.archiveteam.org/?diff=53151&oldid=50550 22:52:40 JustAnotherArchivist edited Tiki (+23): https://wiki.archiveteam.org/?diff=53152&oldid=50217 22:53:38 JustAnotherArchivist edited ЯRUS (+23): https://wiki.archiveteam.org/?diff=53153&oldid=50113 22:53:39 JustAnotherArchivist edited Wysp (+23): https://wiki.archiveteam.org/?diff=53154&oldid=50982 22:53:40 JustAnotherArchivist edited Xuite (+23): https://wiki.archiveteam.org/?diff=53155&oldid=50631 22:53:41 JustAnotherArchivist edited ZOWA (+13): https://wiki.archiveteam.org/?diff=53156&oldid=50923