00:01:19 We (the spore community) haven't gotten a complete list of IDs yet. Archiving them one by one works but is inefficient. Known ID ranges exist but not where their "holes" are. 300000299348 - 300001258136; 500001000011 - 500999999998; 501000000016 - 501039850984; 00:02:45 how many items are there in your list? 00:05:07 (order-of-magnitude estimate is ok) 00:05:36 Oh, alright approx 10^7 00:07:00 2 * 10^7 00:08:31 My question is if there are any tools that ArchiveTeam could/would use for this large list of small files (approx 20kB) 00:09:08 yes, there are--both archivebot and the urls project (https://wiki.archiveteam.org/index.php/URLs#How_to_help_if_you_have_lists_of_URLs) could do this. i'm just thinking about which would be more appropriate 00:11:54 I see, I see 00:11:58 we usually prefer archivebot when retrieving large numbers of urls from a single site, because it offers better feedback and control of request speed (preventing accidental ddosing) 00:13:31 Hm, then what of speed? 00:14:26 The only remaining factors would be AT's willingness to take the URLs on and the speed at which they're saved 00:16:58 _but_ them all being small images is otherwise best-case, so it might be all right. or we could break them up into multiple lists. JAA, what do you think? 00:20:47 2e7 * 32K is about 600G, which afaik shouldn't be a problem, and speed is probably a question of what ea can/will tolerate and how long we're willing to take 00:21:35 ...whoops, first response should have been: 00:21:39 well no, the other factor is that archivebot starts having issues if given very large lists, and 2e7 is on the high side. 00:28:42 But you are saying that speed from ArchiveTeam's side is not a problem? 00:34:40 from what i recall of previous discussion, i suspect the spore servers would be the limiting factor, yeah. 00:35:12 I see, alright. We're still busy getting all the viable links though 00:38:38 sounds good. by the way, did you get around to extracting the imgur links from the sporepedia2.foroactivo.com crawl like you mentioned? 00:40:31 I’ll look for some spore telegram groups too. I’ve just done a bunch more coupon ones to compliment the crypto stuff. And yes I do throw in good stuff too sometimes 00:52:10 Pedrosso, thuban: 20M is feasible with AB, especially with images. They take little processing time, so they can run fast. It'd probably take on the order of 2 weeks. 00:53:30 As was mentioned, #// doesn't work well for lists of single/few hosts due to the DDoS risk, and we get no real feedback over what happened to the URLs once they go in. It's very much a best-effort shotgun approach at the internet, not useful for targeted crawls. 01:12:58 thuban: I got stumped at trying to get to the .warc's. Someone offered to do it and if they won't/can't I'll take it back up 01:13:23 okay that was really unclear. I got stumped at trying to use them once downloaded 01:15:36 It also appears we are way closer to finishing the list than I thought we were; it appears as though we're halfway but I'll have to confirm that with them 01:15:52 Pedrosso: no worries, i've just done it 01:16:03 oh, sweet 01:16:06 thank you 01:16:59 (for future reference, you can handle .warc.gz with anything that handles .gz--zless, zgrep, etc) 01:18:17 Nice to say, but honestly, it's just really hard for me to wrap my head around anything that doesn't have an obvious "Click here to download exactly what you want" button and a GUI, haha. 01:18:40 I still probably will want to look through .warc.gz:s in the future so, thanks 01:20:45 Thanks again :) 01:34:21 Pedrosso: you're welcome! 01:36:39 JAA: thanks! 01:38:47 also, idk if you saw the earlier discussion, but is it correct that we don't move third-party uploads to the archiveteam collection anymore and https://wiki.archiveteam.org/index.php/Frequently_Asked_Questions#halp_pls_halp should be updated? 01:41:24 Yeah, that was quite confusing 01:48:47 I did an update to at least remove the part where it claims that 01:49:11 It probably deserves some rewording, though 01:49:21 TheTechRobo edited Frequently Asked Questions (-51, Temporary update to reduce confusion: AT…): https://wiki.archiveteam.org/?diff=51149&oldid=50785 01:50:14 yeah. i also noted when i went to link it just now that the entries aren't proper headings, just bold... there's a bunch of stuff to fix there 01:53:36 Yeah, the vast majority of that page is from almost a decade ago. 02:45:41 Found a site with a lot of Spanish forums https://www.google.com/search?q=site%3Aforoactivo.com 05:14:11 Scarlett03 edited Deathwatch (+199, wilko aqquired by CDS Superstores): https://wiki.archiveteam.org/?diff=51150&oldid=51147 05:59:27 Yo, regarding https://wiki.archiveteam.org/index.php/Deathwatch - if there's a bunch of entries in https://wiki.archiveteam.org/index.php/Deathwatch#Pining_for_the_Fjords_(Dying) - shouldn't the ones that died already move to https://wiki.archiveteam.org/index.php/Deathwatch#Dead_as_a_Doornail ? 06:02:06 yes, it just sometimes takes a while for someone to get around to it (esp when sites don't actually shut down on the announced schedule) 06:03:08 Yeah, I moved a bunch the other day, several of which had been dead for months. 06:03:24 Ryz edited Deathwatch (+243, /* 2024 */ Add GameBattles): https://wiki.archiveteam.org/?diff=51151&oldid=51150 06:11:00 Hmm...I'm not sure if ArchiveBot can handle archiving stuff like https://gamebattles.majorleaguegaming.com/tournaments :/ 07:10:33 israel just stormed Gaza hospital 07:29:23 Ah yes, killing the website version of Comixology, and then finally killing off the app version so it can merge into Kindle, wow Amazon :/ - https://www.comicsbeat.com/comixology-app-merges-with-the-kindle-app-at-amazon/ 08:38:01 JustAnotherArchivist edited Deathwatch (+52, /* 2023 */ Add LARM.fm): https://wiki.archiveteam.org/?diff=51152&oldid=51151 17:22:36 I would appreciate it if someone would grab "haughey.com". This user posted that their blog would be shutdown in 60 days: https://xoxo.zone/@mathowie/111415557908672738 17:22:56 (Can't find any AB jobs: https://archive.fart.website/archivebot/viewer/?q=haughey.com) 17:24:35 Unsure if they are going to save the blog or not, but better safe than sorry ig? 18:22:42 Weird, that isn't even a Blogger blog as far as I can see. Maybe it was in the past. 18:26:46 Running 18:30:48 Appreciated 18:31:02 JAA++ 18:31:02 -eggdrop- [karma] 'JAA' now has 3 karma! 18:31:09 Also #frogger for the upcoming Blogger project 19:29:16 arkiver and others, a reminder on not only Blogger stuff, it's also Google Docs and other goodies like Google Photos; rather curious it's not YouTube, though probably a specific reason >;o 19:37:54 Frogger might want to keep track of urls to those other services in the blogs and potentially send them to another project as well. Is google drive in the burnpile? might be a good idea to get #googlecrash back online if so 19:40:42 As I understand it, everything associated with the inactive accounts is getting shredded. 19:46:26 * vokunal|m uploaded an image: (903KiB) < https://matrix.hackint.org/_matrix/media/v3/download/matrix.org/aNKbhbTkcFxWHGUKjbcuwLIT/image.png > 19:46:27 It's so fun watching these things work. Probably a bit inefficient having the mdisplay every single line, but it's nice to watch 21:15:54 yeah Ryz 22:00:59 I've gathered a fairly extensive but not complete list of old and dying or thriving but niche Spore-related forums https://transfer.archivete.am/J2GVQ/sporeforums1.txt 22:51:23 Copied from #archiveteam-ot: I want to archive a few hundred historical documents from the local library (books, newspapers...). The problem is that they can't be archived using Wayback Machine because each image is loaded using javascript and the links to those images aren't loaded in a way that IA can capture them. The names of the images are available in the source code of the each book (for example: https://www.old.dlib.me/sken_prikaz_1_f.php 22:51:23 ?id_jedinice=1034) and will show that the images are loaded from lista_skenova section and that they exist in skenovi/nj-gorski-vijenac-engleski folder under the base url. Folder name is different for each document. Did anyone else encounter this type of library preview because I think I've seen it before. I would also like to convert all of those books to pdf and upload them to IA separately. I've began downloading this manually using some 22:51:25 basic scripts and wget, but there is about 1500 pages of this and it would be too labor intensive to continue like that. 23:05:59 Peroniko: interesting, i will take a look at this and get back to you in a bit. are the available documents just the ones under the https://www.old.dlib.me/petarpetrovic2njegos/ collection, or is there more? 23:06:30 There are other here https://www.old.dlib.me/ 23:06:51 book, manuscripts, photos, maps.. 23:09:31 oh, my mistake! i saw those but didn't see that they were browsable 23:09:31 (the link isn't clearly indicated and the "english" site mostly isn't...) 23:13:24 i think it should be possible to get the documents to work in the wayback machine 23:25:48 I've made this script to download. Seems to work but not yet fully tested. https://gist.github.com/Fooftilly/52793337319782576ad57fc01cbbb312 23:31:09 bad ids don't result in 404s, unfortunately