-
Pedrosso
We (the spore community) haven't gotten a complete list of IDs yet. Archiving them one by one works but is inefficient. Known ID ranges exist but not where their "holes" are. 300000299348 - 300001258136; 500001000011 - 500999999998; 501000000016 - 501039850984;
-
thuban
how many items are there in your list?
-
thuban
(order-of-magnitude estimate is ok)
-
Pedrosso
Oh, alright approx 10^7
-
Pedrosso
2 * 10^7
-
Pedrosso
My question is if there are any tools that ArchiveTeam could/would use for this large list of small files (approx 20kB)
-
thuban
yes, there are--both archivebot and the urls project (
wiki.archiveteam.org/index.php/URLs…w_to_help_if_you_have_lists_of_URLs) could do this. i'm just thinking about which would be more appropriate
-
Pedrosso
I see, I see
-
thuban
we usually prefer archivebot when retrieving large numbers of urls from a single site, because it offers better feedback and control of request speed (preventing accidental ddosing)
-
Pedrosso
Hm, then what of speed?
-
Pedrosso
The only remaining factors would be AT's willingness to take the URLs on and the speed at which they're saved
-
thuban
_but_ them all being small images is otherwise best-case, so it might be all right. or we could break them up into multiple lists. JAA, what do you think?
-
thuban
2e7 * 32K is about 600G, which afaik shouldn't be a problem, and speed is probably a question of what ea can/will tolerate and how long we're willing to take
-
thuban
...whoops, first response should have been:
-
thuban
well no, the other factor is that archivebot starts having issues if given very large lists, and 2e7 is on the high side.
-
Pedrosso
But you are saying that speed from ArchiveTeam's side is not a problem?
-
thuban
from what i recall of previous discussion, i suspect the spore servers would be the limiting factor, yeah.
-
Pedrosso
I see, alright. We're still busy getting all the viable links though
-
thuban
sounds good. by the way, did you get around to extracting the imgur links from the sporepedia2.foroactivo.com crawl like you mentioned?
-
flashfire42|m
I’ll look for some spore telegram groups too. I’ve just done a bunch more coupon ones to compliment the crypto stuff. And yes I do throw in good stuff too sometimes
-
JAA
Pedrosso, thuban: 20M is feasible with AB, especially with images. They take little processing time, so they can run fast. It'd probably take on the order of 2 weeks.
-
JAA
As was mentioned, #// doesn't work well for lists of single/few hosts due to the DDoS risk, and we get no real feedback over what happened to the URLs once they go in. It's very much a best-effort shotgun approach at the internet, not useful for targeted crawls.
-
Pedrosso
thuban: I got stumped at trying to get to the .warc's. Someone offered to do it and if they won't/can't I'll take it back up
-
Pedrosso
okay that was really unclear. I got stumped at trying to use them once downloaded
-
Pedrosso
It also appears we are way closer to finishing the list than I thought we were; it appears as though we're halfway but I'll have to confirm that with them
-
thuban
Pedrosso: no worries, i've just done it
-
Pedrosso
oh, sweet
-
Pedrosso
thank you
-
thuban
(for future reference, you can handle .warc.gz with anything that handles .gz--zless, zgrep, etc)
-
Pedrosso
Nice to say, but honestly, it's just really hard for me to wrap my head around anything that doesn't have an obvious "Click here to download exactly what you want" button and a GUI, haha.
-
Pedrosso
I still probably will want to look through .warc.gz:s in the future so, thanks
-
Pedrosso
Thanks again :)
-
thuban
Pedrosso: you're welcome!
-
thuban
JAA: thanks!
-
thuban
also, idk if you saw the earlier discussion, but is it correct that we don't move third-party uploads to the archiveteam collection anymore and
wiki.archiveteam.org/index.php/Freq…ently_Asked_Questions#halp_pls_halp should be updated?
-
Pedrosso
Yeah, that was quite confusing
-
TheTechRobo
I did an update to at least remove the part where it claims that
-
TheTechRobo
It probably deserves some rewording, though
-
h2ibot
TheTechRobo edited Frequently Asked Questions (-51, Temporary update to reduce confusion: AT…):
wiki.archiveteam.org/?diff=51149&oldid=50785
-
thuban
yeah. i also noted when i went to link it just now that the entries aren't proper headings, just bold... there's a bunch of stuff to fix there
-
JAA
Yeah, the vast majority of that page is from almost a decade ago.
-
Pedrosso
Found a site with a lot of Spanish forums
google.com/search?q=site%3Aforoactivo.com
-
h2ibot
Scarlett03 edited Deathwatch (+199, wilko aqquired by CDS Superstores):
wiki.archiveteam.org/?diff=51150&oldid=51147
-
Ryz
-
thuban
yes, it just sometimes takes a while for someone to get around to it (esp when sites don't actually shut down on the announced schedule)
-
JAA
Yeah, I moved a bunch the other day, several of which had been dead for months.
-
h2ibot
Ryz edited Deathwatch (+243, /* 2024 */ Add GameBattles):
wiki.archiveteam.org/?diff=51151&oldid=51150
-
Ryz
Hmm...I'm not sure if ArchiveBot can handle archiving stuff like
gamebattles.majorleaguegaming.com/tournaments :/
-
Flashfire42
israel just stormed Gaza hospital
-
Ryz
Ah yes, killing the website version of Comixology, and then finally killing off the app version so it can merge into Kindle, wow Amazon :/ -
comicsbeat.com/comixology-app-merges-with-the-kindle-app-at-amazon
-
h2ibot
JustAnotherArchivist edited Deathwatch (+52, /* 2023 */ Add LARM.fm):
wiki.archiveteam.org/?diff=51152&oldid=51151
-
tomodachi94
I would appreciate it if someone would grab "haughey.com". This user posted that their blog would be shutdown in 60 days:
xoxo.zone/@mathowie/111415557908672738
-
tomodachi94
-
tomodachi94
Unsure if they are going to save the blog or not, but better safe than sorry ig?
-
JAA
Weird, that isn't even a Blogger blog as far as I can see. Maybe it was in the past.
-
JAA
Running
-
tomodachi94
Appreciated
-
fireonlive
JAA++
-
eggdrop
[karma] 'JAA' now has 3 karma!
-
JAA
Also #frogger for the upcoming Blogger project
-
Ryz
arkiver and others, a reminder on not only Blogger stuff, it's also Google Docs and other goodies like Google Photos; rather curious it's not YouTube, though probably a specific reason >;o
-
vokunal|m
Frogger might want to keep track of urls to those other services in the blogs and potentially send them to another project as well. Is google drive in the burnpile? might be a good idea to get #googlecrash back online if so
-
JAA
As I understand it, everything associated with the inactive accounts is getting shredded.
-
-
vokunal|m
It's so fun watching these things work. Probably a bit inefficient having the mdisplay every single line, but it's nice to watch
-
arkiver
yeah Ryz
-
Pedrosso
I've gathered a fairly extensive but not complete list of old and dying or thriving but niche Spore-related forums
transfer.archivete.am/J2GVQ/sporeforums1.txt
-
Peroniko
Copied from #archiveteam-ot: I want to archive a few hundred historical documents from the local library (books, newspapers...). The problem is that they can't be archived using Wayback Machine because each image is loaded using javascript and the links to those images aren't loaded in a way that IA can capture them. The names of the images are available in the source code of the each book (for example:
old.dlib.me/sken_prikaz_1_f.php
-
Peroniko
?id_jedinice=1034) and will show that the images are loaded from lista_skenova section and that they exist in skenovi/nj-gorski-vijenac-engleski folder under the base url. Folder name is different for each document. Did anyone else encounter this type of library preview because I think I've seen it before. I would also like to convert all of those books to pdf and upload them to IA separately. I've began downloading this manually using some
-
Peroniko
basic scripts and wget, but there is about 1500 pages of this and it would be too labor intensive to continue like that.
-
thuban
Peroniko: interesting, i will take a look at this and get back to you in a bit. are the available documents just the ones under the
old.dlib.me/petarpetrovic2njegos collection, or is there more?
-
Peroniko
There are other here
old.dlib.me
-
Peroniko
book, manuscripts, photos, maps..
-
thuban
oh, my mistake! i saw those but didn't see that they were browsable
-
thuban
(the link isn't clearly indicated and the "english" site mostly isn't...)
-
thuban
i think it should be possible to get the documents to work in the wayback machine
-
Peroniko
I've made this script to download. Seems to work but not yet fully tested.
gist.github.com/Fooftilly/52793337319782576ad57fc01cbbb312
-
thuban
bad ids don't result in 404s, unfortunately