#archiveteam-bs

00:01

Pedrosso

We (the spore community) haven't gotten a complete list of IDs yet. Archiving them one by one works but is inefficient. Known ID ranges exist but not where their "holes" are. 300000299348 - 300001258136; 500001000011 - 500999999998; 501000000016 - 501039850984;
00:02

thuban

how many items are there in your list?
00:05

thuban

(order-of-magnitude estimate is ok)
00:05

Pedrosso

Oh, alright approx 10^7
00:07

Pedrosso

2 * 10^7
00:08

Pedrosso

My question is if there are any tools that ArchiveTeam could/would use for this large list of small files (approx 20kB)
00:09

thuban

yes, there are--both archivebot and the urls project (wiki.archiveteam.org/index.php/URLs…w_to_help_if_you_have_lists_of_URLs) could do this. i'm just thinking about which would be more appropriate
00:11

Pedrosso

I see, I see
00:11

thuban

we usually prefer archivebot when retrieving large numbers of urls from a single site, because it offers better feedback and control of request speed (preventing accidental ddosing)
00:13

Pedrosso

Hm, then what of speed?
00:14

Pedrosso

The only remaining factors would be AT's willingness to take the URLs on and the speed at which they're saved
00:16

thuban

_but_ them all being small images is otherwise best-case, so it might be all right. or we could break them up into multiple lists. JAA, what do you think?
00:20

thuban

2e7 * 32K is about 600G, which afaik shouldn't be a problem, and speed is probably a question of what ea can/will tolerate and how long we're willing to take
00:21

thuban

...whoops, first response should have been:
00:21

thuban

well no, the other factor is that archivebot starts having issues if given very large lists, and 2e7 is on the high side.
00:28

Pedrosso

But you are saying that speed from ArchiveTeam's side is not a problem?
00:34

thuban

from what i recall of previous discussion, i suspect the spore servers would be the limiting factor, yeah.
00:35

Pedrosso

I see, alright. We're still busy getting all the viable links though
00:38

thuban

sounds good. by the way, did you get around to extracting the imgur links from the sporepedia2.foroactivo.com crawl like you mentioned?
00:40

flashfire42|m

I’ll look for some spore telegram groups too. I’ve just done a bunch more coupon ones to compliment the crypto stuff. And yes I do throw in good stuff too sometimes
00:52

JAA

Pedrosso, thuban: 20M is feasible with AB, especially with images. They take little processing time, so they can run fast. It'd probably take on the order of 2 weeks.
00:53

JAA

As was mentioned, #// doesn't work well for lists of single/few hosts due to the DDoS risk, and we get no real feedback over what happened to the URLs once they go in. It's very much a best-effort shotgun approach at the internet, not useful for targeted crawls.
01:12

Pedrosso

thuban: I got stumped at trying to get to the .warc's. Someone offered to do it and if they won't/can't I'll take it back up
01:13

Pedrosso

okay that was really unclear. I got stumped at trying to use them once downloaded
01:15

Pedrosso

It also appears we are way closer to finishing the list than I thought we were; it appears as though we're halfway but I'll have to confirm that with them
01:15

thuban

Pedrosso: no worries, i've just done it
01:16

Pedrosso

oh, sweet
01:16

Pedrosso

thank you
01:16

thuban

(for future reference, you can handle .warc.gz with anything that handles .gz--zless, zgrep, etc)
01:18

Pedrosso

Nice to say, but honestly, it's just really hard for me to wrap my head around anything that doesn't have an obvious "Click here to download exactly what you want" button and a GUI, haha.
01:18

Pedrosso

I still probably will want to look through .warc.gz:s in the future so, thanks
01:20

Pedrosso

Thanks again :)
01:34

thuban

Pedrosso: you're welcome!
01:36

thuban

JAA: thanks!
01:38

thuban

also, idk if you saw the earlier discussion, but is it correct that we don't move third-party uploads to the archiveteam collection anymore and wiki.archiveteam.org/index.php/Freq…ently_Asked_Questions#halp_pls_halp should be updated?
01:41

Pedrosso

Yeah, that was quite confusing
01:48

TheTechRobo

I did an update to at least remove the part where it claims that
01:49

TheTechRobo

It probably deserves some rewording, though
01:49

h2ibot

TheTechRobo edited Frequently Asked Questions (-51, Temporary update to reduce confusion: AT…): wiki.archiveteam.org/?diff=51149&oldid=50785
01:50

thuban

yeah. i also noted when i went to link it just now that the entries aren't proper headings, just bold... there's a bunch of stuff to fix there
01:53

JAA

Yeah, the vast majority of that page is from almost a decade ago.
02:45

Pedrosso

Found a site with a lot of Spanish forums google.com/search?q=site%3Aforoactivo.com
05:14

h2ibot

Scarlett03 edited Deathwatch (+199, wilko aqquired by CDS Superstores): wiki.archiveteam.org/?diff=51150&oldid=51147
05:59

Ryz

Yo, regarding wiki.archiveteam.org/index.php/Deathwatch - if there's a bunch of entries in wiki.archiveteam.org/index.php/Deat…watch#Pining_for_the_Fjords_(Dying) - shouldn't the ones that died already move to wiki.archiveteam.org/index.php/Deathwatch#Dead_as_a_Doornail ?
06:02

thuban

yes, it just sometimes takes a while for someone to get around to it (esp when sites don't actually shut down on the announced schedule)
06:03

JAA

Yeah, I moved a bunch the other day, several of which had been dead for months.
06:03

h2ibot

Ryz edited Deathwatch (+243, /* 2024 */ Add GameBattles): wiki.archiveteam.org/?diff=51151&oldid=51150
06:11

Ryz

Hmm...I'm not sure if ArchiveBot can handle archiving stuff like gamebattles.majorleaguegaming.com/tournaments :/
07:10

Flashfire42

israel just stormed Gaza hospital
07:29

Ryz

Ah yes, killing the website version of Comixology, and then finally killing off the app version so it can merge into Kindle, wow Amazon :/ - comicsbeat.com/comixology-app-merges-with-the-kindle-app-at-amazon
08:38

h2ibot

JustAnotherArchivist edited Deathwatch (+52, /* 2023 */ Add LARM.fm): wiki.archiveteam.org/?diff=51152&oldid=51151
17:22

tomodachi94

I would appreciate it if someone would grab "haughey.com". This user posted that their blog would be shutdown in 60 days: xoxo.zone/@mathowie/111415557908672738
17:22

tomodachi94

(Can't find any AB jobs: archive.fart.website/archivebot/viewer/?q=haughey.com)
17:24

tomodachi94

Unsure if they are going to save the blog or not, but better safe than sorry ig?
18:22

JAA

Weird, that isn't even a Blogger blog as far as I can see. Maybe it was in the past.
18:26

JAA

Running
18:30

tomodachi94

Appreciated
18:31

fireonlive

JAA++
18:31

eggdrop

[karma] 'JAA' now has 3 karma!
18:31

JAA

Also #frogger for the upcoming Blogger project
19:29

Ryz

arkiver and others, a reminder on not only Blogger stuff, it's also Google Docs and other goodies like Google Photos; rather curious it's not YouTube, though probably a specific reason >;o
19:37

vokunal|m

Frogger might want to keep track of urls to those other services in the blogs and potentially send them to another project as well. Is google drive in the burnpile? might be a good idea to get #googlecrash back online if so
19:40

JAA

As I understand it, everything associated with the inactive accounts is getting shredded.
19:46

» vokunal|m uploaded an image: (903KiB) <matrix.hackint.org/_matrix/media/v3…/aNKbhbTkcFxWHGUKjbcuwLIT/image.png>
19:46

vokunal|m

It's so fun watching these things work. Probably a bit inefficient having the mdisplay every single line, but it's nice to watch
21:15

arkiver

yeah Ryz
22:00

Pedrosso

I've gathered a fairly extensive but not complete list of old and dying or thriving but niche Spore-related forums transfer.archivete.am/J2GVQ/sporeforums1.txt
22:51

Peroniko

Copied from #archiveteam-ot: I want to archive a few hundred historical documents from the local library (books, newspapers...). The problem is that they can't be archived using Wayback Machine because each image is loaded using javascript and the links to those images aren't loaded in a way that IA can capture them. The names of the images are available in the source code of the each book (for example: old.dlib.me/sken_prikaz_1_f.php
22:51

Peroniko

?id_jedinice=1034) and will show that the images are loaded from lista_skenova section and that they exist in skenovi/nj-gorski-vijenac-engleski folder under the base url. Folder name is different for each document. Did anyone else encounter this type of library preview because I think I've seen it before. I would also like to convert all of those books to pdf and upload them to IA separately. I've began downloading this manually using some
22:51

Peroniko

basic scripts and wget, but there is about 1500 pages of this and it would be too labor intensive to continue like that.
23:05

thuban

Peroniko: interesting, i will take a look at this and get back to you in a bit. are the available documents just the ones under the old.dlib.me/petarpetrovic2njegos collection, or is there more?
23:06

Peroniko

There are other here old.dlib.me
23:06

Peroniko

book, manuscripts, photos, maps..
23:09

thuban

oh, my mistake! i saw those but didn't see that they were browsable
23:09

thuban

(the link isn't clearly indicated and the "english" site mostly isn't...)
23:13

thuban

i think it should be possible to get the documents to work in the wayback machine
23:25

Peroniko

I've made this script to download. Seems to work but not yet fully tested. gist.github.com/Fooftilly/52793337319782576ad57fc01cbbb312
23:31

thuban

bad ids don't result in 404s, unfortunately

10 months ago

« a day earlier

a day later »

today »