-
Pedrosso
Awesome. I do have about 1.5 million of the smaller files downloaded, although since it's only the creature file and not the entire page I'm unsure of the relevance
-
pokechu22
Looks like the stuff that was run before was
spore-cr.ucoz.com and some stuff on staging.spore.com (
transfer.archivete.am/inline/xEMox/staging.spore.com_seed_urls.txt specifically)
-
pokechu22
hmm,
spore.com/sporepedia#qry=pg-220 looks to be handled via POST which archivebot can't do
-
pokechu22
sporepedia itself says 191,397,848 creations to date, but the browse tab says 1,769 newest creations -
spore.com/sporepedia#qry=st-sc looks like it goes on through 503,904 things though
-
Pedrosso
What I believe is the case is that it starts indexing at 500,000
-
Pedrosso
or- wait. 500,000,000,000
-
pokechu22
-
pokechu22
It also says you can drag the thumbnail into the spore creator app but I'm not sure how that works (if there's an additional URL for extra data or they're hiding it in the image somehow)
-
Pedrosso
They're hiding it in the image somehow
-
Pedrosso
Just to verify that, I'm gonna go into spore, turn off my internet, and pull one in
-
pokechu22
-
Pedrosso
I'll post other possibly relevant URLs.
spore.com/comm/developer spore.com/comm/samples (the latter has a list of possibly relevant urls)
-
pokechu22
-
pokechu22
The API docs there are helpful
-
Pedrosso
That is unfortunate. However as the API docs show, the files can be accessed directly, although that will miss out on users, comments, etc
-
Pedrosso
actually- I disregard that last statement about missing out, as I don't know how to read the XML files
-
pokechu22
Theoretically we could generate WARCs containing the POST data if a whole custom crawl were done, they just wouldn't allow navigating the site directly on web.archive.org as it stands today (theoretically it could be implemented in the future, but I think there are technical complications)
-
nicolas17
Pedrosso: if I understand correctly, we *can* get the user and comment, we just can't display it as a functional website on web.archive.org
-
nicolas17
user and comment data*
-
Pedrosso
That is awesome, thank you
-
nicolas17
so it will be a pile of xml or whatever waiting for someone to make a tool to read it
-
pokechu22
We *can* if something custom was implemented - archivebot wouldn't work for it (though giving archivebot millions of images as a list also isn't easy since it needs the list ahead of time; you can't just tell it the pattern the images follow)
-
Pedrosso
I've no clue how I didn't find it before but there's a page on the archiveteam wiki with also possibly relevant info
wiki.archiveteam.org/index.php/Spore
-
pokechu22
I started an archivebot job for
spore.com but that's not going to recurse into anything that's accessed via javascript only (so it's not going to find everything on the character creator)
-
h2ibot
Pokechu22 edited Spore (+211, mention that the thumbnails include data):
wiki.archiveteam.org/?diff=51106&oldid=51087
-
Pedrosso
pokechu22: You say archivebot needs the list ahead of time, could you elaborate on that? Because I mean, making a very long list full of urls following the pattern is possible, no?
-
pokechu22
Yeah, it's definitely possible, not too difficult even, but if
static.spore.com/static/thumb/501/110/210/501110210233.png implies there are 1,110,210,233 images, I think that exceeds some of the reasonable limits :)
-
pokechu22
it's possible to upload zst-compressed text files to transfer.archivete.am and then remove the zst extension to download it decompressed, which helps a bit, but archivebot still downloads it decompressed (and ends up uploading that decompressed list to archive.org without any other compression)
-
Pedrosso
You'll have to forgive me as I've no real basis on what's reasonable
-
pokechu22
Yeah, I'm trying to dig up an example of when I last did this
-
Pedrosso
Thank you
-
pokechu22
(the info on the archivebot article on the wiki is fairly out of date - we can and regularly do jobs a lot larger than it recommends there)
-
JAA
A billion images? Oh dear...
-
JAA
Request rates of something like 25/s are possible in AB, but then we'd still be looking at something like 1.5 years...
-
Pedrosso
If I interpret the infromation correctly, many urls in that pattern could be pointing to nothing
-
JAA
That is likely, given that the site itself says there are only 191 million creations.
-
JAA
So roughly every 6th URL will work.
-
JAA
But it doesn't matter for this purpose since we'd still have to try the full billion.
-
Pedrosso
That's true. Unless there's any way to check if it exists beforehand
-
Pedrosso
Well, any reasonable way
-
JAA
Yeah, maybe the API has some bulk lookup endpoint. Otherwise, probably not.
-
Pedrosso
which API? Sporepedia's?
-
JAA
Yeah
-
pokechu22
OK, right, the example I had was
wwii.germandocsinrussia.org of which there were 54533607 URLs related to map tiles (e.g.
wwii.germandocsinrussia.org/system/…36f78d93dbfe7fc063bf0d396/2/2_0.jpg - but at a bunch of zoom levels), which I generated by a script. I split the list into 5 lists of 11000000
-
pokechu22
URLs, which ended up being about 118.1 GiB of data per list. I ran those lists one at a time (starting the next one after the previous one finished); it took about 2 hours for archivebot to download each list of 11M URLs and queue it (as that process isn't very optimized), and it took about 5 days for it to actually download the URLs in that list (though I don't think that's
-
pokechu22
representative of actual speeds for downloading...)
-
pokechu22
In other cases (which I can't find) I did parallelize the process between a few AB pipelines, and each pipeline downloads multiple files at once, but it's still not ideal
-
pokechu22
That job is still fairly comparable though because it's downloading a bunch of low-resolution images
-
Pedrosso
even the "large" images (same item, different image, same info for the char just higher res I belive) are approx 60 kB
-
pokechu22
The storage space for downloaded images probably isn't an issue overall (as that can be uploaded to web.archive.org in 5GB chunks), it's more the storage space used for the list of URLs and such
-
Pedrosso
(quite ironic)
-
pokechu22
Similarly, I'm not sure how useful it'd be to save the "large" images as it seems like they don't have the embedded data, unlike the "thumb" images, so presumably it'd be possible to regenerate the large images from the data in the thumb images ingame, which is the opposite of how thumbnails/high resolution images usually work
-
Pedrosso
That's fair
-
pokechu22
Assuming 20kB for thumbnails and the listed 191,397,848 creatures, that's about 4TB, which is a reasonable amount (on the large side, but still reasonable)
-
Pedrosso
Would it be relevant to save comments as well? I'd suggest users but that process is far less iterable
-
pokechu22
It looks like comments requires POST so archivebot can't do that, but those would be nice to save
-
Pedrosso
-
Pedrosso
5000 is just an arbitrarily big value I put there
-
pokechu22
for what it's worth
spore.com gives me an expired certificate error though
spore.com works - I'm guessing you dismissed that error beforehand?
-
Pedrosso
I don't recall, so assume that I have
-
pokechu22
-
Pedrosso
As long as it's the same information it's all good, right?
-
Pedrosso
As for users though, it does seem like there's a "userid" however I can't see anywhere you can put it to get the url for the userpage
-
pokechu22
Yeah, at least for having the information - it wouldn't make the first URL work on web.archive.org but that's not as important
-
JAA
Ah, nedbat wrote that thumbnail data article, nice.
-
Pedrosso
So, what'd need to be done is to get that URL list and split it in reasonable chunks?
-
pokechu22
I should also note that archivebot isn't the only possible tool
-
Pedrosso
That is good to note, yes. Though I'm not really aware of many of the others
-
JAA
If there are really no rate limits, qwarc could get through this in no time.
-
JAA
I've done 2k requests per second before with qwarc.
-
JAA
That'd work out to a week for 1.1 billion.
-
Pedrosso
I wouldn't say there are none, but they may not be too limiting. I have nothing against running it on my own machine, but I'm not really aware of how to use it properly as of now
-
JAA
Well, 'not too limiting' and 'allowing 2k/s' are two very different things. :-)
-
pokechu22
archiveteam also has
wiki.archiveteam.org/index.php/DPoS where you have a bunch of tasks distributed to other users, and a lua script that handles it. So creature:500447019787 could be one task and that would fetch
spore.com/rest/comments/500226147573/0/5000 and
spore.com/rest/creature/500226147573 and
-
pokechu22
-
pokechu22
It's fully scriptable... but that means you need to write the full script :)
-
pokechu22
So it's a lot more difficult to actually do it
-
JAA
Yeah, same with qwarc.
-
pokechu22
(oh, and DPoS projects can also record POST requests, though they still won't play back properly)
-
JAA
Frameworks for archiving things at scale.
-
pokechu22
Yeah, qwarc being a single-machine system instead of distributed
-
Pedrosso
I wouldn't mind trying to run qwarc, Anything I should be aware of?
-
JAA
Beware of the dragons.
-
JAA
:-)
-
fireonlive
π³
-
JAA
There's no documentation, and there are some quirks to running it, especially memory-related. There's a memory 'leak' somewhere that I haven't been able to locate. With a large crawl like this, you're going to run into that.
-
Pedrosso57
What a convenient time for my internet to drop, hah
-
pokechu22
-
Pedrosso57
?
-
pokechu22
(message history, in case you missed anything)
-
Pedrosso57
Thank you
-
jwn
This is where I can notify people of a website shutting down, right? Just making sure I have the right channel.
-
pabs
yes
-
pabs
which website, when is the shutdown?
-
jwn
The website is brick-hill.com, unsure of any exact date of shutdown but I do know of plans to re-launch the site due to ownership issues but without accounts and I assume forum posts by extension.
-
JAA
Huh. Brickset shut down their forums the other day. Is that just a coincidence?
-
jwn
Never heard of it, so probably. I also assume it wasn't as messy.
-
JAA
That was
forum.brickset.com (a few pages are still in their server-side cache).
-
JAA
And yeah, not very messy.
-
JAA
Just funny that we go years without any LEGO-related shutdowns (that I remember), and then there's two in quick succession.
-
JAA
brick-hill.com does seem to work fairly well without JS, so that's nice.
-
jwn
Technically Brick-Hill is a Roblox clone but resemblances to Lego Island weren't accidental.
-
pabs
looks like the www/blog/merch subdomains have been captured previously
-
pabs
-
pabs
oh, some of them relatively recently
-
pabs
20230904
-
JAA
-
DogsRNice
could be one of those forums that wont let you view some boards if you arent signed in
-
jwn
I'm pretty sure you don't need to be logged in to see the forums.
-
JAA
80% or so are in a single forum that is publicly viewable.
-
JAA
-
JAA
Ryz: ^ You started that job.
-
JAA
Ok yeah, it got 200s from 2352033 unique thread IDs. I guess that should be reasonably close to the total.
-
vokunal|m
I'm definitely completely wrong about this, but if we had #Y working, would we need dedicated projects for sites anymore or could they be run through that with modification? Would we need AB?
-
thuban
vokunal|m: we would still need dedicated projects for sites that couldn't be crawled by generic spidering logic (because eg they depend on javascript api interactions).
-
thuban
theoretically it could do anything that archivebot could, but between overhead and the increased complexity of 'live' configuration in a distributed environment, we'd want to keep ab around anyway
-
pabs
from #archivebot <mannie> the energy company ENSTRAGO has been declared bankrupt. Here is the court annoucement:
insolventies.rechtspraak.nl/#!/details/03.lim.23.189.F.1300.1.23 and here the official website:
enstroga.nl
-
pabs
-
masterX244
Jwn: brickset is a lego fansite, forum got too expensive and activity declined. Luckily we caught it just before the shredders were starting
-
Adrmcr
On the note for spore, I found that
staging.spore.com has its "static" and "www_static" subdirectories open; as far as i can tell, everything on there is also on the regular non-staging website, so it may be safe to extract everything, strip "staging.", and archive the main file links
-
Edel69
Hello. I have a quick question. Is it possible to find a deleted private Imgur album among the huge archive dump with an album URL link or a link to a single image that was within the album?
-
JAA
Edel69: If it was archived, it's in the Wayback Machine. Album or image page or direct image link should all work.
-
Edel69
Thanks for the response. I was under the impression that the team behind the archive job has actual access to files that were downloaded and backed up before the May 2023 TOS change went into effect.
-
JAA
Well, the raw data is all publicly accessible, but trust me, you don't want to work with that. :-)
-
Edel69
I wouldn't even know what to do with all of that. lol
-
JAA
The WBM index should contain all of them, and that's the far more convenient way of accessing it.
-
Ryz
Regarding
brick-hill.com/forum - JAA, hmm, I'm a bit iffy on how much is it covered, because I recall the last couple of times it got errored out from overloading or something...?
-
Edel69
So I tried multiple album and separate image URLs in the Wayback Machine and I get no hits at all. I don't think any of my deleted account's uploads have been archived on there. None of my albums were public, so it wouldn't have been possible for there to be Web archives maybe? My decade old account was abruptly deleted with no warnings just a few
-
Edel69
days ago, so if there's nothing at all I guess that means my data was somehow not archived.
-
JAA
I think we should've grabbed virtually all 5-char image IDs. But beyond that, it would've been mostly things that were publicly shared in one of the sources we scraped.
-
Edel69
I finally got a hit from one of the limited URLs I have.
i.imgur.com/eClDaR3.jpg - An image from a Resident Evil album. I guess this wouldn't help in finding anything else that was in the same album though.
-
Edel69
Isn't the image ID in the URL link? If, so they're all 7 characters.
-
vokunal|m
Yeah
-
vokunal|m
really old urls can be 5 characters though
-
vokunal|m
they went through all the 5 character ids before upping their ids to 7 characters
-
Edel69
Ah, so the 7 character IDs were also backed up. I was thinking he was saying that they only grabbed the 5 character IDs.
-
imer
We didn't get all 5char albums unfortunately, virtually all 5 char images should be saved and then most 7char ones we found
-
vokunal|m
We grabbed basically all 900M 5char images, and around 1 billion 7 character images, i think
-
vokunal|m
we brute forced a lot, but there's 3.5 trillion possible ids in the 7 character space
-
vokunal|m
just guessing on the 5char, because i thik that's what that would pan out to with our total done
-
Edel69
That's a lot of downloading you all did. With that massive amount of data it would be looking for the needle in the haystack to find anything specific I guess, let alone a specific album collection. I'm just going to cut my losses and forget about it lol. Thanks for the help and information though.
-
imer
Edel69: for 5char albums there is metadata which might be easier to search through
archive.org/details/imgur_album5_api_dump
-
imer
still a lot of data though
-
imer
I don't have these local anymore unfortunately, could've done a quick search otherwise :(
-
JAA
β #imgone for further discussion please
-
h2ibot
Manu edited Political parties/Germany (+63, /* CDU: more */):
wiki.archiveteam.org/?diff=51107&oldid=48436
-
Pedrosso
It appears that, according to others, ArchiveBot is putting pressure on spore.com hence I'm not planning to do that archive instantly using qwarc. I am going to keep looking into it though.
-
Pedrosso
On another note, what kind of motivations (if any) are needed for using the ArchiveBot? I've got a few small sites in mind, but I've mostly no good reason other than "I want em archived lol"
-
pokechu22
For small sites that's probably good enough of a motivation right now as there's nothing urgent that needs to be run
-
pokechu22
I'm not entirely sure about the amount of pressure - archivebot was slowed to one request per second and it seems like the site is giving a response basically instantly (and it didn't look like things were bad when it was running at con=3, d=250-375)
-
pokechu22
but I'm also not monitoring the site directly and am not an expert
-
Pedrosso
Alright, that's good. As for offsite links, if I'm understanding this correctly, it goes recursively within a websites, but doesn't do so with outlinks?
-
JAA
The intent is to provide ~~players~~ archivists with a sense of pride and accomplishment for ~~unlocking different heroes~~ slowing down the servers.
-
pokechu22
It does do outlinks by default; the outlink and any of its resources (e.g. embedded images, scripts, audio, or video (if it's done in a way that can be parsed automatically)) will be saved
-
pokechu22
There is a --no-offsite option to disable that but it's generally fine to include them
-
JAA
I also wouldn't expect AB to make a difference for a website by a major game publisher, but you never know.
-
JAA
It was already slow last night before we started the job.
-
Pedrosso
Haha
-
Pedrosso
also, I was not informed of any restrictions of commands lol. Makes sense to not let people just randomly do it but I didn't find anything on the wiki like that
-
JAA
> Note that you will need channel operator (@) or voice (+) permissions in order to issue archiving jobs
-
JAA
It's not mentioned in the command docs though, only on the wiki page.
-
Pedrosso
Thank you
-
JAA
I've been trying to separate the docs of 'ArchiveBot the software' from 'ArchiveBot the AT instance'. But the permissions part should be in the former, too.
-
h2ibot
Manu created Political parties/Germany/Hamburg (+7072, Beginn collection political parties forβ¦):
wiki.archiveteam.org/?title=Political%20parties/Germany/Hamburg
-
h2ibot