-
thuban
thank you!
-
JAA
1.26M dirs done, 2.83M remaining. It exploded a bit again ~15 minutes ago.
-
rafaeletc
hi folks, I need a help to understand the archives of the "fotolog" wevsite, I am doing a research on a young brazilian movement that used the site a lot, here is the right place for me to ask?
-
JAA
Over 3M todo now, so I'll stop updating here until it's closer to completion.
-
JAA
Hi rafaeletc, yeah, here is fine to ask.
-
rafaeletc
Oh, hello JAA, nice to meet u
-
nicolas17
JAA: any idea how much data is there?
-
nicolas17
I guess you're not extracting file sizes yet
-
rafaeletc
oh, a lot of gigabytes
-
nicolas17
oh I meant the mozilla archive he's currently working on :P
-
rafaeletc
nicolas17: sorry, thought it was for me the question
-
JAA
nicolas17: No idea yet, no, just fetching the dir listings so far.
-
arkiver
rafaeletc: it would be good to know the question :P
-
arkiver
but the archived are browsable in the Wayback Machine
-
rafaeletc
I'm trying to recover some contents from "fotolog.com" via Wayback Machine but I'm finding a lot of broken content, than I found this:
archive.org/details/archiveteam_fotolog - my question is how to extract them in a way that I can identify where are the files of the profiles I want to analyze, if you give a search in this link you will see
-
rafaeletc
that they are very large files to download fully and look for needle in a hay tree.
-
arkiver
i do see the site is marked "partially saved"
-
arkiver
rafaeletc: do you have any URLs?
-
arkiver
you could look those up in the Wayback Machine
-
rafaeletc
I managed to understand the metadata you have in the json of each upload, do you have a way to automate the download only of these json?
-
arkiver
rafaeletc: the CDX files may be useful for you, they contain lists of URLs saved in the archives - though i don't know what you are looking for
-
arkiver
as for downloading files, you can for example use the `internetarchive` (`ia`) library for this
-
arkiver
-
arkiver
i'll be off to bed now though, but you can leave message and me or JAA or someone else might know what to do
-
rafaeletc
arkiver: I would like to search the comments and photos of what was archived, but not everything, only of some profiles
-
arkiver
rafaeletc: i'm afraid 'searching' is not exactly possible, as the content of the records is not really indexed, only the URLs and some other metadata
-
arkiver
so if you know some URLs for what you are looking for, you could look those up in the Wayback Machine
-
arkiver
but searching all comments/pages for some terms is not possible at the moment, unless someone makes this searchable (which could be expensive)
-
rafaeletc
arkiver: so major it's static content
-
rafaeletc
but this application that you linked is well documented, I believe will help me a lot, via browser is very difficult
-
rafaeletc
I am researching a cultural movement that between 2003 and 2006 used the website for dissemination and communication
-
rafaeletc
therefore my interest in the comments, but as they were dynamic contents, I believe I will not be able to recover them, but getting the publications I believe that can help
-
rafaeletc
pardon my bad english, I am Brazilian and natively speak portuguese. but thank you very much, the indicated application illuminated the path I need to follow to research more
-
JAA
rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'` should work for downloading all of the megawarc JSON files.
-
rafaeletc
-
JAA
rafaeletc: Yes, that should work.
-
rafaeletc
JAA: I mean, the warc of profiles are compressed right? each page archived is a warc inside the compressed file?
-
nicolas17
each page archived is a compressed record inside the warc file
-
JAA
rafaeletc: It's a compressed file, and each record is compressed individually. replayweb.page should support reading the compressed .warc.gz file directly.
-
rafaeletc
JAA: but a warc.gz with 36gb?
-
JAA
In other words, you do not need to decompress it.
-
JAA
Yeah, they are large and not exactly nice to work with.
-
JAA
Using the metadata in the JSON files, you can download just the part of the file you're interested in.
-
rafaeletc
but I have to download it or just paste the link at replayweb.page?
-
JAA
I'm not sure. I haven't used ReplayWeb.Page before. I just know that it exists.
-
rafaeletc
oh, i figured out, downloaded a small file and opened it on replayweb and saw the content that was archived, luckily the comments are saved, now is to find where are the warcs of the profiles I want to search
-
rafaeletc
thank you all, very very much
-
JAA
:-)
-
JAA
I'm trying to get a partial size estimate for what I've crawld so far of archive.mozilla.org, but it makes my CPU sad.
-
JAA
I'm already at 5 GiB of compressed WARCs for the listings so far...
-
nicolas17
x_x
-
nicolas17
}
-
nicolas17
I wonder where it's hosted
-
JAA
Somewhere that doesn't care a bit about me hammering it for hours at least, so that's good.
-
nicolas17
some massive storage server exposing all this data *as a single filesystem*?
-
JAA
28.35.117.34.bc.googleusercontent.com
-
nulldata
Why in the <i>cloud</i>, of course!
-
nicolas17
x-goog-storage-class: NEARLINE
-
nicolas17
they must have their own thing for file listings then huh
-
rafaeletc
JAA: I accidentally closed the browser tab and did not copy the parameters of ia that you had suggested, can you paste to me again?
-
JAA
rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'`
-
rafaeletc
JAA: thank you, again
-
JAA
Happy to help. :-)
-
nicolas17
yep, files have the x-goog headers suggesting they're on Google Cloud Storage, but file listings don't, so they probably have a reverse proxy / load balancer / thing redirecting file listing requests to somewhere else
-
JAA
I just crossed 3M dirs fetched. There's another 3.15M in the queue.
-
JAA
And I thought my 4.8M symlinks were bad...
-
JAA
Always fun to optimise grep/awk/sed/... pipelines to get best throughput.
-
JAA
In the first 3M-ish dirs listed (breadth-first recursion), I got 71 million files.
-
JAA
20.7M of those are over 1 MB.
-
JAA
19M are over 10 MB.
-
JAA
2.2M are over 100 MB.
-
JAA
Those 2.2M alone add up to 391.29 TiB.
-
JAA
arkiver: ^ First numbers, and the listing isn't even halfway through the discovered dirs yet.
-
JAA
Minor correction, those are over 1/10/100 MiB, not MB.
-
JAA
Todo is finally less than done. 3.37M remaining though.
-
nicolas17
is todo going down? :D
-
JAA
No, up, but slower than done at least. :-P
-
nicolas17
great
-
nicolas17
similar to telegrab right now
-
nicolas17
completing 12400 items/min, todo going down at 900 items/min *but at least it's going down*
-
JAA
Yeah, here I'm grabbing 10k/min but todo grows by 8k...
-
nicolas17
ow
-
fireonlive
-
eggdrop
-
fireonlive
waste of money is being shut down
-
fireonlive
no action required other than a “lulz”
-
JAA
In other news, the Canucks forums are still online and active. It was supposed to shut down at the end of September. The announcement has since been edited to read:
-
JAA
> It will be closed on __________.
-
JAA
-
nicolas17
literally underscores?
-
JAA
Yep
-
nicolas17
they can't find the power button
-
JAA
The only forum admin has no idea what's going on either. :-)
-
h2ibot
Megame edited Deathwatch (+265, /* 2023 */ GCN+ - Dec 19):
wiki.archiveteam.org/?diff=51160&oldid=51153
-
vokunal|m
There's a post on reddit saying pannchoa.com is likely going to be taken down in 10 hours. I can't find any evidence for an actual shutdown online. There's lots of people saying it should though
-
vokunal|m
-
vokunal|m
Just based on the number of pages they have, it looks like they probably have ~10779 posts
-
vokunal|m
Not sure whether this is a Deathwatch or more of a Firedrill
-
AK
Already in AB by the looks of it, "!status wjfqc3qnj93820c7o97ea1vw"
-
JAA
My archive.mozilla.org listing finished after 9494060 dirs. A handful of errors I need to look at. At least one of those dirs probably just can't be listed.
-
JAA
14.3 GiB of listings in compressed WARC...
-
fireonlive
holy crap
-
h2ibot
Manu edited Political parties/Germany/Hamburg (+12464, SPD (not even finished yet)):
wiki.archiveteam.org/?diff=51161&oldid=51116
-
h2ibot
Manu edited Political parties/Germany/Hamburg (+4072, /* Sozialdemokratische Partei Deutschlands…):
wiki.archiveteam.org/?diff=51162&oldid=51161
-
h2ibot
Manu edited Political parties/Germany/Hamburg (+50, /* SPD Hamburg-Nord */):
wiki.archiveteam.org/?diff=51163&oldid=51162