00:00:59 <thuban> thank you!
00:20:50 <JAA> 1.26M dirs done, 2.83M remaining. It exploded a bit again ~15 minutes ago.
02:13:18 <rafaeletc> hi folks, I need a help to understand the archives of the "fotolog" wevsite, I am doing a research on a young brazilian movement that used the site a lot, here is the right place for me to ask?
02:15:43 <JAA> Over 3M todo now, so I'll stop updating here until it's closer to completion.
02:15:51 <JAA> Hi rafaeletc, yeah, here is fine to ask.
02:17:49 <rafaeletc> Oh, hello JAA, nice to meet u
02:18:52 <nicolas17> JAA: any idea how much data is there?
02:19:15 <nicolas17> I guess you're not extracting file sizes yet
02:20:21 <rafaeletc> oh, a lot of gigabytes
02:20:37 <nicolas17> oh I meant the mozilla archive he's currently working on :P
02:21:41 <rafaeletc> nicolas17: sorry, thought it was for me the question
02:23:58 <JAA> nicolas17: No idea yet, no, just fetching the dir listings so far.
02:24:19 <arkiver> rafaeletc: it would be good to know the question :P
02:24:30 <arkiver> but the archived are browsable in the Wayback Machine
02:26:17 <rafaeletc> I'm trying to recover some contents from "fotolog.com" via Wayback Machine but I'm finding a lot of broken content, than I found this: https://archive.org/details/archiveteam_fotolog - my question is how to extract them in a way that I can identify where are the files of the profiles I want to analyze, if you give a search in this link you will see
02:26:17 <rafaeletc> that they are very large files to download fully and look for needle in a hay tree.
02:27:18 <arkiver> i do see the site is marked "partially saved"
02:27:25 <arkiver> rafaeletc: do you have any URLs?
02:27:33 <arkiver> you could look those up in the Wayback Machine
02:27:39 <rafaeletc> I managed to understand the metadata you have in the json of each upload, do you have a way to automate the download only of these json?
02:28:13 <arkiver> rafaeletc: the CDX files may be useful for you, they contain lists of URLs saved in the archives - though i don't know what you are looking for
02:28:41 <arkiver> as for downloading files, you can for example use the `internetarchive` (`ia`) library for this
02:29:06 <arkiver> https://archive.org/developers/internetarchive/
02:29:26 <arkiver> i'll be off to bed now though, but you can leave message and me or JAA or someone else might know what to do
02:30:22 <rafaeletc> arkiver: I would like to search the comments and photos of what was archived, but not everything, only of some profiles
02:31:17 <arkiver> rafaeletc: i'm afraid 'searching' is not exactly possible, as the content of the records is not really indexed, only the URLs and some other metadata
02:31:32 <arkiver> so if you know some URLs for what you are looking for, you could look those up in the Wayback Machine
02:31:57 <arkiver> but searching all comments/pages for some terms is not possible at the moment, unless someone makes this searchable (which could be expensive)
02:33:08 <rafaeletc> arkiver: so major it's static content
02:33:17 <rafaeletc> but this application that you linked is well documented, I believe will help me a lot, via browser is very difficult
02:35:26 <rafaeletc> I am researching a cultural movement that between 2003 and 2006 used the website for dissemination and communication
02:37:28 <rafaeletc> therefore my interest in the comments, but as they were dynamic contents, I believe I will not be able to recover them, but getting the publications I believe that can help
02:41:00 <rafaeletc> pardon my bad english, I am Brazilian and natively speak portuguese. but thank you very much, the indicated application illuminated the path I need to follow to research more
02:52:31 <JAA> rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'` should work for downloading all of the megawarc JSON files.
02:52:56 <rafaeletc> JAA: it's possible to use this files: https://archive.org/download/archiveteam_fotolog_20160211111954 with https://replayweb.page/
02:53:56 <JAA> rafaeletc: Yes, that should work.
02:55:23 <rafaeletc> JAA: I mean, the warc of profiles are compressed right? each page archived is a warc inside the compressed file?
02:57:02 <nicolas17> each page archived is a compressed record inside the warc file
02:57:22 <JAA> rafaeletc: It's a compressed file, and each record is compressed individually. replayweb.page should support reading the compressed .warc.gz file directly.
02:57:51 <rafaeletc> JAA: but a warc.gz with 36gb?
02:57:52 <JAA> In other words, you do not need to decompress it.
02:58:07 <JAA> Yeah, they are large and not exactly nice to work with.
02:59:00 <JAA> Using the metadata in the JSON files, you can download just the part of the file you're interested in.
02:59:15 <rafaeletc> but I have to download it or just paste the link at replayweb.page?
02:59:56 <JAA> I'm not sure. I haven't used ReplayWeb.Page before. I just know that it exists.
03:09:28 <rafaeletc> oh, i figured out, downloaded a small file and opened it on replayweb and saw the content that was archived, luckily the comments are saved, now is to find where are the warcs of the profiles I want to search
03:09:28 <rafaeletc> thank you all, very very much
03:10:32 <JAA> :-)
03:12:02 <JAA> I'm trying to get a partial size estimate for what I've crawld so far of archive.mozilla.org, but it makes my CPU sad.
03:12:21 <JAA> I'm already at 5 GiB of compressed WARCs for the listings so far...
03:12:26 <nicolas17> x_x
03:12:32 <nicolas17> }
03:12:36 <nicolas17> I wonder where it's hosted
03:13:10 <JAA> Somewhere that doesn't care a bit about me hammering it for hours at least, so that's good.
03:13:27 <nicolas17> some massive storage server exposing all this data *as a single filesystem*?
03:13:29 <JAA> 28.35.117.34.bc.googleusercontent.com
03:13:35 <nulldata> Why in the <i>cloud</i>, of course!
03:15:20 <nicolas17> x-goog-storage-class: NEARLINE
03:15:29 <nicolas17> they must have their own thing for file listings then huh
03:15:34 <rafaeletc> JAA: I accidentally closed the browser tab and did not copy the parameters of ia that you had suggested, can you paste to me again?
03:15:54 <JAA> rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'`
03:16:35 <rafaeletc> JAA: thank you, again
03:16:58 <JAA> Happy to help. :-)
03:17:08 <nicolas17> yep, files have the x-goog headers suggesting they're on Google Cloud Storage, but file listings don't, so they probably have a reverse proxy / load balancer / thing redirecting file listing requests to somewhere else
03:17:12 <JAA> I just crossed 3M dirs fetched. There's another 3.15M in the queue.
03:17:34 <JAA> And I thought my 4.8M symlinks were bad...
03:22:20 <JAA> Always fun to optimise grep/awk/sed/... pipelines to get best throughput.
03:24:42 <JAA> In the first 3M-ish dirs listed (breadth-first recursion), I got 71 million files.
03:25:15 <JAA> 20.7M of those are over 1 MB.
03:26:29 <JAA> 19M are over 10 MB.
03:26:38 <JAA> 2.2M are over 100 MB.
03:27:26 <JAA> Those 2.2M alone add up to 391.29 TiB.
03:28:10 <JAA> arkiver: ^ First numbers, and the listing isn't even halfway through the discovered dirs yet.
03:29:46 <JAA> Minor correction, those are over 1/10/100 MiB, not MB.
03:50:32 <JAA> Todo is finally less than done. 3.37M remaining though.
03:55:12 <nicolas17> is todo going down? :D
04:01:46 <JAA> No, up, but slower than done at least. :-P
04:08:16 <nicolas17> great
04:08:20 <nicolas17> similar to telegrab right now
04:08:49 <nicolas17> completing 12400 items/min, todo going down at 900 items/min *but at least it's going down*
04:11:18 <JAA> Yeah, here I'm grabbing 10k/min but todo grows by 8k...
04:13:13 <nicolas17> ow
04:24:40 <fireonlive> https://x.com/discordpreviews/status/1725240412023959844?s=12
04:24:40 <eggdrop> nitter: https://nitter.net/discordpreviews/status/1725240412023959844
04:24:46 <fireonlive> waste of money is being shut down
04:25:01 <fireonlive> no action required other than a “lulz”
04:32:53 <JAA> In other news, the Canucks forums are still online and active. It was supposed to shut down at the end of September. The announcement has since been edited to read:
04:32:56 <JAA> > It will be closed on __________.
04:33:08 <JAA> https://forum.canucks.com/announcement/25-forum-closure/
04:34:44 <nicolas17> literally underscores?
04:34:47 <JAA> Yep
04:35:06 <nicolas17> they can't find the power button
04:35:40 <JAA> The only forum admin has no idea what's going on either. :-)
17:14:55 <h2ibot> Megame edited Deathwatch (+265, /* 2023 */ GCN+ - Dec 19): https://wiki.archiveteam.org/?diff=51160&oldid=51153
17:27:46 <vokunal|m> There's a post on reddit saying pannchoa.com is likely going to be taken down in 10 hours. I can't find any evidence for an actual shutdown online. There's lots of people saying it should though
17:27:47 <vokunal|m> https://www.reddit.com/r/Archiveteam/comments/17x5qdr/pannchoacom_website_likely_being_taken_down_in_24/
17:28:40 <vokunal|m> Just based on the number of pages they have, it looks like they probably have ~10779 posts
17:30:56 <vokunal|m> Not sure whether this is a Deathwatch or more of a Firedrill
17:37:02 <AK> Already in AB by the looks of it, "!status wjfqc3qnj93820c7o97ea1vw"
17:39:22 <JAA> My archive.mozilla.org listing finished after 9494060 dirs. A handful of errors I need to look at. At least one of those dirs probably just can't be listed.
17:39:50 <JAA> 14.3 GiB of listings in compressed WARC...
17:55:56 <fireonlive> holy crap
20:55:37 <h2ibot> Manu edited Political parties/Germany/Hamburg (+12464, SPD (not even finished yet)): https://wiki.archiveteam.org/?diff=51161&oldid=51116
22:05:52 <h2ibot> Manu edited Political parties/Germany/Hamburg (+4072, /* Sozialdemokratische Partei Deutschlands…): https://wiki.archiveteam.org/?diff=51162&oldid=51161
22:36:57 <h2ibot> Manu edited Political parties/Germany/Hamburg (+50, /* SPD Hamburg-Nord */): https://wiki.archiveteam.org/?diff=51163&oldid=51162