00:00:59 thank you! 00:20:50 1.26M dirs done, 2.83M remaining. It exploded a bit again ~15 minutes ago. 02:13:18 hi folks, I need a help to understand the archives of the "fotolog" wevsite, I am doing a research on a young brazilian movement that used the site a lot, here is the right place for me to ask? 02:15:43 Over 3M todo now, so I'll stop updating here until it's closer to completion. 02:15:51 Hi rafaeletc, yeah, here is fine to ask. 02:17:49 Oh, hello JAA, nice to meet u 02:18:52 JAA: any idea how much data is there? 02:19:15 I guess you're not extracting file sizes yet 02:20:21 oh, a lot of gigabytes 02:20:37 oh I meant the mozilla archive he's currently working on :P 02:21:41 nicolas17: sorry, thought it was for me the question 02:23:58 nicolas17: No idea yet, no, just fetching the dir listings so far. 02:24:19 rafaeletc: it would be good to know the question :P 02:24:30 but the archived are browsable in the Wayback Machine 02:26:17 I'm trying to recover some contents from "fotolog.com" via Wayback Machine but I'm finding a lot of broken content, than I found this: https://archive.org/details/archiveteam_fotolog - my question is how to extract them in a way that I can identify where are the files of the profiles I want to analyze, if you give a search in this link you will see 02:26:17 that they are very large files to download fully and look for needle in a hay tree. 02:27:18 i do see the site is marked "partially saved" 02:27:25 rafaeletc: do you have any URLs? 02:27:33 you could look those up in the Wayback Machine 02:27:39 I managed to understand the metadata you have in the json of each upload, do you have a way to automate the download only of these json? 02:28:13 rafaeletc: the CDX files may be useful for you, they contain lists of URLs saved in the archives - though i don't know what you are looking for 02:28:41 as for downloading files, you can for example use the `internetarchive` (`ia`) library for this 02:29:06 https://archive.org/developers/internetarchive/ 02:29:26 i'll be off to bed now though, but you can leave message and me or JAA or someone else might know what to do 02:30:22 arkiver: I would like to search the comments and photos of what was archived, but not everything, only of some profiles 02:31:17 rafaeletc: i'm afraid 'searching' is not exactly possible, as the content of the records is not really indexed, only the URLs and some other metadata 02:31:32 so if you know some URLs for what you are looking for, you could look those up in the Wayback Machine 02:31:57 but searching all comments/pages for some terms is not possible at the moment, unless someone makes this searchable (which could be expensive) 02:33:08 arkiver: so major it's static content 02:33:17 but this application that you linked is well documented, I believe will help me a lot, via browser is very difficult 02:35:26 I am researching a cultural movement that between 2003 and 2006 used the website for dissemination and communication 02:37:28 therefore my interest in the comments, but as they were dynamic contents, I believe I will not be able to recover them, but getting the publications I believe that can help 02:41:00 pardon my bad english, I am Brazilian and natively speak portuguese. but thank you very much, the indicated application illuminated the path I need to follow to research more 02:52:31 rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'` should work for downloading all of the megawarc JSON files. 02:52:56 JAA: it's possible to use this files: https://archive.org/download/archiveteam_fotolog_20160211111954 with https://replayweb.page/ 02:53:56 rafaeletc: Yes, that should work. 02:55:23 JAA: I mean, the warc of profiles are compressed right? each page archived is a warc inside the compressed file? 02:57:02 each page archived is a compressed record inside the warc file 02:57:22 rafaeletc: It's a compressed file, and each record is compressed individually. replayweb.page should support reading the compressed .warc.gz file directly. 02:57:51 JAA: but a warc.gz with 36gb? 02:57:52 In other words, you do not need to decompress it. 02:58:07 Yeah, they are large and not exactly nice to work with. 02:59:00 Using the metadata in the JSON files, you can download just the part of the file you're interested in. 02:59:15 but I have to download it or just paste the link at replayweb.page? 02:59:56 I'm not sure. I haven't used ReplayWeb.Page before. I just know that it exists. 03:09:28 oh, i figured out, downloaded a small file and opened it on replayweb and saw the content that was archived, luckily the comments are saved, now is to find where are the warcs of the profiles I want to search 03:09:28 thank you all, very very much 03:10:32 :-) 03:12:02 I'm trying to get a partial size estimate for what I've crawld so far of archive.mozilla.org, but it makes my CPU sad. 03:12:21 I'm already at 5 GiB of compressed WARCs for the listings so far... 03:12:26 x_x 03:12:32 } 03:12:36 I wonder where it's hosted 03:13:10 Somewhere that doesn't care a bit about me hammering it for hours at least, so that's good. 03:13:27 some massive storage server exposing all this data *as a single filesystem*? 03:13:29 28.35.117.34.bc.googleusercontent.com 03:13:35 Why in the cloud, of course! 03:15:20 x-goog-storage-class: NEARLINE 03:15:29 they must have their own thing for file listings then huh 03:15:34 JAA: I accidentally closed the browser tab and did not copy the parameters of ia that you had suggested, can you paste to me again? 03:15:54 rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'` 03:16:35 JAA: thank you, again 03:16:58 Happy to help. :-) 03:17:08 yep, files have the x-goog headers suggesting they're on Google Cloud Storage, but file listings don't, so they probably have a reverse proxy / load balancer / thing redirecting file listing requests to somewhere else 03:17:12 I just crossed 3M dirs fetched. There's another 3.15M in the queue. 03:17:34 And I thought my 4.8M symlinks were bad... 03:22:20 Always fun to optimise grep/awk/sed/... pipelines to get best throughput. 03:24:42 In the first 3M-ish dirs listed (breadth-first recursion), I got 71 million files. 03:25:15 20.7M of those are over 1 MB. 03:26:29 19M are over 10 MB. 03:26:38 2.2M are over 100 MB. 03:27:26 Those 2.2M alone add up to 391.29 TiB. 03:28:10 arkiver: ^ First numbers, and the listing isn't even halfway through the discovered dirs yet. 03:29:46 Minor correction, those are over 1/10/100 MiB, not MB. 03:50:32 Todo is finally less than done. 3.37M remaining though. 03:55:12 is todo going down? :D 04:01:46 No, up, but slower than done at least. :-P 04:08:16 great 04:08:20 similar to telegrab right now 04:08:49 completing 12400 items/min, todo going down at 900 items/min *but at least it's going down* 04:11:18 Yeah, here I'm grabbing 10k/min but todo grows by 8k... 04:13:13 ow 04:24:40 https://x.com/discordpreviews/status/1725240412023959844?s=12 04:24:40 nitter: https://nitter.net/discordpreviews/status/1725240412023959844 04:24:46 waste of money is being shut down 04:25:01 no action required other than a “lulz” 04:32:53 In other news, the Canucks forums are still online and active. It was supposed to shut down at the end of September. The announcement has since been edited to read: 04:32:56 > It will be closed on __________. 04:33:08 https://forum.canucks.com/announcement/25-forum-closure/ 04:34:44 literally underscores? 04:34:47 Yep 04:35:06 they can't find the power button 04:35:40 The only forum admin has no idea what's going on either. :-) 17:14:55 Megame edited Deathwatch (+265, /* 2023 */ GCN+ - Dec 19): https://wiki.archiveteam.org/?diff=51160&oldid=51153 17:27:46 There's a post on reddit saying pannchoa.com is likely going to be taken down in 10 hours. I can't find any evidence for an actual shutdown online. There's lots of people saying it should though 17:27:47 https://www.reddit.com/r/Archiveteam/comments/17x5qdr/pannchoacom_website_likely_being_taken_down_in_24/ 17:28:40 Just based on the number of pages they have, it looks like they probably have ~10779 posts 17:30:56 Not sure whether this is a Deathwatch or more of a Firedrill 17:37:02 Already in AB by the looks of it, "!status wjfqc3qnj93820c7o97ea1vw" 17:39:22 My archive.mozilla.org listing finished after 9494060 dirs. A handful of errors I need to look at. At least one of those dirs probably just can't be listed. 17:39:50 14.3 GiB of listings in compressed WARC... 17:55:56 holy crap 20:55:37 Manu edited Political parties/Germany/Hamburg (+12464, SPD (not even finished yet)): https://wiki.archiveteam.org/?diff=51161&oldid=51116 22:05:52 Manu edited Political parties/Germany/Hamburg (+4072, /* Sozialdemokratische Partei Deutschlands…): https://wiki.archiveteam.org/?diff=51162&oldid=51161 22:36:57 Manu edited Political parties/Germany/Hamburg (+50, /* SPD Hamburg-Nord */): https://wiki.archiveteam.org/?diff=51163&oldid=51162