#archiveteam-bs

00:00

thuban

thank you!
00:20

JAA

1.26M dirs done, 2.83M remaining. It exploded a bit again ~15 minutes ago.
02:13

rafaeletc

hi folks, I need a help to understand the archives of the "fotolog" wevsite, I am doing a research on a young brazilian movement that used the site a lot, here is the right place for me to ask?
02:15

JAA

Over 3M todo now, so I'll stop updating here until it's closer to completion.
02:15

JAA

Hi rafaeletc, yeah, here is fine to ask.
02:17

rafaeletc

Oh, hello JAA, nice to meet u
02:18

nicolas17

JAA: any idea how much data is there?
02:19

nicolas17

I guess you're not extracting file sizes yet
02:20

rafaeletc

oh, a lot of gigabytes
02:20

nicolas17

oh I meant the mozilla archive he's currently working on :P
02:21

rafaeletc

nicolas17: sorry, thought it was for me the question
02:23

JAA

nicolas17: No idea yet, no, just fetching the dir listings so far.
02:24

arkiver

rafaeletc: it would be good to know the question :P
02:24

arkiver

but the archived are browsable in the Wayback Machine
02:26

rafaeletc

I'm trying to recover some contents from "fotolog.com" via Wayback Machine but I'm finding a lot of broken content, than I found this: archive.org/details/archiveteam_fotolog - my question is how to extract them in a way that I can identify where are the files of the profiles I want to analyze, if you give a search in this link you will see
02:26

rafaeletc

that they are very large files to download fully and look for needle in a hay tree.
02:27

arkiver

i do see the site is marked "partially saved"
02:27

arkiver

rafaeletc: do you have any URLs?
02:27

arkiver

you could look those up in the Wayback Machine
02:27

rafaeletc

I managed to understand the metadata you have in the json of each upload, do you have a way to automate the download only of these json?
02:28

arkiver

rafaeletc: the CDX files may be useful for you, they contain lists of URLs saved in the archives - though i don't know what you are looking for
02:28

arkiver

as for downloading files, you can for example use the `internetarchive` (`ia`) library for this
02:29

arkiver

archive.org/developers/internetarchive
02:29

arkiver

i'll be off to bed now though, but you can leave message and me or JAA or someone else might know what to do
02:30

rafaeletc

arkiver: I would like to search the comments and photos of what was archived, but not everything, only of some profiles
02:31

arkiver

rafaeletc: i'm afraid 'searching' is not exactly possible, as the content of the records is not really indexed, only the URLs and some other metadata
02:31

arkiver

so if you know some URLs for what you are looking for, you could look those up in the Wayback Machine
02:31

arkiver

but searching all comments/pages for some terms is not possible at the moment, unless someone makes this searchable (which could be expensive)
02:33

rafaeletc

arkiver: so major it's static content
02:33

rafaeletc

but this application that you linked is well documented, I believe will help me a lot, via browser is very difficult
02:35

rafaeletc

I am researching a cultural movement that between 2003 and 2006 used the website for dissemination and communication
02:37

rafaeletc

therefore my interest in the comments, but as they were dynamic contents, I believe I will not be able to recover them, but getting the publications I believe that can help
02:41

rafaeletc

pardon my bad english, I am Brazilian and natively speak portuguese. but thank you very much, the indicated application illuminated the path I need to follow to research more
02:52

JAA

rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'` should work for downloading all of the megawarc JSON files.
02:52

rafaeletc

JAA: it's possible to use this files: archive.org/download/archiveteam_fotolog_20160211111954 with replayweb.page
02:53

JAA

rafaeletc: Yes, that should work.
02:55

rafaeletc

JAA: I mean, the warc of profiles are compressed right? each page archived is a warc inside the compressed file?
02:57

nicolas17

each page archived is a compressed record inside the warc file
02:57

JAA

rafaeletc: It's a compressed file, and each record is compressed individually. replayweb.page should support reading the compressed .warc.gz file directly.
02:57

rafaeletc

JAA: but a warc.gz with 36gb?
02:57

JAA

In other words, you do not need to decompress it.
02:58

JAA

Yeah, they are large and not exactly nice to work with.
02:59

JAA

Using the metadata in the JSON files, you can download just the part of the file you're interested in.
02:59

rafaeletc

but I have to download it or just paste the link at replayweb.page?
02:59

JAA

I'm not sure. I haven't used ReplayWeb.Page before. I just know that it exists.
03:09

rafaeletc

oh, i figured out, downloaded a small file and opened it on replayweb and saw the content that was archived, luckily the comments are saved, now is to find where are the warcs of the profiles I want to search
03:09

rafaeletc

thank you all, very very much
03:10

JAA

:-)
03:12

JAA

I'm trying to get a partial size estimate for what I've crawld so far of archive.mozilla.org, but it makes my CPU sad.
03:12

JAA

I'm already at 5 GiB of compressed WARCs for the listings so far...
03:12

nicolas17

x_x
03:12

nicolas17

}
03:12

nicolas17

I wonder where it's hosted
03:13

JAA

Somewhere that doesn't care a bit about me hammering it for hours at least, so that's good.
03:13

nicolas17

some massive storage server exposing all this data *as a single filesystem*?
03:13

JAA

28.35.117.34.bc.googleusercontent.com
03:13

nulldata

Why in the <i>cloud</i>, of course!
03:15

nicolas17

x-goog-storage-class: NEARLINE
03:15

nicolas17

they must have their own thing for file listings then huh
03:15

rafaeletc

JAA: I accidentally closed the browser tab and did not copy the parameters of ia that you had suggested, can you paste to me again?
03:15

JAA

rafaeletc: `ia download --search 'collection:archiveteam_fotolog' --glob '*.json.gz'`
03:16

rafaeletc

JAA: thank you, again
03:16

JAA

Happy to help. :-)
03:17

nicolas17

yep, files have the x-goog headers suggesting they're on Google Cloud Storage, but file listings don't, so they probably have a reverse proxy / load balancer / thing redirecting file listing requests to somewhere else
03:17

JAA

I just crossed 3M dirs fetched. There's another 3.15M in the queue.
03:17

JAA

And I thought my 4.8M symlinks were bad...
03:22

JAA

Always fun to optimise grep/awk/sed/... pipelines to get best throughput.
03:24

JAA

In the first 3M-ish dirs listed (breadth-first recursion), I got 71 million files.
03:25

JAA

20.7M of those are over 1 MB.
03:26

JAA

19M are over 10 MB.
03:26

JAA

2.2M are over 100 MB.
03:27

JAA

Those 2.2M alone add up to 391.29 TiB.
03:28

JAA

arkiver: ^ First numbers, and the listing isn't even halfway through the discovered dirs yet.
03:29

JAA

Minor correction, those are over 1/10/100 MiB, not MB.
03:50

JAA

Todo is finally less than done. 3.37M remaining though.
03:55

nicolas17

is todo going down? :D
04:01

JAA

No, up, but slower than done at least. :-P
04:08

nicolas17

great
04:08

nicolas17

similar to telegrab right now
04:08

nicolas17

completing 12400 items/min, todo going down at 900 items/min *but at least it's going down*
04:11

JAA

Yeah, here I'm grabbing 10k/min but todo grows by 8k...
04:13

nicolas17

ow
04:24

fireonlive

x.com/discordpreviews/status/1725240412023959844?s=12
04:24

eggdrop

nitter: nitter.net/discordpreviews/status/1725240412023959844
04:24

fireonlive

waste of money is being shut down
04:25

fireonlive

no action required other than a “lulz”
04:32

JAA

In other news, the Canucks forums are still online and active. It was supposed to shut down at the end of September. The announcement has since been edited to read:
04:32

JAA

> It will be closed on __________.
04:33

JAA

forum.canucks.com/announcement/25-forum-closure
04:34

nicolas17

literally underscores?
04:34

JAA

Yep
04:35

nicolas17

they can't find the power button
04:35

JAA

The only forum admin has no idea what's going on either. :-)
17:14

h2ibot

Megame edited Deathwatch (+265, /* 2023 */ GCN+ - Dec 19): wiki.archiveteam.org/?diff=51160&oldid=51153
17:27

vokunal|m

There's a post on reddit saying pannchoa.com is likely going to be taken down in 10 hours. I can't find any evidence for an actual shutdown online. There's lots of people saying it should though
17:27

vokunal|m

reddit.com/r/Archiveteam/comments/1…bsite_likely_being_taken_down_in_24
17:28

vokunal|m

Just based on the number of pages they have, it looks like they probably have ~10779 posts
17:30

vokunal|m

Not sure whether this is a Deathwatch or more of a Firedrill
17:37

AK

Already in AB by the looks of it, "!status wjfqc3qnj93820c7o97ea1vw"
17:39

JAA

My archive.mozilla.org listing finished after 9494060 dirs. A handful of errors I need to look at. At least one of those dirs probably just can't be listed.
17:39

JAA

14.3 GiB of listings in compressed WARC...
17:55

fireonlive

holy crap
20:55

h2ibot

Manu edited Political parties/Germany/Hamburg (+12464, SPD (not even finished yet)): wiki.archiveteam.org/?diff=51161&oldid=51116
22:05

h2ibot

Manu edited Political parties/Germany/Hamburg (+4072, /* Sozialdemokratische Partei Deutschlands…): wiki.archiveteam.org/?diff=51162&oldid=51161
22:36

h2ibot

Manu edited Political parties/Germany/Hamburg (+50, /* SPD Hamburg-Nord */): wiki.archiveteam.org/?diff=51163&oldid=51162

10 months ago

« a day earlier

a day later »

today »