#archiveteam-bs

00:24

billy549

Is it allowed to slightly modify ArchiveTeam's logo?
00:25

billy549

i cant really go into specifics without it being obvious so please DM me and i can explain
00:25

billy549

in fact, ill just
00:27

OrIdow6

?
00:28

billy549

it's fine, i dmed JAA instead - please ignore me :P
00:29

OrIdow6

Oh
02:52

purplebot

Frequently Asked Questions edited by JustAnotherArchivist (+8, /* halp pls halp */ Fix WBM inclusion …) just now -- archiveteam.org/?diff=46532&oldid=45717
12:52

purplebot

Template:Czech websites edited by Sanqui (+27, Add szm.com and call it Slovak …), Sanqui (-4, two lines) just now -- archiveteam.org/?diff=46534&oldid=46222
13:23

avoozl1

how big is the yahoo_answers collection getting? would I need all files within the archiveteam_yahooanswers collection to be complete?
13:23

avoozl1

There's 789 parts at the moment, but I'm not sure where to check status
13:24

arkiver

avoozl1: you want to get a copy?
13:24

arkiver

it'll be tens of TBs
13:24

arkiver

maybe 100+
13:24

avoozl1

I would. I'm building a forum indexer and this seems like a good testcase
13:24

avoozl1

tens of TBs I can handle, 100+ not so much
13:24

arkiver

the archiveteam_yahooanswers contains data from an old yahooanswers project as well
13:25

avoozl1

ahh yeah I see, there's something from 2016 in there
13:25

avoozl1

and 2017
13:25

arkiver

yeah all from 2021+ is from the currently project
13:25

avoozl1

I can pick every id that starts with archiveteam_yahooanswers_2021
13:25

avoozl1

thanks
13:25

arkiver

back than yahoo answers had a different structure (no horrible PUT requests for pagination)
13:26

avoozl1

is the archiveteam_yahooanswers_dictionary_16* part of the new run?
13:26

arkiver

yes
13:26

arkiver

everything is compressed with ZSTD with a dictionary
13:26

arkiver

that dictionary is stored there for safekeeping
13:26

avoozl1

yeah I had some issues missing these in earlier downloads :) good to have these dicts
13:27

avoozl1

otherwise the data is useless :)
13:27

arkiver

i mean the dictionary is in the ZST megaWARC as well
13:27

avoozl1

ohh ok. good
13:28

arkiver

details here on how its stored in the ZST WARC github.com/ArchiveTeam/wget-lua/releases/tag/v1.20.3-at.20200401.01
13:28

avoozl1

I'm mostly using my own go-based tools, so I have to take care of some of these things every once in a while by myself
13:28

arkiver

also contains details on deduplication
13:31

avoozl1

current size seems <10TB but I'm not sure how far it has gotten
13:33

avoozl1

I'll grab a few and start coding, it will take a very long time for this to trickle down into my local machine :)
13:33

avoozl1

Thanks!
13:43

rewby

avoozl1: If you're interested, I have a golang package to read the special zstd format
13:56

avoozl1

rewby: that'd be awesome. I'm currently using github.com/CorentinB/warc
14:00

rewby

avoozl1: here is my library. It's not the best library out there but it does one thing I wasn't able to find another golang one does. Well two things. It does our zstd files and it does streaming io. That is, I don't need to load an entire record into memory to process it. My own code uses that to skip past big media since I only care about html content
14:00

rewby

gitlab.roelf.org/warcscan/warcreader
14:02

avoozl

Thanks, I'll take a look
14:03

rewby

I use this to pull multiple gigabits out of the IA and analyze the warcs for urls
14:04

avoozl

Looks awesome. I could probably drop this into my codebase almost as is
14:05

avoozl

I'm working on a bit of a hobby project to consume forum grabs in warc format and turn them into a searchable forum (using bleve/bluge as on-disk index format)
14:05

rewby

Yeah, feel free to poke me if you need support for it
14:05

avoozl

I haven't given it the time it requires lately, but it is in an functioning state
14:05

rewby

Cool!
14:06

avoozl

basically parses all the response bodies, runs then through goquery so you can use css selectors or xpath to extract the necessary parts at a thread/post level, and then builds a giant index with a bit of a simple web interface on top
14:06

rewby

That's neat!
14:06

avoozl

it requires custom work for each forum of course, but the customization is fairly minimal.. I started from the league of legends forums
14:07

rewby

If you have any licencing concerns with my library, feel free to ask and we'll work it out
14:07

avoozl

awesome, will do
14:07

rewby

That's really neat. I quite like doing things with previously archived data.
14:07

Sanqui

I'm a big fan too!
14:08

Sanqui

I've been thinking about how useful a "personalized forum search engine" would be to me
14:08

rewby

I've got this compute cluster building a huge urls database to help with discovery for new projects
14:08

Sanqui

oftentimes when searching for something, google is garbage and I'd rather know "what are people saying about this"
14:08

Sanqui

y'know?
14:09

Sanqui

so if I can help with this, perhaps contribute some forums, that'd be lovely
14:09

Sanqui

I'm planning on doing something similar with my Discord archives later
14:09

avoozl

I'll get the code cleaned up a bit so it can be pushed out somewhere. I still have to complete some parts of the bleve->bluge switch
14:09

avoozl

(the ingest is done but the search is still bleve-only)
14:09

rewby

Cool
14:10

rewby

I'd love to see the code
14:10

Sanqui

my Discord archiver is still in progress, but it's looking like it will scale to a few 100s of servers
14:11

avoozl

rewby: just a peek right now, but this is what I implement at a forum level.. paste.ofcode.org/DKszjr3eNAy4HjEjzJSSp9
14:11

Sanqui

this is important because web forums are dying and "public discords" fill the same cultural nice today...
14:11

rewby

Neat
14:11

avoozl

rewby: so this is just 'parse the response into an array of Bodies' and then the rest of the pipeline takes care of the indexing
14:12

rewby

Yeah, if you want "free" zstd support, feel free to use my code. My gitlab does have like 3-4 mins of downtime every few days while auto updates run, just as a warning
14:12

avoozl

one of the big todo's is to modify bluge so that the search index can be hosted as a remote file instead of a local one. Search performance would suffer, but that'd mean I could just put a large index somewhere http-accessible (or s3) and have people query it without any special server side logic
14:13

masterX244

pushing up a 2019 dump of forum.brickset.com atm (did that back then due to forum being in danger of shutdown)
14:13

rewby

Yeah, that makes lots of sense. A remote index would make sense. Especially if there's no need to do server side compute
14:14

avoozl

yeah it'll just be a bunch of range-bytes style retrievals
14:14

rewby

Makes a lot of sense. I would personally avoid s3 or similar due to the bandwidth/per request costs. If the files are small enough a vps or something could do the trick
14:15

avoozl

typically I'm finding the file will be around as large or slightly larger as the gz compressed input text, so they are fairly large
14:16

rewby

Ah hm.
14:16

rewby

Try zstd. It does wonders
14:16

avoozl

rewby: you can always opt to do requester-pays on s3, but I guess for some users that'd be a hurdle
14:16

rewby

Yeah, I can imagine
14:16

rewby

I'd almost wonder if the IA would be willing to host the index files
14:16

avoozl

rewby: yeah zstd compresses better, but the index itself can't really be compressed well. at least not without sacrificing a ton of performance.. I've been chatting with the author of bluge but it wasn't really on their radar as a usecase so it takes some time to look at the options there
14:17

rewby

Ah that sucks. I'm personally lucky my warc work compresses well
14:17

rewby

20billion urls in less than 200G
14:17

avoozl

(if I compress it I loose proper seek performance, and it needs pretty fine grained retrieval.. I tried 8k or 64k block based compression but it wasn't that great for this usecase)
14:18

avoozl

s/loose/lose/
14:18

rewby

Yeah no, makes sense
14:18

avoozl

my json stuff also compresses like crazy with zstd
14:19

rewby

Yeah, zstd is bloody magic
14:19

Sanqui

I've been compressing jsonl with gzip
14:19

Sanqui

should I look into zstd?
14:19

rewby

I'd say you should
14:19

rewby

I got like 3-4x better compression out of zstd
14:19

Sanqui

does zstd have better indexing support?
14:19

rewby

With no more overhead
14:20

rewby

I dunno about indexing. I mostly do streaming IO
14:20

EggplantN

Zstd is magic we use it at work for VM backups/snapshots
14:22

avoozl

I've used zstd with a 64kb block size, then train a dict on that, and then compress each block so we can effectively seek (with an offset table next to it).. That gives still pretty good compression for some of our data files (better than compressing each block individually).. but only when the file is pretty monotonous
14:22

rewby

I generally leave it set to default and it works loads better than gzip
14:23

avoozl

well xz compresses my json best, but it is also slow to compress and decompress
14:24

avoozl

zstd is faster and compresses a bit less.. though the speed for me makes up for it
14:25

Sanqui

incidentally, the upcoming release of fedora comes with btrfs with zstd compression enabled by default
14:25

Sanqui

so i guess it's *really* good
14:25

avoozl

I do have btrfs, but have compression disabled as I use dedup a lot and they don't seem to mingle well
14:26

avoozl

as a quick compression comparison for my json (which contains quite a bit of plain text content): paste.ofcode.org/q7PYhs6D8spbgc8KHm6San
14:28

Sanqui

303217943234215948.jsonl : 10.17% (146585063 => 14903289 bytes, 303217943234215948.jsonl.zst)
14:28

Sanqui

original is 146.6 MB, gzipped 17.4 MB, zst'd 14.9 MB
14:29

Sanqui

I guess that adds up
14:30

Sanqui

xz is waaaay slower but nets 11.2 MB
14:39

JAA

All of these comparisons are useless unless you mention the compression level. :-)
14:40

Sanqui

the defaults matter :)
14:40

JAA

zstd's defaults are more in favour of speed than of compression ratio compared to the other tools.
14:40

Sanqui

I understand that
14:41

Sanqui

I also tried zstd with -T0 --ultra -20 and it got to the same size that xz did
14:41

JAA

I've found that zstd -10 is about comparable in runtime to gzip -9. Depends a lot on the data of course.
14:42

JAA

And zstd -2 yielded similarly sized files as gzip -9 at a tenfold shorter runtime.
14:43

JAA

This was using log files and SQLite databases from ArchiveBot, FWIW.
14:45

lunik1

for absolute best compression lrzip+zpaq gives me the best ratios, too slow to be practical in most circumstances though
14:47

JAA

Ultimately, one just needs to test all compression levels (and possibly extra settings like threading on zstd) and then analyse at what point the additional compression ratio is no longer worth the extra computational effort. The results will differ wildly depending on what you're compressing, what the bottleneck is, etc.
14:48

lunik1

and your definition of "no longer worth"
14:58

avoozl

Just a check, the yahoo answers warc, that does exclude images and other frills, right? Because that is some serious amount of text
17:25

avoozl

rewby: did you have any reasons for picking github.com/DataDog/zstd over github.com/valyala/gozstd ? I'm using the latter one
21:34

masterX244

avoozl: bricksetforum still uploading, 31/38 warcs up. total 190GB (since i captured embedded and referenced media, too)
21:42

HCross

avoozl: the warcs contain everything
21:42

HCross

The entire point of them is to give you enough data to recreate the experience of being on the site without being online
21:43

HCross

also: look into AWS S3 but using Requester Pays
21:44

HCross

docs.aws.amazon.com/AmazonS3/latest…userguide/RequesterPaysBuckets.html
21:46

tech234a

Side note: one of us should update the Docker Warrior tutorial to ensure that configuration is persisted across Watchtower updates

3 years ago

« a day earlier

a day later »

today »