-
billy549
Is it allowed to slightly modify ArchiveTeam's logo?
-
billy549
i cant really go into specifics without it being obvious so please DM me and i can explain
-
billy549
in fact, ill just
-
OrIdow6
?
-
billy549
it's fine, i dmed JAA instead - please ignore me :P
-
OrIdow6
Oh
-
purplebot
Frequently Asked Questions edited by JustAnotherArchivist (+8, /* halp pls halp */ Fix WBM inclusion …) just now --
archiveteam.org/?diff=46532&oldid=45717
-
purplebot
Template:Czech websites edited by Sanqui (+27, Add szm.com and call it Slovak …), Sanqui (-4, two lines) just now --
archiveteam.org/?diff=46534&oldid=46222
-
avoozl1
how big is the yahoo_answers collection getting? would I need all files within the archiveteam_yahooanswers collection to be complete?
-
avoozl1
There's 789 parts at the moment, but I'm not sure where to check status
-
arkiver
avoozl1: you want to get a copy?
-
arkiver
it'll be tens of TBs
-
arkiver
maybe 100+
-
avoozl1
I would. I'm building a forum indexer and this seems like a good testcase
-
avoozl1
tens of TBs I can handle, 100+ not so much
-
arkiver
the archiveteam_yahooanswers contains data from an old yahooanswers project as well
-
avoozl1
ahh yeah I see, there's something from 2016 in there
-
avoozl1
and 2017
-
arkiver
yeah all from 2021+ is from the currently project
-
avoozl1
I can pick every id that starts with archiveteam_yahooanswers_2021
-
avoozl1
thanks
-
arkiver
back than yahoo answers had a different structure (no horrible PUT requests for pagination)
-
avoozl1
is the archiveteam_yahooanswers_dictionary_16* part of the new run?
-
arkiver
yes
-
arkiver
everything is compressed with ZSTD with a dictionary
-
arkiver
that dictionary is stored there for safekeeping
-
avoozl1
yeah I had some issues missing these in earlier downloads :) good to have these dicts
-
avoozl1
otherwise the data is useless :)
-
arkiver
i mean the dictionary is in the ZST megaWARC as well
-
avoozl1
ohh ok. good
-
arkiver
-
avoozl1
I'm mostly using my own go-based tools, so I have to take care of some of these things every once in a while by myself
-
arkiver
also contains details on deduplication
-
avoozl1
current size seems <10TB but I'm not sure how far it has gotten
-
avoozl1
I'll grab a few and start coding, it will take a very long time for this to trickle down into my local machine :)
-
avoozl1
Thanks!
-
rewby
avoozl1: If you're interested, I have a golang package to read the special zstd format
-
avoozl1
rewby: that'd be awesome. I'm currently using github.com/CorentinB/warc
-
rewby
avoozl1: here is my library. It's not the best library out there but it does one thing I wasn't able to find another golang one does. Well two things. It does our zstd files and it does streaming io. That is, I don't need to load an entire record into memory to process it. My own code uses that to skip past big media since I only care about html content
-
rewby
-
avoozl
Thanks, I'll take a look
-
rewby
I use this to pull multiple gigabits out of the IA and analyze the warcs for urls
-
avoozl
Looks awesome. I could probably drop this into my codebase almost as is
-
avoozl
I'm working on a bit of a hobby project to consume forum grabs in warc format and turn them into a searchable forum (using bleve/bluge as on-disk index format)
-
rewby
Yeah, feel free to poke me if you need support for it
-
avoozl
I haven't given it the time it requires lately, but it is in an functioning state
-
rewby
Cool!
-
avoozl
basically parses all the response bodies, runs then through goquery so you can use css selectors or xpath to extract the necessary parts at a thread/post level, and then builds a giant index with a bit of a simple web interface on top
-
rewby
That's neat!
-
avoozl
it requires custom work for each forum of course, but the customization is fairly minimal.. I started from the league of legends forums
-
rewby
If you have any licencing concerns with my library, feel free to ask and we'll work it out
-
avoozl
awesome, will do
-
rewby
That's really neat. I quite like doing things with previously archived data.
-
Sanqui
I'm a big fan too!
-
Sanqui
I've been thinking about how useful a "personalized forum search engine" would be to me
-
rewby
I've got this compute cluster building a huge urls database to help with discovery for new projects
-
Sanqui
oftentimes when searching for something, google is garbage and I'd rather know "what are people saying about this"
-
Sanqui
y'know?
-
Sanqui
so if I can help with this, perhaps contribute some forums, that'd be lovely
-
Sanqui
I'm planning on doing something similar with my Discord archives later
-
avoozl
I'll get the code cleaned up a bit so it can be pushed out somewhere. I still have to complete some parts of the bleve->bluge switch
-
avoozl
(the ingest is done but the search is still bleve-only)
-
rewby
Cool
-
rewby
I'd love to see the code
-
Sanqui
my Discord archiver is still in progress, but it's looking like it will scale to a few 100s of servers
-
avoozl
rewby: just a peek right now, but this is what I implement at a forum level..
paste.ofcode.org/DKszjr3eNAy4HjEjzJSSp9
-
Sanqui
this is important because web forums are dying and "public discords" fill the same cultural nice today...
-
rewby
Neat
-
avoozl
rewby: so this is just 'parse the response into an array of Bodies' and then the rest of the pipeline takes care of the indexing
-
rewby
Yeah, if you want "free" zstd support, feel free to use my code. My gitlab does have like 3-4 mins of downtime every few days while auto updates run, just as a warning
-
avoozl
one of the big todo's is to modify bluge so that the search index can be hosted as a remote file instead of a local one. Search performance would suffer, but that'd mean I could just put a large index somewhere http-accessible (or s3) and have people query it without any special server side logic
-
masterX244
pushing up a 2019 dump of forum.brickset.com atm (did that back then due to forum being in danger of shutdown)
-
rewby
Yeah, that makes lots of sense. A remote index would make sense. Especially if there's no need to do server side compute
-
avoozl
yeah it'll just be a bunch of range-bytes style retrievals
-
rewby
Makes a lot of sense. I would personally avoid s3 or similar due to the bandwidth/per request costs. If the files are small enough a vps or something could do the trick
-
avoozl
typically I'm finding the file will be around as large or slightly larger as the gz compressed input text, so they are fairly large
-
rewby
Ah hm.
-
rewby
Try zstd. It does wonders
-
avoozl
rewby: you can always opt to do requester-pays on s3, but I guess for some users that'd be a hurdle
-
rewby
Yeah, I can imagine
-
rewby
I'd almost wonder if the IA would be willing to host the index files
-
avoozl
rewby: yeah zstd compresses better, but the index itself can't really be compressed well. at least not without sacrificing a ton of performance.. I've been chatting with the author of bluge but it wasn't really on their radar as a usecase so it takes some time to look at the options there
-
rewby
Ah that sucks. I'm personally lucky my warc work compresses well
-
rewby
20billion urls in less than 200G
-
avoozl
(if I compress it I loose proper seek performance, and it needs pretty fine grained retrieval.. I tried 8k or 64k block based compression but it wasn't that great for this usecase)
-
avoozl
s/loose/lose/
-
rewby
Yeah no, makes sense
-
avoozl
my json stuff also compresses like crazy with zstd
-
rewby
Yeah, zstd is bloody magic
-
Sanqui
I've been compressing jsonl with gzip
-
Sanqui
should I look into zstd?
-
rewby
I'd say you should
-
rewby
I got like 3-4x better compression out of zstd
-
Sanqui
does zstd have better indexing support?
-
rewby
With no more overhead
-
rewby
I dunno about indexing. I mostly do streaming IO
-
EggplantN
Zstd is magic we use it at work for VM backups/snapshots
-
avoozl
I've used zstd with a 64kb block size, then train a dict on that, and then compress each block so we can effectively seek (with an offset table next to it).. That gives still pretty good compression for some of our data files (better than compressing each block individually).. but only when the file is pretty monotonous
-
rewby
I generally leave it set to default and it works loads better than gzip
-
avoozl
well xz compresses my json best, but it is also slow to compress and decompress
-
avoozl
zstd is faster and compresses a bit less.. though the speed for me makes up for it
-
Sanqui
incidentally, the upcoming release of fedora comes with btrfs with zstd compression enabled by default
-
Sanqui
so i guess it's *really* good
-
avoozl
I do have btrfs, but have compression disabled as I use dedup a lot and they don't seem to mingle well
-
avoozl
as a quick compression comparison for my json (which contains quite a bit of plain text content):
paste.ofcode.org/q7PYhs6D8spbgc8KHm6San
-
Sanqui
303217943234215948.jsonl : 10.17% (146585063 => 14903289 bytes, 303217943234215948.jsonl.zst)
-
Sanqui
original is 146.6 MB, gzipped 17.4 MB, zst'd 14.9 MB
-
Sanqui
I guess that adds up
-
Sanqui
xz is waaaay slower but nets 11.2 MB
-
JAA
All of these comparisons are useless unless you mention the compression level. :-)
-
Sanqui
the defaults matter :)
-
JAA
zstd's defaults are more in favour of speed than of compression ratio compared to the other tools.
-
Sanqui
I understand that
-
Sanqui
I also tried zstd with -T0 --ultra -20 and it got to the same size that xz did
-
JAA
I've found that zstd -10 is about comparable in runtime to gzip -9. Depends a lot on the data of course.
-
JAA
And zstd -2 yielded similarly sized files as gzip -9 at a tenfold shorter runtime.
-
JAA
This was using log files and SQLite databases from ArchiveBot, FWIW.
-
lunik1
for absolute best compression lrzip+zpaq gives me the best ratios, too slow to be practical in most circumstances though
-
JAA
Ultimately, one just needs to test all compression levels (and possibly extra settings like threading on zstd) and then analyse at what point the additional compression ratio is no longer worth the extra computational effort. The results will differ wildly depending on what you're compressing, what the bottleneck is, etc.
-
lunik1
and your definition of "no longer worth"
-
avoozl
Just a check, the yahoo answers warc, that does exclude images and other frills, right? Because that is some serious amount of text
-
avoozl
rewby: did you have any reasons for picking github.com/DataDog/zstd over github.com/valyala/gozstd ? I'm using the latter one
-
masterX244
avoozl: bricksetforum still uploading, 31/38 warcs up. total 190GB (since i captured embedded and referenced media, too)
-
HCross
avoozl: the warcs contain everything
-
HCross
The entire point of them is to give you enough data to recreate the experience of being on the site without being online
-
HCross
also: look into AWS S3 but using Requester Pays
-
HCross
-
tech234a
Side note: one of us should update the Docker Warrior tutorial to ensure that configuration is persisted across Watchtower updates