00:24:53 <billy549> Is it allowed to slightly modify ArchiveTeam's logo?
00:25:25 <billy549> i cant really go into specifics without it being obvious so please DM me and i can explain
00:25:30 <billy549> in fact, ill just
00:27:09 <OrIdow6> ?
00:28:55 <billy549> it's fine, i dmed JAA instead - please ignore me :P
00:29:17 <OrIdow6> Oh
02:52:26 -purplebot- Frequently Asked Questions edited by JustAnotherArchivist (+8, /* halp pls halp */ Fix WBM inclusion …) just now -- https://www.archiveteam.org/?diff=46532&oldid=45717
12:52:26 -purplebot- Template:Czech websites edited by Sanqui (+27, Add szm.com and call it Slovak …), Sanqui (-4, two lines) just now -- https://www.archiveteam.org/?diff=46534&oldid=46222
13:23:21 <avoozl1> how big is the yahoo_answers collection getting? would I need all files within the archiveteam_yahooanswers collection to be complete?
13:23:47 <avoozl1> There's 789 parts at the moment, but I'm not sure where to check status
13:24:15 <arkiver> avoozl1: you want to get a copy?
13:24:19 <arkiver> it'll be tens of TBs
13:24:23 <arkiver> maybe 100+
13:24:28 <avoozl1> I would. I'm building a forum indexer and this seems like a good testcase
13:24:35 <avoozl1> tens of TBs I can handle, 100+ not so much
13:24:52 <arkiver> the archiveteam_yahooanswers contains data from an old yahooanswers project as well
13:25:08 <avoozl1> ahh yeah I see, there's something from 2016 in there
13:25:15 <avoozl1> and 2017
13:25:22 <arkiver> yeah all from 2021+ is from the currently project
13:25:29 <avoozl1> I can pick every id that starts with archiveteam_yahooanswers_2021
13:25:32 <avoozl1> thanks
13:25:43 <arkiver> back than yahoo answers had a different structure (no horrible PUT requests for pagination)
13:26:11 <avoozl1> is the archiveteam_yahooanswers_dictionary_16* part of the new run?
13:26:22 <arkiver> yes
13:26:42 <arkiver> everything is compressed with ZSTD with a dictionary
13:26:50 <arkiver> that dictionary is stored there for safekeeping
13:26:56 <avoozl1> yeah I had some issues missing these in earlier downloads :) good to have these dicts
13:27:03 <avoozl1> otherwise the data is useless :)
13:27:22 <arkiver> i mean the dictionary is in the ZST megaWARC as well
13:27:35 <avoozl1> ohh ok. good
13:28:04 <arkiver> details here on how its stored in the ZST WARC https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.20.3-at.20200401.01
13:28:13 <avoozl1> I'm mostly using my own go-based tools, so I have to take care of some of these things every once in a while by myself
13:28:28 <arkiver> also contains details on deduplication
13:31:51 <avoozl1> current size seems <10TB but I'm not sure how far it has gotten
13:33:16 <avoozl1> I'll grab a few and start coding, it will take a very long time for this to trickle down into my local machine :)
13:33:21 <avoozl1> Thanks!
13:43:50 <rewby> avoozl1: If you're interested, I have a golang package to read the special zstd format
13:56:17 <avoozl1> rewby: that'd be awesome. I'm currently using github.com/CorentinB/warc
14:00:12 <rewby> avoozl1: here is my library. It's not the best library out there but it does one thing I wasn't able to find another golang one does. Well two things. It does our zstd files and it does streaming io. That is, I don't need to load an entire record into memory to process it. My own code uses that to skip past big media since I only care about html content
14:00:24 <rewby> https://gitlab.roelf.org/warcscan/warcreader
14:02:37 <avoozl> Thanks, I'll take a look
14:03:21 <rewby> I use this to pull multiple gigabits out of the IA and analyze the warcs for urls
14:04:51 <avoozl> Looks awesome. I could probably drop this into my codebase almost as is
14:05:25 <avoozl> I'm working on a bit of a hobby project to consume forum grabs in warc format and turn them into a searchable forum (using bleve/bluge as on-disk index format)
14:05:34 <rewby> Yeah, feel free to poke me if you need support for it
14:05:41 <avoozl> I haven't given it the time it requires lately, but it is in an functioning state
14:05:54 <rewby> Cool!
14:06:22 <avoozl> basically parses all the response bodies, runs then through goquery so you can use css selectors or xpath to extract the necessary parts at a thread/post level, and then builds a giant index with a bit of a simple web interface on top
14:06:36 <rewby> That's neat!
14:06:51 <avoozl> it requires custom work for each forum of course, but the customization is fairly minimal.. I started from the league of legends forums
14:07:04 <rewby> If you have any licencing concerns with my library, feel free to ask and we'll work it out
14:07:15 <avoozl> awesome, will do
14:07:32 <rewby> That's really neat. I quite like doing things with previously archived data.
14:07:42 <Sanqui> I'm a big fan too!
14:08:08 <Sanqui> I've been thinking about how useful a "personalized forum search engine" would be to me
14:08:18 <rewby> I've got this compute cluster building a huge urls database to help with discovery for new projects
14:08:26 <Sanqui> oftentimes when searching for something, google is garbage and I'd rather know "what are people saying about this"
14:08:28 <Sanqui> y'know?
14:09:02 <Sanqui> so if I can help with this, perhaps contribute some forums, that'd be lovely
14:09:15 <Sanqui> I'm planning on doing something similar with my Discord archives later
14:09:32 <avoozl> I'll get the code cleaned up a bit so it can be pushed out somewhere.  I still have to complete some parts of the bleve->bluge switch
14:09:43 <avoozl> (the ingest is done but the search is still bleve-only)
14:09:50 <rewby> Cool
14:10:03 <rewby> I'd love to see the code
14:10:38 <Sanqui> my Discord archiver is still in progress, but it's looking like it will scale to a few 100s of servers
14:11:13 <avoozl> rewby: just a peek right now, but this is what I implement at a forum level.. https://paste.ofcode.org/DKszjr3eNAy4HjEjzJSSp9
14:11:27 <Sanqui> this is important because web forums are dying and "public discords" fill the same cultural nice today...
14:11:40 <rewby> Neat
14:11:50 <avoozl> rewby: so this is just 'parse the response into an array of Bodies' and then the rest of the pipeline takes care of the indexing
14:12:32 <rewby> Yeah, if you want "free" zstd support, feel free to use my code. My gitlab does have like 3-4 mins of downtime every few days while auto updates run, just as a warning
14:12:56 <avoozl> one of the big todo's is to modify bluge so that the search index can be hosted as a remote file instead of a local one. Search performance would suffer, but that'd mean I could just put a large index somewhere http-accessible (or s3) and have people query it without any special server side logic
14:13:01 <masterX244> pushing up a 2019 dump of forum.brickset.com atm (did that back then due to forum being in danger of shutdown)
14:13:50 <rewby> Yeah, that makes lots of sense. A remote index would make sense. Especially if there's no need to do server side compute
14:14:08 <avoozl> yeah it'll just be a bunch of range-bytes style retrievals
14:14:58 <rewby> Makes a lot of sense. I would personally avoid s3 or similar due to the bandwidth/per request costs. If the files are small enough a vps or something could do the trick
14:15:39 <avoozl> typically I'm finding the file will be around as large or slightly larger as the gz compressed input text, so they are fairly large
14:16:00 <rewby> Ah hm.
14:16:10 <rewby> Try zstd. It does wonders
14:16:15 <avoozl> rewby: you can always opt to do requester-pays on s3, but I guess for some users that'd be a hurdle
14:16:31 <rewby> Yeah, I can imagine
14:16:51 <rewby> I'd almost wonder if the IA would be willing to host the index files
14:16:55 <avoozl> rewby: yeah zstd compresses better, but the index itself can't really be compressed well. at least not without sacrificing a ton of performance.. I've been chatting with the author of bluge but it wasn't really on their radar as a usecase so it takes some time to look at the options there
14:17:25 <rewby> Ah that sucks. I'm personally lucky my warc work compresses well
14:17:38 <rewby> 20billion urls in less than 200G
14:17:38 <avoozl> (if I compress it I loose proper seek performance, and it needs pretty fine grained retrieval.. I tried 8k or 64k block based compression but it wasn't that great for this usecase)
14:18:04 <avoozl> s/loose/lose/
14:18:10 <rewby> Yeah no, makes sense
14:18:56 <avoozl> my json stuff also compresses like crazy with zstd
14:19:11 <rewby> Yeah, zstd is bloody magic
14:19:18 <Sanqui> I've been compressing jsonl with gzip
14:19:21 <Sanqui> should I look into zstd?
14:19:30 <rewby> I'd say you should
14:19:42 <rewby> I got like 3-4x better compression out of zstd
14:19:45 <Sanqui> does zstd have better indexing support?
14:19:47 <rewby> With no more overhead
14:20:09 <rewby> I dunno about indexing. I mostly do streaming IO
14:20:17 <EggplantN> Zstd is magic we use it at work for VM backups/snapshots
14:22:10 <avoozl> I've used zstd with a 64kb block size, then train a dict on that, and then compress each block so we can effectively seek (with an offset table next to it).. That gives still pretty good compression for some of our data files (better than compressing each block individually).. but only when the file is pretty monotonous
14:22:59 <rewby> I generally leave it set to default and it works loads better than gzip
14:23:45 <avoozl> well xz compresses my json best, but it is also slow to compress and decompress
14:24:02 <avoozl> zstd is faster and compresses a bit less.. though the speed for me makes up for it
14:25:12 <Sanqui> incidentally, the upcoming release of fedora comes with btrfs with zstd compression enabled by default
14:25:21 <Sanqui> so i guess it's *really* good
14:25:45 <avoozl> I do have btrfs, but have compression disabled as I use dedup a lot and they don't seem to mingle well
14:26:45 <avoozl> as a quick compression comparison for my json (which contains quite a bit of plain text content): https://paste.ofcode.org/q7PYhs6D8spbgc8KHm6San
14:28:23 <Sanqui> 303217943234215948.jsonl : 10.17%   (146585063 => 14903289 bytes, 303217943234215948.jsonl.zst)
14:28:51 <Sanqui> original is 146.6 MB, gzipped 17.4 MB, zst'd 14.9 MB
14:29:05 <Sanqui> I guess that adds up
14:30:15 <Sanqui> xz is waaaay slower but nets 11.2 MB
14:39:18 <JAA> All of these comparisons are useless unless you mention the compression level. :-)
14:40:30 <Sanqui> the defaults matter :)
14:40:32 <JAA> zstd's defaults are more in favour of speed than of compression ratio compared to the other tools.
14:40:55 <Sanqui> I understand that
14:41:12 <Sanqui> I also tried zstd with -T0 --ultra -20 and it got to the same size that xz did
14:41:24 <JAA> I've found that zstd -10 is about comparable in runtime to gzip -9. Depends a lot on the data of course.
14:42:35 <JAA> And zstd -2 yielded similarly sized files as gzip -9 at a tenfold shorter runtime.
14:43:05 <JAA> This was using log files and SQLite databases from ArchiveBot, FWIW.
14:45:44 <lunik1> for absolute best compression lrzip+zpaq gives me the best ratios, too slow to be practical in most circumstances though
14:47:43 <JAA> Ultimately, one just needs to test all compression levels (and possibly extra settings like threading on zstd) and then analyse at what point the additional compression ratio is no longer worth the extra computational effort. The results will differ wildly depending on what you're compressing, what the bottleneck is, etc.
14:48:10 <lunik1> and your definition of "no longer worth"
14:58:47 <avoozl> Just a check,  the yahoo answers warc,  that does exclude images and other frills, right? Because that is some serious amount of text
17:25:29 <avoozl> rewby: did you have any reasons for picking github.com/DataDog/zstd over github.com/valyala/gozstd ? I'm using the latter one
21:34:42 <masterX244> avoozl: bricksetforum still uploading, 31/38 warcs up. total 190GB (since i captured embedded and referenced media, too)
21:42:09 <HCross> avoozl: the warcs contain everything
21:42:31 <HCross> The entire point of them is to give you enough data to recreate the experience of being on the site without being online
21:43:03 <HCross> also: look into AWS S3 but using Requester Pays
21:44:44 <HCross> https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html
21:46:00 <tech234a> Side note: one of us should update the Docker Warrior tutorial to ensure that configuration is persisted across Watchtower updates