00:24:53 Is it allowed to slightly modify ArchiveTeam's logo? 00:25:25 i cant really go into specifics without it being obvious so please DM me and i can explain 00:25:30 in fact, ill just 00:27:09 ? 00:28:55 it's fine, i dmed JAA instead - please ignore me :P 00:29:17 Oh 02:52:26 -purplebot- Frequently Asked Questions edited by JustAnotherArchivist (+8, /* halp pls halp */ Fix WBM inclusion …) just now -- https://www.archiveteam.org/?diff=46532&oldid=45717 12:52:26 -purplebot- Template:Czech websites edited by Sanqui (+27, Add szm.com and call it Slovak …), Sanqui (-4, two lines) just now -- https://www.archiveteam.org/?diff=46534&oldid=46222 13:23:21 how big is the yahoo_answers collection getting? would I need all files within the archiveteam_yahooanswers collection to be complete? 13:23:47 There's 789 parts at the moment, but I'm not sure where to check status 13:24:15 avoozl1: you want to get a copy? 13:24:19 it'll be tens of TBs 13:24:23 maybe 100+ 13:24:28 I would. I'm building a forum indexer and this seems like a good testcase 13:24:35 tens of TBs I can handle, 100+ not so much 13:24:52 the archiveteam_yahooanswers contains data from an old yahooanswers project as well 13:25:08 ahh yeah I see, there's something from 2016 in there 13:25:15 and 2017 13:25:22 yeah all from 2021+ is from the currently project 13:25:29 I can pick every id that starts with archiveteam_yahooanswers_2021 13:25:32 thanks 13:25:43 back than yahoo answers had a different structure (no horrible PUT requests for pagination) 13:26:11 is the archiveteam_yahooanswers_dictionary_16* part of the new run? 13:26:22 yes 13:26:42 everything is compressed with ZSTD with a dictionary 13:26:50 that dictionary is stored there for safekeeping 13:26:56 yeah I had some issues missing these in earlier downloads :) good to have these dicts 13:27:03 otherwise the data is useless :) 13:27:22 i mean the dictionary is in the ZST megaWARC as well 13:27:35 ohh ok. good 13:28:04 details here on how its stored in the ZST WARC https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.20.3-at.20200401.01 13:28:13 I'm mostly using my own go-based tools, so I have to take care of some of these things every once in a while by myself 13:28:28 also contains details on deduplication 13:31:51 current size seems <10TB but I'm not sure how far it has gotten 13:33:16 I'll grab a few and start coding, it will take a very long time for this to trickle down into my local machine :) 13:33:21 Thanks! 13:43:50 avoozl1: If you're interested, I have a golang package to read the special zstd format 13:56:17 rewby: that'd be awesome. I'm currently using github.com/CorentinB/warc 14:00:12 avoozl1: here is my library. It's not the best library out there but it does one thing I wasn't able to find another golang one does. Well two things. It does our zstd files and it does streaming io. That is, I don't need to load an entire record into memory to process it. My own code uses that to skip past big media since I only care about html content 14:00:24 https://gitlab.roelf.org/warcscan/warcreader 14:02:37 Thanks, I'll take a look 14:03:21 I use this to pull multiple gigabits out of the IA and analyze the warcs for urls 14:04:51 Looks awesome. I could probably drop this into my codebase almost as is 14:05:25 I'm working on a bit of a hobby project to consume forum grabs in warc format and turn them into a searchable forum (using bleve/bluge as on-disk index format) 14:05:34 Yeah, feel free to poke me if you need support for it 14:05:41 I haven't given it the time it requires lately, but it is in an functioning state 14:05:54 Cool! 14:06:22 basically parses all the response bodies, runs then through goquery so you can use css selectors or xpath to extract the necessary parts at a thread/post level, and then builds a giant index with a bit of a simple web interface on top 14:06:36 That's neat! 14:06:51 it requires custom work for each forum of course, but the customization is fairly minimal.. I started from the league of legends forums 14:07:04 If you have any licencing concerns with my library, feel free to ask and we'll work it out 14:07:15 awesome, will do 14:07:32 That's really neat. I quite like doing things with previously archived data. 14:07:42 I'm a big fan too! 14:08:08 I've been thinking about how useful a "personalized forum search engine" would be to me 14:08:18 I've got this compute cluster building a huge urls database to help with discovery for new projects 14:08:26 oftentimes when searching for something, google is garbage and I'd rather know "what are people saying about this" 14:08:28 y'know? 14:09:02 so if I can help with this, perhaps contribute some forums, that'd be lovely 14:09:15 I'm planning on doing something similar with my Discord archives later 14:09:32 I'll get the code cleaned up a bit so it can be pushed out somewhere. I still have to complete some parts of the bleve->bluge switch 14:09:43 (the ingest is done but the search is still bleve-only) 14:09:50 Cool 14:10:03 I'd love to see the code 14:10:38 my Discord archiver is still in progress, but it's looking like it will scale to a few 100s of servers 14:11:13 rewby: just a peek right now, but this is what I implement at a forum level.. https://paste.ofcode.org/DKszjr3eNAy4HjEjzJSSp9 14:11:27 this is important because web forums are dying and "public discords" fill the same cultural nice today... 14:11:40 Neat 14:11:50 rewby: so this is just 'parse the response into an array of Bodies' and then the rest of the pipeline takes care of the indexing 14:12:32 Yeah, if you want "free" zstd support, feel free to use my code. My gitlab does have like 3-4 mins of downtime every few days while auto updates run, just as a warning 14:12:56 one of the big todo's is to modify bluge so that the search index can be hosted as a remote file instead of a local one. Search performance would suffer, but that'd mean I could just put a large index somewhere http-accessible (or s3) and have people query it without any special server side logic 14:13:01 pushing up a 2019 dump of forum.brickset.com atm (did that back then due to forum being in danger of shutdown) 14:13:50 Yeah, that makes lots of sense. A remote index would make sense. Especially if there's no need to do server side compute 14:14:08 yeah it'll just be a bunch of range-bytes style retrievals 14:14:58 Makes a lot of sense. I would personally avoid s3 or similar due to the bandwidth/per request costs. If the files are small enough a vps or something could do the trick 14:15:39 typically I'm finding the file will be around as large or slightly larger as the gz compressed input text, so they are fairly large 14:16:00 Ah hm. 14:16:10 Try zstd. It does wonders 14:16:15 rewby: you can always opt to do requester-pays on s3, but I guess for some users that'd be a hurdle 14:16:31 Yeah, I can imagine 14:16:51 I'd almost wonder if the IA would be willing to host the index files 14:16:55 rewby: yeah zstd compresses better, but the index itself can't really be compressed well. at least not without sacrificing a ton of performance.. I've been chatting with the author of bluge but it wasn't really on their radar as a usecase so it takes some time to look at the options there 14:17:25 Ah that sucks. I'm personally lucky my warc work compresses well 14:17:38 20billion urls in less than 200G 14:17:38 (if I compress it I loose proper seek performance, and it needs pretty fine grained retrieval.. I tried 8k or 64k block based compression but it wasn't that great for this usecase) 14:18:04 s/loose/lose/ 14:18:10 Yeah no, makes sense 14:18:56 my json stuff also compresses like crazy with zstd 14:19:11 Yeah, zstd is bloody magic 14:19:18 I've been compressing jsonl with gzip 14:19:21 should I look into zstd? 14:19:30 I'd say you should 14:19:42 I got like 3-4x better compression out of zstd 14:19:45 does zstd have better indexing support? 14:19:47 With no more overhead 14:20:09 I dunno about indexing. I mostly do streaming IO 14:20:17 Zstd is magic we use it at work for VM backups/snapshots 14:22:10 I've used zstd with a 64kb block size, then train a dict on that, and then compress each block so we can effectively seek (with an offset table next to it).. That gives still pretty good compression for some of our data files (better than compressing each block individually).. but only when the file is pretty monotonous 14:22:59 I generally leave it set to default and it works loads better than gzip 14:23:45 well xz compresses my json best, but it is also slow to compress and decompress 14:24:02 zstd is faster and compresses a bit less.. though the speed for me makes up for it 14:25:12 incidentally, the upcoming release of fedora comes with btrfs with zstd compression enabled by default 14:25:21 so i guess it's *really* good 14:25:45 I do have btrfs, but have compression disabled as I use dedup a lot and they don't seem to mingle well 14:26:45 as a quick compression comparison for my json (which contains quite a bit of plain text content): https://paste.ofcode.org/q7PYhs6D8spbgc8KHm6San 14:28:23 303217943234215948.jsonl : 10.17% (146585063 => 14903289 bytes, 303217943234215948.jsonl.zst) 14:28:51 original is 146.6 MB, gzipped 17.4 MB, zst'd 14.9 MB 14:29:05 I guess that adds up 14:30:15 xz is waaaay slower but nets 11.2 MB 14:39:18 All of these comparisons are useless unless you mention the compression level. :-) 14:40:30 the defaults matter :) 14:40:32 zstd's defaults are more in favour of speed than of compression ratio compared to the other tools. 14:40:55 I understand that 14:41:12 I also tried zstd with -T0 --ultra -20 and it got to the same size that xz did 14:41:24 I've found that zstd -10 is about comparable in runtime to gzip -9. Depends a lot on the data of course. 14:42:35 And zstd -2 yielded similarly sized files as gzip -9 at a tenfold shorter runtime. 14:43:05 This was using log files and SQLite databases from ArchiveBot, FWIW. 14:45:44 for absolute best compression lrzip+zpaq gives me the best ratios, too slow to be practical in most circumstances though 14:47:43 Ultimately, one just needs to test all compression levels (and possibly extra settings like threading on zstd) and then analyse at what point the additional compression ratio is no longer worth the extra computational effort. The results will differ wildly depending on what you're compressing, what the bottleneck is, etc. 14:48:10 and your definition of "no longer worth" 14:58:47 Just a check, the yahoo answers warc, that does exclude images and other frills, right? Because that is some serious amount of text 17:25:29 rewby: did you have any reasons for picking github.com/DataDog/zstd over github.com/valyala/gozstd ? I'm using the latter one 21:34:42 avoozl: bricksetforum still uploading, 31/38 warcs up. total 190GB (since i captured embedded and referenced media, too) 21:42:09 avoozl: the warcs contain everything 21:42:31 The entire point of them is to give you enough data to recreate the experience of being on the site without being online 21:43:03 also: look into AWS S3 but using Requester Pays 21:44:44 https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html 21:46:00 Side note: one of us should update the Docker Warrior tutorial to ensure that configuration is persisted across Watchtower updates