03:38:16 EggplantN: I was corrected once I got over to #archiveteam-dev. This is actually the correct place for the discussion I wanted. That’s for devs only to talk about the code itself. 03:54:20 Yes this is the better place for that discussion, but no, #archiveteam-dev isn't really for "devs only". You're free to idle there and ask relevant questions there. 03:54:38 If it was really for devs only, it'd be put in moderated mode or something. 05:50:31 I think JAA's answer in -dev is a good way to understand what each channel is intended for, when it comes to coding: (-dev) "channel is for software development. As in, development of the software we've been using for years, like changes to wget-at or the tracker. Project stuff should go into -bs (or the project-specific channel if there is one)." 06:06:17 https://www.msn.com/en-us/news/politics/pro-trump-discussion-board-faces-possible-shutdown/ar-BB1cOtaE 06:06:47 "Robert Davis, senior vice president of Epik Inc., told The Wall Street Journal his firm warned TheDonald.Win it might be dropped within days if it fails to better cull what he said are discussions glorifying violence, propagating white supremacy and fomenting extremism." 06:07:40 Note that Epik hosts Gab and some other content that has been kicked off other platforms 06:08:15 https://en.wikipedia.org/wiki/Epik_(company) 06:18:35 -purplebot- Fast.io edited by Wickedplayer494 (+153, Not much of it done, but it's done) just now -- https://www.archiveteam.org/?diff=46194&oldid=45948 06:19:35 -purplebot- Current Projects edited by Wickedplayer494 (+0, Fast.io to MtM) just now -- https://www.archiveteam.org/?diff=46195&oldid=46177 06:48:35 -purplebot- Current Projects edited by Wickedplayer494 (+0, MediaFire to scripts only) 23 minutes ago -- https://www.archiveteam.org/?diff=46196&oldid=46195 08:10:05 is there any specific trick that I can use to detect hosts that do not respect partial retrievals? I keep getting some zip files that are just restarting from scratch in the middle of the file due to wget trying to continue them. 13:22:53 Sanqui, AAP: having some fun parsing one of the WARC files for the LoL forums scrape.. Currently using some hardcoded logic to extract postid/title/username/userurl/body of each post, and pushing those into a local db.. seems to go reasonably well 13:23:19 I'll have to have a closer look as to figure out how to make this configurable in a way that doesn't involve changing code 13:23:46 that's awesome! 13:24:44 I wouldn't attempt to go that way -- you can't make the configuration/parser general enough without eventually reinventing some kind of programming language. best to keep it in code, just as encapsulated and straightforward as possible 13:25:32 True, maybe that's overthinking it 13:26:08 make some sort of classes for phpbb2, phpbb3, simple machines etc. forums, with easy overrides 13:26:14 I'm currently pushing it all into blevesearch, which is a search engine that is self-contained, a bit like what sqlite is to sql... Just to test the scale and performance. Would be great if we can use it since it keeps things rather portable 13:26:22 yeah phpbb should be easy.. 13:26:41 The only things I have a minor headache over is the handling of 'quoted' text within message bodies.. that is rather tricky to get right 13:27:05 right now I just ignore that and consume the entire text, but from parsing email boxes in the past I know that isn't the correct way to go 13:27:17 I wrote a reasonably easy zetaboards scraper once (but it utilized the admin interface for user data) https://github.com/Sanqui/zetaboards-scrape/blob/master/scrape.py 13:28:11 I've written scrapers with go-colly ( http://go-colly.org ) which is quite neat and compact 13:28:33 historically I've used requests with beautifulsoup but recently I discovered scrapy which is nice for smaller projects 13:28:42 I'm a python person tho 13:29:19 yeah I've done a ton of python in the past 13:29:33 in general it doesn't really matter what you use imo, i can see myself using a go program if the "forum" class is readable 13:29:52 what I like about go-colly is that it is really really performant and easy to compile to a single executable to run elsewhere. also coordinating multiple scrapers with a central Redis queue is trivial 13:30:22 But I'll make sure to make this easy to extend.. Right now I'm just looking to get my bearings on the performance and API 13:30:47 do you have any experience with building simple frontends on top of rest/graphql? I have used vue in the past but I'm not really a front-end person 13:31:24 sadly nah, I've attempted to learn vue briefly but it really didn't suit me, I'm a static html maybe with bootstrap and jquery kind of person 13:31:35 Because I would love for this forum browser to be a bit decoupled from the underlying storage.. so that I just provide a self-hosted API with 'getpost' 'getthread' and 'search' a few other things around it 13:31:59 I'd probably write the thing in flask lol 13:32:11 yeah flask is fine for a lot of things too 13:32:50 right now I'm just busy ingesting forums.eune.leagueoflegends.com-00000.warc.gz as a test... that is around 25GB uncompressed, just checking how large the final search index will be 13:33:34 i'll be curious to hear! 13:34:30 I'm currently on my 2011 macbook air, so things take a rather extreme amount of time to complete, but once it runs ok on this system it'll run like lightning everywhere else 13:45:24 Sanqui: in your experience, are posts often identifiable with a unique identifier? With the LOL scrape there is a nice id= field in the html that contains a unique id for each post, but I can imagine that isn't the case everywhere 13:46:21 very typically, but not always. some caveats I can think of: 13:47:08 - some forums may reuse ids for posts that were deleted (meaning you can get collisions basically if a race condition happens while archiving) 13:47:18 oh that is nasty, good to be aware of that 13:47:26 - some forums may change the post id when it gets edited 13:47:37 (by virtually deleting the old post and replacing it with a new one) 13:48:01 - most forums probably have some sort of post ID, but may not expose it -- you'll only have the thread id, and post number within the thread 13:49:35 all sorts of weird situations can happen... posts can be moved between threads, for example 13:51:47 also, personally I wouldn't be attempting to parse the post HTML (including the quote tags) while scraping. just stick the post HTML snippet in the database raw, and process it in another step (i.e. while rendering) 13:52:19 reverse engineering the original bbcdode is nontrivial and many forums allow raw HTML (albeit with filters), anyway 13:52:46 even phpbb allows for configuration of custom tags 13:52:56 I would love to have a means to exclude quoted portions of a post, but that I can do with some custom code 13:53:46 yeah I understand that, but also consider that people sometimes... literally change the quoted text. or, the post gets quoted, and the original post gets edited, and now the quote is the only instance of the original text 13:53:56 Html is a bit messy to ship though APIs and to recompose in a browser safe way, unless I go down the iframe sandbox rabbithole 13:54:08 so you definitely want to include quoted text while searching 13:54:24 But I'll see how far i can get without parsing these bodies 13:54:24 you can probably try to de-prioritize it when showing search results, but yeah 13:55:07 this is like a "falsehoods programmers believe about forums" session :D 13:56:31 Can't imagine including quotes would be much of a problem while searching 13:56:38 Yeah I've been living in dataset territory for a bit too long. Dara is also full of lies, but they are of a different matter 13:57:03 As long as your purpose searching is to let a human look for discussions by topic, instead of training a neural network or whatever 13:57:43 in general, one thing I would try to remember is that posts can get edited over time, and since you're scraping archives, you might as well use this to your advantage -- store every revision of a post you encounter, and add the ability to display diffs etc 13:57:45 OrIdow6: in forensic email analysis it was a bit of a pain, there a lot of the results are consumed by algorithms, not by humans. When the human comes into play most of the filtering has already been done 13:58:00 By the way, apparently (from my logs, and from the faint memory I have of doing some preliminary work on this) the forums went well, but there were a million edge cases in the boards 13:58:16 Sanqui: I'll check if i can get some kind of historic perspective at the db/index level 13:59:02 (Edge cases, as well as things that flat out went wrong for no apparent reason) 13:59:15 sometimes people get mad and delete text from all their posts 13:59:24 leaving discussions hard to read 13:59:35 or delete the posts, if the forum enables that 14:05:54 Sanqui: by the way, the response content charset things worked out, turns out the internal go html parser was reusable so I've just used that and haven't ran into any issues yet 14:06:48 nice yeah, shouldn't be much of a problem when dealing with individual websites, although watch out if you start parsing forums in languages you're not familiar with because you might not be able to recognize that text is in the wrong encoding 14:13:31 hey, anyone here working on the parler stuff? 14:16:48 what do you mean by that eicos 14:24:52 apologies for the off-topic post, I just found & joined the right channel EggplantN 15:58:06 Sanqui: getting around 80MB per minute of compressed warc.gz handled now... that'll require some work 16:39:11 https://www.wsj.com/articles/pro-trump-discussion-board-faces-possible-shutdown-over-violent-racist-posts-11610819176 16:39:32 (thedonald.win) 16:41:27 "Robert Davis, senior vice president of Epik Inc., told The Wall Street Journal his firm warned TheDonald.Win it might be dropped within days if it fails to better cull what he said are discussions glorifying violence, propagating white supremacy and fomenting extremism" 16:44:11 ha 16:44:14 HAHAHAHAHAHAHAHAHAHAHAHA 16:44:20 BAHAHAHAHAHAHAHAHAHAHAHAHAHAHA 16:44:33 EPIK?! THINKING THEY CAN TELL TD.WIN THAT 16:44:34 omg 16:44:43 they're high 16:44:53 an upstream has contacted them 100% 16:45:50 do we have a project trying to archive them? 16:47:41 i've heard people discussing it over the past few days but im not sure if it got anywhere 16:50:33 i dont believe they're at risk in reality but i'm not sure 16:50:49 i know they're more desirable and easier to host than parler 16:57:26 what I've read when doing a rather cursory search mentions that CloudFlare makes it hard to scrape that site 17:03:09 do I need some kind of zstd dictionary to decompress these types of downloads? https://archive.org/download/archiveteam_pastebin_20200606171522_85926a35/ 17:03:13 (if so, where would I find that) 17:26:21 avoozl: The dictionary is in the first frame of the file as a skippable frame. Here's a draft of the .warc.zst spec: https://github.com/iipc/warc-specifications/pull/69 17:27:21 Ahhh, makes sense 17:27:29 Neat trick 17:28:23 OrIdow6 wrote a script recently for the extraction, but I can't find the link right now. 17:28:48 https://transfer.notkiska.pw/TXlRo/xtract.py 17:29:13 There's also an effort to get this general concept into the contrib section of the Zstandard docs: https://github.com/facebook/zstd/pull/2349 17:31:08 One of the dependencies broke recently, you need zstandard==0.10.2 17:31:17 For the extraction script 17:31:50 Or, at least, that's one of the working versions 17:38:38 I'm using libzstd so things are slightly different here, but I'll manage 21:39:20 JAA: what do you use to test warc replay ? I wanna have another crack at TikTok and wanna know what the closest approximation to wayback is gonna be ? 21:48:49 god fuck tiktok 21:48:50 but yes