03:38:16 <brad> EggplantN: I was corrected once I got over to #archiveteam-dev. This is actually the correct place for the discussion I wanted. That’s for devs only to talk about the code itself.
03:54:20 <jodizzle> Yes this is the better place for that discussion, but no, #archiveteam-dev isn't really for "devs only".  You're free to idle there and ask relevant questions there.
03:54:38 <jodizzle> If it was really for devs only, it'd be put in moderated mode or something.
05:50:31 <atphoenix> I think JAA's answer in -dev is a good way to understand what each channel is intended for, when it comes to coding:  (-dev) "channel is for software development. As in, development of the software we've been using for years, like changes to wget-at or the tracker. Project stuff should go into -bs (or the project-specific channel if there is one)."
06:06:17 <atphoenix> https://www.msn.com/en-us/news/politics/pro-trump-discussion-board-faces-possible-shutdown/ar-BB1cOtaE
06:06:47 <atphoenix> "Robert Davis, senior vice president of Epik Inc., told The Wall Street Journal his firm warned TheDonald.Win it might be dropped within days if it fails to better cull what he said are discussions glorifying violence, propagating white supremacy and fomenting extremism."
06:07:40 <atphoenix> Note that Epik hosts Gab and some other content that has been kicked off other platforms
06:08:15 <atphoenix> https://en.wikipedia.org/wiki/Epik_(company)
06:18:35 -purplebot- Fast.io edited by Wickedplayer494 (+153, Not much of it done, but it's done) just now -- https://www.archiveteam.org/?diff=46194&oldid=45948
06:19:35 -purplebot- Current Projects edited by Wickedplayer494 (+0, Fast.io to MtM) just now -- https://www.archiveteam.org/?diff=46195&oldid=46177
06:48:35 -purplebot- Current Projects edited by Wickedplayer494 (+0, MediaFire to scripts only) 23 minutes ago -- https://www.archiveteam.org/?diff=46196&oldid=46195
08:10:05 <avoozl> is there any specific trick that I can use to detect hosts that do not respect partial retrievals? I keep getting some zip files that are just restarting from scratch in the middle of the file due to wget trying to continue them.
13:22:53 <avoozl> Sanqui, AAP: having some fun parsing one of the WARC files for the LoL forums scrape.. Currently using some hardcoded logic to extract postid/title/username/userurl/body of each post, and pushing those into a local db.. seems to go reasonably well
13:23:19 <avoozl> I'll have to have a closer look as to figure out how to make this configurable in a way that doesn't involve changing code
13:23:46 <Sanqui> that's awesome!
13:24:44 <Sanqui> I wouldn't attempt to go that way -- you can't make the configuration/parser general enough without eventually reinventing some kind of programming language.  best to keep it in code, just as encapsulated and straightforward as possible
13:25:32 <avoozl> True, maybe that's overthinking it
13:26:08 <Sanqui> make some sort of classes for phpbb2, phpbb3, simple machines etc. forums, with easy overrides
13:26:14 <avoozl> I'm currently pushing it all into blevesearch, which is a search engine that is self-contained, a bit like what sqlite is to sql... Just to test the scale and performance. Would be great if we can use it since it keeps things rather portable
13:26:22 <avoozl> yeah phpbb should be easy..
13:26:41 <avoozl> The only things I have a minor headache over is the handling of 'quoted' text within message bodies.. that is rather tricky to get right
13:27:05 <avoozl> right now I just ignore that and consume the entire text, but from parsing email boxes in the past I know that isn't the correct way to go
13:27:17 <Sanqui> I wrote a reasonably easy zetaboards scraper once (but it utilized the admin interface for user data) https://github.com/Sanqui/zetaboards-scrape/blob/master/scrape.py
13:28:11 <avoozl> I've written scrapers with go-colly ( http://go-colly.org ) which is quite neat and compact
13:28:33 <Sanqui> historically I've used requests with beautifulsoup but recently I discovered scrapy which is nice for smaller projects
13:28:42 <Sanqui> I'm a python person tho
13:29:19 <avoozl> yeah I've done a ton of python in the past
13:29:33 <Sanqui> in general it doesn't really matter what you use imo, i can see myself using a go program if the "forum" class is readable
13:29:52 <avoozl> what I like about go-colly is that it is really really performant and easy to compile to a single executable to run elsewhere. also coordinating multiple scrapers with a central Redis queue is trivial
13:30:22 <avoozl> But I'll make sure to make this easy to extend.. Right now I'm just looking to get my bearings on the performance and API
13:30:47 <avoozl> do you have any experience with building simple frontends on top of rest/graphql? I have used vue in the past but I'm not really a front-end person
13:31:24 <Sanqui> sadly nah, I've attempted to learn vue briefly but it really didn't suit me, I'm a static html maybe with bootstrap and jquery kind of person
13:31:35 <avoozl> Because I would love for this forum browser to be a bit decoupled from the underlying storage.. so that I just provide a self-hosted API with 'getpost' 'getthread' and 'search' a few other things around it
13:31:59 <Sanqui> I'd probably write the thing in flask lol
13:32:11 <avoozl> yeah flask is fine for a lot of things too
13:32:50 <avoozl> right now I'm just busy ingesting forums.eune.leagueoflegends.com-00000.warc.gz as a test... that is around 25GB uncompressed, just checking how large the final search index will be
13:33:34 <Sanqui> i'll be curious to hear!
13:34:30 <avoozl> I'm currently on my 2011 macbook air, so things take a rather extreme amount of time to complete, but once it runs ok on this system it'll run like lightning everywhere else
13:45:24 <avoozl> Sanqui: in your experience, are posts often identifiable with a unique identifier? With the LOL scrape there is a nice id= field in the html that contains a unique id for each post, but I can imagine that isn't the case everywhere
13:46:21 <Sanqui> very typically, but not always.  some caveats I can think of:
13:47:08 <Sanqui> - some forums may reuse ids for posts that were deleted (meaning you can get collisions basically if a race condition happens while archiving)
13:47:18 <avoozl> oh that is nasty, good to be aware of that
13:47:26 <Sanqui> - some forums may change the post id when it gets edited
13:47:37 <Sanqui> (by virtually deleting the old post and replacing it with a new one)
13:48:01 <Sanqui> - most forums probably have some sort of post ID, but may not expose it -- you'll only have the thread id, and post number within the thread
13:49:35 <Sanqui> all sorts of weird situations can happen...  posts can be moved between threads, for example
13:51:47 <Sanqui> also, personally I wouldn't be attempting to parse the post HTML (including the quote tags) while scraping.  just stick the post HTML snippet in the database raw, and process it in another step (i.e. while rendering)
13:52:19 <Sanqui> reverse engineering the original bbcdode is nontrivial and many forums allow raw HTML (albeit with filters), anyway
13:52:46 <Sanqui> even phpbb allows for configuration of custom tags
13:52:56 <avoozl> I would love to have a means to exclude quoted portions of a post,  but that I can do with some custom code
13:53:46 <Sanqui> yeah I understand that, but also consider that people sometimes...  literally change the quoted text.  or, the post gets quoted, and the original post gets edited, and now the quote is the only instance of the original text
13:53:56 <avoozl> Html is a bit messy to ship though APIs and to recompose in a browser safe way,  unless I go down the iframe sandbox rabbithole
13:54:08 <Sanqui> so you definitely want to include quoted text while searching
13:54:24 <avoozl> But I'll see how far i can get without parsing these bodies
13:54:24 <Sanqui> you can probably try to de-prioritize it when showing search results, but yeah
13:55:07 <Sanqui> this is like a "falsehoods programmers believe about forums" session :D
13:56:31 <OrIdow6> Can't imagine including quotes would be much of a problem while searching
13:56:38 <avoozl> Yeah I've been living in dataset territory for a bit too long.  Dara is also full of lies, but they are of a different matter
13:57:03 <OrIdow6> As long as your purpose searching is to let a human look for discussions by topic, instead of training a neural network or whatever
13:57:43 <Sanqui> in general, one thing I would try to remember is that posts can get edited over time, and since you're scraping archives, you might as well use this to your advantage -- store every revision of a post you encounter, and add the ability to display diffs etc
13:57:45 <avoozl> OrIdow6: in forensic email analysis it was a bit of a pain,  there a lot of the results are consumed by algorithms,  not by humans.  When the human comes into play most of the filtering has already been done
13:58:00 <OrIdow6> By the way, apparently (from my logs, and from the faint memory I have of doing some preliminary work on this) the forums went well, but there were a million edge cases in the boards
13:58:16 <avoozl> Sanqui: I'll check if i can get some kind of historic perspective at the db/index level
13:59:02 <OrIdow6> (Edge cases, as well as things that flat out went wrong for no apparent reason)
13:59:15 <Sanqui> sometimes people get mad and delete text from all their posts
13:59:24 <Sanqui> leaving discussions hard to read
13:59:35 <Sanqui> or delete the posts, if the forum enables that
14:05:54 <avoozl> Sanqui: by the way, the response content charset things worked out, turns out the internal go html parser was reusable so I've just used that and haven't ran into any issues yet
14:06:48 <Sanqui> nice yeah, shouldn't be much of a problem when dealing with individual websites, although watch out if you start parsing forums in languages you're not familiar with because you might not be able to recognize that text is in the wrong encoding
14:13:31 <eicos> hey, anyone here working on the parler stuff?
14:16:48 <EggplantN> what do you mean by that eicos
14:24:52 <eicos> apologies for the off-topic post, I just found & joined the right channel EggplantN
15:58:06 <avoozl> Sanqui: getting around 80MB per minute of compressed warc.gz handled now... that'll require some work
16:39:11 <sliccricc_> https://www.wsj.com/articles/pro-trump-discussion-board-faces-possible-shutdown-over-violent-racist-posts-11610819176
16:39:32 <sliccricc_> (thedonald.win)
16:41:27 <sliccricc_> "Robert Davis, senior vice president of Epik Inc., told The Wall Street Journal his firm warned TheDonald.Win it might be dropped within days if it fails to better cull what he said are discussions glorifying violence, propagating white supremacy and fomenting extremism"
16:44:11 <EggplantN> ha
16:44:14 <EggplantN> HAHAHAHAHAHAHAHAHAHAHAHA
16:44:20 <EggplantN> BAHAHAHAHAHAHAHAHAHAHAHAHAHAHA
16:44:33 <EggplantN> EPIK?! THINKING THEY CAN TELL TD.WIN THAT
16:44:34 <EggplantN> omg
16:44:43 <EggplantN> they're high
16:44:53 <EggplantN> an upstream has contacted them 100%
16:45:50 <LeighR> do we have a project trying to archive them?
16:47:41 <sliccricc> i've heard people discussing it over the past few days but im not sure if it got anywhere
16:50:33 <EggplantN> i dont believe they're at risk in reality but i'm not sure
16:50:49 <EggplantN> i know they're more desirable and easier to host than parler
16:57:26 <LeighR> what I've read when doing a rather cursory search mentions that CloudFlare makes it hard to scrape that site
17:03:09 <avoozl> do I need some kind of zstd dictionary to decompress these types of downloads? https://archive.org/download/archiveteam_pastebin_20200606171522_85926a35/
17:03:13 <avoozl> (if so, where would I find that)
17:26:21 <JAA> avoozl: The dictionary is in the first frame of the file as a skippable frame. Here's a draft of the .warc.zst spec: https://github.com/iipc/warc-specifications/pull/69
17:27:21 <avoozl> Ahhh, makes sense
17:27:29 <avoozl> Neat trick
17:28:23 <JAA> OrIdow6 wrote a script recently for the extraction, but I can't find the link right now.
17:28:48 <OrIdow6> https://transfer.notkiska.pw/TXlRo/xtract.py
17:29:13 <JAA> There's also an effort to get this general concept into the contrib section of the Zstandard docs: https://github.com/facebook/zstd/pull/2349
17:31:08 <OrIdow6> One of the dependencies broke recently, you need zstandard==0.10.2
17:31:17 <OrIdow6> For the extraction script
17:31:50 <OrIdow6> Or, at least, that's one of the working versions
17:38:38 <avoozl> I'm using libzstd so things are slightly different here,  but I'll manage
21:39:20 <Dallas> JAA: what do you use to test warc replay ? I wanna have another crack at TikTok and wanna know what the closest approximation to wayback is gonna be ?
21:48:49 <EggplantN> god fuck tiktok
21:48:50 <EggplantN> but yes