#archiveteam-bs

03:38

brad

EggplantN: I was corrected once I got over to #archiveteam-dev. This is actually the correct place for the discussion I wanted. That’s for devs only to talk about the code itself.
03:54

jodizzle

Yes this is the better place for that discussion, but no, #archiveteam-dev isn't really for "devs only". You're free to idle there and ask relevant questions there.
03:54

jodizzle

If it was really for devs only, it'd be put in moderated mode or something.
05:50

atphoenix

I think JAA's answer in -dev is a good way to understand what each channel is intended for, when it comes to coding: (-dev) "channel is for software development. As in, development of the software we've been using for years, like changes to wget-at or the tracker. Project stuff should go into -bs (or the project-specific channel if there is one)."
06:06

atphoenix

msn.com/en-us/news/politics/pro-tru…faces-possible-shutdown/ar-BB1cOtaE
06:06

atphoenix

"Robert Davis, senior vice president of Epik Inc., told The Wall Street Journal his firm warned TheDonald.Win it might be dropped within days if it fails to better cull what he said are discussions glorifying violence, propagating white supremacy and fomenting extremism."
06:07

atphoenix

Note that Epik hosts Gab and some other content that has been kicked off other platforms
06:08

atphoenix

en.wikipedia.org/wiki/Epik_(company)
06:18

purplebot

Fast.io edited by Wickedplayer494 (+153, Not much of it done, but it's done) just now -- archiveteam.org/?diff=46194&oldid=45948
06:19

purplebot

Current Projects edited by Wickedplayer494 (+0, Fast.io to MtM) just now -- archiveteam.org/?diff=46195&oldid=46177
06:48

purplebot

Current Projects edited by Wickedplayer494 (+0, MediaFire to scripts only) 23 minutes ago -- archiveteam.org/?diff=46196&oldid=46195
08:10

avoozl

is there any specific trick that I can use to detect hosts that do not respect partial retrievals? I keep getting some zip files that are just restarting from scratch in the middle of the file due to wget trying to continue them.
13:22

avoozl

Sanqui, AAP: having some fun parsing one of the WARC files for the LoL forums scrape.. Currently using some hardcoded logic to extract postid/title/username/userurl/body of each post, and pushing those into a local db.. seems to go reasonably well
13:23

avoozl

I'll have to have a closer look as to figure out how to make this configurable in a way that doesn't involve changing code
13:23

Sanqui

that's awesome!
13:24

Sanqui

I wouldn't attempt to go that way -- you can't make the configuration/parser general enough without eventually reinventing some kind of programming language. best to keep it in code, just as encapsulated and straightforward as possible
13:25

avoozl

True, maybe that's overthinking it
13:26

Sanqui

make some sort of classes for phpbb2, phpbb3, simple machines etc. forums, with easy overrides
13:26

avoozl

I'm currently pushing it all into blevesearch, which is a search engine that is self-contained, a bit like what sqlite is to sql... Just to test the scale and performance. Would be great if we can use it since it keeps things rather portable
13:26

avoozl

yeah phpbb should be easy..
13:26

avoozl

The only things I have a minor headache over is the handling of 'quoted' text within message bodies.. that is rather tricky to get right
13:27

avoozl

right now I just ignore that and consume the entire text, but from parsing email boxes in the past I know that isn't the correct way to go
13:27

Sanqui

I wrote a reasonably easy zetaboards scraper once (but it utilized the admin interface for user data) github.com/Sanqui/zetaboards-scrape/blob/master/scrape.py
13:28

avoozl

I've written scrapers with go-colly ( go-colly.org ) which is quite neat and compact
13:28

Sanqui

historically I've used requests with beautifulsoup but recently I discovered scrapy which is nice for smaller projects
13:28

Sanqui

I'm a python person tho
13:29

avoozl

yeah I've done a ton of python in the past
13:29

Sanqui

in general it doesn't really matter what you use imo, i can see myself using a go program if the "forum" class is readable
13:29

avoozl

what I like about go-colly is that it is really really performant and easy to compile to a single executable to run elsewhere. also coordinating multiple scrapers with a central Redis queue is trivial
13:30

avoozl

But I'll make sure to make this easy to extend.. Right now I'm just looking to get my bearings on the performance and API
13:30

avoozl

do you have any experience with building simple frontends on top of rest/graphql? I have used vue in the past but I'm not really a front-end person
13:31

Sanqui

sadly nah, I've attempted to learn vue briefly but it really didn't suit me, I'm a static html maybe with bootstrap and jquery kind of person
13:31

avoozl

Because I would love for this forum browser to be a bit decoupled from the underlying storage.. so that I just provide a self-hosted API with 'getpost' 'getthread' and 'search' a few other things around it
13:31

Sanqui

I'd probably write the thing in flask lol
13:32

avoozl

yeah flask is fine for a lot of things too
13:32

avoozl

right now I'm just busy ingesting forums.eune.leagueoflegends.com-00000.warc.gz as a test... that is around 25GB uncompressed, just checking how large the final search index will be
13:33

Sanqui

i'll be curious to hear!
13:34

avoozl

I'm currently on my 2011 macbook air, so things take a rather extreme amount of time to complete, but once it runs ok on this system it'll run like lightning everywhere else
13:45

avoozl

Sanqui: in your experience, are posts often identifiable with a unique identifier? With the LOL scrape there is a nice id= field in the html that contains a unique id for each post, but I can imagine that isn't the case everywhere
13:46

Sanqui

very typically, but not always. some caveats I can think of:
13:47

Sanqui

- some forums may reuse ids for posts that were deleted (meaning you can get collisions basically if a race condition happens while archiving)
13:47

avoozl

oh that is nasty, good to be aware of that
13:47

Sanqui

- some forums may change the post id when it gets edited
13:47

Sanqui

(by virtually deleting the old post and replacing it with a new one)
13:48

Sanqui

- most forums probably have some sort of post ID, but may not expose it -- you'll only have the thread id, and post number within the thread
13:49

Sanqui

all sorts of weird situations can happen... posts can be moved between threads, for example
13:51

Sanqui

also, personally I wouldn't be attempting to parse the post HTML (including the quote tags) while scraping. just stick the post HTML snippet in the database raw, and process it in another step (i.e. while rendering)
13:52

Sanqui

reverse engineering the original bbcdode is nontrivial and many forums allow raw HTML (albeit with filters), anyway
13:52

Sanqui

even phpbb allows for configuration of custom tags
13:52

avoozl

I would love to have a means to exclude quoted portions of a post, but that I can do with some custom code
13:53

Sanqui

yeah I understand that, but also consider that people sometimes... literally change the quoted text. or, the post gets quoted, and the original post gets edited, and now the quote is the only instance of the original text
13:53

avoozl

Html is a bit messy to ship though APIs and to recompose in a browser safe way, unless I go down the iframe sandbox rabbithole
13:54

Sanqui

so you definitely want to include quoted text while searching
13:54

avoozl

But I'll see how far i can get without parsing these bodies
13:54

Sanqui

you can probably try to de-prioritize it when showing search results, but yeah
13:55

Sanqui

this is like a "falsehoods programmers believe about forums" session :D
13:56

OrIdow6

Can't imagine including quotes would be much of a problem while searching
13:56

avoozl

Yeah I've been living in dataset territory for a bit too long. Dara is also full of lies, but they are of a different matter
13:57

OrIdow6

As long as your purpose searching is to let a human look for discussions by topic, instead of training a neural network or whatever
13:57

Sanqui

in general, one thing I would try to remember is that posts can get edited over time, and since you're scraping archives, you might as well use this to your advantage -- store every revision of a post you encounter, and add the ability to display diffs etc
13:57

avoozl

OrIdow6: in forensic email analysis it was a bit of a pain, there a lot of the results are consumed by algorithms, not by humans. When the human comes into play most of the filtering has already been done
13:58

OrIdow6

By the way, apparently (from my logs, and from the faint memory I have of doing some preliminary work on this) the forums went well, but there were a million edge cases in the boards
13:58

avoozl

Sanqui: I'll check if i can get some kind of historic perspective at the db/index level
13:59

OrIdow6

(Edge cases, as well as things that flat out went wrong for no apparent reason)
13:59

Sanqui

sometimes people get mad and delete text from all their posts
13:59

Sanqui

leaving discussions hard to read
13:59

Sanqui

or delete the posts, if the forum enables that
14:05

avoozl

Sanqui: by the way, the response content charset things worked out, turns out the internal go html parser was reusable so I've just used that and haven't ran into any issues yet
14:06

Sanqui

nice yeah, shouldn't be much of a problem when dealing with individual websites, although watch out if you start parsing forums in languages you're not familiar with because you might not be able to recognize that text is in the wrong encoding
14:13

eicos

hey, anyone here working on the parler stuff?
14:16

EggplantN

what do you mean by that eicos
14:24

eicos

apologies for the off-topic post, I just found & joined the right channel EggplantN
15:58

avoozl

Sanqui: getting around 80MB per minute of compressed warc.gz handled now... that'll require some work
16:39

sliccricc_

wsj.com/articles/pro-trump-discussi…er-violent-racist-posts-11610819176
16:39

sliccricc_

(thedonald.win)
16:41

sliccricc_

"Robert Davis, senior vice president of Epik Inc., told The Wall Street Journal his firm warned TheDonald.Win it might be dropped within days if it fails to better cull what he said are discussions glorifying violence, propagating white supremacy and fomenting extremism"
16:44

EggplantN

ha
16:44

EggplantN

HAHAHAHAHAHAHAHAHAHAHAHA
16:44

EggplantN

BAHAHAHAHAHAHAHAHAHAHAHAHAHAHA
16:44

EggplantN

EPIK?! THINKING THEY CAN TELL TD.WIN THAT
16:44

EggplantN

omg
16:44

EggplantN

they're high
16:44

EggplantN

an upstream has contacted them 100%
16:45

LeighR

do we have a project trying to archive them?
16:47

sliccricc

i've heard people discussing it over the past few days but im not sure if it got anywhere
16:50

EggplantN

i dont believe they're at risk in reality but i'm not sure
16:50

EggplantN

i know they're more desirable and easier to host than parler
16:57

LeighR

what I've read when doing a rather cursory search mentions that CloudFlare makes it hard to scrape that site
17:03

avoozl

do I need some kind of zstd dictionary to decompress these types of downloads? archive.org/download/archiveteam_pastebin_20200606171522_85926a35
17:03

avoozl

(if so, where would I find that)
17:26

JAA

avoozl: The dictionary is in the first frame of the file as a skippable frame. Here's a draft of the .warc.zst spec: iipc/warc-specifications #69
17:27

avoozl

Ahhh, makes sense
17:27

avoozl

Neat trick
17:28

JAA

OrIdow6 wrote a script recently for the extraction, but I can't find the link right now.
17:28

OrIdow6

transfer.notkiska.pw/TXlRo/xtract.py
17:29

JAA

There's also an effort to get this general concept into the contrib section of the Zstandard docs: facebook/zstd #2349
17:31

OrIdow6

One of the dependencies broke recently, you need zstandard==0.10.2
17:31

OrIdow6

For the extraction script
17:31

OrIdow6

Or, at least, that's one of the working versions
17:38

avoozl

I'm using libzstd so things are slightly different here, but I'll manage
21:39

Dallas

JAA: what do you use to test warc replay ? I wanna have another crack at TikTok and wanna know what the closest approximation to wayback is gonna be ?
21:48

EggplantN

god fuck tiktok
21:48

EggplantN

but yes

3 years ago

« a day earlier

a day later »

today »