-
brad
EggplantN: I was corrected once I got over to #archiveteam-dev. This is actually the correct place for the discussion I wanted. That’s for devs only to talk about the code itself.
-
jodizzle
Yes this is the better place for that discussion, but no, #archiveteam-dev isn't really for "devs only". You're free to idle there and ask relevant questions there.
-
jodizzle
If it was really for devs only, it'd be put in moderated mode or something.
-
atphoenix
I think JAA's answer in -dev is a good way to understand what each channel is intended for, when it comes to coding: (-dev) "channel is for software development. As in, development of the software we've been using for years, like changes to wget-at or the tracker. Project stuff should go into -bs (or the project-specific channel if there is one)."
-
atphoenix
-
atphoenix
"Robert Davis, senior vice president of Epik Inc., told The Wall Street Journal his firm warned TheDonald.Win it might be dropped within days if it fails to better cull what he said are discussions glorifying violence, propagating white supremacy and fomenting extremism."
-
atphoenix
Note that Epik hosts Gab and some other content that has been kicked off other platforms
-
atphoenix
-
purplebot
Fast.io edited by Wickedplayer494 (+153, Not much of it done, but it's done) just now --
archiveteam.org/?diff=46194&oldid=45948
-
purplebot
Current Projects edited by Wickedplayer494 (+0, Fast.io to MtM) just now --
archiveteam.org/?diff=46195&oldid=46177
-
purplebot
Current Projects edited by Wickedplayer494 (+0, MediaFire to scripts only) 23 minutes ago --
archiveteam.org/?diff=46196&oldid=46195
-
avoozl
is there any specific trick that I can use to detect hosts that do not respect partial retrievals? I keep getting some zip files that are just restarting from scratch in the middle of the file due to wget trying to continue them.
-
avoozl
Sanqui, AAP: having some fun parsing one of the WARC files for the LoL forums scrape.. Currently using some hardcoded logic to extract postid/title/username/userurl/body of each post, and pushing those into a local db.. seems to go reasonably well
-
avoozl
I'll have to have a closer look as to figure out how to make this configurable in a way that doesn't involve changing code
-
Sanqui
that's awesome!
-
Sanqui
I wouldn't attempt to go that way -- you can't make the configuration/parser general enough without eventually reinventing some kind of programming language. best to keep it in code, just as encapsulated and straightforward as possible
-
avoozl
True, maybe that's overthinking it
-
Sanqui
make some sort of classes for phpbb2, phpbb3, simple machines etc. forums, with easy overrides
-
avoozl
I'm currently pushing it all into blevesearch, which is a search engine that is self-contained, a bit like what sqlite is to sql... Just to test the scale and performance. Would be great if we can use it since it keeps things rather portable
-
avoozl
yeah phpbb should be easy..
-
avoozl
The only things I have a minor headache over is the handling of 'quoted' text within message bodies.. that is rather tricky to get right
-
avoozl
right now I just ignore that and consume the entire text, but from parsing email boxes in the past I know that isn't the correct way to go
-
Sanqui
I wrote a reasonably easy zetaboards scraper once (but it utilized the admin interface for user data)
github.com/Sanqui/zetaboards-scrape/blob/master/scrape.py
-
avoozl
I've written scrapers with go-colly (
go-colly.org ) which is quite neat and compact
-
Sanqui
historically I've used requests with beautifulsoup but recently I discovered scrapy which is nice for smaller projects
-
Sanqui
I'm a python person tho
-
avoozl
yeah I've done a ton of python in the past
-
Sanqui
in general it doesn't really matter what you use imo, i can see myself using a go program if the "forum" class is readable
-
avoozl
what I like about go-colly is that it is really really performant and easy to compile to a single executable to run elsewhere. also coordinating multiple scrapers with a central Redis queue is trivial
-
avoozl
But I'll make sure to make this easy to extend.. Right now I'm just looking to get my bearings on the performance and API
-
avoozl
do you have any experience with building simple frontends on top of rest/graphql? I have used vue in the past but I'm not really a front-end person
-
Sanqui
sadly nah, I've attempted to learn vue briefly but it really didn't suit me, I'm a static html maybe with bootstrap and jquery kind of person
-
avoozl
Because I would love for this forum browser to be a bit decoupled from the underlying storage.. so that I just provide a self-hosted API with 'getpost' 'getthread' and 'search' a few other things around it
-
Sanqui
I'd probably write the thing in flask lol
-
avoozl
yeah flask is fine for a lot of things too
-
avoozl
right now I'm just busy ingesting forums.eune.leagueoflegends.com-00000.warc.gz as a test... that is around 25GB uncompressed, just checking how large the final search index will be
-
Sanqui
i'll be curious to hear!
-
avoozl
I'm currently on my 2011 macbook air, so things take a rather extreme amount of time to complete, but once it runs ok on this system it'll run like lightning everywhere else
-
avoozl
Sanqui: in your experience, are posts often identifiable with a unique identifier? With the LOL scrape there is a nice id= field in the html that contains a unique id for each post, but I can imagine that isn't the case everywhere
-
Sanqui
very typically, but not always. some caveats I can think of:
-
Sanqui
- some forums may reuse ids for posts that were deleted (meaning you can get collisions basically if a race condition happens while archiving)
-
avoozl
oh that is nasty, good to be aware of that
-
Sanqui
- some forums may change the post id when it gets edited
-
Sanqui
(by virtually deleting the old post and replacing it with a new one)
-
Sanqui
- most forums probably have some sort of post ID, but may not expose it -- you'll only have the thread id, and post number within the thread
-
Sanqui
all sorts of weird situations can happen... posts can be moved between threads, for example
-
Sanqui
also, personally I wouldn't be attempting to parse the post HTML (including the quote tags) while scraping. just stick the post HTML snippet in the database raw, and process it in another step (i.e. while rendering)
-
Sanqui
reverse engineering the original bbcdode is nontrivial and many forums allow raw HTML (albeit with filters), anyway
-
Sanqui
even phpbb allows for configuration of custom tags
-
avoozl
I would love to have a means to exclude quoted portions of a post, but that I can do with some custom code
-
Sanqui
yeah I understand that, but also consider that people sometimes... literally change the quoted text. or, the post gets quoted, and the original post gets edited, and now the quote is the only instance of the original text
-
avoozl
Html is a bit messy to ship though APIs and to recompose in a browser safe way, unless I go down the iframe sandbox rabbithole
-
Sanqui
so you definitely want to include quoted text while searching
-
avoozl
But I'll see how far i can get without parsing these bodies
-
Sanqui
you can probably try to de-prioritize it when showing search results, but yeah
-
Sanqui
this is like a "falsehoods programmers believe about forums" session :D
-
OrIdow6
Can't imagine including quotes would be much of a problem while searching
-
avoozl
Yeah I've been living in dataset territory for a bit too long. Dara is also full of lies, but they are of a different matter
-
OrIdow6
As long as your purpose searching is to let a human look for discussions by topic, instead of training a neural network or whatever
-
Sanqui
in general, one thing I would try to remember is that posts can get edited over time, and since you're scraping archives, you might as well use this to your advantage -- store every revision of a post you encounter, and add the ability to display diffs etc
-
avoozl
OrIdow6: in forensic email analysis it was a bit of a pain, there a lot of the results are consumed by algorithms, not by humans. When the human comes into play most of the filtering has already been done
-
OrIdow6
By the way, apparently (from my logs, and from the faint memory I have of doing some preliminary work on this) the forums went well, but there were a million edge cases in the boards
-
avoozl
Sanqui: I'll check if i can get some kind of historic perspective at the db/index level
-
OrIdow6
(Edge cases, as well as things that flat out went wrong for no apparent reason)
-
Sanqui
sometimes people get mad and delete text from all their posts
-
Sanqui
leaving discussions hard to read
-
Sanqui
or delete the posts, if the forum enables that
-
avoozl
Sanqui: by the way, the response content charset things worked out, turns out the internal go html parser was reusable so I've just used that and haven't ran into any issues yet
-
Sanqui
nice yeah, shouldn't be much of a problem when dealing with individual websites, although watch out if you start parsing forums in languages you're not familiar with because you might not be able to recognize that text is in the wrong encoding
-
eicos
hey, anyone here working on the parler stuff?
-
EggplantN
what do you mean by that eicos
-
eicos
apologies for the off-topic post, I just found & joined the right channel EggplantN
-
avoozl
Sanqui: getting around 80MB per minute of compressed warc.gz handled now... that'll require some work
-
sliccricc_
-
sliccricc_
(thedonald.win)
-
sliccricc_
"Robert Davis, senior vice president of Epik Inc., told The Wall Street Journal his firm warned TheDonald.Win it might be dropped within days if it fails to better cull what he said are discussions glorifying violence, propagating white supremacy and fomenting extremism"
-
EggplantN
ha
-
EggplantN
HAHAHAHAHAHAHAHAHAHAHAHA
-
EggplantN
BAHAHAHAHAHAHAHAHAHAHAHAHAHAHA
-
EggplantN
EPIK?! THINKING THEY CAN TELL TD.WIN THAT
-
EggplantN
omg
-
EggplantN
they're high
-
EggplantN
an upstream has contacted them 100%
-
LeighR
do we have a project trying to archive them?
-
sliccricc
i've heard people discussing it over the past few days but im not sure if it got anywhere
-
EggplantN
i dont believe they're at risk in reality but i'm not sure
-
EggplantN
i know they're more desirable and easier to host than parler
-
LeighR
what I've read when doing a rather cursory search mentions that CloudFlare makes it hard to scrape that site
-
avoozl
do I need some kind of zstd dictionary to decompress these types of downloads?
archive.org/download/archiveteam_pastebin_20200606171522_85926a35
-
avoozl
(if so, where would I find that)
-
JAA
avoozl: The dictionary is in the first frame of the file as a skippable frame. Here's a draft of the .warc.zst spec:
iipc/warc-specifications #69
-
avoozl
Ahhh, makes sense
-
avoozl
Neat trick
-
JAA
OrIdow6 wrote a script recently for the extraction, but I can't find the link right now.
-
OrIdow6
-
JAA
There's also an effort to get this general concept into the contrib section of the Zstandard docs:
facebook/zstd #2349
-
OrIdow6
One of the dependencies broke recently, you need zstandard==0.10.2
-
OrIdow6
For the extraction script
-
OrIdow6
Or, at least, that's one of the working versions
-
avoozl
I'm using libzstd so things are slightly different here, but I'll manage
-
Dallas
JAA: what do you use to test warc replay ? I wanna have another crack at TikTok and wanna know what the closest approximation to wayback is gonna be ?
-
EggplantN
god fuck tiktok
-
EggplantN
but yes