-
arkiver
Ajay: hi! would you be able to set up a similar submission dashboard for so-net u-page+ as you did for mediafire?
-
arkiver
we can spread it through twitter to allow users to submit websites
-
arkiver
it's So-net U-Page+ we might want to put some japanese text on there
-
OrIdow6
arkiver: Some Japanese people on a message board have done that already
-
OrIdow6
-
arkiver
OrIdow6: where?
-
arkiver
do we have the output of that?
-
OrIdow6
The text file is downloadable at one of thos elinks
-
arkiver
and was it tweeted around?
-
OrIdow6
Yes
-
OrIdow6
I don't know about Twitter
-
OrIdow6
It was spread through something
-
OrIdow6
Don't know if GOogle Translte told me
-
OrIdow6
1 month ago, or whenever I read the thread
-
arkiver
we should put it on our twitter
-
arkiver
i want us to get more active on our twitter
-
Arcorann
Actually that could be good in general for announcing project launches
-
thuban
who actually has access
-
Ajay
looks like it's already covered by that other site, but yea I can set that site up for any future projects if we need/want
-
Ajay
I agree with announcing project launches on twitter
-
fuzzy8021
EggplantN your colo in kansas city ks/mo?
-
EggplantN
yes
-
fuzzy8021
only about 6 hrs south of me
-
EggplantN
worrying
-
fuzzy8021
lol
-
fuzzy8021
on a separate note not sure if i am considered a regular but be happy to get a permanent target at hetzner if i would be allowed
-
flashfire42
Do we wanna grab the NRA as they are restructuring to a not for profit organisation and filing for bankruptcy?
-
Craigle
Same actually. I have an AX-51-NVME that I would be more than happy to repurpose from workers to target if needed.
-
EggplantN
For now we're not too bad overall were coping and growing. Sadly one of our new targets isn't fully up to scratch and we're working to improve it
-
Craigle
flashfire42: That's probably not a bad idea. From the brief glance I made, it looks like they are already a non-profit, but they want to close that and re-start in Texas.
-
Craigle
Suposedly to avoid being sued in NYC
-
fuzzy8021
k keep it in mind open ended offer
-
EggplantN
i will do thank you!
-
Craigle
All I hear is an excuse for you to put more new hardware in a colo :D
-
kiska
:D
-
EggplantN
>_> Craigle
-
EggplantN
dont shame me
-
Craigle
No shame, I'm all about it
-
EggplantN
>_>
-
atphoenix
WebBBS (Version 4.33; June 8, 2000)
-
atphoenix
-
atphoenix
oh you already found that
-
atphoenix
well or found something close to it
-
JAA
Yeah, there's a full version history.
-
JAA
-
OrIdow6
So in the interest of time, I am running a quick thing on CrowdMap to get report URLs
-
OrIdow6
Which are, as far as I can tell, the actual data being captured - the rest of the site more or less does nothing but display reports
-
OrIdow6
(With exceptions)
-
OrIdow6
The AB job didn't get them
-
OrIdow6
So I think these can be run via AB
-
OrIdow6
Actually looking to be about 400k urls here
-
JAA
Yeah, looked like simple non-scripty HTML I think.
-
JAA
I couldn't find an example with comments though.
-
OrIdow6
That part of the site, yeah
-
JAA
Well, nothing with many comments for potential pagination.
-
JAA
Anyway, since this might go down any second now, let's split it up into a couple jobs and run?
-
OrIdow6
Yeah, this is very barebones in the interest of time
-
JAA
'Best Effort' SLA on this one. :-)
-
OrIdow6
-
JAA
Actually,
archive.fart.website/archivebot/viewer/job/6hn3f did grab some reports. No idea how complete it is though.
-
JAA
Started
-
OrIdow6
Well, it must be incomplete somewhere, because it grabbed less URLs than there are so far in this list I made
-
OrIdow6
Thanks
-
JAA
Oof
-
JAA
Their server returned empty HTTP 200 responses for quite a few sites.
-
JAA
And it's doing that again.
-
OrIdow6
I noticed that it did that when I typod endpoint names
-
OrIdow6
How can you tell when they're empty?
-
JAA
I'm looking at the WARCs.
-
JAA
And can reproduce it with curl.
-
OrIdow6
Oh, I thought you meant the running job
-
JAA
Yes, I'm looking at that job's WARC.
-
OrIdow6
Oh
-
OrIdow6
It should be returning non-empty responses for the reports
-
atphoenix
so
awsd.com/scripts/webbbs says "PLEASE GO TO TETRABB.COM FOR THE NEWEST VERSION OF THE WEBBBS FORUM".
tetrabb.com says "Domain for sale". ooops.
-
OrIdow6
Because it should have gotten the cookie beforehand
-
OrIdow6
At least, that's what I experienced
-
JAA
Well, not necessarily.
-
JAA
Concurrency etc.
-
JAA
But yeah, it's the cookie stuff.
-
JAA
The cookie seems to be valid for an hour.
-
JAA
So if we shuffle the list and add a fake request at the beginning for /?archiveteam or similar, it should be fine.
-
OrIdow6
Why shuffle the list?
-
JAA
Because it might not get to the end of the list within an hour.
-
JAA
Each request extends the cookie by an hour.
-
JAA
But by the time it gets to the bottom, those cookies might've expired already.
-
JAA
I've launched another recursive job with those requests.
-
JAA
By the way, excellent example of the -ot discussion earlier: the server sends a 'Refresh' header, which isn't standardised in HTTP headers but browsers behave as if it was a refresh meta tag.
-
JAA
Anyway, we can leave the job with your list as is if you want, but we'll probably miss a handful of URLs.
-
JAA
Nope, the new job is still getting empty 200s even though the cookie is being sent. WTF is this shit?
-
JAA
Well, occasionally at least.
-
JAA
Better than the previous attempt anyway.
-
OrIdow6
-
OrIdow6
Every domain name should be prefixed with
domain.crowdwhatever.com/?archiveteam now
-
JAA
That has the same issue.
-
JAA
Unless run at 1 concurrency, but that's not reasonable with the deadline already over.
-
OrIdow6
Oh
-
JAA
It would fetch multiple URLs at once, and every time it begins processing a new domain, it'll miss content on a few URLs.
-
JAA
In theory, you could put the /?archiveteam trick URL 'a bit further up', but it's hard to predict how many buffer URLs you need.
-
OrIdow6
If it had more than one dummy request, would that work?
-
OrIdow6
Oh, you got there before me
-
OrIdow6
Sort of
-
JAA
Here's what I'd do: shuffle the list, then take the unique domains in the order they first appear in the file, then insert the trick URL in that order at the beginning.
-
JAA
The shuffling also has the side effect that we'd get a random sample of the content if the site shuts down while we're still grabbing it.
-
JAA
As opposed to a strongly biased one.
-
OrIdow6
Would it work to have a padding section in between the cookie-requests and the proper content?
-
JAA
That's a definite maybe.
-
JAA
:-P
-
OrIdow6
Well, it's only a few extra requests, I'll throw it in there in case it does help
-
JAA
Make it /?disable /?your /?stupid /?cookie /?bullshit :-)
-
OrIdow6
-
JAA
Looks good, thanks.
-
JAA
Want to do the same thing for the first list?
-
OrIdow6
Ok
-
OrIdow6
-
JAA
Queued that to a full pipeline. Oops
-
JAA
But the list 2 job seems to be running fine now. :-)
-
OrIdow6
Good
-
OrIdow6
-
OrIdow6
Though I'm just going in the order of the list scrape, which isn't random and doesn't seem to be completely alphabetical, either
-
Ryz
Heya OrIdow6, I'll be taking over from JAA, tossing in your work into AB
-
OrIdow6
Ryz: Ok; it seems to have slowed down recently (think it's hit a few big sites), so it may be some more time until I have another
-
Ryz
Check #archivebot
-
OrIdow6
Reading logs
-
hexa-
-
kiska
Known
-
hexa-
Thx
-
avoozl
Hey all. In short I'm looking for something that connects some 'web extraction/scraping' logic to WARC parsing. I can code in go (and python), but wanted to make sure I'm not overlooking anything. Basically I would like to convert a forum scrape from WARC to database records (post, user, etc.)
-
JAA
Hi avoozl. My go-to library for WARC parsing is warcio. You can use it to iterate over a WARC and then do whatever you need with the HTTP body.
-
JAA
That's Python. No idea if there's any decent Go libraries.
-
avoozl
There's some reasonable libraries, but most of them stop at the content level. So parsing the HTTP response and converting it into the right character set will take some additional effort
-
JAA
warcio does parse HTTP responses.
-
JAA
You'll want the content_stream() of each record.
-
avoozl
Thanks I'll take a look at how that is implemented. The go library I was using just gives me the raw content stream, but doesn't do any handling of content encodings
-
avoozl
Seems like I'll need to add quite a few parts to this go library, but that's fair. thanks
-
avoozl
JAA: browsing through the warcio source, I don't think I can see it actually parsing/using the response header such as 'Content-Type: text/html; charset=UTF-8' ... Not sure how it currently selects the encoding
-
JAA
Hmm, I thought it did.
-
JAA
But yeah, looks like you're right.
-
Sanqui
don't want to make you sound stupid, but if you're parsing a single forum it's probably enough to just hardcode the relevant encoding
-
Sanqui
(if you're parsing many forums, I wanna talk about your project over dinner)
-
JAA
:-)
-
JAA
Yeah, agreed.
-
avoozl
Sanqui: I'm parsing quite a few different forums, but usually everything using the same 'base' is ok. I'm currently in the processes of expanding the scope a bit, and this bit me
-
Sanqui
right. well, remember that charset parsing is non-trivial anyway, and even browsers do quite a bit of guesswork
-
JAA
I wonder if Requests has a nice way to handle this.
-
avoozl
Sanqui: I feared so.. I'm currently browsing through some go/net/http/response code, and they don't really have a great way of handling this either.. I'll check some other sources
-
JAA
-
Sanqui
the standard workflow is probably 1. check the first 512 bytes for UTF-16 BOM or a <meta charset tag, 2. check the HTTP header, 3. run some heuristics on the text to guess.
-
avoozl
-
avoozl
JAA: yes :)
-
Sanqui
I would definitely prioritize what the HTML document says over the HTTP header
-
avoozl
Sanqui: RFC 2616 disagrees, but I guess reality is harsh :)
-
Sanqui
indeed
-
JAA
Yeah, you can't really implement an HTTP client based on the specs if you want to be compatible with shitty servers.
-
JAA
We just had that discussion in -ot last night. :-)
-
avoozl
Haha, ok :)
-
Sanqui
browsers have gotten surprisingly good at this -- I've been browsing 2000s czech websites and I forgot half of them have byzantine encodings that firefox just autodetects
-
JAA
'good'
-
avoozl
I'll see how far I can get with just the basics. If I hit any encoding snags in reality I'll come back to bug you :)
-
avoozl
I'll stick around on the channel, sounds like some interesting discussions went on here :)
-
JAA
Browsers should simply refuse to display pages that don't specify the correct encoding per spec. Oh well, let's not have that discussion again. :-P
-
JAA
You may be interested in #archiveteam-ot and #archiveteam-dev as well.
-
avoozl
Basically, I built a prototype of a scraper a while ago that can take a config file that determines which parts to extract (xpath/css matching) for certain url types, and then pushes it all into neo4j (not ideal, but easy to set up)
-
Sanqui
JAA: remember XHTML?
-
Sanqui
we've tried the whole "refuse to display non-standard pages" thing
-
avoozl
Now I found the lovely trove of warc files on archive.org, and I'm rethinking part of my approach to just read an entire forum
-
JAA
Sanqui: Yes, I used to develop all my websites with XHTML. But the trainwreck had long left the station by that point.
-
Sanqui
avoozl: a database of web forums from archive.org is one of my dream projects
-
» avoozl has some flashbacks to structured web and OWL
-
JAA
See also Transfer-Encoding vs Content-Encoding, which *nobody* seems to use correctly.
-
avoozl
Sanqui: I'm trying to keep it self-contained, I have experimented with blevesearch and dgraph before, but it is hard to work at scale. neo4j seems like a nice middle ground, but it will require a fairly beefy setup
-
Sanqui
old forums are a goldmine, a treasure trove of information, and as they drop out they're no longer searchable by google
-
Sanqui
even if we work to archive them
-
avoozl
of course you could just dump everything into elastic and try the 'search' approach. But I like analytics so I want things a bit more organized and referenced
-
Sanqui
absolutely, as a first step it'd be great to even just have metadata -- there's these fora, they had this many posts and users, click here to browse them in wayback
-
Sanqui
a graph of posts over time so you can say "prime time was 2007"
-
Sanqui
etc.
-
JAA
I've also had an idea for a project in this direction before. A standardised format for any sort of online discussion, extensible with platform-specific information as needed.
-
JAA
And then parsers that extract things accordingly from forums, social media, mailing lists, and whatnot.
-
avoozl
Sanqui: do you have any specific 'small' forum from archiveteam that you can recommend to try first? I'm currently just picking something at random, but I'd rather start with something that is pretty textual and not too large (say <100GB)
-
avoozl
JAA: yeah I've been thinking along those lines as well, but it is always difficult to scale these things propertly, especially once 'time' becomes part of the storage structure (this user existed at this time, but later it disappeared, or changed alias, etc.)
-
JAA
Yeah
-
JAA
-
avoozl
JAA: everyone seems to want something different out of it.. for me I would like two things: browsing a forum with search, and running analytical queries (spark/graphql/python) on subsets of the data (and on the topology)
-
Sanqui
invisionfree and zetaboards used to have tons and archive.org archived a lot of them
-
Sanqui
and they're pretty standard
-
Sanqui
I'm currently archiving this major estonian forum active since 2002
foorum.hinnavaatlus.ee
-
avoozl
the LOL one looks good, I'll have a quick look at that
-
Sanqui
78756 users, 4860928 posts
-
avoozl
Ages ago when geocities was archived I started playing with that, but I'm glad things have gotten a bit easier these days. that was a tricky set to get anything out of
-
JAA
avoozl: Big advantage is that those archives only contain the relevant HTML pages, no images, videos, outlinks, or other fluff.
-
JAA
So that should make for a nice small test bed.
-
avoozl
JAA: yeah that makes it perfect. I started at something from archiveteam eu domains and that was 99% non-html
-
JAA
Then once that works, try an ArchiveBot crawl of a small forum I suppose.
-
Sanqui
in 2007 I archived a czech forum about pet birds that seem to be offline now
-
Sanqui
-
Sanqui
sorry 2017*
-
avoozl
I'll have a go at LOL first, see when I find some time to work on this
-
Sanqui
also speaking of google, I've come to realize just how irrelevant it's gotten when it comes to finding quality information
-
Sanqui
increasingly I'm finding better information just by searching reddit, hacker news, a discord server relevant to the topic, or heck literally doing a fulltext search in my telegram chats
-
Sanqui
times are changing
-
avoozl
What i like most is that you can easily store the entire history of reddit on your desktop machine, unless you want media
-
Sanqui
but yeah accessibility of archived data is one thing we (archive team) are not that great at... there's always so much going on that by the time a project is done, we immediately move onto the next thing that's at danger
-
Sanqui
which is fine, the data is saved and we keep focus, BUT I would be delighted to see more projects making use of the archives, analyzing, enabling ease of access
-
Sanqui
so, thumbs up from me
-
JAA
++
-
avoozl
It is also difficult to find the right 'size' of software project. 'accessing data' could easily spiral out of control into some planet-sized IPFS key-value store with auto-indexing and distributed version control... which of course will never be finished and the user experience will be backlogged into the better half of this century
-
Sanqui
it's absolutely true
-
avoozl
I would just like something fairly simple, see if it sticks. Then if anyone wants to move it further, lovely.
-
Sanqui
a narrow, detail-oriented focus is often better than casting a net that's too wide
-
avoozl
Also, some shiny-object-syndrome exists... I tend also to think 'oh maybe I could run openface-id neural nets to pick up all the faces in the parler data.. and then of course, you COULD do that, but it feels worse then just working at the core of things
-
Sanqui
#adhd
-
JAA
YES
-
JAA
This is why my AT indexer is still not running, basically. :-P
-
Sanqui
a bizarre forum I'm archiving right now :p
turfwarsapp.com/forum/43/topic/4824171
-
Sanqui
for a geolocation-based mobile game
-
avoozl
they used to play ingress here, that must be a ton of data too
-
Sanqui
motivated by a friend I archived a few websites pertaining to geolocation based games because they've been having a hard time with the pandemic
-
Sanqui
ingress is probably safe
-
avoozl
JAA: is there any trick to getting files from archive.org faster? I typically see curl/wget drop to like 250KB/sec after a while and it takes forever for most downloads
-
avoozl
or is that normal speed
-
avoozl
average speed on larger files seems to be around 400KB/sec.. 1,91G 397KB/s in 94m 3s
-
purplebot
A Million Ways to Die on the Web edited by KamafaDelgato (+14) just now --
archiveteam.org/?diff=46185&oldid=44414
-
purplebot
Template:IRC edited by Justcool393 (-12, Default to hackint for ArchiveTeam …) just now --
archiveteam.org/?diff=46186&oldid=41611
-
purplebot
This Is My Jam edited by Flashfire42 (+104) just now --
archiveteam.org/?diff=46188&oldid=31812
-
purplebot
FileTrip edited by Flashfire42 (+0) just now --
archiveteam.org/?diff=46189&oldid=35949
-
purplebot
Wildscreen Arkive edited by Flashfire42 (+0) just now --
archiveteam.org/?diff=46190&oldid=34427
-
JAA
avoozl: Nope, that's normal speed. :-/
-
avoozl
Ok, I'll just throw it all into the queue
-
atphoenix
Re: comments in #archiveteam .... I suppose some think that bugtraq is comparatively boring...
-
purplebot
Coronavirus edited by Wessel1512 (+26, /* Archives and dedicated sites …) just now --
archiveteam.org/?diff=46193&oldid=46100
-
purplebot
Template:IRC edited by JustAnotherArchivist (+12, Reverted edits by [[Special:Contributions/Justcool393|Justcool393]] …) 22 minutes ago --
archiveteam.org/?diff=46191&oldid=46186
-
purplebot
This Is My Jam edited by Sanqui (+8, use job template for job id) 20 minutes ago --
archiveteam.org/?diff=46192&oldid=46188
-
JAA
On the Template:IRC edit reversal: that breaks pretty much every single IRC channel mention on the wiki. We need a mass edit at the same time as changing the default network in the template. There's been a bit of discussion in here about how to best do that and possibly also tackle the issue of dead channels at the same time (e.g. '#archiveteam-bs (on hackint), formerly #foobar (on EFnet)'). Until
-
JAA
that happens, the IRC template should stay as it is, even though it's messy. (Cc justcool393)
-
DaxServer
-
JAA
We're not GNU, but it's up now.
-
JAA
(And yeah, I could reproduce it being down at first.)
-
DaxServer
I have created a pull request to update the Dockerfile and add Wget-AT as a common dependency into the container itself, so that all the projects can use it
ArchiveTeam/warrior-dockerfile #44
-
brad
DaxServer — I think the warrior hasn’t been used in a while, but I would be interested to see a how a dockerized version works, with all the C&C coordination that is required, etc....
-
brad
Speaking of Warrior-type activities, does anyone know what the status is of a warrior-style archive project for the community.fantasyflightgames.com site.
-
brad
I know there was the ArchiveBot pipeline 218c8179a369ceb37a999add83e36442 but that’s just a single source that is probably getting throttled or temporarily banned frequently, and I don’t know if they’ve even made a single full complete run yet.
-
EggplantN
brad: he left sadly but I do believe it would've been nice for pre-parler for people to have it available
-
brad
Yeah. ;(
-
EggplantN
we have lots of new helpers now and if it works, pending approval and review from the devs. it would be idea till trackerv2 & warriorv4 are ready
-
EggplantN
either way this is probably best for #archiveteam-dev
-
brad
Thanks! I’ll head over there....
-
brad
Oh, I do have another question — are the WARCs created by ArchiveBot and other ArchiveTeam projects available anywhere for download? Some of the folks on the FFG SWRPG Discord are setting up a new community-owned forum site, and would also like to have their own searchable archive of the FFG community site, and I know the WARCs are key to doing that.
-
hook54321
-
hook54321
-
brad
Thanks!
-
jodizzle
brad: The main job for community.fantasyflightforums.com ran finished, so the bulk of the site should be archived. However, there were a large number of 403s, which are now running in a separate job (erulqjgzn97r2xiab2yqe1qqv).
-
jodizzle
This isn't a perfect solution, because the way it's set up, AB won't recurse on the URLs in the second job (it's an '!ao <', not an '!a').
-
brad
Yeah, I wasn’t able to find the original job on the trackers, so I assumed it had finished or shut down. I did find the one to sweep through and pick up the 403s for something like 180k links? Wow, that’s a lot of 403s....
-
jodizzle
It is 180k, but it's not all from community.fantasyflightforums.com. It's from other domains as well (including some that may 403 naturally)
-
brad
It makes total sense that you would do multiple runs. It hadn’t occurred to me that the best way to pick up the 403s was to do a non-recursive list of specific URLs to try, however. But that is kinda clever.
-
jodizzle
If I have time, I might look into other ways to retrieve any missing pages.
-
brad
Much appreciated!
-
jodizzle
Keep in mind that it's not really the best way, because again, there might be URLs on those 403-ing pages that the original job never got to.
-
jodizzle
Hopefully not too many, though.
-
brad
Right, so ideally you’d want to do multiple recursive runs, plus non-recursive runs with specific URLs.
-
brad
And I’m happy to help with that in any way I can.