#archiveteam-bs

00:00

arkiver

Ajay: hi! would you be able to set up a similar submission dashboard for so-net u-page+ as you did for mediafire?
00:01

arkiver

we can spread it through twitter to allow users to submit websites
00:01

arkiver

it's So-net U-Page+ we might want to put some japanese text on there
00:04

OrIdow6

arkiver: Some Japanese people on a message board have done that already
00:05

OrIdow6

geolog.mydns.jp/so-net
00:05

arkiver

OrIdow6: where?
00:05

arkiver

do we have the output of that?
00:05

OrIdow6

The text file is downloadable at one of thos elinks
00:05

arkiver

and was it tweeted around?
00:05

OrIdow6

Yes
00:06

OrIdow6

I don't know about Twitter
00:06

OrIdow6

It was spread through something
00:06

OrIdow6

Don't know if GOogle Translte told me
00:06

OrIdow6

1 month ago, or whenever I read the thread
00:06

arkiver

we should put it on our twitter
00:06

arkiver

i want us to get more active on our twitter
00:09

Arcorann

Actually that could be good in general for announcing project launches
00:10

thuban

who actually has access
00:28

Ajay

looks like it's already covered by that other site, but yea I can set that site up for any future projects if we need/want
00:30

Ajay

I agree with announcing project launches on twitter
01:02

fuzzy8021

EggplantN your colo in kansas city ks/mo?
01:02

EggplantN

yes
01:03

fuzzy8021

only about 6 hrs south of me
01:03

EggplantN

worrying
01:03

fuzzy8021

lol
01:06

fuzzy8021

on a separate note not sure if i am considered a regular but be happy to get a permanent target at hetzner if i would be allowed
01:07

flashfire42

Do we wanna grab the NRA as they are restructuring to a not for profit organisation and filing for bankruptcy?
01:08

Craigle

Same actually. I have an AX-51-NVME that I would be more than happy to repurpose from workers to target if needed.
01:10

EggplantN

For now we're not too bad overall were coping and growing. Sadly one of our new targets isn't fully up to scratch and we're working to improve it
01:10

Craigle

flashfire42: That's probably not a bad idea. From the brief glance I made, it looks like they are already a non-profit, but they want to close that and re-start in Texas.
01:10

Craigle

Suposedly to avoid being sued in NYC
01:14

fuzzy8021

k keep it in mind open ended offer
01:15

EggplantN

i will do thank you!
01:21

Craigle

All I hear is an excuse for you to put more new hardware in a colo :D
01:23

kiska

:D
01:26

EggplantN

>_> Craigle
01:26

EggplantN

dont shame me
01:28

Craigle

No shame, I'm all about it
01:29

EggplantN

>_>
03:36

atphoenix

WebBBS (Version 4.33; June 8, 2000)
03:36

atphoenix

web.archive.org/web/20000817194941/http://www.awsd.com/scripts/webbbs
03:36

atphoenix

oh you already found that
03:37

atphoenix

well or found something close to it
03:37

JAA

Yeah, there's a full version history.
03:37

JAA

awsd.com/download/webbbs/history.txt
04:05

OrIdow6

So in the interest of time, I am running a quick thing on CrowdMap to get report URLs
04:06

OrIdow6

Which are, as far as I can tell, the actual data being captured - the rest of the site more or less does nothing but display reports
04:06

OrIdow6

(With exceptions)
04:06

OrIdow6

The AB job didn't get them
04:09

OrIdow6

So I think these can be run via AB
04:09

OrIdow6

Actually looking to be about 400k urls here
04:19

JAA

Yeah, looked like simple non-scripty HTML I think.
04:20

JAA

I couldn't find an example with comments though.
04:20

OrIdow6

That part of the site, yeah
04:20

JAA

Well, nothing with many comments for potential pagination.
04:21

JAA

Anyway, since this might go down any second now, let's split it up into a couple jobs and run?
04:21

OrIdow6

Yeah, this is very barebones in the interest of time
04:21

JAA

'Best Effort' SLA on this one. :-)
04:23

OrIdow6

transfer.notkiska.pw/Ccm2G/crowdmap_reports_list_1.txt - about 66k URLs, mostly reports
04:23

JAA

Actually, archive.fart.website/archivebot/viewer/job/6hn3f did grab some reports. No idea how complete it is though.
04:25

JAA

Started
04:25

OrIdow6

Well, it must be incomplete somewhere, because it grabbed less URLs than there are so far in this list I made
04:25

OrIdow6

Thanks
04:27

JAA

Oof
04:27

JAA

Their server returned empty HTTP 200 responses for quite a few sites.
04:28

JAA

And it's doing that again.
04:29

OrIdow6

I noticed that it did that when I typod endpoint names
04:29

OrIdow6

How can you tell when they're empty?
04:29

JAA

I'm looking at the WARCs.
04:29

JAA

And can reproduce it with curl.
04:29

OrIdow6

Oh, I thought you meant the running job
04:30

JAA

Yes, I'm looking at that job's WARC.
04:30

OrIdow6

Oh
04:30

OrIdow6

It should be returning non-empty responses for the reports
04:31

atphoenix

so awsd.com/scripts/webbbs says "PLEASE GO TO TETRABB.COM FOR THE NEWEST VERSION OF THE WEBBBS FORUM". tetrabb.com says "Domain for sale". ooops.
04:31

OrIdow6

Because it should have gotten the cookie beforehand
04:32

OrIdow6

At least, that's what I experienced
04:32

JAA

Well, not necessarily.
04:32

JAA

Concurrency etc.
04:33

JAA

But yeah, it's the cookie stuff.
04:34

JAA

The cookie seems to be valid for an hour.
04:35

JAA

So if we shuffle the list and add a fake request at the beginning for /?archiveteam or similar, it should be fine.
04:36

OrIdow6

Why shuffle the list?
04:37

JAA

Because it might not get to the end of the list within an hour.
04:37

JAA

Each request extends the cookie by an hour.
04:37

JAA

But by the time it gets to the bottom, those cookies might've expired already.
04:39

JAA

I've launched another recursive job with those requests.
04:42

JAA

By the way, excellent example of the -ot discussion earlier: the server sends a 'Refresh' header, which isn't standardised in HTTP headers but browsers behave as if it was a refresh meta tag.
04:43

JAA

Anyway, we can leave the job with your list as is if you want, but we'll probably miss a handful of URLs.
04:45

JAA

Nope, the new job is still getting empty 200s even though the cookie is being sent. WTF is this shit?
04:45

JAA

Well, occasionally at least.
04:46

JAA

Better than the previous attempt anyway.
04:55

OrIdow6

Ok, transfer.notkiska.pw/Q9qoc/crowdmap_reports_list_2.txt is the new one
04:55

OrIdow6

Every domain name should be prefixed with domain.crowdwhatever.com/?archiveteam now
04:55

JAA

That has the same issue.
04:55

JAA

Unless run at 1 concurrency, but that's not reasonable with the deadline already over.
04:56

OrIdow6

Oh
04:56

JAA

It would fetch multiple URLs at once, and every time it begins processing a new domain, it'll miss content on a few URLs.
04:57

JAA

In theory, you could put the /?archiveteam trick URL 'a bit further up', but it's hard to predict how many buffer URLs you need.
04:57

OrIdow6

If it had more than one dummy request, would that work?
04:57

OrIdow6

Oh, you got there before me
04:57

OrIdow6

Sort of
04:58

JAA

Here's what I'd do: shuffle the list, then take the unique domains in the order they first appear in the file, then insert the trick URL in that order at the beginning.
04:58

JAA

The shuffling also has the side effect that we'd get a random sample of the content if the site shuts down while we're still grabbing it.
04:59

JAA

As opposed to a strongly biased one.
04:59

OrIdow6

Would it work to have a padding section in between the cookie-requests and the proper content?
05:00

JAA

That's a definite maybe.
05:00

JAA

:-P
05:00

OrIdow6

Well, it's only a few extra requests, I'll throw it in there in case it does help
05:01

JAA

Make it /?disable /?your /?stupid /?cookie /?bullshit :-)
05:07

OrIdow6

Ok, transfer.notkiska.pw/RLMJo/crowdmap_reports_list_2_improved.txt has the changes
05:12

JAA

Looks good, thanks.
05:12

JAA

Want to do the same thing for the first list?
05:14

OrIdow6

Ok
05:15

OrIdow6

transfer.notkiska.pw/hTYJ1/crowdmap_reports_list_1_improved.txt
05:28

JAA

Queued that to a full pipeline. Oops
05:30

JAA

But the list 2 job seems to be running fine now. :-)
05:45

OrIdow6

Good
05:45

OrIdow6

transfer.notkiska.pw/JsCaO/crowdmap_reports_list_3_improved.txt - nearly halfway done now
05:46

OrIdow6

Though I'm just going in the order of the list scrape, which isn't random and doesn't seem to be completely alphabetical, either
06:10

Ryz

Heya OrIdow6, I'll be taking over from JAA, tossing in your work into AB
06:25

OrIdow6

Ryz: Ok; it seems to have slowed down recently (think it's hit a few big sites), so it may be some more time until I have another
06:25

Ryz

Check #archivebot
06:26

OrIdow6

Reading logs
09:15

hexa-

Bugtraq: BugTraq Shutdown - seclists.org/bugtraq/2021/Jan/0
09:16

kiska

Known
09:16

hexa-

Thx
14:25

avoozl

Hey all. In short I'm looking for something that connects some 'web extraction/scraping' logic to WARC parsing. I can code in go (and python), but wanted to make sure I'm not overlooking anything. Basically I would like to convert a forum scrape from WARC to database records (post, user, etc.)
14:25

JAA

Hi avoozl. My go-to library for WARC parsing is warcio. You can use it to iterate over a WARC and then do whatever you need with the HTTP body.
14:26

JAA

That's Python. No idea if there's any decent Go libraries.
14:26

avoozl

There's some reasonable libraries, but most of them stop at the content level. So parsing the HTTP response and converting it into the right character set will take some additional effort
14:27

JAA

warcio does parse HTTP responses.
14:27

JAA

You'll want the content_stream() of each record.
14:28

avoozl

Thanks I'll take a look at how that is implemented. The go library I was using just gives me the raw content stream, but doesn't do any handling of content encodings
14:29

avoozl

Seems like I'll need to add quite a few parts to this go library, but that's fair. thanks
14:32

avoozl

JAA: browsing through the warcio source, I don't think I can see it actually parsing/using the response header such as 'Content-Type: text/html; charset=UTF-8' ... Not sure how it currently selects the encoding
14:34

JAA

Hmm, I thought it did.
14:35

JAA

But yeah, looks like you're right.
14:35

Sanqui

don't want to make you sound stupid, but if you're parsing a single forum it's probably enough to just hardcode the relevant encoding
14:35

Sanqui

(if you're parsing many forums, I wanna talk about your project over dinner)
14:36

JAA

:-)
14:36

JAA

Yeah, agreed.
14:36

avoozl

Sanqui: I'm parsing quite a few different forums, but usually everything using the same 'base' is ok. I'm currently in the processes of expanding the scope a bit, and this bit me
14:37

Sanqui

right. well, remember that charset parsing is non-trivial anyway, and even browsers do quite a bit of guesswork
14:37

JAA

I wonder if Requests has a nice way to handle this.
14:37

avoozl

Sanqui: I feared so.. I'm currently browsing through some go/net/http/response code, and they don't really have a great way of handling this either.. I'll check some other sources
14:38

JAA

This should be useful: github.com/psf/requests/blob/c2b307…81a89a799fcc/requests/utils.py#L486
14:38

Sanqui

the standard workflow is probably 1. check the first 512 bytes for UTF-16 BOM or a <meta charset tag, 2. check the HTTP header, 3. run some heuristics on the text to guess.
14:38

avoozl

Sanqui: github.com/psf/requests/blob/4f6c01…34eb91c7a9e/requests/models.py#L839 handles some of the magic on the python side
14:38

avoozl

JAA: yes :)
14:39

Sanqui

I would definitely prioritize what the HTML document says over the HTTP header
14:39

avoozl

Sanqui: RFC 2616 disagrees, but I guess reality is harsh :)
14:39

Sanqui

indeed
14:39

JAA

Yeah, you can't really implement an HTTP client based on the specs if you want to be compatible with shitty servers.
14:40

JAA

We just had that discussion in -ot last night. :-)
14:40

avoozl

Haha, ok :)
14:40

Sanqui

browsers have gotten surprisingly good at this -- I've been browsing 2000s czech websites and I forgot half of them have byzantine encodings that firefox just autodetects
14:40

JAA

'good'
14:40

avoozl

I'll see how far I can get with just the basics. If I hit any encoding snags in reality I'll come back to bug you :)
14:41

avoozl

I'll stick around on the channel, sounds like some interesting discussions went on here :)
14:41

JAA

Browsers should simply refuse to display pages that don't specify the correct encoding per spec. Oh well, let's not have that discussion again. :-P
14:41

JAA

You may be interested in #archiveteam-ot and #archiveteam-dev as well.
14:41

avoozl

Basically, I built a prototype of a scraper a while ago that can take a config file that determines which parts to extract (xpath/css matching) for certain url types, and then pushes it all into neo4j (not ideal, but easy to set up)
14:42

Sanqui

JAA: remember XHTML?
14:42

Sanqui

we've tried the whole "refuse to display non-standard pages" thing
14:42

avoozl

Now I found the lovely trove of warc files on archive.org, and I'm rethinking part of my approach to just read an entire forum
14:43

JAA

Sanqui: Yes, I used to develop all my websites with XHTML. But the trainwreck had long left the station by that point.
14:43

Sanqui

avoozl: a database of web forums from archive.org is one of my dream projects
14:43

» avoozl has some flashbacks to structured web and OWL
14:44

JAA

See also Transfer-Encoding vs Content-Encoding, which *nobody* seems to use correctly.
14:44

avoozl

Sanqui: I'm trying to keep it self-contained, I have experimented with blevesearch and dgraph before, but it is hard to work at scale. neo4j seems like a nice middle ground, but it will require a fairly beefy setup
14:44

Sanqui

old forums are a goldmine, a treasure trove of information, and as they drop out they're no longer searchable by google
14:44

Sanqui

even if we work to archive them
14:45

avoozl

of course you could just dump everything into elastic and try the 'search' approach. But I like analytics so I want things a bit more organized and referenced
14:45

Sanqui

absolutely, as a first step it'd be great to even just have metadata -- there's these fora, they had this many posts and users, click here to browse them in wayback
14:45

Sanqui

a graph of posts over time so you can say "prime time was 2007"
14:45

Sanqui

etc.
14:45

JAA

I've also had an idea for a project in this direction before. A standardised format for any sort of online discussion, extensible with platform-specific information as needed.
14:46

JAA

And then parsers that extract things accordingly from forums, social media, mailing lists, and whatnot.
14:46

avoozl

Sanqui: do you have any specific 'small' forum from archiveteam that you can recommend to try first? I'm currently just picking something at random, but I'd rather start with something that is pretty textual and not too large (say <100GB)
14:47

avoozl

JAA: yeah I've been thinking along those lines as well, but it is always difficult to scale these things propertly, especially once 'time' becomes part of the storage structure (this user existed at this time, but later it disappeared, or changed alias, etc.)
14:47

JAA

Yeah
14:47

JAA

Try archive.org/details/forums.region.leagueoflegends.com_202003 perhaps.
14:47

avoozl

JAA: everyone seems to want something different out of it.. for me I would like two things: browsing a forum with search, and running analytical queries (spark/graphql/python) on subsets of the data (and on the topology)
14:47

Sanqui

invisionfree and zetaboards used to have tons and archive.org archived a lot of them
14:48

Sanqui

and they're pretty standard
14:48

Sanqui

I'm currently archiving this major estonian forum active since 2002 foorum.hinnavaatlus.ee
14:48

avoozl

the LOL one looks good, I'll have a quick look at that
14:48

Sanqui

78756 users, 4860928 posts
14:49

avoozl

Ages ago when geocities was archived I started playing with that, but I'm glad things have gotten a bit easier these days. that was a tricky set to get anything out of
14:49

JAA

avoozl: Big advantage is that those archives only contain the relevant HTML pages, no images, videos, outlinks, or other fluff.
14:49

JAA

So that should make for a nice small test bed.
14:50

avoozl

JAA: yeah that makes it perfect. I started at something from archiveteam eu domains and that was 99% non-html
14:50

JAA

Then once that works, try an ArchiveBot crawl of a small forum I suppose.
14:51

Sanqui

in 2007 I archived a czech forum about pet birds that seem to be offline now
14:51

Sanqui

operenci.cz archive.fart.website/archivebot/viewer/job/29l18
14:51

Sanqui

sorry 2017*
14:51

avoozl

I'll have a go at LOL first, see when I find some time to work on this
14:52

Sanqui

also speaking of google, I've come to realize just how irrelevant it's gotten when it comes to finding quality information
14:52

Sanqui

increasingly I'm finding better information just by searching reddit, hacker news, a discord server relevant to the topic, or heck literally doing a fulltext search in my telegram chats
14:52

Sanqui

times are changing
14:54

avoozl

What i like most is that you can easily store the entire history of reddit on your desktop machine, unless you want media
14:54

Sanqui

but yeah accessibility of archived data is one thing we (archive team) are not that great at... there's always so much going on that by the time a project is done, we immediately move onto the next thing that's at danger
14:55

Sanqui

which is fine, the data is saved and we keep focus, BUT I would be delighted to see more projects making use of the archives, analyzing, enabling ease of access
14:55

Sanqui

so, thumbs up from me
14:56

JAA

++
14:56

avoozl

It is also difficult to find the right 'size' of software project. 'accessing data' could easily spiral out of control into some planet-sized IPFS key-value store with auto-indexing and distributed version control... which of course will never be finished and the user experience will be backlogged into the better half of this century
14:56

Sanqui

it's absolutely true
14:57

avoozl

I would just like something fairly simple, see if it sticks. Then if anyone wants to move it further, lovely.
14:57

Sanqui

a narrow, detail-oriented focus is often better than casting a net that's too wide
14:58

avoozl

Also, some shiny-object-syndrome exists... I tend also to think 'oh maybe I could run openface-id neural nets to pick up all the faces in the parler data.. and then of course, you COULD do that, but it feels worse then just working at the core of things
14:58

Sanqui

#adhd
15:00

JAA

YES
15:00

JAA

This is why my AT indexer is still not running, basically. :-P
15:00

Sanqui

a bizarre forum I'm archiving right now :p turfwarsapp.com/forum/43/topic/4824171
15:01

Sanqui

for a geolocation-based mobile game
15:10

avoozl

they used to play ingress here, that must be a ton of data too
15:11

Sanqui

motivated by a friend I archived a few websites pertaining to geolocation based games because they've been having a hard time with the pandemic
15:11

Sanqui

ingress is probably safe
16:57

avoozl

JAA: is there any trick to getting files from archive.org faster? I typically see curl/wget drop to like 250KB/sec after a while and it takes forever for most downloads
16:57

avoozl

or is that normal speed
16:59

avoozl

average speed on larger files seems to be around 400KB/sec.. 1,91G 397KB/s in 94m 3s
17:02

purplebot

A Million Ways to Die on the Web edited by KamafaDelgato (+14) just now -- archiveteam.org/?diff=46185&oldid=44414
17:03

purplebot

Template:IRC edited by Justcool393 (-12, Default to hackint for ArchiveTeam …) just now -- archiveteam.org/?diff=46186&oldid=41611
17:04

purplebot

This Is My Jam edited by Flashfire42 (+104) just now -- archiveteam.org/?diff=46188&oldid=31812
17:04

purplebot

FileTrip edited by Flashfire42 (+0) just now -- archiveteam.org/?diff=46189&oldid=35949
17:05

purplebot

Wildscreen Arkive edited by Flashfire42 (+0) just now -- archiveteam.org/?diff=46190&oldid=34427
17:05

JAA

avoozl: Nope, that's normal speed. :-/
17:06

avoozl

Ok, I'll just throw it all into the queue
17:14

atphoenix

Re: comments in #archiveteam .... I suppose some think that bugtraq is comparatively boring...
17:33

purplebot

Coronavirus edited by Wessel1512 (+26, /* Archives and dedicated sites …) just now -- archiveteam.org/?diff=46193&oldid=46100
17:34

purplebot

Template:IRC edited by JustAnotherArchivist (+12, Reverted edits by [[Special:Contributions/Justcool393|Justcool393]] …) 22 minutes ago -- archiveteam.org/?diff=46191&oldid=46186
17:45

purplebot

This Is My Jam edited by Sanqui (+8, use job template for job id) 20 minutes ago -- archiveteam.org/?diff=46192&oldid=46188
17:48

JAA

On the Template:IRC edit reversal: that breaks pretty much every single IRC channel mention on the wiki. We need a mass edit at the same time as changing the default network in the template. There's been a bit of discussion in here about how to best do that and possibly also tackle the issue of dead channels at the same time (e.g. '#archiveteam-bs (on hackint), formerly #foobar (on EFnet)'). Until
17:48

JAA

that happens, the IRC template should stay as it is, even though it's messy. (Cc justcool393)
17:55

DaxServer

Is the git.savannah.gnu.org/git/gnulib.git down?
17:56

JAA

We're not GNU, but it's up now.
17:57

JAA

(And yeah, I could reproduce it being down at first.)
19:36

DaxServer

I have created a pull request to update the Dockerfile and add Wget-AT as a common dependency into the container itself, so that all the projects can use it ArchiveTeam/warrior-dockerfile #44
20:39

brad

DaxServer — I think the warrior hasn’t been used in a while, but I would be interested to see a how a dockerized version works, with all the C&C coordination that is required, etc....
20:42

brad

Speaking of Warrior-type activities, does anyone know what the status is of a warrior-style archive project for the community.fantasyflightgames.com site.
20:42

brad

I know there was the ArchiveBot pipeline 218c8179a369ceb37a999add83e36442 but that’s just a single source that is probably getting throttled or temporarily banned frequently, and I don’t know if they’ve even made a single full complete run yet.
20:44

EggplantN

brad: he left sadly but I do believe it would've been nice for pre-parler for people to have it available
20:45

brad

Yeah. ;(
20:45

EggplantN

we have lots of new helpers now and if it works, pending approval and review from the devs. it would be idea till trackerv2 & warriorv4 are ready
20:45

EggplantN

either way this is probably best for #archiveteam-dev
20:48

brad

Thanks! I’ll head over there....
20:57

brad

Oh, I do have another question — are the WARCs created by ArchiveBot and other ArchiveTeam projects available anywhere for download? Some of the folks on the FFG SWRPG Discord are setting up a new community-owned forum site, and would also like to have their own searchable archive of the FFG community site, and I know the WARCs are key to doing that.
21:01

hook54321

brad: archive.org/details/archivebot
21:03

hook54321

archive.fart.website/archivebot/viewer/job/5l4qk
21:36

brad

Thanks!
21:48

jodizzle

brad: The main job for community.fantasyflightforums.com ran finished, so the bulk of the site should be archived. However, there were a large number of 403s, which are now running in a separate job (erulqjgzn97r2xiab2yqe1qqv).
21:50

jodizzle

This isn't a perfect solution, because the way it's set up, AB won't recurse on the URLs in the second job (it's an '!ao <', not an '!a').
21:50

brad

Yeah, I wasn’t able to find the original job on the trackers, so I assumed it had finished or shut down. I did find the one to sweep through and pick up the 403s for something like 180k links? Wow, that’s a lot of 403s....
21:51

jodizzle

It is 180k, but it's not all from community.fantasyflightforums.com. It's from other domains as well (including some that may 403 naturally)
21:52

brad

It makes total sense that you would do multiple runs. It hadn’t occurred to me that the best way to pick up the 403s was to do a non-recursive list of specific URLs to try, however. But that is kinda clever.
21:52

jodizzle

If I have time, I might look into other ways to retrieve any missing pages.
21:52

brad

Much appreciated!
21:53

jodizzle

Keep in mind that it's not really the best way, because again, there might be URLs on those 403-ing pages that the original job never got to.
21:53

jodizzle

Hopefully not too many, though.
21:56

brad

Right, so ideally you’d want to do multiple recursive runs, plus non-recursive runs with specific URLs.
21:56

brad

And I’m happy to help with that in any way I can.

3 years ago

« a day earlier

a day later »

today »