#archiveteam-bs

00:00

JAA

Ah yes, I see it. Thanks!
00:00

JAA

> Upcoming Planned Outage for Site Updates
00:00

JAA

> On September 6th starting at 3PM Pacific, Kongregate.com will be unavailable for planned maintenance. Our engineers will be hard at work implementing back end updates in preparation for awesome site upgrades!
00:00

JAA

> The Kongregate forums will be inaccessible after the update. If you are still referencing or linking to information on the forums, please take steps to save the information elsewhere.
00:02

JAA

You'd think they'd post about it on the forums or have a permanent notice on the page rather than only displaying it on (some?) game pages. Seems on-brand for Kongregate though.
00:03

guest_5981

They already shut down the forums.
00:06

guest_5981

It's also on-brand for any company that thinks old forum software is obsolete compared to social media
00:07

JAA

I mean, the forums were allegedly made read-only in January 2022, but there are posts from a few days ago, so... :-)
00:11

JAA

I'll dig out my scripts from when I dumped the forums three years ago.
00:12

JAA

I doubt they changed anything about how it works, so I can probably just bump the maximum topic number and regrab everything.
00:34

JAA

lolwtf, the outage warning is done with a fucking cookie.
00:35

JAA

< set-cookie: kong_flash_messages=%7B%22messages%22%3A%5B%22%5Cu003cdiv+class%3D%5C%22media+pas%5C%22%5Cu003e+<snip>
00:35

JAA

A 2.2 kB cookie
00:37

JAA

Here it is in its full glory: transfer.archivete.am/inline/3tV0B/kongregate-cookie
00:39

guest_5981

At least it wasn't printed, photographed then uploaded.
00:39

h2ibot

JustAnotherArchivist edited Kongregate (+875, Add forum removal, datetimeify): wiki.archiveteam.org/?diff=50697&oldid=46612
00:40

fireonlive

wooooow lol
00:41

h2ibot

JustAnotherArchivist edited Deathwatch (+248, /* 2023 */ Add Kongregate forums): wiki.archiveteam.org/?diff=50698&oldid=50623
00:43

JAA

In any case, the forums still work exactly the same as they did three years ago, as expected. Assuming they didn't introduce rate limiting, that should easily finish before the deadline. The 2020 run took just over 24 hours. This one might even be shorter since they purged a bunch of content at the time.
00:57

Rynav

Hello i am interested about the comment status of the picosong archivisation process. is anyone still doing it or is it abandoned?
01:00

fireonlive

for reference: wiki.archiveteam.org/index.php/Picosong
01:00

Rynav

yes
01:01

JAA

Ah right, I wanted to do that at the time, but Disqus is a minor pain, and I eventually forgot about it.
01:01

Rynav

sad, is there a change you could try doing it again, maybe i can help you with that
01:02

Rynav

chance*
01:05

JAA

It's still somewhere on my todo list, so I'd get to it eventually. Not sure when I'll have time for it though. And also not sure whether it'll work properly in the WBM on the picosong song pages. That part in particular is where the pain is involved IIRC.
01:06

JAA

It's funny how Disqus is now tracking comments for the WBM snapshots because people accessing those trigger requests to Disqus that don't contain the original picosong URL.
01:11

JAA

Yeah, I don't think the WBM playback can work.
01:11

JAA

Which doesn't mean we shouldn't still archive the comments, of course.
01:14

JAA

Kongregate ETA: 18 hours
01:44

fireonlive

-+rss- PSA: Don't base your business around Discord.7yr account banned for posting ASNs: There's been a trend of startups basing their entire business on Discord like Midjourney AI, and Discord themselves pushing for people to do so with their subscriptions system.Well just a few days ago I found out that my account of 7 years was just banned
01:44

fireonlive

without a warning for a very obvious error on their part. Just hours [...] news.ycombinator.com/item?id=37364605
01:44

fireonlive

on why AT doesn’t use discord
01:44

fireonlive

we pass around a lot of IP addresses :P
01:44

fireonlive

and scary things like, results of dns lookups!
01:47

fireonlive

on->yet another reason
01:50

JAA

Thanks, added to my list. :-)
01:52

fireonlive

:)
02:49

flashfire42|m

We are running out of items for telegram. If we didn’t have current space issues I’d say switch the project default to YouTube for an hour or 2
02:52

thuban

are we not unstashing the backlog because of said space issues?
03:01

flashfire42

I dunno either way I think we need arkiver to do it
03:14

flashfire42

Ok thats a little bit more telegram that wont even last like an hour
03:16

fireonlive

let the warriors rest :P
03:28

flashfire42|m

No
06:56

fireonlive

flashfire42|m: crack that whip then ;)
06:56

fireonlive

:D
07:48

flashfire42|m

THEY FEED. THE TELEGRAM WORKERS THEY FEED
14:39

JAA

Kongregate forums ETA: 5 hours
14:39

JAA

I've also started an AB job for it.
14:39

JAA

My qwarc crawl isn't grabbing images etc., only topic pages.
18:59

fireonlive

flashfire42|m: ”Please sir, I want some more”
19:10

JAA

Kongregate forums qwarc just finished a couple minutes ago. 4 topics couldn't be retrieved because of server-side bugs, everything else should be covered.
19:11

JAA

The 4 broken ones: kongregate.com/forums/-/topics/425775 kongregate.com/forums/-/topics/425777 kongregate.com/forums/-/topics/425779 kongregate.com/forums/-/topics/1820620
19:12

JAA

There are a bunch of funny topics that redirect to a slug URL but then return a 404: kongregate.com/forums/503507-animat…topics/1956229-seige-voting-for-9-3
19:13

JAA

(That's the highest topic ID that exists in some form. I overscanned until 1956300.)
19:14

JAA

I got HTTP 200s from 479808 topic IDs.
20:37

qq44|m

anyone know of an archiving tool that can mirror websites and work with https proxies?
20:39

qq44|m

I need to redirect certain urls to other urls while the mirroring tool is running
20:39

qq44|m

trying to figure out the best way to do something like this
20:42

JAA

Proper archival would preserve the exact data sent by the origin server. That contradicts what you're trying to achieve here.
20:42

pokechu22

The way I'd approach that (not having actually done it) is to modify the tools URL-following logic instead of using a proxy, so when the tool recurses over pages it replaces URLs as needed
20:42

TheTechRobo

If you do do that remember it can’t really go into WARC. If you do want to do WARC anyway you could do a dual-proxy setup with grabber -> MITM proxy -> warcprox (or something like it) -> target site
20:43

pokechu22

(but doesn't affect the saved page or similar)
20:43

TheTechRobo

Or yeah, modify the grabber itself.
20:43

JAA

Agreed on both ideas.
20:50

fireonlive

oh i meant to ask again because i lost my notes :( if not ArchiveBox (archivebox.io) because webrecorder is there a better alternative for personal use?
20:54

qq44|m

JAA, TheTechRobo: yes i know, it would not be a proper archive in this scenario. I'm trying to retain a copy of the info on the site, not necessarily archive it
20:54

qq44|m

pokechu22: which tool do you recommend for something like this?
20:58

pokechu22

I imagine it would be possible with scripting (either lua with wget-at or python with wpull) but I haven't actually used either of them
21:01

qq44|m

I've tried wpull, but there doesn't seem to be a hook that I can use
21:01

qq44|m

that or I'm just not familiar enough with the hooks
21:04

JAA

It should be possible with wpull's hooks, but it depends a bit on the details of what you need to do.
21:05

JAA

I had to do weird trickery there before, collecting URLs in one hook and then queueing them from another, which always has the potential to break.
21:05

TheTechRobo

I suggest Wget-AT, it’s fairly usable and isn’t a nightmare to install on modern Python versions
21:05

JAA

Personally, I'd do it with qwarc, but only because I wrote it and know how it works. It's entirely undocumented and not necessarily entirely intuitive.
21:06

TheTechRobo

You kinda have to build it with docker though because of the dependency hell
21:06

JAA

You can just use the pre-built image though.
21:11

qq44|m

JAA: what im trying to do is a bit complicated. basically, before connecting to a page, I need to connect to a different page to find the URL i'm looking for, and then sort of spoof the whole warc entry to look like its coming from the original url
21:12

qq44|m

so lets say I'm trying to grab a website, foo.com. I'm trying to get the page foo.com/a.html. However, the page I actually need is foo.com/3/a.html
21:12

qq44|m

however, to find 3/a.html, I need to fetch a different page first that lists all of the versions of a.html, and pick a specific one
21:13

JAA

qq44|m: Please don't spoof WARCs.
21:13

JAA

WARCs have a very specific purpose and meaning. They capture original HTTP traffic exactly as transmitted to/from the server.
21:13

qq44|m

JAA: i know i know, it's for personal use, not to actually upload to an archive
21:13

qq44|m

the warc is only temporary
21:14

qq44|m

i want to unpack all of the data at the end into files on disk
21:14

JAA

Why not write files to disk directly? :-)
21:15

qq44|m

im using warc2html to unpack the files
21:15

qq44|m

it rewrites all of the links for me
21:15

qq44|m

so that they can be viewed on disk
21:17

JAA

So similar to wget's --convert-links?
21:21

qq44|m

yes exactly
21:21

qq44|m

can wget-at include page requisites when mirroring a site?
21:21

qq44|m

wget doesn't seem to include 3rd party page requisties when mirroring
21:23

JAA

Should be possible via Lua hooks.
21:24

JAA

wpull has the neat direct option for it (and also has --convert-links).
21:26

qq44|m

in that case I wouldn't need the warc, but I still can't figure out how to fetch the correct urls im looking for
21:26

qq44|m

my thinking was have the proxy do all that logic and leave the grabber unaware of it so that I don't have to modify the grabber
21:27

qq44|m

wpull works with http proxies, but not https
21:27

qq44|m

if wget-at works with https proxies, and can save the 3rd party page requistes, then I think I have this problem mostly solved
21:29

JAA

We never use proxies here, so it's not well-tested, but I think wget should support HTTPS proxies, yeah.
21:30

qq44|m

but does it also support the 3rd party page requistes?
21:30

qq44|m

a while ago i tested wget with https proxy and believe it was working, so should be good there
21:31

qq44|m

for a single page wget saves the 3rd party page requisites
21:31

qq44|m

but with the mirror arg it doesnt for some reason
21:31

JAA

You'd probably have to use wget-at with a Lua script that does the requisite filtering.
21:32

JAA

Do you really need HTTPS proxy support though? Sounds like you can run your manipulating proxy locally anyway, and then TLS wouldn't matter.
21:33

qq44|m

maybe i dont need it, i dont use proxies frequently. how would the proxy decrypt https traffic?
21:34

qq44|m

my understanding was that I need an https proxy and a self-signed cert
21:34

JAA

The proxy needs to decrypt it anyway to be able to rewrite anything.
21:35

qq44|m

yes i know, im stuck on that part
21:38

qq44|m

can I do the decrypting with an http proxy, and if so how?
21:39

JAA

So there are two ways you can proxy stuff: CONNECT proxies simply establish a TCP connection and then tunnel the data between client and server; if the server uses TLS, it can't intercept the data. The other method is using 'GET example.org HTTP/1.1' requests, where the proxy establishes the TLS connection itself and returns the response to the client.
21:39

JAA

If you want to do this with a proxy, you need the latter.
21:40

qq44|m

got it, how would I do that, any examples on the wiki?
21:43

JAA

No idea, and don't think so, since again, we essentially never use proxies here.
21:49

fireonlive

oh i could grep for archivebox too :)
21:49

fireonlive

gotta love gnu
21:57

JAA

fireonlive: I haven't verified it, but warcprox with an actual browser sounds like a decent option. It's from IA, so it should be fine WARC-spec-wise.
21:58

fireonlive

ahh ok
21:59

fireonlive

i have a less-technical friend who likes to shove things in the borg sometimes so i did like archivebox's simplicity but perhaps i can 'just do it for him' :D
21:59

fireonlive

but for sure IA seems to care about specs
22:00

fireonlive

*pins tab*
22:00

SketchCow

Hey JAA
22:00

fireonlive

:o
22:03

JAA

Hi SketchCow
22:04

fireonlive

Hi JAA and SketchCow :)
22:18

SketchCow

This one's a live one, isn't he.
22:18

SketchCow

Anyway, got some remnants sitting on FOS, wanted to run them by you.
22:19

SketchCow

I have a bundle of at-org wiki dumps. They go from 2020-01-31 to 2021-03-19 and then stop.
22:19

fireonlive

indeed :3
22:19

SketchCow

I see a script called UPLOAD_TO_INTERNETARCHIVE.sh, I'm going to run it, so they're going into archive.org/details/archiveteam_wiki_backup
22:21

SketchCow

When I'm done with it, I'll pack away the scripts into a backup directory, but it's worth nothing this, because I don't know why it stopped and hopefully it's being backed up elsewhere.
22:23

SketchCow

Next, I'm wrapping up the ARCHIVETEAM reception directory, where we used to rsync the warrior stuff through FOS. I see a couple jobs that never got a home, but the majority are just empty shells, meaning all the stuff got into the archive.
22:23

SketchCow

Only two seem to have not have: BINTRAY and VAMPIREFREAKS.
22:24

arkiver

very nice
22:25

SketchCow

Oh look who's here. Another guy from the salt mines.
22:25

SketchCow

BINTRAY is 41gb and VAMPIREFREAKS is 931mb, so I get why they're stuck in the gullet. I never did quite get the hang of code where it had an end-game and then cleared out the pipes.
22:26

arkiver

bintray is WARCs right?
22:26

arkiver

what is vampirefreaks/
22:26

arkiver

?
22:26

SketchCow

bintray is warcs. bintray-80d70a579661d712c6a3d26ae9d2f2cd6fa14097-20210501-094248.warc.gz and so on.
22:27

SketchCow

vampirefreaks is users. vampirefreaks-user_xXxRikaxXx-20200130-163129.warc.gz and so on.
22:28

arkiver

interesting
22:28

arkiver

what are the dates on vampirefreaks?
22:28

arkiver

ah 2020
22:28

arkiver

JAA: do you remember anything about that project?
22:28

JAA

Negative
22:28

SketchCow

Otherwise I'll just stick it in a big fatty fat WARC item in the archiveteam section.
22:28

arkiver

guess i did that project github.com/ArchiveTeam/vampirefreaks-grab
22:29

arkiver

SketchCow: at last one item for each project please
22:29

SketchCow

Don't lose the plot on archiveteam_wiki_backup - that item will have archiveteam xml dumps but is going to stop at 2021-03-19
22:29

arkiver

instead of mixing
22:29

SketchCow

What do you think I am, you sprite
22:29

JAA

I assume those are the public dumps available through the web as well?
22:29

SketchCow

I assume so.
22:29

JAA

I've been grabbing those continuously for some time now.
22:29

SketchCow

We can always check.
22:30

fireonlive

someone ran a maual backup at least: archive.org/details/wiki-wikiarchiveteamorg
22:30

JAA

archive.org/details/archiveteam.org_wiki_dumps and archive.org/details/archiveteam.org_wiki_dumps_2021 through 2023
22:30

SketchCow

Sounds to me like someone took it over and ended FOS.
22:30

JAA

(Reminds me that I should upload the current stash.)
22:31

SketchCow

Remember when we cut down FOS access because people were potentially going to cause trouble on an old and rusty pipeline
22:31

SketchCow

I'm going to make an item for vampirefreaks and one for Bintray and put them into archiveteam-fire.
22:32

SketchCow

That'll leave the most flexibility for later.
22:32

fireonlive

oh hey my own collection :p
22:37

JAA

Yup, those dumps seem to match archiveteam.org_wiki_dumps. :-)
22:37

SketchCow

archive.org/details/archiveteam_vampirefreaks_users will get the vampirefreaks.
22:38

SketchCow

So, with the addition of Room of Sorrow, there is an OUTSIDE chance I can get the archivebot pre-renderer working again.
22:45

SketchCow

Oh, I see, bintray was a slight fuckup.
23:28

SketchCow

OK, I can do the rest of this pretty well, I'll get bintray megawarc'd and up, and from all that, I'll let you know about the next this or that's when they show up.
23:28

SketchCow

Goal is for FOS to be 100% free of archiveteam un-uploaded data
23:29

SketchCow

And back up, in a clear place, the remnants of scripts, just to have for the records

a year ago

« a day earlier

a day later »

today »