00:00:33 <JAA> Ah yes, I see it. Thanks!
00:00:50 <JAA> > Upcoming Planned Outage for Site Updates
00:00:50 <JAA> > On September 6th starting at 3PM Pacific, Kongregate.com will be unavailable for planned maintenance. Our engineers will be hard at work implementing back end updates in preparation for awesome site upgrades!
00:00:54 <JAA> > The Kongregate forums will be inaccessible after the update. If you are still referencing or linking to information on the forums, please take steps to save the information elsewhere.
00:02:42 <JAA> You'd think they'd post about it on the forums or have a permanent notice on the page rather than only displaying it on (some?) game pages. Seems on-brand for Kongregate though.
00:03:21 <guest_5981> They already shut down the forums.
00:06:33 <guest_5981> It's also on-brand for any company that thinks old forum software is obsolete compared to social media
00:07:44 <JAA> I mean, the forums were allegedly made read-only in January 2022, but there are posts from a few days ago, so... :-)
00:11:24 <JAA> I'll dig out my scripts from when I dumped the forums three years ago.
00:12:03 <JAA> I doubt they changed anything about how it works, so I can probably just bump the maximum topic number and regrab everything.
00:34:15 <JAA> lolwtf, the outage warning is done with a fucking cookie.
00:35:02 <JAA> < set-cookie: kong_flash_messages=%7B%22messages%22%3A%5B%22%5Cu003cdiv+class%3D%5C%22media+pas%5C%22%5Cu003e+<snip>
00:35:24 <JAA> A 2.2 kB cookie
00:37:15 <JAA> Here it is in its full glory: https://transfer.archivete.am/inline/3tV0B/kongregate-cookie
00:39:26 <guest_5981> At least it wasn't printed, photographed then uploaded.
00:39:35 <h2ibot> JustAnotherArchivist edited Kongregate (+875, Add forum removal, datetimeify): https://wiki.archiveteam.org/?diff=50697&oldid=46612
00:40:30 <fireonlive> wooooow lol
00:41:35 <h2ibot> JustAnotherArchivist edited Deathwatch (+248, /* 2023 */ Add Kongregate forums): https://wiki.archiveteam.org/?diff=50698&oldid=50623
00:43:32 <JAA> In any case, the forums still work exactly the same as they did three years ago, as expected. Assuming they didn't introduce rate limiting, that should easily finish before the deadline. The 2020 run took just over 24 hours. This one might even be shorter since they purged a bunch of content at the time.
00:57:59 <Rynav> Hello i am interested about the comment status of the picosong archivisation process. is anyone still doing it or is it abandoned?
01:00:29 <fireonlive> for reference: https://wiki.archiveteam.org/index.php/Picosong
01:00:41 <Rynav> yes
01:01:04 <JAA> Ah right, I wanted to do that at the time, but Disqus is a minor pain, and I eventually forgot about it.
01:01:47 <Rynav> sad, is there a change you could try doing it again, maybe i can help you with that
01:02:03 <Rynav> chance*
01:05:03 <JAA> It's still somewhere on my todo list, so I'd get to it eventually. Not sure when I'll have time for it though. And also not sure whether it'll work properly in the WBM on the picosong song pages. That part in particular is where the pain is involved IIRC.
01:06:50 <JAA> It's funny how Disqus is now tracking comments for the WBM snapshots because people accessing those trigger requests to Disqus that don't contain the original picosong URL.
01:11:21 <JAA> Yeah, I don't think the WBM playback can work.
01:11:34 <JAA> Which doesn't mean we shouldn't still archive the comments, of course.
01:14:35 <JAA> Kongregate ETA: 18 hours
01:44:21 <fireonlive> -+rss- PSA: Don't base your business around Discord.7yr account banned for posting ASNs: There's been a trend of startups basing their entire business on Discord like Midjourney AI, and Discord themselves pushing for people to do so with their subscriptions system.Well just a few days ago I found out that my account of 7 years was just banned
01:44:21 <fireonlive> without a warning for a very obvious error on their part. Just hours [...] https://news.ycombinator.com/item?id=37364605
01:44:31 <fireonlive> on why AT doesn’t use discord
01:44:43 <fireonlive> we pass around a lot of IP addresses :P
01:44:51 <fireonlive> and scary things like, results of dns lookups!
01:47:32 <fireonlive> on->yet another reason
01:50:49 <JAA> Thanks, added to my list. :-)
01:52:55 <fireonlive> :)
02:49:52 <flashfire42|m> We are running out of items for telegram. If we didn’t have current space issues I’d say switch the project default to YouTube for an hour or 2
02:52:36 <thuban> are we not unstashing the backlog because of said space issues?
03:01:59 <flashfire42> I dunno either way I think we need arkiver to do it
03:14:08 <flashfire42> Ok thats a little bit more telegram that wont even last like an hour
03:16:46 <fireonlive> let the warriors rest :P
03:28:16 <flashfire42|m> No
06:56:30 <fireonlive> flashfire42|m: crack that whip then ;)
06:56:34 <fireonlive> :D
07:48:18 <flashfire42|m> THEY FEED. THE TELEGRAM WORKERS THEY FEED
14:39:01 <JAA> Kongregate forums ETA: 5 hours
14:39:06 <JAA> I've also started an AB job for it.
14:39:49 <JAA> My qwarc crawl isn't grabbing images etc., only topic pages.
18:59:31 <fireonlive> flashfire42|m: ”Please sir, I want some more”
19:10:28 <JAA> Kongregate forums qwarc just finished a couple minutes ago. 4 topics couldn't be retrieved because of server-side bugs, everything else should be covered.
19:11:17 <JAA> The 4 broken ones: https://www.kongregate.com/forums/-/topics/425775 https://www.kongregate.com/forums/-/topics/425777 https://www.kongregate.com/forums/-/topics/425779 https://www.kongregate.com/forums/-/topics/1820620
19:12:17 <JAA> There are a bunch of funny topics that redirect to a slug URL but then return a 404: https://www.kongregate.com/forums/503507-animation-throwdown-the-quest-for-cards-chaotic-harmony/topics/1956229-seige-voting-for-9-3
19:13:04 <JAA> (That's the highest topic ID that exists in some form. I overscanned until 1956300.)
19:14:12 <JAA> I got HTTP 200s from 479808 topic IDs.
20:37:26 <qq44|m> anyone know of an archiving tool that can mirror websites and work with https proxies?
20:39:22 <qq44|m> I need to redirect certain urls to other urls while the mirroring tool is running
20:39:34 <qq44|m> trying to figure out the best way to do something like this
20:42:48 <JAA> Proper archival would preserve the exact data sent by the origin server. That contradicts what you're trying to achieve here.
20:42:53 <pokechu22> The way I'd approach that (not having actually done it) is to modify the tools URL-following logic instead of using a proxy, so when the tool recurses over pages it replaces URLs as needed
20:42:55 <TheTechRobo> If you do do that remember it can’t really go into WARC. If you do want to do WARC anyway you could do a dual-proxy setup with grabber -> MITM proxy -> warcprox (or something like it) -> target site
20:43:10 <pokechu22> (but doesn't affect the saved page or similar)
20:43:17 <TheTechRobo> Or yeah, modify the grabber itself.
20:43:34 <JAA> Agreed on both ideas.
20:50:20 <fireonlive> oh i meant to ask again because i lost my notes :( if not ArchiveBox (https://archivebox.io) because webrecorder is there a better alternative for personal use?
20:54:13 <qq44|m> JAA, TheTechRobo: yes i know, it would not be a proper archive in this scenario. I'm trying to retain a copy of the info on the site, not necessarily archive it
20:54:48 <qq44|m> pokechu22: which tool do you recommend for something like this?
20:58:26 <pokechu22> I imagine it would be possible with scripting (either lua with wget-at or python with wpull) but I haven't actually used either of them
21:01:43 <qq44|m> I've tried wpull, but there doesn't seem to be a hook that I can use
21:01:55 <qq44|m> that or I'm just not familiar enough with the hooks
21:04:47 <JAA> It should be possible with wpull's hooks, but it depends a bit on the details of what you need to do.
21:05:29 <JAA> I had to do weird trickery there before, collecting URLs in one hook and then queueing them from another, which always has the potential to break.
21:05:57 <TheTechRobo> I suggest Wget-AT, it’s fairly usable and isn’t a nightmare to install on modern Python versions
21:05:59 <JAA> Personally, I'd do it with qwarc, but only because I wrote it and know how it works. It's entirely undocumented and not necessarily entirely intuitive.
21:06:11 <TheTechRobo> You kinda have to build it with docker though because of the dependency hell
21:06:24 <JAA> You can just use the pre-built image though.
21:11:10 <qq44|m> JAA: what im trying to do is a bit complicated. basically, before connecting to a page, I need to connect to a different page to find the URL i'm looking for, and then sort of spoof the whole warc entry to look like its coming from the original url
21:12:09 <qq44|m> so lets say I'm trying to grab a website, foo.com. I'm trying to get the page https://foo.com/a.html. However, the page I actually need is https://foo.com/3/a.html
21:12:59 <qq44|m> however, to find 3/a.html, I need to fetch a different page first that lists all of the versions of a.html, and pick a specific one
21:13:23 <JAA> qq44|m: Please don't spoof WARCs.
21:13:46 <JAA> WARCs have a very specific purpose and meaning. They capture original HTTP traffic exactly as transmitted to/from the server.
21:13:47 <qq44|m> JAA: i know i know, it's for personal use, not to actually upload to an archive
21:13:56 <qq44|m> the warc is only temporary
21:14:13 <qq44|m> i want to unpack all of the data at the end into files on disk
21:14:34 <JAA> Why not write files to disk directly? :-)
21:15:13 <qq44|m> im using warc2html to unpack the files
21:15:20 <qq44|m> it rewrites all of the links for me
21:15:42 <qq44|m> so that they can be viewed on disk
21:17:57 <JAA> So similar to wget's --convert-links?
21:21:24 <qq44|m> yes exactly
21:21:42 <qq44|m> can wget-at include page requisites when mirroring a site?
21:21:59 <qq44|m> wget doesn't seem to include 3rd party page requisties when mirroring
21:23:40 <JAA> Should be possible via Lua hooks.
21:24:17 <JAA> wpull has the neat direct option for it (and also has --convert-links).
21:26:11 <qq44|m> in that case I wouldn't need the warc, but I still can't figure out how to fetch the correct urls im looking for
21:26:44 <qq44|m> my thinking was have the proxy do all that logic and leave the grabber unaware of it so that I don't have to modify the grabber
21:27:05 <qq44|m> wpull works with http proxies, but not https
21:27:40 <qq44|m> if wget-at works with https proxies, and can save the 3rd party page requistes, then I think I have this problem mostly solved
21:29:21 <JAA> We never use proxies here, so it's not well-tested, but I think wget should support HTTPS proxies, yeah.
21:30:39 <qq44|m> but does it also support the 3rd party page requistes?
21:30:54 <qq44|m> a while ago i tested wget with https proxy and believe it was working, so should be good there
21:31:19 <qq44|m> for a single page wget saves the 3rd party page requisites
21:31:29 <qq44|m> but with the mirror arg it doesnt for some reason
21:31:57 <JAA> You'd probably have to use wget-at with a Lua script that does the requisite filtering.
21:32:44 <JAA> Do you really need HTTPS proxy support though? Sounds like you can run your manipulating proxy locally anyway, and then TLS wouldn't matter.
21:33:37 <qq44|m> maybe i dont need it, i dont use proxies frequently. how would the proxy decrypt https traffic?
21:34:15 <qq44|m> my understanding was that I need an https proxy and a self-signed cert
21:34:30 <JAA> The proxy needs to decrypt it anyway to be able to rewrite anything.
21:35:34 <qq44|m> yes i know, im stuck on that part
21:38:25 <qq44|m> can I do the decrypting with an http proxy, and if so how?
21:39:10 <JAA> So there are two ways you can proxy stuff: CONNECT proxies simply establish a TCP connection and then tunnel the data between client and server; if the server uses TLS, it can't intercept the data. The other method is using 'GET https://example.org/ HTTP/1.1' requests, where the proxy establishes the TLS connection itself and returns the response to the client.
21:39:28 <JAA> If you want to do this with a proxy, you need the latter.
21:40:43 <qq44|m> got it, how would I do that, any examples on the wiki?
21:43:16 <JAA> No idea, and don't think so, since again, we essentially never use proxies here.
21:49:04 <fireonlive> oh i could grep for archivebox too :)
21:49:11 <fireonlive> gotta love gnu
21:57:54 <JAA> fireonlive: I haven't verified it, but warcprox with an actual browser sounds like a decent option. It's from IA, so it should be fine WARC-spec-wise.
21:58:53 <fireonlive> ahh ok
21:59:18 <fireonlive> i have a less-technical friend who likes to shove things in the borg sometimes so i did like archivebox's simplicity but perhaps i can 'just do it for him'  :D
21:59:28 <fireonlive> but for sure IA seems to care about specs
22:00:20 <fireonlive> *pins tab*
22:00:31 <SketchCow> Hey JAA
22:00:40 <fireonlive> :o
22:03:55 <JAA> Hi SketchCow
22:04:09 <fireonlive> Hi JAA and SketchCow :)
22:18:20 <SketchCow> This one's a live one, isn't he.
22:18:31 <SketchCow> Anyway, got some remnants sitting on FOS, wanted to run them by you.
22:19:05 <SketchCow> I have a bundle of at-org wiki dumps. They go from 2020-01-31 to 2021-03-19 and then stop.
22:19:13 <fireonlive> indeed :3
22:19:57 <SketchCow> I see a script called UPLOAD_TO_INTERNETARCHIVE.sh, I'm going to run it, so they're going into https://archive.org/details/archiveteam_wiki_backup
22:21:21 <SketchCow> When I'm done with it, I'll pack away the scripts into a backup directory, but it's worth nothing this, because I don't know why it stopped and hopefully it's being backed up elsewhere.
22:23:30 <SketchCow> Next, I'm wrapping up the ARCHIVETEAM reception directory, where we used to rsync the warrior stuff through FOS. I see a couple jobs that never got a home, but the majority are just empty shells, meaning all the stuff got into the archive.
22:23:57 <SketchCow> Only two seem to have not have: BINTRAY and VAMPIREFREAKS.
22:24:21 <arkiver> very nice
22:25:03 <SketchCow> Oh look who's here. Another guy from the salt mines.
22:25:51 <SketchCow> BINTRAY is 41gb and VAMPIREFREAKS is 931mb, so I get why they're stuck in the gullet. I never did quite get the hang of code where it had an end-game and then cleared out the pipes.
22:26:08 <arkiver> bintray is WARCs right?
22:26:12 <arkiver> what is vampirefreaks/
22:26:13 <arkiver> ?
22:26:30 <SketchCow> bintray is warcs. bintray-80d70a579661d712c6a3d26ae9d2f2cd6fa14097-20210501-094248.warc.gz and so on.
22:27:53 <SketchCow> vampirefreaks is users. vampirefreaks-user_xXxRikaxXx-20200130-163129.warc.gz and so on.
22:28:02 <arkiver> interesting
22:28:08 <arkiver> what are the dates on vampirefreaks?
22:28:16 <arkiver> ah 2020
22:28:22 <arkiver> JAA: do you remember anything about that project?
22:28:35 <JAA> Negative
22:28:38 <SketchCow> Otherwise I'll just stick it in a big fatty fat WARC item in the archiveteam section.
22:28:57 <arkiver> guess i did that project https://github.com/ArchiveTeam/vampirefreaks-grab
22:29:14 <arkiver> SketchCow: at last one item for each project please
22:29:16 <SketchCow> Don't lose the plot on archiveteam_wiki_backup - that item will have archiveteam xml dumps but is going to stop at 2021-03-19
22:29:19 <arkiver> instead of mixing
22:29:29 <SketchCow> What do you think I am, you sprite
22:29:38 <JAA> I assume those are the public dumps available through the web as well?
22:29:46 <SketchCow> I assume so.
22:29:48 <JAA> I've been grabbing those continuously for some time now.
22:29:50 <SketchCow> We can always check.
22:30:09 <fireonlive> someone ran a maual backup at least: https://archive.org/details/wiki-wikiarchiveteamorg
22:30:29 <JAA> https://archive.org/details/archiveteam.org_wiki_dumps and https://archive.org/details/archiveteam.org_wiki_dumps_2021 through 2023
22:30:45 <SketchCow> Sounds to me like someone took it over and ended FOS.
22:30:50 <JAA> (Reminds me that I should upload the current stash.)
22:31:04 <SketchCow> Remember when we cut down FOS access because people were potentially going to cause trouble on an old and rusty pipeline
22:31:57 <SketchCow> I'm going to make an item for vampirefreaks and one for Bintray and put them into archiveteam-fire.
22:32:15 <SketchCow> That'll leave the most flexibility for later.
22:32:22 <fireonlive> oh hey my own collection :p
22:37:23 <JAA> Yup, those dumps seem to match archiveteam.org_wiki_dumps. :-)
22:37:50 <SketchCow> https://archive.org/details/archiveteam_vampirefreaks_users will get the vampirefreaks.
22:38:18 <SketchCow> So, with the addition of Room of Sorrow, there is an OUTSIDE chance I can get the archivebot pre-renderer working again.
22:45:25 <SketchCow> Oh, I see, bintray was a slight fuckup.
23:28:03 <SketchCow> OK, I can do the rest of this pretty well, I'll get bintray megawarc'd and up, and from all that, I'll let you know about the next this or that's when they show up.
23:28:19 <SketchCow> Goal is for FOS to be 100% free of archiveteam un-uploaded data
23:29:10 <SketchCow> And back up, in a clear place, the remnants of scripts, just to have for the records