-
JAA
Ah yes, I see it. Thanks!
-
JAA
> Upcoming Planned Outage for Site Updates
-
JAA
> On September 6th starting at 3PM Pacific, Kongregate.com will be unavailable for planned maintenance. Our engineers will be hard at work implementing back end updates in preparation for awesome site upgrades!
-
JAA
> The Kongregate forums will be inaccessible after the update. If you are still referencing or linking to information on the forums, please take steps to save the information elsewhere.
-
JAA
You'd think they'd post about it on the forums or have a permanent notice on the page rather than only displaying it on (some?) game pages. Seems on-brand for Kongregate though.
-
guest_5981
They already shut down the forums.
-
guest_5981
It's also on-brand for any company that thinks old forum software is obsolete compared to social media
-
JAA
I mean, the forums were allegedly made read-only in January 2022, but there are posts from a few days ago, so... :-)
-
JAA
I'll dig out my scripts from when I dumped the forums three years ago.
-
JAA
I doubt they changed anything about how it works, so I can probably just bump the maximum topic number and regrab everything.
-
JAA
lolwtf, the outage warning is done with a fucking cookie.
-
JAA
< set-cookie: kong_flash_messages=%7B%22messages%22%3A%5B%22%5Cu003cdiv+class%3D%5C%22media+pas%5C%22%5Cu003e+<snip>
-
JAA
A 2.2 kB cookie
-
JAA
-
guest_5981
At least it wasn't printed, photographed then uploaded.
-
h2ibot
JustAnotherArchivist edited Kongregate (+875, Add forum removal, datetimeify):
wiki.archiveteam.org/?diff=50697&oldid=46612
-
fireonlive
wooooow lol
-
h2ibot
JustAnotherArchivist edited Deathwatch (+248, /* 2023 */ Add Kongregate forums):
wiki.archiveteam.org/?diff=50698&oldid=50623
-
JAA
In any case, the forums still work exactly the same as they did three years ago, as expected. Assuming they didn't introduce rate limiting, that should easily finish before the deadline. The 2020 run took just over 24 hours. This one might even be shorter since they purged a bunch of content at the time.
-
Rynav
Hello i am interested about the comment status of the picosong archivisation process. is anyone still doing it or is it abandoned?
-
fireonlive
-
Rynav
yes
-
JAA
Ah right, I wanted to do that at the time, but Disqus is a minor pain, and I eventually forgot about it.
-
Rynav
sad, is there a change you could try doing it again, maybe i can help you with that
-
Rynav
chance*
-
JAA
It's still somewhere on my todo list, so I'd get to it eventually. Not sure when I'll have time for it though. And also not sure whether it'll work properly in the WBM on the picosong song pages. That part in particular is where the pain is involved IIRC.
-
JAA
It's funny how Disqus is now tracking comments for the WBM snapshots because people accessing those trigger requests to Disqus that don't contain the original picosong URL.
-
JAA
Yeah, I don't think the WBM playback can work.
-
JAA
Which doesn't mean we shouldn't still archive the comments, of course.
-
JAA
Kongregate ETA: 18 hours
-
fireonlive
-+rss- PSA: Don't base your business around Discord.7yr account banned for posting ASNs: There's been a trend of startups basing their entire business on Discord like Midjourney AI, and Discord themselves pushing for people to do so with their subscriptions system.Well just a few days ago I found out that my account of 7 years was just banned
-
fireonlive
without a warning for a very obvious error on their part. Just hours [...]
news.ycombinator.com/item?id=37364605
-
fireonlive
on why AT doesn’t use discord
-
fireonlive
we pass around a lot of IP addresses :P
-
fireonlive
and scary things like, results of dns lookups!
-
fireonlive
on->yet another reason
-
JAA
Thanks, added to my list. :-)
-
fireonlive
:)
-
flashfire42|m
We are running out of items for telegram. If we didn’t have current space issues I’d say switch the project default to YouTube for an hour or 2
-
thuban
are we not unstashing the backlog because of said space issues?
-
flashfire42
I dunno either way I think we need arkiver to do it
-
flashfire42
Ok thats a little bit more telegram that wont even last like an hour
-
fireonlive
let the warriors rest :P
-
flashfire42|m
No
-
fireonlive
flashfire42|m: crack that whip then ;)
-
fireonlive
:D
-
flashfire42|m
THEY FEED. THE TELEGRAM WORKERS THEY FEED
-
JAA
Kongregate forums ETA: 5 hours
-
JAA
I've also started an AB job for it.
-
JAA
My qwarc crawl isn't grabbing images etc., only topic pages.
-
fireonlive
flashfire42|m: ”Please sir, I want some more”
-
JAA
Kongregate forums qwarc just finished a couple minutes ago. 4 topics couldn't be retrieved because of server-side bugs, everything else should be covered.
-
JAA
-
JAA
There are a bunch of funny topics that redirect to a slug URL but then return a 404:
kongregate.com/forums/503507-animat…topics/1956229-seige-voting-for-9-3
-
JAA
(That's the highest topic ID that exists in some form. I overscanned until 1956300.)
-
JAA
I got HTTP 200s from 479808 topic IDs.
-
qq44|m
anyone know of an archiving tool that can mirror websites and work with https proxies?
-
qq44|m
I need to redirect certain urls to other urls while the mirroring tool is running
-
qq44|m
trying to figure out the best way to do something like this
-
JAA
Proper archival would preserve the exact data sent by the origin server. That contradicts what you're trying to achieve here.
-
pokechu22
The way I'd approach that (not having actually done it) is to modify the tools URL-following logic instead of using a proxy, so when the tool recurses over pages it replaces URLs as needed
-
TheTechRobo
If you do do that remember it can’t really go into WARC. If you do want to do WARC anyway you could do a dual-proxy setup with grabber -> MITM proxy -> warcprox (or something like it) -> target site
-
pokechu22
(but doesn't affect the saved page or similar)
-
TheTechRobo
Or yeah, modify the grabber itself.
-
JAA
Agreed on both ideas.
-
fireonlive
oh i meant to ask again because i lost my notes :( if not ArchiveBox (
archivebox.io) because webrecorder is there a better alternative for personal use?
-
qq44|m
JAA, TheTechRobo: yes i know, it would not be a proper archive in this scenario. I'm trying to retain a copy of the info on the site, not necessarily archive it
-
qq44|m
pokechu22: which tool do you recommend for something like this?
-
pokechu22
I imagine it would be possible with scripting (either lua with wget-at or python with wpull) but I haven't actually used either of them
-
qq44|m
I've tried wpull, but there doesn't seem to be a hook that I can use
-
qq44|m
that or I'm just not familiar enough with the hooks
-
JAA
It should be possible with wpull's hooks, but it depends a bit on the details of what you need to do.
-
JAA
I had to do weird trickery there before, collecting URLs in one hook and then queueing them from another, which always has the potential to break.
-
TheTechRobo
I suggest Wget-AT, it’s fairly usable and isn’t a nightmare to install on modern Python versions
-
JAA
Personally, I'd do it with qwarc, but only because I wrote it and know how it works. It's entirely undocumented and not necessarily entirely intuitive.
-
TheTechRobo
You kinda have to build it with docker though because of the dependency hell
-
JAA
You can just use the pre-built image though.
-
qq44|m
JAA: what im trying to do is a bit complicated. basically, before connecting to a page, I need to connect to a different page to find the URL i'm looking for, and then sort of spoof the whole warc entry to look like its coming from the original url
-
qq44|m
so lets say I'm trying to grab a website, foo.com. I'm trying to get the page
foo.com/a.html. However, the page I actually need is
foo.com/3/a.html
-
qq44|m
however, to find 3/a.html, I need to fetch a different page first that lists all of the versions of a.html, and pick a specific one
-
JAA
qq44|m: Please don't spoof WARCs.
-
JAA
WARCs have a very specific purpose and meaning. They capture original HTTP traffic exactly as transmitted to/from the server.
-
qq44|m
JAA: i know i know, it's for personal use, not to actually upload to an archive
-
qq44|m
the warc is only temporary
-
qq44|m
i want to unpack all of the data at the end into files on disk
-
JAA
Why not write files to disk directly? :-)
-
qq44|m
im using warc2html to unpack the files
-
qq44|m
it rewrites all of the links for me
-
qq44|m
so that they can be viewed on disk
-
JAA
So similar to wget's --convert-links?
-
qq44|m
yes exactly
-
qq44|m
can wget-at include page requisites when mirroring a site?
-
qq44|m
wget doesn't seem to include 3rd party page requisties when mirroring
-
JAA
Should be possible via Lua hooks.
-
JAA
wpull has the neat direct option for it (and also has --convert-links).
-
qq44|m
in that case I wouldn't need the warc, but I still can't figure out how to fetch the correct urls im looking for
-
qq44|m
my thinking was have the proxy do all that logic and leave the grabber unaware of it so that I don't have to modify the grabber
-
qq44|m
wpull works with http proxies, but not https
-
qq44|m
if wget-at works with https proxies, and can save the 3rd party page requistes, then I think I have this problem mostly solved
-
JAA
We never use proxies here, so it's not well-tested, but I think wget should support HTTPS proxies, yeah.
-
qq44|m
but does it also support the 3rd party page requistes?
-
qq44|m
a while ago i tested wget with https proxy and believe it was working, so should be good there
-
qq44|m
for a single page wget saves the 3rd party page requisites
-
qq44|m
but with the mirror arg it doesnt for some reason
-
JAA
You'd probably have to use wget-at with a Lua script that does the requisite filtering.
-
JAA
Do you really need HTTPS proxy support though? Sounds like you can run your manipulating proxy locally anyway, and then TLS wouldn't matter.
-
qq44|m
maybe i dont need it, i dont use proxies frequently. how would the proxy decrypt https traffic?
-
qq44|m
my understanding was that I need an https proxy and a self-signed cert
-
JAA
The proxy needs to decrypt it anyway to be able to rewrite anything.
-
qq44|m
yes i know, im stuck on that part
-
qq44|m
can I do the decrypting with an http proxy, and if so how?
-
JAA
So there are two ways you can proxy stuff: CONNECT proxies simply establish a TCP connection and then tunnel the data between client and server; if the server uses TLS, it can't intercept the data. The other method is using 'GET
example.org HTTP/1.1' requests, where the proxy establishes the TLS connection itself and returns the response to the client.
-
JAA
If you want to do this with a proxy, you need the latter.
-
qq44|m
got it, how would I do that, any examples on the wiki?
-
JAA
No idea, and don't think so, since again, we essentially never use proxies here.
-
fireonlive
oh i could grep for archivebox too :)
-
fireonlive
gotta love gnu
-
JAA
fireonlive: I haven't verified it, but warcprox with an actual browser sounds like a decent option. It's from IA, so it should be fine WARC-spec-wise.
-
fireonlive
ahh ok
-
fireonlive
i have a less-technical friend who likes to shove things in the borg sometimes so i did like archivebox's simplicity but perhaps i can 'just do it for him' :D
-
fireonlive
but for sure IA seems to care about specs
-
fireonlive
*pins tab*
-
SketchCow
Hey JAA
-
fireonlive
:o
-
JAA
Hi SketchCow
-
fireonlive
Hi JAA and SketchCow :)
-
SketchCow
This one's a live one, isn't he.
-
SketchCow
Anyway, got some remnants sitting on FOS, wanted to run them by you.
-
SketchCow
I have a bundle of at-org wiki dumps. They go from 2020-01-31 to 2021-03-19 and then stop.
-
fireonlive
indeed :3
-
SketchCow
I see a script called UPLOAD_TO_INTERNETARCHIVE.sh, I'm going to run it, so they're going into
archive.org/details/archiveteam_wiki_backup
-
SketchCow
When I'm done with it, I'll pack away the scripts into a backup directory, but it's worth nothing this, because I don't know why it stopped and hopefully it's being backed up elsewhere.
-
SketchCow
Next, I'm wrapping up the ARCHIVETEAM reception directory, where we used to rsync the warrior stuff through FOS. I see a couple jobs that never got a home, but the majority are just empty shells, meaning all the stuff got into the archive.
-
SketchCow
Only two seem to have not have: BINTRAY and VAMPIREFREAKS.
-
arkiver
very nice
-
SketchCow
Oh look who's here. Another guy from the salt mines.
-
SketchCow
BINTRAY is 41gb and VAMPIREFREAKS is 931mb, so I get why they're stuck in the gullet. I never did quite get the hang of code where it had an end-game and then cleared out the pipes.
-
arkiver
bintray is WARCs right?
-
arkiver
what is vampirefreaks/
-
arkiver
?
-
SketchCow
bintray is warcs. bintray-80d70a579661d712c6a3d26ae9d2f2cd6fa14097-20210501-094248.warc.gz and so on.
-
SketchCow
vampirefreaks is users. vampirefreaks-user_xXxRikaxXx-20200130-163129.warc.gz and so on.
-
arkiver
interesting
-
arkiver
what are the dates on vampirefreaks?
-
arkiver
ah 2020
-
arkiver
JAA: do you remember anything about that project?
-
JAA
Negative
-
SketchCow
Otherwise I'll just stick it in a big fatty fat WARC item in the archiveteam section.
-
arkiver
-
arkiver
SketchCow: at last one item for each project please
-
SketchCow
Don't lose the plot on archiveteam_wiki_backup - that item will have archiveteam xml dumps but is going to stop at 2021-03-19
-
arkiver
instead of mixing
-
SketchCow
What do you think I am, you sprite
-
JAA
I assume those are the public dumps available through the web as well?
-
SketchCow
I assume so.
-
JAA
I've been grabbing those continuously for some time now.
-
SketchCow
We can always check.
-
fireonlive
-
JAA
-
SketchCow
Sounds to me like someone took it over and ended FOS.
-
JAA
(Reminds me that I should upload the current stash.)
-
SketchCow
Remember when we cut down FOS access because people were potentially going to cause trouble on an old and rusty pipeline
-
SketchCow
I'm going to make an item for vampirefreaks and one for Bintray and put them into archiveteam-fire.
-
SketchCow
That'll leave the most flexibility for later.
-
fireonlive
oh hey my own collection :p
-
JAA
Yup, those dumps seem to match archiveteam.org_wiki_dumps. :-)
-
SketchCow
-
SketchCow
So, with the addition of Room of Sorrow, there is an OUTSIDE chance I can get the archivebot pre-renderer working again.
-
SketchCow
Oh, I see, bintray was a slight fuckup.
-
SketchCow
OK, I can do the rest of this pretty well, I'll get bintray megawarc'd and up, and from all that, I'll let you know about the next this or that's when they show up.
-
SketchCow
Goal is for FOS to be 100% free of archiveteam un-uploaded data
-
SketchCow
And back up, in a clear place, the remnants of scripts, just to have for the records