-
h2ibot
-
hook54321
ah, if it was a private beta then that's fair
-
that_lurker
-
pabs
Barto JAA - re mastodon JS, a wrapper around zygolophodon can do small parts of it (but not a whole site I think)
github.com/jwilk/zygolophodon paste.debian.net/hidden/622ccf51
-
JAA
pabs: Embeds are beginning to require JS as well.
-
pabs
hmm, got an example?
-
JAA
mastodon.social
-
JAA
-
pabs
-
JAA
Yes, they're probably not running the bleeding edge.
-
JAA
The PR was only merged 6 days ago and isn't in a release yet. But mastodon.social runs it already, it seems.
-
JAA
It looks like there'll be a new release soon, and then it'll spread to most instances quickly.
-
pabs
crap
-
pabs
hmm, zygolophodon does still work with mastodon.social. maybe I can modify it to output API URLs instead
-
JAA
Rewriting the URLs should be trivial.
-
pabs
aha, it has --debug-http already
-
pabs
does 2 requests for individual posts: /api/v1/statuses/113082066860765988 /api/v1/statuses/113082066860765988/context
-
pabs
and 3 for users: /api/v1/accounts/lookup?acct=mozilla /api/v1/accounts/110306602663312748/statuses?pinned=true /api/v1/accounts/110306602663312748/statuses?exclude_replies=true&limit=40
-
pabs
(plus pagination I guess)
-
monoxane
I wonder if it would be worth writing a thing that scrapes the raw ap apis instead of trying to go through the js ui
-
monoxane
it would violate the "preserve original content" rules though
-
magmaus3
monoxane: one potential problem is that some instances require authorized fetches, which would require the scraper to have an instance. (btw, that also means that it would be possible to prevent scraping, which is both a good and a bad thing)
-
HiccupJul
Would it be okay to make a "List of websites not captured correctly by the Wayback Machine" page on the wiki, like the exclusions page? Don't have many examples right now, but there are a few. Although I guess it can be somewhat worked around by using archivebot.
-
HiccupJul
The one I was thinking of was this:
electricsheep.co.jp/blog.php?id=431
-
arkiver
HiccupJul: that sounds nice, i'm guessing final call would be with JAA ^
-
HiccupJul
wiki.archiveteam.org/index.php/How_to_use_our_wiki this says to be bold but yeah I was wondering about his opinion
-
HiccupJul
webpage is the blog of the Gimmick! (famous NES game) developer, has behind the scenes info and such. blog pages only load if you navigate to the main page first.
-
HiccupJul
doesn't work through Save Page Now at the very least
-
HiccupJul
i'm asking on #archivebot if someone can try it through archivebot
-
h2ibot
MihaiArchive1 edited WikiTeam (+3, /* Wiki dumps */):
wiki.archiveteam.org/?diff=53486&oldid=53483
-
h2ibot
MihaiArchive1 edited Wikimedia Commons (+57):
wiki.archiveteam.org/?diff=53487&oldid=49964
-
h2ibot
Awauwa edited Deathwatch (+198, added mozilla.social):
wiki.archiveteam.org/?diff=53488&oldid=53463
-
JAA
HiccupJul_: How would you define 'correctly'?
-
HiccupJul_
good question
-
HiccupJul_
but ones that don't have any of the content, like in this case, should probably be recorded
-
JAA
The content not being displayed doesn't necessarily mean it wasn't captured though.
-
JAA
I know there are sites that can be captured, all the relevant data is captured, but then something breaks on playback. If you know the API URL, you can still get the content back.
-
JAA
The SPN 'just' does a MITM proxy to capture the network traffic. The WBM dynamically rewrites things, which sometimes breaks due to how the target site's JS is written.
-
HiccupJul_
huh
-
HiccupJul_
how can i check that for myself?
-
JAA
There's no generic way. It depends on the individual site.
-
JAA
You might be able to see something in the SPN output when using the submission form (rather than /save/URL).
-
JAA
I see that
electricsheep.co.jp/blog.php?id=431 returns a message about requiring cookies, so that's different, I guess.
-
HiccupJul_
yeah i think its a server-side thing
-
HiccupJul_
ah i thought you meant the wayback machine api
-
JAA
POST requests frequently break, but the failure mode varies. For example, it might only generate one capture per hour, and the playback then doesn't load the correct data.
-
JAA
Ah, sorry, no, I mean the target site's.
-
HiccupJul_
yeah looking in chrome devtools network log, loading the page in incognito, i don't see the page content
-
HiccupJul_
so i think it is a server-side check of some kind
-
JAA
Yeah
-
HiccupJul_
maybe the wiki page should just list things like that which save page now can't handle, e.g. navigating to home page first. bit of an obscure requirement though
-
JAA
Yeah, I feel like there are too many different failure modes here to document them in a sensible manner. Maybe a list of those failure modes could be useful though.
-
JAA
And then we can add a couple examples to each failure mode.
-
HiccupJul_
side question: is there a way to view the metadata of IA items (like
archive.org/metadata/whatever) after the item is taken down?
-
arkiver
HiccupJul_: no
-
HiccupJul_
ah, didn't think so. do you know if there's any third party backup of that metadata being made?
-
arkiver
i dont think so
-
nulldata
monoxane - Grabbing the API results via AB wouldn't violate a "preserve original content" rule. It's not ideal and wouldn't be easy to browse, but it's not making up or modifying content
-
JAA
(Faking HTML pages using the API data would however be bad.)
-
magmaus3
JAA: im assuming that adding additional js to make the contents readable would still violate the rule, right?
-
JAA
Naturally
-
JAA
Any modification at all does.
-
JAA
However, you could have an external page that fetches the API response from the WBM and renders it however you like.
-
steering
^ and then capture that in the WBM! :P
-
JAA
Why yes, I've done that before (due to CORS). :-D
-
JAA
-
TheTechRobo
magmaus3: Depends on whether it's in the WARC or not. If you're modifying the WARC record, not allowed. But the Wayback Machine adding special code to fix the page would be fine.
-
JAA
Right, yes, but we have no influence over that.
-
magmaus3
TheTechRobo: good to know :3