#archiveteam-bs

00:39

h2ibot

TheTechRobo edited Post News (+146): wiki.archiveteam.org/?diff=53484&oldid=52312
03:03

hook54321

ah, if it was a private beta then that's fair
05:11

that_lurker

Entire Independent Board of Directors of 23andMe Resigns investors.23andme.com/news-releases…dent-directors-23andme-resign-board news.ycombinator.com/item?id=41573034
06:42

pabs

Barto JAA - re mastodon JS, a wrapper around zygolophodon can do small parts of it (but not a whole site I think) github.com/jwilk/zygolophodon paste.debian.net/hidden/622ccf51
06:43

JAA

pabs: Embeds are beginning to require JS as well.
06:43

pabs

hmm, got an example?
06:43

JAA

mastodon.social
06:43

JAA

It's this PR, I think: mastodon/mastodon #31766
06:43

pabs

this works without JS mozilla.social/@mozilla/113153943609185249/embed
06:44

JAA

Yes, they're probably not running the bleeding edge.
06:45

JAA

The PR was only merged 6 days ago and isn't in a release yet. But mastodon.social runs it already, it seems.
06:45

JAA

It looks like there'll be a new release soon, and then it'll spread to most instances quickly.
06:46

pabs

crap
06:47

pabs

hmm, zygolophodon does still work with mastodon.social. maybe I can modify it to output API URLs instead
06:48

JAA

Rewriting the URLs should be trivial.
06:56

pabs

aha, it has --debug-http already
06:57

pabs

does 2 requests for individual posts: /api/v1/statuses/113082066860765988 /api/v1/statuses/113082066860765988/context
06:59

pabs

and 3 for users: /api/v1/accounts/lookup?acct=mozilla /api/v1/accounts/110306602663312748/statuses?pinned=true /api/v1/accounts/110306602663312748/statuses?exclude_replies=true&limit=40
06:59

pabs

(plus pagination I guess)
08:32

monoxane

I wonder if it would be worth writing a thing that scrapes the raw ap apis instead of trying to go through the js ui
08:32

monoxane

it would violate the "preserve original content" rules though
08:42

magmaus3

monoxane: one potential problem is that some instances require authorized fetches, which would require the scraper to have an instance. (btw, that also means that it would be possible to prevent scraping, which is both a good and a bad thing)
14:42

HiccupJul

Would it be okay to make a "List of websites not captured correctly by the Wayback Machine" page on the wiki, like the exclusions page? Don't have many examples right now, but there are a few. Although I guess it can be somewhat worked around by using archivebot.
14:43

HiccupJul

The one I was thinking of was this: electricsheep.co.jp/blog.php?id=431
14:43

arkiver

HiccupJul: that sounds nice, i'm guessing final call would be with JAA ^
14:44

HiccupJul

wiki.archiveteam.org/index.php/How_to_use_our_wiki this says to be bold but yeah I was wondering about his opinion
14:44

HiccupJul

webpage is the blog of the Gimmick! (famous NES game) developer, has behind the scenes info and such. blog pages only load if you navigate to the main page first.
14:45

HiccupJul

doesn't work through Save Page Now at the very least
14:47

HiccupJul

i'm asking on #archivebot if someone can try it through archivebot
14:48

h2ibot

MihaiArchive1 edited WikiTeam (+3, /* Wiki dumps */): wiki.archiveteam.org/?diff=53486&oldid=53483
14:48

h2ibot

MihaiArchive1 edited Wikimedia Commons (+57): wiki.archiveteam.org/?diff=53487&oldid=49964
14:49

h2ibot

Awauwa edited Deathwatch (+198, added mozilla.social): wiki.archiveteam.org/?diff=53488&oldid=53463
15:14

JAA

HiccupJul_: How would you define 'correctly'?
15:15

HiccupJul_

good question
15:15

HiccupJul_

but ones that don't have any of the content, like in this case, should probably be recorded
15:16

JAA

The content not being displayed doesn't necessarily mean it wasn't captured though.
15:17

JAA

I know there are sites that can be captured, all the relevant data is captured, but then something breaks on playback. If you know the API URL, you can still get the content back.
15:18

JAA

The SPN 'just' does a MITM proxy to capture the network traffic. The WBM dynamically rewrites things, which sometimes breaks due to how the target site's JS is written.
15:19

HiccupJul_

huh
15:19

HiccupJul_

how can i check that for myself?
15:19

JAA

There's no generic way. It depends on the individual site.
15:20

JAA

You might be able to see something in the SPN output when using the submission form (rather than /save/URL).
15:21

JAA

I see that electricsheep.co.jp/blog.php?id=431 returns a message about requiring cookies, so that's different, I guess.
15:21

HiccupJul_

yeah i think its a server-side thing
15:22

HiccupJul_

ah i thought you meant the wayback machine api
15:22

JAA

POST requests frequently break, but the failure mode varies. For example, it might only generate one capture per hour, and the playback then doesn't load the correct data.
15:22

JAA

Ah, sorry, no, I mean the target site's.
15:23

HiccupJul_

yeah looking in chrome devtools network log, loading the page in incognito, i don't see the page content
15:23

HiccupJul_

so i think it is a server-side check of some kind
15:24

JAA

Yeah
15:24

HiccupJul_

maybe the wiki page should just list things like that which save page now can't handle, e.g. navigating to home page first. bit of an obscure requirement though
15:27

JAA

Yeah, I feel like there are too many different failure modes here to document them in a sensible manner. Maybe a list of those failure modes could be useful though.
15:32

JAA

And then we can add a couple examples to each failure mode.
15:41

HiccupJul_

side question: is there a way to view the metadata of IA items (like archive.org/metadata/whatever) after the item is taken down?
15:49

arkiver

HiccupJul_: no
15:52

HiccupJul_

ah, didn't think so. do you know if there's any third party backup of that metadata being made?
16:09

arkiver

i dont think so
16:55

nulldata

monoxane - Grabbing the API results via AB wouldn't violate a "preserve original content" rule. It's not ideal and wouldn't be easy to browse, but it's not making up or modifying content
17:57

JAA

(Faking HTML pages using the API data would however be bad.)
19:39

magmaus3

JAA: im assuming that adding additional js to make the contents readable would still violate the rule, right?
19:40

JAA

Naturally
19:40

JAA

Any modification at all does.
19:41

JAA

However, you could have an external page that fetches the API response from the WBM and renders it however you like.
19:41

steering

^ and then capture that in the WBM! :P
19:45

JAA

Why yes, I've done that before (due to CORS). :-D
19:46

JAA

Well, not quite that, but same principle: web.archive.org/web/20211001003631i…tems/picosong.com_finder/index.html
22:31

TheTechRobo

magmaus3: Depends on whether it's in the WARC or not. If you're modifying the WARC record, not allowed. But the Wayback Machine adding special code to fix the page would be fine.
22:32

JAA

Right, yes, but we have no influence over that.
22:36

magmaus3

TheTechRobo: good to know :3

2 days ago

« a day earlier

a day later »