#archiveteam-bs

00:02

wyatt8740

Well, here I am. :p
00:02

wyatt8740

jp.mercari.com/item/m40357548342 is the page i'm trying to grab
00:03

wyatt8740

It's a react.js page
00:03

JAA

wyatt8740: ArchiveBot doesn't know JS at all. wpull tries to find URLs in <script> blocks, but that's very unreliable. Anything beyond that simply won't be covered by it.
00:04

wyatt8740

alright. so since it's react, probably a non-starter. Had a bad feeling that'd be the case.
00:04

wyatt8740

(I love the modern web.)
00:05

JAA

WARC is capable of many things, but far from everything. HTTP/2 and HTTP/3 are right out. WebSockets, too. If you can achieve the page with HTTP/1.1 requests only, then WARC would work.
00:05

wyatt8740

any suggestions for how best to do this, then? Just have a screenshot in my article, links to self-hosted copies of images, and let wayback machine/archivebot crawl my site?
00:06

JAA

Playback is a whole different topic. POST requests in particular are hard, as is anything involving random variables (JSONP, timestamps as cache busters, etc.).
00:06

wyatt8740

my blog should be fine for that :)
00:06

wyatt8740

wyatt8740.gitlab.io/site/blog/011_012.html#pc9801-1
00:06

thuban

and here's a list of 190 blogs extracted from other sources i had lying around (deduped from previous): transfer.archivete.am/2urPt/blogspot_blogs_2.txt
00:06

wyatt8740

yeah i'd looked into WARC format before and quickly got confused because I didn't grok that it was actually transcribing the full HTTP transaction
00:06

thuban

(these aren't really filtered for significance--a lot of them are from when we were trawling for zippyshare links--but if we end up doing horizontal discovery, more seeds don't hurt, right?)
00:07

wyatt8740

*transactions
00:08

JAA

Getting playback right in the general case is virtually impossible. There are too many things that can influence what exactly is displayed etc., e.g. screen size, browser version, datetime, time zone, you name it.
00:08

wyatt8740

i do like what i imagine the archive.is approach is as a supplement to WARC
00:08

JAA

The WBM does a lot of tricks and manages to work around some of those, but ... yeah.
00:08

pabs

I think SPN2 is the main thing that does JavaScript. either submit to the form on web.archive.org/save/ or send a mail with links to savepagenow⊙ao
00:08

JAA

What you can do is use SPN as a logged-in user and make it also create a screenshot.
00:09

wyatt8740

yeah i did the /save/ thing and got the near-empty file
00:09

JAA

Probably the data itself is captured, but the playback doesn't work. SPN uses a browser under the hood.
00:09

pabs

ah, you're talking about saving the DOM to HTML?
00:09

wyatt8740

that would be one method, sure.
00:09

pabs

archive.is is the only thing I know of that does the DOM2HTML thing
00:10

wyatt8740

i mostly want things i link in my blog externally to be archived/findable in future
00:10

pabs

for having a public archive at least
00:10

wyatt8740

since it drives me nuts when people don't
00:10

JAA

I don't see a screenshot on the WBM for jp.mercari.com/item/m40357548342
00:11

wyatt8740

nothing at web.archive.org/web/20231111234533i…://jp.mercari.com/item/m40357548342 ?
00:11

JAA

Screenshot, not snapshot
00:11

wyatt8740

ahh ok
00:11

JAA

There's an option for logged-in users on SPN to also capture a screenshot.
00:11

JAA

That happens from the browser doing the archival, so it should be reasonably usable.
00:12

wyatt8740

i guess i was using a diff. browser than normal
00:12

wyatt8740

let me go back to the one i'm logged in on :\
00:12

JAA

Can't be resaved currently due to the cooldown timer, but should work again in half an hour or so (not sure what the current limit is).
00:13

wyatt8740

yeah, discovered that.
00:13

JAA

It'll still only be a screenshot, so no Ctrl+F, no copying, etc.
00:13

wyatt8740

yeah
00:13

wyatt8740

better than nothing
00:14

JAA

DOM dump as a static page would be nice. Then again, that method also has its limitations, as can frequently be seen on archive.$tldoftheday.
00:14

JAA

Anything requiring scripting on the page, e.g. expanding sections or whatever, won't work.
00:14

wyatt8740

archive.is/HUTy2
00:14

wyatt8740

thankfully the side image thumbnails seem to be the full size images shrunk in CSS/JS
00:14

wyatt8740

so they're actually saved
00:15

JAA

They're in the WBM, too, e.g. web.archive.org/web/20231111234535/…hotos/m40357548342_5.jpg?1695524270
00:16

wyatt8740

hmm. well, that's good at least.
00:16

JAA

As I said, probably captured alright, just doesn't play back, which makes it fairly useless currently. :-/
00:17

wyatt8740

The state of modern web dev; I love it.
00:17

JAA

Aye
00:18

JAA

And it'll only get worse. Hooray.
00:18

wyatt8740

I love the future.
00:18

wyatt8740

And I especially love facebook
00:18

JAA

I've seen a site before that did all content loading with a WebSocket.
00:18

wyatt8740

That's... like doing rtmp grabs in a SWF or something, as far as archival is concerned
00:18

wyatt8740

actually what you describe reminds me a lot of flash-based sites
00:19

JAA

Yeah, pretty much.
00:22

h2ibot

Switchnode edited Deathwatch (+4, /* 2023 */ fix syntax): wiki.archiveteam.org/?diff=51131&oldid=51126
00:23

JAA

Whoops, thanks.
00:31

pabs

/cc arkiver re having SPN2 get an option to save the DOM to HTML, similar to how it has the screenshot thing
01:34

tomodachi94

@Pedrosso:hackint.org @JAA:hackint.org thank you for grabbing Fextralife's wikis, I appreciate it! ❤️
01:35

Pedrosso

<3
01:36

Pedrosso

You were right about it being a gold-mine. So satisfying.
02:26

Pedrosso

I know of a website svtplay.se (videos cannot be archived, most of it is locked behind a region specific wall too) that is often the only source to a specific media and they're often deleted on grounds of copyright or other rights. There are -dl scripts for it. I am concerned about archival though. It definitely needs archival since
02:26

Pedrosso

otherwise a lot of media is continually lost, however I cannot hold it and it's clear it cannot just be submitted publically. What would be adviced here?
02:27

Pedrosso

(videos cannot be archived via save-page or a web save afaik)*
02:28

Flashfire42

Maybe tubeup but use it VERY SPARINGLY because there is a lot of garbage people upload using it and it can cause a lot of space usage for IA
02:28

Pedrosso

tubeup?
02:29

Pedrosso

I'm not entirely sure if you understand what I'm asking about
02:30

Pedrosso

to reclarify, there are -dl scripts to get the videos. ( github.com/spaam/svtplay-dl ). My problem is more legal and ethical
02:33

Pedrosso

It's a general question but if specifics are required, it's about storage.
02:57

h2ibot

Tech234a edited YouTube (+302, /* Stories */ Discontinued): wiki.archiveteam.org/?diff=51132&oldid=50877
03:01

h2ibot

Tech234a edited YouTube (+15, /* Playlist notes (October 2020) */ Add…): wiki.archiveteam.org/?diff=51133&oldid=51132
03:43

Pedrosso

(I feel locked-out from asking any other questions by having this one here lol) Is there no like, go-to process in situations like this?
03:44

pokechu22

I would say in practice we usually lean towards archiving something if it's useful to have - but it also does depend on the total size
03:46

Pedrosso

The point of my lemma is that since when items are removed it's because the rights run out, it's innately and obviously an item not using Creative Commons
03:51

Pedrosso

For context, videos are up for free and not all videos are deleted
03:53

Pedrosso

and with "videos" I mean movies/films, series, news, documentaries, tv channels, etc. Which I believe is in what's counted as useful to have
04:04

JAA

Pedrosso: Just so I understand what we're talking about: this is a legitimate site, right? Based on the name, I assume it's the TV broadcaster's digital platform, where they make their and licensed content available for a limited time?
04:05

JAA

I'll assume that you missed that message.
04:05

JAA

Pedrosso47: Just so I understand what we're talking about: this is a legitimate site, right? Based on the name, I assume it's the TV broadcaster's digital platform, where they make their and licensed content available for a limited time?
04:06

Pedrosso47

Oh yes, indeed.
04:07

JAA

Virtually everything we archive is copyrighted content. That's not really a factor at play here. It's how intellectual property works, for better or for worse. There are exceptions in many jurisdictions that free you from having to follow copyright restrictions when it's done for preservation purposes, which would probably apply here.
04:10

Pedrosso47

That's very nice to know, however whenever I try to look up information on the internet archive they seem adamant about not posting non-creative commons. Though I may have gotten the wrong inpression
04:11

pokechu22

Where'd you see that?
04:12

JAA

They probably say something along those lines to discourage people from uploading stuff that's already widely shared and won't get lost anytime soon (e.g. latest Hollywood productions). There's likely also a 'we have to say that so we don't get in trouble' angle to it. Nevertheless, IA does have the legal right to store such content. They might not be able to make it publicly available until the
04:12

JAA

copyright expires in a few hundred years.
04:13

JAA

So for an individual uploader, that's the policy they probably want, more or less.
04:13

Pedrosso

Great, great.
04:14

thuban

yeah, in practice ia 'darks' items (makes them inaccessible) in response to dmca claims; while accumulating a lot of reports or flagrantly pirating popular content can get you b&, they're pretty relaxed about good-faith uploads. if it's niche or abandoned enough not to get reported in the first place, it's basically fine
04:14

JAA

That doesn't mean they might not be interested in something like this. It'd be all about size and logistics. How much data is it, and do they just need to provide storage for it or does it involve them doing work.
04:15

JAA

Talking to them is important for things like this. Either directly or through arkiver, for example. If they want to take the data, and they already know what this is about, future takedowns etc. won't be as problematic.
04:17

JAA

Archiving these official platforms by major broadcasters has been on my wishlist for a while. It's a lot of work though, especially at scale (i.e. many countries etc.).
04:18

Pedrosso

I see ~~But I'm shy~~ As for the major broadcasters though; svt.se is a "parent" website with loads of news articles all over the country. I'd believe it's quite large
04:19

Pedrosso

as in, svt.se
04:20

JAA

We probably archive a fair bit of that through #//. These audio and video platforms can virtually never be archived properly like that though and need special stuff.
04:21

Pedrosso

does #// get that through outlinks or are you saying a lot of it is manually added?
04:22

JAA

There are things we grab regularly. At least one of those lists is news outlets sourced from Wikidata. I'd expect svt.se to be there, though I didn't check.
04:22

JAA

For those, we regularly grab the homepage and links from it, or something along those lines.
04:23

pokechu22

Yeah, there's wikidata.org/wiki/Q215363 (and also wikidata.org/wiki/Q10686370 for some reason?)
04:24

Pedrosso

How do you search on IA for URLs in a domain archived by WikiTeam?
04:25

JAA

pokechu22: One is the company, the other is their website. But also, yes, naturally it's in Wikidata, but I'm not sure whether it made it into the list of news outlets since that was filtered by probably the 'instance of' value and I don't remember which possible values were accepted there.
04:25

Pedrosso

would that be an extensive list of outlinks or simply a selection?
04:26

Pedrosso

assuming it is in the list of news outlets
04:26

JAA

It's in 43200_wikidata_Q11033_mass-media.wikidata.txt
04:26

JAA

Which should mean it gets grabbed every 12 hours.
04:27

JAA

But the GitHub repo is outdated, so...
04:27

JAA

github.com/ArchiveTeam/urls-sources if you want to poke around.
04:29

JAA

archiveteam_urls doesn't show up on web.archive.org/web/collections/20230000000000*/https://www.svt.se though, odd.
04:31

JAA

Pedrosso: The idea is that we fetch the homepage every N hours and then queue back any links found on it. If they were already captured, that gets filtered out. New links make it through and get archived.
04:32

Pedrosso

Ahh, I get the concept
04:32

Pedrosso

because of frontpage stuff
04:32

JAA

(Also, bringing up that missing stuff in #// directly.)
04:32

Pedrosso

(thx for the note)
04:54

Pedrosso

A list of big websites that I have been debating on sharing here. I suppose even if they're too big & not useful enough to archive there's no harm in sharing transfer.archivete.am/13l4Ga/list.txt
05:01

» pabs wonders if legit .tk domains need to get grabbed technologyreview.com/2023/11/02/108…ic-island-global-capital-cybercrime
05:02

pabs

tcl.tk for eg :)
05:15

pokechu22

There's an archivebot job for legislation.gov.uk but it turns out the UK has a lot of law (and also that site's banned us as of a bit under a month ago :|)
05:19

pabs

perhaps needs a distributed project?
13:34

h2ibot

Bzc6p edited Fextralife (+0, fix banner link): wiki.archiveteam.org/?diff=51134&oldid=51129
14:07

arkiver

hi
14:07

arkiver

so google is doing stuff
14:08

arkiver

pabs: yeah we perform discovery while archiving of blogger
14:09

h2ibot

0KepOnline edited Spore (+40, Add OLDEST view type): wiki.archiveteam.org/?diff=51135&oldid=51112
14:22

mossssss

wait sorry - it disconnected (i think my internet is just bad lol), arkiver what is google doing?
17:25

Pedrosso

wiki.archiveteam.org/index.php/Freq…/archived,for%20hosting%20archives! as per this, if I use the given tools to create archives of svtplay.se would the process then be to have someone here review the files' integrity? Is there any nice naming scheme the IA items should have (and any other preferred
17:25

Pedrosso

fields & metadata for IA)?
17:51

fireonlive

mossssss: hackint.logs.kiska.pw/archiveteam-bs just in case you get disconnected :3
17:51

fireonlive

mossssss: also if you leave webirc in a background tab, browsers suspend the tab which drops the connection
17:52

mossssss

oohhh that would make sense. ill keep it in another window to leave it up. also thank you!!!!!
17:52

fireonlive

welcome =]
17:52

fireonlive

you can also use a desktop IRC client if you wish such as hexchat
17:52

fireonlive

or quassel, there's a few out there
17:53

mossssss

ill have to look into that! my partner is a lot more well versed in this stuff haha so ill ask them
17:53

mossssss

(im the archiving nerd, they are the computer stuff (inc. irc) nerd)
17:55

Pedrosso

fireonlive: I keep forgetting how to get to those logs lol. Still do
17:55

fireonlive

:)
17:55

» fireonlive hands Pedrosso a bookmark
17:56

Pedrosso

(How to do actions? "* text"?)
17:56

katia

/me pets a cat
17:57

» Pedrosso gladly receives said bookmark
17:57

fireonlive

:3
17:57

» Pedrosso tests /me
18:01

fireonlive

katia: taking a look under the hood eh
18:01

fireonlive

:p
18:02

katia

👀
18:28

Ryz

arkiver, any updates on Blogger/Google stuff?
18:53

h2ibot

JustAnotherArchivist edited List of websites excluded from the Wayback Machine/Partial exclusions (+907, Add Airbnb): wiki.archiveteam.org/?diff=51136&oldid=51122
20:26

h2ibot

Exorcism uploaded File:Fextralife-screenshot.png: wiki.archiveteam.org/?title=File%3AFextralife-screenshot.png
20:27

h2ibot

Exorcism edited Fextralife (+35): wiki.archiveteam.org/?diff=51138&oldid=51134
21:18

vokunal|m

erai-raws.info missed a payment on their ER-drive service, and lost their subscription.
21:19

vokunal|m

I have no idea what an ER-Drive is
21:20

vokunal|m

They've had issues with their paypal being banned before. I'm not sure if this is related
21:46

thuban

Pedrosso: your question is a little unclear to me. by "us[ing] the given tools to create archives of svtplay.se", do you mean using -dl scripts to get the video files, or using warc tools to create warcs?
21:46

thuban

(the latter would be difficult, because most warc tools won't work well with such a js-heavy site without substantial custom scripting. (also, even a perfect capture might or might not play back correctly in the wayback machine))
21:46

thuban

in either case, no, there isn't a process to "review the files' integrity"; there's no technical mechanism to do that (tls isn't designed that way), so the internet archive basically operates on trust. archiveteam, aiui, no longer adopts third-party data into the archiveteam collection--that faq entry is outdated and should be changed.
21:47

thuban

what JAA said earlier is right; if you want to do this at scale you should consider talking to ia about it first
21:47

thuban

that said, for general information about metadata consult archive.org/developers/metadata-schema/index.html and/or web.archive.org/web/20221001171424/…docs/api/metadata-schema/index.html (latter has file-level metadata documentation; i have no idea why it was removed)
21:49

Pedrosso

thuban: The answer to your first question is archiveteam's "grab-site" tool. As the -dl scripts would require some scripting to get working within a web format I'd imagine
21:50

thuban

yeah, i would be _very_ surprised if that worked
21:52

Pedrosso

I didn't mean a technical mechanism specifically, just any mechanism technical or otherwise. Sad to know there are none adopted anymore but I suppose it may be better to go straight to the top ~~still shy about that tho~~. It's a little annoying that the wiki is out of date with such things, but still nice to have the info. Thanks about the
21:52

Pedrosso

metadata-related links
21:53

thuban

sorry about that! i'll update the page if an op confirms the current policy.
22:27

h2ibot

JustAnotherArchivist edited List of websites excluded from the Wayback Machine/Partial exclusions (+875, More Airbnb): wiki.archiveteam.org/?diff=51139&oldid=51136
23:17

that_lurker

arstechnica.com/science/2023/11/fir…r-plant-in-the-us-has-been-canceled
23:17

that_lurker

Could maybe be a good idea to grab nuscalepower.com/en
23:22

vokunal|m

vokunal: Source for my above message earlier erai-raws.info/news/er-drive-and-hevc

11 months ago

« a day earlier

a day later »

today »