-
wyatt8740
Well, here I am. :p
-
wyatt8740
-
wyatt8740
It's a react.js page
-
JAA
wyatt8740: ArchiveBot doesn't know JS at all. wpull tries to find URLs in <script> blocks, but that's very unreliable. Anything beyond that simply won't be covered by it.
-
wyatt8740
alright. so since it's react, probably a non-starter. Had a bad feeling that'd be the case.
-
wyatt8740
(I love the modern web.)
-
JAA
WARC is capable of many things, but far from everything. HTTP/2 and HTTP/3 are right out. WebSockets, too. If you can achieve the page with HTTP/1.1 requests only, then WARC would work.
-
wyatt8740
any suggestions for how best to do this, then? Just have a screenshot in my article, links to self-hosted copies of images, and let wayback machine/archivebot crawl my site?
-
JAA
Playback is a whole different topic. POST requests in particular are hard, as is anything involving random variables (JSONP, timestamps as cache busters, etc.).
-
wyatt8740
my blog should be fine for that :)
-
wyatt8740
-
thuban
and here's a list of 190 blogs extracted from other sources i had lying around (deduped from previous):
transfer.archivete.am/2urPt/blogspot_blogs_2.txt
-
wyatt8740
yeah i'd looked into WARC format before and quickly got confused because I didn't grok that it was actually transcribing the full HTTP transaction
-
thuban
(these aren't really filtered for significance--a lot of them are from when we were trawling for zippyshare links--but if we end up doing horizontal discovery, more seeds don't hurt, right?)
-
wyatt8740
*transactions
-
JAA
Getting playback right in the general case is virtually impossible. There are too many things that can influence what exactly is displayed etc., e.g. screen size, browser version, datetime, time zone, you name it.
-
wyatt8740
i do like what i imagine the archive.is approach is as a supplement to WARC
-
JAA
The WBM does a lot of tricks and manages to work around some of those, but ... yeah.
-
pabs
I think SPN2 is the main thing that does JavaScript. either submit to the form on web.archive.org/save/ or send a mail with links to savepagenow⊙ao
-
JAA
What you can do is use SPN as a logged-in user and make it also create a screenshot.
-
wyatt8740
yeah i did the /save/ thing and got the near-empty file
-
JAA
Probably the data itself is captured, but the playback doesn't work. SPN uses a browser under the hood.
-
pabs
ah, you're talking about saving the DOM to HTML?
-
wyatt8740
that would be one method, sure.
-
pabs
archive.is is the only thing I know of that does the DOM2HTML thing
-
wyatt8740
i mostly want things i link in my blog externally to be archived/findable in future
-
pabs
for having a public archive at least
-
wyatt8740
since it drives me nuts when people don't
-
JAA
I don't see a screenshot on the WBM for
jp.mercari.com/item/m40357548342
-
wyatt8740
-
JAA
Screenshot, not snapshot
-
wyatt8740
ahh ok
-
JAA
There's an option for logged-in users on SPN to also capture a screenshot.
-
JAA
That happens from the browser doing the archival, so it should be reasonably usable.
-
wyatt8740
i guess i was using a diff. browser than normal
-
wyatt8740
let me go back to the one i'm logged in on :\
-
JAA
Can't be resaved currently due to the cooldown timer, but should work again in half an hour or so (not sure what the current limit is).
-
wyatt8740
yeah, discovered that.
-
JAA
It'll still only be a screenshot, so no Ctrl+F, no copying, etc.
-
wyatt8740
yeah
-
wyatt8740
better than nothing
-
JAA
DOM dump as a static page would be nice. Then again, that method also has its limitations, as can frequently be seen on archive.$tldoftheday.
-
JAA
Anything requiring scripting on the page, e.g. expanding sections or whatever, won't work.
-
wyatt8740
-
wyatt8740
thankfully the side image thumbnails seem to be the full size images shrunk in CSS/JS
-
wyatt8740
so they're actually saved
-
JAA
-
wyatt8740
hmm. well, that's good at least.
-
JAA
As I said, probably captured alright, just doesn't play back, which makes it fairly useless currently. :-/
-
wyatt8740
The state of modern web dev; I love it.
-
JAA
Aye
-
JAA
And it'll only get worse. Hooray.
-
wyatt8740
I love the future.
-
wyatt8740
And I especially love facebook
-
JAA
I've seen a site before that did all content loading with a WebSocket.
-
wyatt8740
That's... like doing rtmp grabs in a SWF or something, as far as archival is concerned
-
wyatt8740
actually what you describe reminds me a lot of flash-based sites
-
JAA
Yeah, pretty much.
-
h2ibot
Switchnode edited Deathwatch (+4, /* 2023 */ fix syntax):
wiki.archiveteam.org/?diff=51131&oldid=51126
-
JAA
Whoops, thanks.
-
pabs
/cc arkiver re having SPN2 get an option to save the DOM to HTML, similar to how it has the screenshot thing
-
tomodachi94
@Pedrosso:hackint.org @JAA:hackint.org thank you for grabbing Fextralife's wikis, I appreciate it! ❤️
-
Pedrosso
<3
-
Pedrosso
You were right about it being a gold-mine. So satisfying.
-
Pedrosso
I know of a website
svtplay.se (videos cannot be archived, most of it is locked behind a region specific wall too) that is often the only source to a specific media and they're often deleted on grounds of copyright or other rights. There are -dl scripts for it. I am concerned about archival though. It definitely needs archival since
-
Pedrosso
otherwise a lot of media is continually lost, however I cannot hold it and it's clear it cannot just be submitted publically. What would be adviced here?
-
Pedrosso
(videos cannot be archived via save-page or a web save afaik)*
-
Flashfire42
Maybe tubeup but use it VERY SPARINGLY because there is a lot of garbage people upload using it and it can cause a lot of space usage for IA
-
Pedrosso
tubeup?
-
Pedrosso
I'm not entirely sure if you understand what I'm asking about
-
Pedrosso
to reclarify, there are -dl scripts to get the videos. (
github.com/spaam/svtplay-dl ). My problem is more legal and ethical
-
Pedrosso
It's a general question but if specifics are required, it's about storage.
-
h2ibot
Tech234a edited YouTube (+302, /* Stories */ Discontinued):
wiki.archiveteam.org/?diff=51132&oldid=50877
-
h2ibot
Tech234a edited YouTube (+15, /* Playlist notes (October 2020) */ Add…):
wiki.archiveteam.org/?diff=51133&oldid=51132
-
Pedrosso
(I feel locked-out from asking any other questions by having this one here lol) Is there no like, go-to process in situations like this?
-
pokechu22
I would say in practice we usually lean towards archiving something if it's useful to have - but it also does depend on the total size
-
Pedrosso
The point of my lemma is that since when items are removed it's because the rights run out, it's innately and obviously an item not using Creative Commons
-
Pedrosso
For context, videos are up for free and not all videos are deleted
-
Pedrosso
and with "videos" I mean movies/films, series, news, documentaries, tv channels, etc. Which I believe is in what's counted as useful to have
-
JAA
Pedrosso: Just so I understand what we're talking about: this is a legitimate site, right? Based on the name, I assume it's the TV broadcaster's digital platform, where they make their and licensed content available for a limited time?
-
JAA
I'll assume that you missed that message.
-
JAA
Pedrosso47: Just so I understand what we're talking about: this is a legitimate site, right? Based on the name, I assume it's the TV broadcaster's digital platform, where they make their and licensed content available for a limited time?
-
Pedrosso47
Oh yes, indeed.
-
JAA
Virtually everything we archive is copyrighted content. That's not really a factor at play here. It's how intellectual property works, for better or for worse. There are exceptions in many jurisdictions that free you from having to follow copyright restrictions when it's done for preservation purposes, which would probably apply here.
-
Pedrosso47
That's very nice to know, however whenever I try to look up information on the internet archive they seem adamant about not posting non-creative commons. Though I may have gotten the wrong inpression
-
pokechu22
Where'd you see that?
-
JAA
They probably say something along those lines to discourage people from uploading stuff that's already widely shared and won't get lost anytime soon (e.g. latest Hollywood productions). There's likely also a 'we have to say that so we don't get in trouble' angle to it. Nevertheless, IA does have the legal right to store such content. They might not be able to make it publicly available until the
-
JAA
copyright expires in a few hundred years.
-
JAA
So for an individual uploader, that's the policy they probably want, more or less.
-
Pedrosso
Great, great.
-
thuban
yeah, in practice ia 'darks' items (makes them inaccessible) in response to dmca claims; while accumulating a lot of reports or flagrantly pirating popular content can get you b&, they're pretty relaxed about good-faith uploads. if it's niche or abandoned enough not to get reported in the first place, it's basically fine
-
JAA
That doesn't mean they might not be interested in something like this. It'd be all about size and logistics. How much data is it, and do they just need to provide storage for it or does it involve them doing work.
-
JAA
Talking to them is important for things like this. Either directly or through arkiver, for example. If they want to take the data, and they already know what this is about, future takedowns etc. won't be as problematic.
-
JAA
Archiving these official platforms by major broadcasters has been on my wishlist for a while. It's a lot of work though, especially at scale (i.e. many countries etc.).
-
Pedrosso
I see ~~But I'm shy~~ As for the major broadcasters though; svt.se is a "parent" website with loads of news articles all over the country. I'd believe it's quite large
-
Pedrosso
-
JAA
We probably archive a fair bit of that through #//. These audio and video platforms can virtually never be archived properly like that though and need special stuff.
-
Pedrosso
does #// get that through outlinks or are you saying a lot of it is manually added?
-
JAA
There are things we grab regularly. At least one of those lists is news outlets sourced from Wikidata. I'd expect svt.se to be there, though I didn't check.
-
JAA
For those, we regularly grab the homepage and links from it, or something along those lines.
-
pokechu22
-
Pedrosso
How do you search on IA for URLs in a domain archived by WikiTeam?
-
JAA
pokechu22: One is the company, the other is their website. But also, yes, naturally it's in Wikidata, but I'm not sure whether it made it into the list of news outlets since that was filtered by probably the 'instance of' value and I don't remember which possible values were accepted there.
-
Pedrosso
would that be an extensive list of outlinks or simply a selection?
-
Pedrosso
assuming it is in the list of news outlets
-
JAA
It's in 43200_wikidata_Q11033_mass-media.wikidata.txt
-
JAA
Which should mean it gets grabbed every 12 hours.
-
JAA
But the GitHub repo is outdated, so...
-
JAA
-
JAA
-
JAA
Pedrosso: The idea is that we fetch the homepage every N hours and then queue back any links found on it. If they were already captured, that gets filtered out. New links make it through and get archived.
-
Pedrosso
Ahh, I get the concept
-
Pedrosso
because of frontpage stuff
-
JAA
(Also, bringing up that missing stuff in #// directly.)
-
Pedrosso
(thx for the note)
-
Pedrosso
A list of big websites that I have been debating on sharing here. I suppose even if they're too big & not useful enough to archive there's no harm in sharing
transfer.archivete.am/13l4Ga/list.txt
-
-
pabs
tcl.tk for eg :)
-
pokechu22
There's an archivebot job for
legislation.gov.uk but it turns out the UK has a lot of law (and also that site's banned us as of a bit under a month ago :|)
-
pabs
perhaps needs a distributed project?
-
h2ibot
Bzc6p edited Fextralife (+0, fix banner link):
wiki.archiveteam.org/?diff=51134&oldid=51129
-
arkiver
hi
-
arkiver
so google is doing stuff
-
arkiver
pabs: yeah we perform discovery while archiving of blogger
-
h2ibot
0KepOnline edited Spore (+40, Add OLDEST view type):
wiki.archiveteam.org/?diff=51135&oldid=51112
-
mossssss
wait sorry - it disconnected (i think my internet is just bad lol), arkiver what is google doing?
-
Pedrosso
wiki.archiveteam.org/index.php/Freq…/archived,for%20hosting%20archives! as per this, if I use the given tools to create archives of svtplay.se would the process then be to have someone here review the files' integrity? Is there any nice naming scheme the IA items should have (and any other preferred
-
Pedrosso
fields & metadata for IA)?
-
fireonlive
mossssss:
hackint.logs.kiska.pw/archiveteam-bs just in case you get disconnected :3
-
fireonlive
mossssss: also if you leave webirc in a background tab, browsers suspend the tab which drops the connection
-
mossssss
oohhh that would make sense. ill keep it in another window to leave it up. also thank you!!!!!
-
fireonlive
welcome =]
-
fireonlive
you can also use a desktop IRC client if you wish such as hexchat
-
fireonlive
or quassel, there's a few out there
-
mossssss
ill have to look into that! my partner is a lot more well versed in this stuff haha so ill ask them
-
mossssss
(im the archiving nerd, they are the computer stuff (inc. irc) nerd)
-
Pedrosso
fireonlive: I keep forgetting how to get to those logs lol. Still do
-
fireonlive
:)
-
» fireonlive hands Pedrosso a bookmark
-
Pedrosso
(How to do actions? "* text"?)
-
katia
/me pets a cat
-
» Pedrosso gladly receives said bookmark
-
fireonlive
:3
-
» Pedrosso tests /me
-
fireonlive
katia: taking a look under the hood eh
-
fireonlive
:p
-
katia
👀
-
Ryz
arkiver, any updates on Blogger/Google stuff?
-
h2ibot
JustAnotherArchivist edited List of websites excluded from the Wayback Machine/Partial exclusions (+907, Add Airbnb):
wiki.archiveteam.org/?diff=51136&oldid=51122
-
h2ibot
-
h2ibot
-
vokunal|m
erai-raws.info missed a payment on their ER-drive service, and lost their subscription.
-
vokunal|m
I have no idea what an ER-Drive is
-
vokunal|m
They've had issues with their paypal being banned before. I'm not sure if this is related
-
thuban
Pedrosso: your question is a little unclear to me. by "us[ing] the given tools to create archives of svtplay.se", do you mean using -dl scripts to get the video files, or using warc tools to create warcs?
-
thuban
(the latter would be difficult, because most warc tools won't work well with such a js-heavy site without substantial custom scripting. (also, even a perfect capture might or might not play back correctly in the wayback machine))
-
thuban
in either case, no, there isn't a process to "review the files' integrity"; there's no technical mechanism to do that (tls isn't designed that way), so the internet archive basically operates on trust. archiveteam, aiui, no longer adopts third-party data into the archiveteam collection--that faq entry is outdated and should be changed.
-
thuban
what JAA said earlier is right; if you want to do this at scale you should consider talking to ia about it first
-
thuban
that said, for general information about metadata consult
archive.org/developers/metadata-schema/index.html and/or
web.archive.org/web/20221001171424/…docs/api/metadata-schema/index.html (latter has file-level metadata documentation; i have no idea why it was removed)
-
Pedrosso
thuban: The answer to your first question is archiveteam's "grab-site" tool. As the -dl scripts would require some scripting to get working within a web format I'd imagine
-
thuban
yeah, i would be _very_ surprised if that worked
-
Pedrosso
I didn't mean a technical mechanism specifically, just any mechanism technical or otherwise. Sad to know there are none adopted anymore but I suppose it may be better to go straight to the top ~~still shy about that tho~~. It's a little annoying that the wiki is out of date with such things, but still nice to have the info. Thanks about the
-
Pedrosso
metadata-related links
-
thuban
sorry about that! i'll update the page if an op confirms the current policy.
-
h2ibot
JustAnotherArchivist edited List of websites excluded from the Wayback Machine/Partial exclusions (+875, More Airbnb):
wiki.archiveteam.org/?diff=51139&oldid=51136
-
that_lurker
-
that_lurker
Could maybe be a good idea to grab
nuscalepower.com/en
-
vokunal|m
vokunal: Source for my above message earlier
erai-raws.info/news/er-drive-and-hevc