-
eightthree
JAA:
github.com/internetarchive/brozzler this brozzler? in python, is it quick enough? anything in rust or go or ideally a memory-safe and type-safe language, works yet? I know none of these are going to be perfect reproductions of a js heavy site, as mentioned in your comment below the one I replied to...
-
eightthree
-
JAA
eightthree: I'm not aware of any software written in Go or Rust that produces WARCs and has been verified to work correctly. And I have no experience with running brozzler.
-
fireonlive
-
JAA
I wouldn't trust it anyway until verified. Lots of software writes incorrect WARCs, and most HTTP libraries don't make it easy to write correct ones since they usually don't expose the low-level byte stream.
-
JAA
So unless you do the I/O yourself and use a sans-I/O parser, it's more likely than not going to be wrong.
-
pabs
are there tools for validating warc files are spec-conformant and not weird in other ways?
-
-
pabs
(added a link on the wikipedia WARC page to the AT WARC ecosystem page)
-
JAA
Not that I'm aware of. Someone was working on one in the context of warcio several years ago, but I don't think that ever landed. I've been working on my own, but not ready yet.
-
pabs
-
OrIdow6
JAA: Doesn't the IA have a crawler in Go?
-
JAA
OrIdow6: Hmm right, Zeno.
-
pabs
"After 16 years online, Feedbooks will soon close down."
feedbooks.com
-
JAA
Yeah, it's been running through AB since late March, but that won't get it done.
-
pabs
ah
-
JAA
Various filters etc. let the queue explode. One job had to be aborted already.
-
fireonlive
i read that as facebook and got a flash of excitement
-
fireonlive
:[
-
Vokun
The amount of family photos that would dissapear from the planet when facebook shuts down. Woah
-
Vokun
It'd be interesting if they decided to sell all their hardware though. Imagine how cheap used servers would start going for
-
Vokun
We need like a solid few months without any emergencies so that there's time to actually get #Y up and running
-
Vokun
A few months without an emergency?
-
-
arkiver
fireonlive: hahahaha, now that would be something!
-
fireonlive
xD for sure!
-
arkiver
JAA: do we need a project for feedbooks?
-
h2ibot
Bear edited List of websites excluded from the Wayback Machine/Partial exclusions/Twitter accounts (+37, twitter.com/TheEuropeanMan1 - now he is…):
wiki.archiveteam.org/?diff=52215&oldid=52035
-
h2ibot
Bear edited List of websites excluded from the Wayback Machine (+363, More URLs that are part of the TRS.com empire.):
wiki.archiveteam.org/?diff=52216&oldid=52204
-
h2ibot
Bear uploaded
File:Abload - upload form.png ([[Abload]] before they disabled uploading.):
wiki.archiveteam.org/?title=File%3AAbload%20-%20upload%20form.png
-
h2ibot
-
JAA
arkiver: Good question, not sure. It'd just be a catalogue of books they offer, I think. The actual interesting part is behind a login wall and would require automating loaning and stuff.
-
ScenarioPlanet
What are the main conditions for getting voiced in several AT channels that are used to operate archival bots (example: #archivebot / #wikibot)?
-
pokechu22
The main one is understanding how to operate the bot mainly (including things like ignores and noticing when a site's gotten mad at us).
-
pokechu22
You're not currently in #wikibot but I'd say that one is easier to operate as there's only a few commands
-
ScenarioPlanet
That doesn't seem to be hard to understand (especially in #wikibot case), but some details like pipeline operations are kinda off-putting for me, maybe because they don't have any public documentation
-
pokechu22
Yeah :/
-
ScenarioPlanet
I mean things like pipeline notes (which is cloudflared or not, which is closer to the server that holds the website being archived, local censorships & more)
-
eightthree
JAA: what about in typescript using node.js ( or perhaps in vue or other typescript-written tools/langs/frameworks)
-
eightthree
-
JAA
eightthree: Stay away from anything webrecorder until proven it's not outputting rubbish.
-
eightthree
JAA
webrecorder/pywb #294 I noticed you linked to this, but don't know how relevant it is, given it's one of the non-ts projects of theirs...
-
JAA
At least two of their tools do not produce valid WARCs. They have known this for years and made no attempt to fix it that I'm aware of.
-
JAA
I have no reason to trust any of their tooling that produces WARCs.
-
eightthree
ArchiveTeam/ArchiveBot #70 otherwise, this is still the open issue on using webrecorder...
-
eightthree
with hardly anything said...
-
JAA
I never saw this issue before, and it predates my presence here.
-
JAA
I guess the problems weren't known at the time.
-
eightthree
damn, how do i get proof? by ...not staying away and trying it (or seeing others comment on reddit github etc?
-
JAA
You produce WARCs with them and verify that they are compliant. This requires intimate knowledge of the WARC and HTTP specs.
-
JAA
Or you look at the code and immediately see why they can't possibly be compliant.
-
JAA
E.g. their browser extension thing can't ever work because the browser doesn't make the necessary data available to extensions.
-
JAA
ArchiveWeb.Page
-
nicolas17
JAA: what would the browser need to expose, the raw network data without TLS?
-
JAA
nicolas17: Yes, headers and transfer encoding as sent by the server. You only get a parsed representation of the former (losing capitalisation, whitespace, order) and TE is stripped.
-
nicolas17
it seems to me that would immediately hit the problem of WARC not supporting HTTP2 :P
-
JAA
Yes, that as well.
-
JAA
You can only write HTTP/1.1 to WARC. Technically, even 0.9 and 1.0 is incompatible.
-
JAA
1.0 *might* work with a bit of generous interpretation. 0.9 definitely doesn't.
-
nicolas17
and translating HTTP2 to HTTP1.1 would likely be frowned upon
-
nicolas17
so now you have to disable HTTP2 browser-wide
-
JAA
I've read somewhere that that's what webrecorder do, and yes, that's also bad.
-
JAA
Or really, worse, because it even misrepresents which HTTP version was used.
-
eightthree
btw, i found this
github.com/archivetheweb/archiver in rust, but no new commits since 1 year, and v0.3 , it does focus on warc 1.1 though
-
eightthree
JAA: what other "places" than here might have reliable enough experts (in warc i guess..or maybe also wacz?) that I could just ask, (and to avoid asking in a specific tool's chatroom/forum, as they are more likely biased).
-
eightthree
I found these 2 awesome lists
github.com/ruarxive/awesome-digital-preservation github.com/iipc/awesome-web-archiving, but the first noted webrecorder a high-fidelity, so I don't know if those listmakers or if any the projects mentioned are reliable enough by yours standards...
-
JAA
eightthree: I'm not aware of any. Even my discussions in the IIPC about this (which is the organisation where the WARC specification is written) were not entirely fruitful. I should revive them though.
-
JAA
It seems that very few people care about spec compliance, which is wild given this data is supposed to survive decades or more.
-
JAA
So far, my only rough rule is that if a software was written by IA, it's probably doing it correctly.
-
eightthree
so I searched for "better than warc" and stumbled upon someone saying HAR is better than warc, and capturing a HAR of the current page is implemented in the F12 devtools I believe... Do you have any link that I can read to see why warc is best of all the archiving formats? When the IIPC itself isn't reliable...
-
JAA
IIPC is reliable, and there are people there that care about accuracy, just many don't.
-
eightthree
or maybe roughly tell me what to google if itll take longer to find link
-
JAA
HAR doesn't preserve the exact HTTP traffic either, just a parsed version of it.
-
JAA
So roughly the same as what webrecorder's tooling produces.
-
JAA
You can always transform a (correct) WARC to a HAR, but the opposite is not possible.
-
JAA
HAR is also awkward to use for anything larger than a single page or maybe a few. There's no concept of compression, and it's a single large JSON object, so appending is hard. I don't recall how binary data is stored, but I believe that's a mess, too. (Maybe base64?)
-
eightthree
JAA: so technically one could make a proper "extension" but compile it right into a browser, if ever a firefox or chromium derivative browser would be willing to do this? I've been annoyed by how singlefile is indeed not always a proper representation and TIL it's not entirely their fault...
-
JAA
Yep, it is base64 indeed.
-
JAA
Sure, if you modified the browser, it could definitely be done.
-
JAA
But that's obviously not an easy task.
-
JAA
And it's probably why brozzler uses an MITM proxy instead.
-
nicolas17
in theory you could use SSLKEYLOGFILE and capture the SSL'd traffic and decrypt it
-
nicolas17
but that still has the problem of HTTP2
-
nicolas17
would need to turn http2/3 off
-
JAA
Yeah
-
nicolas17
or do MITM so you can tamper with the list of supported protocols
-
eightthree
JAA: tor browser baked in noscript, the grapheneos browser also incorporated...a content filter, not sure if it's noscript. I have noticed first hand many times how extensions can stop working when ram use is too high or something, and I guess that's why they didn't want to compromise on security.
-
nicolas17
I have also seen cases where Wireshark/dumpcap loses a packet even though the recipient ack'd it so it wasn't lost in the network, and then the entire stream is fucked
-
lea
nicolas17: why is http2/3 a problem here? SSLKEYLOGFILE should work for them as well
-
DJ
Hey, has anyone checked out
github.com/aliparlakci/bulk-downloader-for-reddit or
github.com/RedditDownloader/redditdownloader.github.io? I think archivebot is limited in terms of crawling subreddits so I was wondering if anyone was aware of these or if they work.
-
nicolas17
lea: the WARC format has no way to represent captured HTTP2 requests/responses
-
nicolas17
and if you synthesize HTTP1.1-looking syntax from the HTTP2 data, that's not a pristine capture
-
lea
nicolas17: curl has a way of representing http2/3 responses in a format similar to http1.1, could that not be used?
-
lea
ah
-
nicolas17
<JAA> Or really, worse, because it even misrepresents which HTTP version was used.
-
lea
can it not say HTTP/2.0 200 OK in the header instead of HTTP/1.1 200 OK?
-
DJ
Nvm the first one has the 1000 posts API limit, I don't know if the second one does but probably.
-
lea
or is that the thing that is not supported?
-
JAA
lea: WARC captures exactly what was sent over the network (at the application layer). And yes, the spec only supports HTTP/1.1 specifically.
-
» fireonlive wonders if IIPC(?) will update it soonish
-
eightthree
fireonlive: Am I too much of a conspiracy theorist for thinking that it might be intentional that proper copies of websites are so hard to do, spec not updated? Either to fingerprint copycat websites (for google and others to programmatically punish them in their algo (or not show them at all), but also to ensure that MITMs can be noticed?
-
JAA
Yes, you are.
-
eightthree
like google has 20bill usd to send to Apple...I think each year, but they can't keep any archiving standard up to date with any of the other standards they spend heavily to influence and update?
-
JAA
The problem is that there's probably less than a dozen people worldwide who have worked on the WARC spec occasionally over the past two decades.
-
JAA
It's very much a niche, and there's no funding for doing these things.
-
JAA
Google et al. don't care about WARC or even archival.
-
JAA
It's so much of a niche that I was apparently the first person to try to implement a WARC parser based on just the spec, since I ran into several inconsistencies that *have* to come up on every implementation.
-
eightthree
JAA: you seem to not see that the funding money amounts could exist, but don't. ~What does google and other search engines use to show an archived copy of a page?~ Ok, forget that, those arent js copies for secu reasons likely, but say, google translate shows a page with js, what does that use? Why don't they care to make it as close to the original as possible? The companies have the money, they choose not to spend on this....
-
eightthree
... There could be 100s or 1000s working on this long term if they wanted it.
-
eightthree
govts and library/archiving orgs likewise have some budget...
-
JAA
They're companies. They care about making money. Spending effort on exact preservation when it doesn't matter to their service isn't something they will do.
-
JAA
At best, they care about storing an HTTP-equivalent copy of the data, i.e. with headers normalised, transfer encoding stripped, etc., since it's much easier to work with that.
-
eightthree
inconsistencies that _have_ to come up on every implementation.
-
eightthree
what do you mean by this? the other warc parsers always have inconsistencies, but why have?
-
JAA
Not other parsers have inconsistencies, the spec has inconsistencies, and anyone implementing a parser based on the spec would have to run into them.
-
JAA
Since they weren't reported previously, apparently nobody did that.
-
JAA
-
JAA
#71 and #72 in particular are unavoidable if you look at the grammar.
-
eightthree
JAA: maybe this is why , as you were complaining earlier, most people don't care about coding to spec? If the spec has issues...
-
JAA
If the spec has issues and you care about preservation, you raise the issue and get the spec fixed.
-
eightthree
have you seen anything about funding for the people contributing, from iipc or other? who/how many, if anyone is paid to contribute?
-
JAA
So the IIPC is a consortium, and many of the people there are actually employed at various institutions doing digital preservation stuff. That would include some people from IA, from the British Library, etc. I imagine that the part of their employment dedicated to contributing to the IIPC is tiny to nonexistent though.
-
JAA
IIPC does fund some projects in a narrow scope. There's an annual call for proposals.
-
JAA
Or there's supposed to be, anyway, I think it hasn't happened in a couple years now for a reason they haven't communicated publicly I think.
-
JAA
(I have considered submitting a proposal in this area before.)
-
eightthree
Call for proposals is now closed
-
eightthree
Proposals due: 15 September 2021
-
eightthree
Projects start: 1 January 2022
-
eightthree
Final report due: by 31 December 2022
-
eightthree
-
JAA
Aye
-
eightthree
hmm so if funding dried up (from only a trickle) at iipc, perhaps the solution is to improve HAR since it's so much more widely deployed? the link on the wikip article links to a draft,
-
eightthree
-
eightthree
with fat warning
-
eightthree
> _DO NOT USE_
-
eightthree
> This document was never published by the W3C Web Performance Working Group and has been abandoned.
-
eightthree
but the document lists itself as the latest,
-
eightthree
> Historical Draft August 14, 2012
-
eightthree
> This version:
-
eightthree
-
eightthree
> Latest version:
-
eightthree
-
eightthree
and searching
w3.org/TR/?filter-tr-name=har shows nothing...
-
eightthree
y was a draft with fat warning so widely deployed over WARC ???
-
eightthree
*deployed instead of warc...
-
eightthree
and do browsers etc all implement their own modified way of capturing HAR file from the current page in the browser? Am I going to have to go on a long hunt on each git/mailing list of each of these
-
eightthree
> The HAR format is supported by various software, including:
-
eightthree
> Charles Proxy
-
eightthree
> Fiddler
-
eightthree
> Firebug
-
eightthree
> Firefox
-
eightthree
> Fluxzy Desktop
-
eightthree
> Google Chrome
-
eightthree
> Internet Explorer 9
-
eightthree
> Microsoft Edge
-
eightthree
> Mitmproxy
-
eightthree
> Postman
-
eightthree
> OWASP ZAP
-
eightthree
> Safari
-
eightthree
to find out how they implement it?
-
JAA
Likely, because I doubt there's any documentation on what they do in detail.
-
eightthree
im using matrix btw and its showing me the sciscors icon, so tell me if my message is hard to read...maybe I'll pastebin it...
-
JAA
HAR being JSON is both great and horrible.
-
JAA
Well, you pasted like a dozen messages, yeah.
-
eightthree
HTTP/2, published in 2015 - does anyone implement HAR beyond 1.1?
-
eightthree
-
JAA
Keep AI nonsense out of here.
-
nicolas17
eightthree: where do you think it got that information from?
-
nicolas17
if there isn't any good information on HAR on websites, then there's nowhere the AI could have learned it from and it's just guessing/hallucinating
-
nicolas17
if there is, then look at those websites instead :P
-
eightthree
JAA: sorry
-
JAA
To answer the question, at least Firefox can put HTTP/2 into HAR. Probably HTTP/3 and WebSocket, too.
-
nicolas17
by putting the decoded normalized headers in there?
-
JAA
Yes
-
JAA
It's a massive JSON object.
-
eightthree
searchfox.org/mozilla-central/search?q=har&path=&case=false®exp=false - 112 results when I check the checkbox for "whole words" when I ctrl-f for har
-
eightthree
JAA: where can I find a detailed "whats missing from HAR that WARC has"
-
JAA
eightthree: By comparing the specs of HAR and WARC in detail. I doubt it's been done before.
-
eightthree
JAA: like, the github repo you linked to earlier for warc, with the 2012 draft I linked for HAR?
-
JAA
Probably? I never looked into what documentation exists on HAR. There might be stuff on MDN or in browsers' documentations, too.
-
eightthree
I guess there's no comparison with firefox's implementation yet...I'll see if I absolutely need to decipher code or if there's something in bugzilla or elsewhere...
-
eightthree
-
eightthree
-
eightthree
[Links (documentation, blog post, etc)]: 'HAR' can be linked to
softwareishard.com/blog/har-12-spec;
-
eightthree
from
bugzilla.mozilla.org/859058#c38, even though they had found the w3c link too, the last mention of what to link to officially as the spec was the above line
-
eightthree
bugzilla.mozilla.org/859058#c7 this honza guy seems to be knowledgeable enough to propose writing a draft update in case extra features were needed. Said 10 years ago though :)
-
eightthree
-
eightthree
-
eightthree
Jan Honza Odvarko, that implemented har in firefox...but in the readme there seems to be a way with user.js setting and then a browser reboot, to not need the extension to automatically save each page, at least that's how it seems...