-
JAA
pabs: FYI, I'm going to let the ArchiveBot jobs for GNOME Bugzilla finish. It might be worth contacting them about the issue history (which is just gone) and XML export (for programmatic access, returns the normal view now) and possibly attachment description page (returns the attachment instead), but I won't have time for that anytime soon.
-
JAA
The actual issues and attachments exist, so at least that will be covered.
-
pabs
JAA: do you have some example URLs that are broken? if so I could file an issue (also if you have a GNOME GitLab account I could CC you)
-
JAA
-
JAA
The attachment description page would be e.g.
bugzilla.gnome.org/attachment.cgi?id=94167&action=edit . This URL was captured but isn't in the WBM yet because the WARC is still sitting on the ArchiveBot pipeline.
-
JAA
I don't have a GNOME GitLab account.
-
pabs
ok, I'll take a look later
-
JAA
Cheers
-
h2ibot
Tech234a edited YouTube (+80, /* Older unlisted videos (July 2021) */ Add…):
wiki.archiveteam.org/?diff=47020&oldid=47013
-
h2ibot
Tech234a edited YouTube (+819, /* Older unlisted videos (July 2021) */ Add…):
wiki.archiveteam.org/?diff=47021&oldid=47020
-
OrIdow6
Is there a channel for Google Drive?
-
rewby
thuban: Your regex doesn't produce any results. (And I've scanned the whole dataset)
-
thuban
fuck, i left an asterisk out. '(file|image):\s*"([^"]*)",'
-
rewby
thuban: Yeah. I figured that. Do you care about having file and image separate, or do you just want one big list?
-
thuban
separate, if that wouldn't require any effort on your part; otherwise together
-
thuban
(i _think_ we already got all the thumbnails in the regular ab run, but i need to check)
-
rewby
Cool. That's easy. My system doesn't really do capture groups so I have to do a second pass to get the urls out of the 'image: "<url>"' strings
-
rewby
It gives me a big list of regex matches per warc
-
rewby
And then I post-process from there
-
rewby
I also only process text/<whatever> and application/json entries. I don't match on image or video files, for obvious reaons
-
rewby
*reasons
-
rewby
thuban: I've updated the regexes and am re-running. It looks to be obtaining urls.
-
thuban
i actually tested it this time, haha
-
rewby
Cool
-
rewby
I'll get you a sample of one file just to check it by you
-
rewby
-
rewby
This look good to you?
-
thuban
yep!
-
rewby
Cool. Still processing the rest. But I'm doing this singlethreaded because I'm lazy.
-
rewby
It's got maybe 10 minutes left
-
rewby
-
thuban
nice
-
rewby
-
thuban
rewby: for some reason i'm getting only 86 unique urls from either of those files when there should be many more.
-
thuban
-
thuban
-
rewby
Uh. Lemme check
-
thuban
idk what your plumbing looks like, but is it possible you ran one warc repeatedly instead of all the warcs? (24 warcs, 24 copies of each url i _do_ have)
-
thuban
(oh wait nvm, 25 warcs)
-
rewby
thuban: d'oh. I ran all the warcs, but I didn't concat the results properly
-
rewby
Lemme fix that
-
thuban
gotcha, thanks
-
rewby
-
rewby
Hm. Still not quite right I think
-
rewby
It's better but still not quite there
-
thuban
yeah... i do expect there to be a few copies of each result (each detail page has a base url and then two possible language parameters) but that's not what it looks like is happening
-
rewby
I'm double checking a few things.
-
rewby
Hmmm.
-
rewby
I wonder if we're dealing with an encoding problem
-
rewby
thuban: I'm doing another run with some tweaks that might help.
-
rewby
If you still find missing things, I'll have to go and manually dig into the warcs to see what wrong because that'll be a bug with my warc reader
-
thuban
i think to confirm anything missing i would have to manually download and zgrep the warcs--that other one was just a lucky spot-check
-
rewby
Fair enough
-
rewby
Just zgrepping doesn't always work
-
thuban
oh?
-
rewby
The problem is that warcs contain raw http responses. Which means your content can be encoded a number of ways. It's not uncomming to have a gzipped response or brotli compressed response
-
thuban
ah, yeah
-
rewby
*uncommon
-
rewby
There's a lot of screwery going on in this software to try and deal with this
-
thuban
i knew there was a reason i asked you instead of trying to do it myself ;)
-
rewby
thuban: Here's another attempt. I turned off all the "smart"ness. It should've gotten everything unless there was a decoding issue.
transfer.archivete.am/ILUoM/file_unique.txt transfer.archivete.am/xBjlu/image_unique.txt
-
thuban
yeah, that's more consistent with what i was expecting
-
thuban
huh... so it looks like archivebot successfully got everything (except a couple of m3u8s) in the original run. i wonder why playback doesn't work in the wbm?
-
rewby
Are there any POST requests involved?
-
rewby
Or maybe javascript that's unhappy?
-
thuban
lol, the only requests that fail are jwplayer's jwpsrv.js and sharing.js, which have somehow been double-rewritten: e.g.
web.archive.org/web/20210728093807/…ps://ssl.p.jwpcdn.com/6/8/jwpsrv.js .
-
thuban
(single-rewritten does exist in the archive and presumably would work.)
-
rewby
Huh. Interesting quirk
-
thuban
-
thuban
in the 'c.repo' function (which returns the base path jwplayer uses to get some assets) the url is rewritten once when the string literal with the original cdn's url is used, then again when the generated url string is munged for ssl
-
thuban
i guess there's no principled way to avoid this...
-
wizards
has anyone archived drivers, manuals, sdks and the like from canon's website? just figured i should ask before trying to archive it myself
-
AK
Got a link to the site and we can check?
-
wizards
-
wizards
-
AK
Hmm, I could give it a go in AB and see how it goes
-
wizards
might work for things like the reference photos, but the section that lists downloads uses js and probably would need manual work to scrape
-
wizards
i was writing a lua script to do exactly that
-
AK
Urgh, same for the manuals, it's all js
-
AK
It's all running through AB now anyway so we at least grab what we can
-
thuban
if you write your script to get the urls for the downloads, we can run that list through archivebot, too, so that at least the files will be in the wayback machine
-
AK
^Forgot about that
-
wizards
in my list of urls, should i include the original urls or the ones they redirect to? since all of them are redirects
-
wizards
-
AK
Original means we'll archive the redirect too
-
thuban
archivebot can follow redirects, so it's probably best to use the originals (since that way both will point to the file)
-
thuban
^ what he said
-
JAA
Uh
-
JAA
Generally yes, but it depends.
-
JAA
If all of the downloads behave like the above, i.e. the actual downloads are on a different host, it's fine.
-
h2ibot
OrIdow6 edited Framasoft (+79, Correction on discovery source):
wiki.archiveteam.org/?diff=47022&oldid=47014