-
lennier1
-
TheTechRobo
lennier1: So yeah, then those environment variables will work.
-
TheTechRobo
Keep in mind that disabling SSL verification like the CURL_CA_BUNDLE="" does (required for warcprox, or modifying the trusted CA store) will spam your terminal with "urllib3: Making insecure connection to localhost. Blablablabla."
-
lennier1
The bottleneck seems to be the request for similar apps. I get timed out for awhile if I don't delay about a second between calling get_similar_app_ids_for_app.
-
lennier1
Alternately you could brute force it, but only about one in a thousand IDs in the used range are valid. Unless there's some way to predict which are used.
-
lennier1
How do I use that sitemap? I downloaded
apps.apple.com/sitemaps_apps_85_1_20220509.xml.gz and extracted the xml file. I figured it would be readable as text, but doesn't seem to be. Edge says it has an encoding error.
-
thuban
lennier1: file looks fine to me--extracts without issue, result is pure ascii xml.
-
thuban
is yours 23528629 bytes unzipped? what did you use to extract it?
-
lennier1
That's odd. Extracted with 7zip. sitemaps_apps_85_1_20220509.xml is 414,082 bytes.
-
lennier1
sitemaps_apps_85_1_20220509.xml.gz is 335,329 bytes. SHA1 92B44E456A3D05B89BB47191DD46E3C208611DC0
-
Jake
a9d2324c7bf34222824ee0f45200362f5811f008 *sitemaps_apps_85_1_20220509.xml
-
Jake
ce80bab6dd6b5af7a6213db77b5abe9bd12a8b18 *sitemaps_apps_85_1_20220509.xml.gz
-
thuban
ditto Jake
-
thuban
the gz file (_not_ the extracted xml) is 414082 bytes
-
thuban
i suggest double-checking whether you've invoked 7zip correctly / just using gunzip
-
thuban
(`7z e sitemaps_apps_85_1_20220509.xml.gz` works for me too)
-
lennier1
Wait, what was the size of the file you originally downloaded? I started with 335329 bytes and AFTER using 7zip it's 414082 bytes. Is that file still in a compressed format?
-
JAA
-
thuban
& maybe; see what `file` says or try extracting it again. (is it possible you downloaded it with some tool that got confused by gzip transfer-encoding?)
-
ThreeHM
I end up with 334,939 bytes if I gzip the .gz file again, so it might compressed twice
-
lennier1
Yes, it's compressed twice.
-
JAA
There is no TE or whatever, it sends it as a plain application/octet-stream.
-
lennier1
If I change the .xml extension of the extracted file to .gz, I can get the final .xml file (with no extension).
-
JAA
At least when accessing it with curl. Maybe it does something weird if you use a browser?
-
thuban
er, *content-encoding
-
TheTechRobo
JAA: Nah, browser is dumb too.
-
JAA
No CE either. Just chunked TE.
-
thuban
lol, weird. what _did_ you download it with, lennier1?
-
lennier1
Firefox
-
TheTechRobo
same
-
ThreeHM
Chrome seems to handle it correctly
-
thuban
TheTechRobo: on windows? (lennier1, i assume you're on windows since you mentioned edge)
-
TheTechRobo
thuban: Firefox Developer Edition on Debian
-
TheTechRobo
I don't use Windows
-
lennier1
Windows, yes.
-
lennier1
But yeah, there are a lot of app links in that file. I wonder if the sitemap includes all publicly listed apps.
-
ThreeHM
Found a bug report for this:
bugzilla.mozilla.org/610679 - Opened 12 years ago!
-
Jake
wow
-
thuban
JAA: content-encoding is gzip with `curl --compressed`
-
thuban
(and ofc in the browser)
-
JAA
thuban: Yes, and it's Apple that recompresses it in that case.
-
JAA
I also get the 335329 file with that.
-
JAA
Which is actually the correct behaviour, I think.
-
JAA
`Accept-Encoding: gzip` asks the server to send the requested resource in gzip-compressed form, so it does that. The fact that the resource is already compressed doesn't really come into play there.
-
JAA
Doesn't make it any less confusing though. And so many times CE is abused when it's actually TE compression because many clients don't even support the latter.
-
JAA
(`curl --compressed` also decompresses it again as it's supposed to.)
-
thuban
i mean, arguably apple's server is incorrect to recompress a gzip (since 'Accept-Encoding: gzip' doesn't specifically forbid the 'identity' value)
-
thuban
but yes, curl handles it correctly and ff does not
-
JAA
It's not incorrect, but it's certainly not optimal, yeah.
-
thuban
'correct' as in 'The Right Thing'
-
thuban
anyway, apparently this behavior was a workaround for some combination of bugs in apache and apache being easy to misconfigure. transfer protocols were a mistake :(
-
Jake
I feel like I've seen this brought up here before, haha.
-
JAA
That's where 90+% of HTTP's quirks come from, browsers working around server bugs instead of throwing a brick at the people running broken servers.
-
h2ibot
Usernam edited List of websites excluded from the Wayback Machine (+27):
wiki.archiveteam.org/?diff=48596&oldid=48547
-
FaraiNL
Hi. Quick question, I hope this is the right channel. I've created some warc files using grab-site which I would like to upload to the internet archive. According to the faq I have to use the subject keyword "archiveteam" (check) and to "let us know". How do I let the team know? Here via IRC? Also, do I upload just the .warc.gz file? Or also the
-
FaraiNL
meta-warc.gz? And how do I set the mediatype to web?
-
Jake
Just as a quick tip, uploads from normal users aren't going into the Wayback Machine anymore. Only specific whitelisted users can directly upload to be included in the Wayback Machine.
-
Jake
Oh, they left. :(
-
Cookie
-
Cookie
I'd like to know how to find out, for any given website -- if any "whole site" archives/swipes have been done. I notice there are wiki pages with this info for many sites, so that's helpful. But I'm interested in a more automated, methodical way of finding out this information without relying on a human updating a wiki page.
-
Sanqui
Cookie, in terms of Archive Team's projects, e.g. for fanfiction.net we try to upkeep a wikipage such as
wiki.archiveteam.org/index.php/FanFiction.Net
-
Cookie
That is awesome! Anyone can scrape a website and create a WARC file though, right? Any they wouldn't necessarily create or update the relevant wiki page...
-
Sanqui
that's right
-
Sanqui
of course, when presented with a WARC file, it is impossible to simply determine whether it is "complete or not"
-
Sanqui
that's why it's important to have good metadata and record keeping as well
-
Cookie
Yes I have realised that after thinking about it
-
Cookie
But suppose there isn't good metadata (which is likely!)
-
Cookie
But they have uploaded it to the archive anyway
-
Sanqui
in general, archivebot jobs tend to be "complete" (unless aborted, run with aggressive ignores, hit high error rates, etc.)
-
Sanqui
-
Sanqui
so that's another place to check
-
Cookie
Ooh I hadn't seen that yet.
-
Cookie
-
Cookie
The first result looks "big" (2018). The others look "small"
-
Cookie
Can these results be related to the collections here:
archive.org/details/archiveteam_fanfiction
-
Cookie
Oh, no those say 2012
-
Sanqui
Nope, ArchiveBot get its own collection for all crawls
-
Sanqui
in fact, the warcs are partitioned, and uploaded in separate items along with other crawls running on the same machine
-
Sanqui
also, I suspect the first job is incomplete due to a crash. Mainly because the last segment (00070) was uploaded 7 months after the previous one
-
Sanqui
it might be possible to determine what exactly went on from irc logs, but I don't have time for that atm :)
-
Cookie
Okay is this the list of archivebot collections?
archive.org/details/archivebot?&sort=-publicdate
-
Sanqui
aye
-
Cookie
What is a "GO Pack" ? Is that just what every item is called? And I have to examine the metadata to find out what's in it?
-
Sanqui
I believe the names are variable based on the pipeline that uploaded them (there are over 50 machines running archivebot instances)
-
Sanqui
and yeah, that's why the archivebot viewer exists, to make it easier to search through these
-
Cookie
If I search that collection for "fanfiction.net" it doesn't find anything:
archive.org/details/archivebot?query=fanfiction.net&sort=-publicdate
-
Sanqui
yeah, the collection's not really intended for manual browsing, but the data is all there. the viewer shows you which jobs exist for a given domain, and which items contain the data from that job.
-
Cookie
-
Sanqui
that's because it's www.fanfiction.net
-
Cookie
Ohh haha. Where do I log a suggestion to remove the "www" from sites in this index? ;-)
-
Cookie
Anyway the search found it alright
-
Sanqui
github.com/ArchiveTeam/ArchiveBot/issues possibly. I'm not even sure the viewer has its own repo
-
Sanqui
there's a lot of issues with the whole setup, but we're all volunteers here, it's difficult to overhaul the way things are built already
-
Cookie
Thanks for explaining this
-
Sanqui
np
-
Cookie
-
Cookie
In order to answer the question "is it efficient for me to archive fanfiction.net again, now?"
-
Sanqui
-
Cookie
also, what about incremental backups?
-
Sanqui
if you see an "ArchiveBot" collection here, then it's done by ArchiveBot
-
Sanqui
otherwise, none of the collections are related
-
Cookie
Okay, and if it's done by ArchiveBot then that means it's "complete"
-
Cookie
?
-
Sanqui
not really, actually, because it's possible ArchiveBot saw a link *to* fanfiction.net while archiving another website, and grabbed just one page
-
Sanqui
the only thing by archivebot that's close to complete is this job from 2018
archive.fart.website/archivebot/viewer/job/1bkfa
-
Sanqui
I would answer your question with a YES because there has been no major and documented effort to archive fanfiction.net since 2018
-
Cookie
Okay :-)
-
Sanqui
as long as you use best practices, update the wikipage, etc.
-
Cookie
So does this mean I should point the ArchiveBot at it? Or is it too large and I should explore different crawling methods?
-
Sanqui
archivebot with no offsite links may be sufficient
-
Cookie
Thank you!!
-
Sanqui
somebody (like me I suppose) will have to run the job for you, but you can help monitor it :)
-
Cookie
Sure. I'll read up on how this all works
-
Cookie
And I guess I will ask the archive.org people how their existing fanfiction.net collections can be managed
-
Sanqui
feel free to poke or pm me in the future
-
Cookie
This is a probably a dumb question, but if the data in "go" archives wasn't grouped by date crawled, but instead by name of website, wouldn't it be possible to use archive.org's native category and metadata system to search for a specific website archive?
-
Cookie
i.e. instead of this:
archive.org/details/archiveteam_archivebot_go_20181027110001 -- which contains sections of several of different website archives, you might have:
archive.org/details/archiveteam_archivebot_go_201810_fanfiction.net - which only contains a portion of one website. Additional crawls would either update that one,
-
Cookie
or add more archives in the same collection (or topic or tag or whatever)
-
Cookie
It would undoubtably take more time and effort than is available. I was just wondering if there is anything specific preventing this from happening.
-
bonga
-
bonga
Google is officially removing the apps
-
bonga
But we don't need to worry like the app store
-
bonga
They are archived by the APK sites
-
bonga
Reviews are the only unarchived thing
-
JAA
Cookie: We tried that in the past, and it didn't go well. The problem is that IA items have a size limit, and AB jobs can get *much* bigger than that.
-
JAA
So then you need to coordinate items for the same domain between pipelines, which gets really fun...
-
JAA
Ideally, IA would allow searching a collection for filenames, and then this would work automatically.
-
JAA
Also, the grouping isn't by pipeline. Pipelines upload to a staging server, which groups files together until a set exceeds some size threshold, and then that gets uploaded.
-
JAA
So it's more of a time slice of all AB pipelines, although that isn't entirely accurate either.
-
Cookie
JAA: Thanks for putting my mind at rest
-
Cookie
Filenames might work... But a metadata field specifically for "website name" would be the best.
-
JAA
win-raid.com is redirecting to the new Discourse forum since about 2022-05-13 23:50. I finally get to stop my continuous archival of that. :-)
-
thuban
ah, they've left. but we've discussed re-archiving ffn on a few occasions; it is not archivebot-suitable
-
thuban
-
thuban
(last has details of site structure)
-
JAA
Last time I checked (earlier this year), all story pages were behind Buttflare Attack mode, not just with a ridiculous rate limit but entirely.
-
thuban
oof
-
» thuban contemplates writing a chromebot that doesn't suck...
-
thuban
Sanqui, weren't you doing something with puppeteer and warcprox?
-
Doranwen
JAA: yes, I can confirm that ff.n is very much behind the worst mode ever - and it's gotten even worse recently
-
Doranwen
bad enough that even browsing it, it'll sit on a "checking you're human" page for ages - I've resorted to pasting the links of any story I want to read into FFDL because at least I don't have to watch it think, and I can check over and solve a captcha or two as necessary
-
JAA
Doranwen: Oh, there's still room for it to get worse, but I won't mention it here in case they're lurking here and looking for ideas how to do so.
-
Doranwen
LOL
-
Doranwen
well, they should realize they've gotten it a little counterproductive at this point - I mean, I *used* to actually read the fics on the site and only d/l them when I was done - now it's a pain to do anything on the site
-
bonga
-
bonga
Google is removing "abandoned" apps from the play store. Let's archive them just in case the APK sites did not.
-
bonga
Let's especially archive html pages and reviews as csv of play store apps over 2 years old