#archiveteam-bs

00:02

lennier1

TheTechRobo: The library is using requests: github.com/digitalmethodsinitiative…aster/itunes_app_scraper/scraper.py
00:02

TheTechRobo

lennier1: So yeah, then those environment variables will work.
00:03

TheTechRobo

Keep in mind that disabling SSL verification like the CURL_CA_BUNDLE="" does (required for warcprox, or modifying the trusted CA store) will spam your terminal with "urllib3: Making insecure connection to localhost. Blablablabla."
00:04

lennier1

The bottleneck seems to be the request for similar apps. I get timed out for awhile if I don't delay about a second between calling get_similar_app_ids_for_app.
00:05

lennier1

Alternately you could brute force it, but only about one in a thousand IDs in the used range are valid. Unless there's some way to predict which are used.
00:24

lennier1

How do I use that sitemap? I downloaded apps.apple.com/sitemaps_apps_85_1_20220509.xml.gz and extracted the xml file. I figured it would be readable as text, but doesn't seem to be. Edge says it has an encoding error.
00:33

thuban

lennier1: file looks fine to me--extracts without issue, result is pure ascii xml.
00:34

thuban

is yours 23528629 bytes unzipped? what did you use to extract it?
00:40

lennier1

That's odd. Extracted with 7zip. sitemaps_apps_85_1_20220509.xml is 414,082 bytes.
00:41

lennier1

sitemaps_apps_85_1_20220509.xml.gz is 335,329 bytes. SHA1 92B44E456A3D05B89BB47191DD46E3C208611DC0
00:42

Jake

a9d2324c7bf34222824ee0f45200362f5811f008 *sitemaps_apps_85_1_20220509.xml
00:42

Jake

ce80bab6dd6b5af7a6213db77b5abe9bd12a8b18 *sitemaps_apps_85_1_20220509.xml.gz
00:43

thuban

ditto Jake
00:43

thuban

the gz file (_not_ the extracted xml) is 414082 bytes
00:44

thuban

i suggest double-checking whether you've invoked 7zip correctly / just using gunzip
00:48

thuban

(`7z e sitemaps_apps_85_1_20220509.xml.gz` works for me too)
00:50

lennier1

Wait, what was the size of the file you originally downloaded? I started with 335329 bytes and AFTER using 7zip it's 414082 bytes. Is that file still in a compressed format?
00:51

JAA

The file at apps.apple.com/sitemaps_apps_85_1_20220509.xml.gz is 414082 bytes.
00:52

thuban

& maybe; see what `file` says or try extracting it again. (is it possible you downloaded it with some tool that got confused by gzip transfer-encoding?)
00:52

ThreeHM

I end up with 334,939 bytes if I gzip the .gz file again, so it might compressed twice
00:53

lennier1

Yes, it's compressed twice.
00:53

JAA

There is no TE or whatever, it sends it as a plain application/octet-stream.
00:53

lennier1

If I change the .xml extension of the extracted file to .gz, I can get the final .xml file (with no extension).
00:53

JAA

At least when accessing it with curl. Maybe it does something weird if you use a browser?
00:53

thuban

er, *content-encoding
00:54

TheTechRobo

JAA: Nah, browser is dumb too.
00:54

JAA

No CE either. Just chunked TE.
00:55

thuban

lol, weird. what _did_ you download it with, lennier1?
00:55

lennier1

Firefox
00:55

TheTechRobo

same
00:58

ThreeHM

Chrome seems to handle it correctly
00:59

thuban

TheTechRobo: on windows? (lennier1, i assume you're on windows since you mentioned edge)
01:00

TheTechRobo

thuban: Firefox Developer Edition on Debian
01:00

TheTechRobo

I don't use Windows
01:01

lennier1

Windows, yes.
01:02

lennier1

But yeah, there are a lot of app links in that file. I wonder if the sitemap includes all publicly listed apps.
01:06

ThreeHM

Found a bug report for this: bugzilla.mozilla.org/610679 - Opened 12 years ago!
01:07

Jake

wow
01:08

thuban

JAA: content-encoding is gzip with `curl --compressed`
01:08

thuban

(and ofc in the browser)
01:17

JAA

thuban: Yes, and it's Apple that recompresses it in that case.
01:17

JAA

I also get the 335329 file with that.
01:20

JAA

Which is actually the correct behaviour, I think.
01:21

JAA

`Accept-Encoding: gzip` asks the server to send the requested resource in gzip-compressed form, so it does that. The fact that the resource is already compressed doesn't really come into play there.
01:22

JAA

Doesn't make it any less confusing though. And so many times CE is abused when it's actually TE compression because many clients don't even support the latter.
01:26

JAA

(`curl --compressed` also decompresses it again as it's supposed to.)
01:34

thuban

i mean, arguably apple's server is incorrect to recompress a gzip (since 'Accept-Encoding: gzip' doesn't specifically forbid the 'identity' value)
01:34

thuban

but yes, curl handles it correctly and ff does not
01:34

JAA

It's not incorrect, but it's certainly not optimal, yeah.
01:35

thuban

'correct' as in 'The Right Thing'
01:35

thuban

anyway, apparently this behavior was a workaround for some combination of bugs in apache and apache being easy to misconfigure. transfer protocols were a mistake :(
01:35

Jake

I feel like I've seen this brought up here before, haha.
01:36

JAA

That's where 90+% of HTTP's quirks come from, browsers working around server bugs instead of throwing a brick at the people running broken servers.
02:20

h2ibot

Usernam edited List of websites excluded from the Wayback Machine (+27): wiki.archiveteam.org/?diff=48596&oldid=48547
02:21

FaraiNL

Hi. Quick question, I hope this is the right channel. I've created some warc files using grab-site which I would like to upload to the internet archive. According to the faq I have to use the subject keyword "archiveteam" (check) and to "let us know". How do I let the team know? Here via IRC? Also, do I upload just the .warc.gz file? Or also the
02:21

FaraiNL

meta-warc.gz? And how do I set the mediatype to web?
03:11

Jake

Just as a quick tip, uploads from normal users aren't going into the Wayback Machine anymore. Only specific whitelisted users can directly upload to be included in the Wayback Machine.
03:14

Jake

Oh, they left. :(
10:52

Cookie

Hi. I made a post here: archive.org/post/1124209/can-i-find…a-website-has-already-been-archived
10:54

Cookie

I'd like to know how to find out, for any given website -- if any "whole site" archives/swipes have been done. I notice there are wiki pages with this info for many sites, so that's helpful. But I'm interested in a more automated, methodical way of finding out this information without relying on a human updating a wiki page.
10:55

Sanqui

Cookie, in terms of Archive Team's projects, e.g. for fanfiction.net we try to upkeep a wikipage such as wiki.archiveteam.org/index.php/FanFiction.Net
10:56

Cookie

That is awesome! Anyone can scrape a website and create a WARC file though, right? Any they wouldn't necessarily create or update the relevant wiki page...
10:56

Sanqui

that's right
10:57

Sanqui

of course, when presented with a WARC file, it is impossible to simply determine whether it is "complete or not"
10:57

Sanqui

that's why it's important to have good metadata and record keeping as well
10:57

Cookie

Yes I have realised that after thinking about it
10:57

Cookie

But suppose there isn't good metadata (which is likely!)
10:58

Cookie

But they have uploaded it to the archive anyway
10:58

Sanqui

in general, archivebot jobs tend to be "complete" (unless aborted, run with aggressive ignores, hit high error rates, etc.)
10:58

Sanqui

and they are searchable here archive.fart.website/archivebot/viewer
10:59

Sanqui

so that's another place to check
11:01

Cookie

Ooh I hadn't seen that yet.
11:01

Cookie

So for example: archive.fart.website/archivebot/viewer/domain/www.fanfiction.net
11:02

Cookie

The first result looks "big" (2018). The others look "small"
11:02

Cookie

Can these results be related to the collections here: archive.org/details/archiveteam_fanfiction
11:03

Cookie

Oh, no those say 2012
11:03

Sanqui

Nope, ArchiveBot get its own collection for all crawls
11:03

Sanqui

in fact, the warcs are partitioned, and uploaded in separate items along with other crawls running on the same machine
11:03

Sanqui

also, I suspect the first job is incomplete due to a crash. Mainly because the last segment (00070) was uploaded 7 months after the previous one
11:04

Sanqui

it might be possible to determine what exactly went on from irc logs, but I don't have time for that atm :)
11:05

Cookie

Okay is this the list of archivebot collections? archive.org/details/archivebot?&sort=-publicdate
11:05

Sanqui

aye
11:06

Cookie

What is a "GO Pack" ? Is that just what every item is called? And I have to examine the metadata to find out what's in it?
11:07

Sanqui

I believe the names are variable based on the pipeline that uploaded them (there are over 50 machines running archivebot instances)
11:08

Sanqui

and yeah, that's why the archivebot viewer exists, to make it easier to search through these
11:08

Cookie

If I search that collection for "fanfiction.net" it doesn't find anything: archive.org/details/archivebot?query=fanfiction.net&sort=-publicdate
11:09

Sanqui

yeah, the collection's not really intended for manual browsing, but the data is all there. the viewer shows you which jobs exist for a given domain, and which items contain the data from that job.
11:09

Cookie

I can't see fanfiction.net here: archive.fart.website/archivebot/viewer/domains/f
11:10

Sanqui

that's because it's www.fanfiction.net
11:11

Cookie

Ohh haha. Where do I log a suggestion to remove the "www" from sites in this index? ;-)
11:11

Cookie

Anyway the search found it alright
11:11

Sanqui

github.com/ArchiveTeam/ArchiveBot/issues possibly. I'm not even sure the viewer has its own repo
11:12

Sanqui

there's a lot of issues with the whole setup, but we're all volunteers here, it's difficult to overhaul the way things are built already
11:12

Cookie

Thanks for explaining this
11:12

Sanqui

np
11:13

Cookie

I'd still like to relate the results at archive.fart.website/archivebot/viewer/domain/www.fanfiction.net to the collections I've listed here: archive.org/post/1124209/can-i-find…a-website-has-already-been-archived
11:13

Cookie

In order to answer the question "is it efficient for me to archive fanfiction.net again, now?"
11:13

Sanqui

web.archive.org/web/collections/2022*/fanfiction.net
11:14

Cookie

also, what about incremental backups?
11:14

Sanqui

if you see an "ArchiveBot" collection here, then it's done by ArchiveBot
11:14

Sanqui

otherwise, none of the collections are related
11:14

Cookie

Okay, and if it's done by ArchiveBot then that means it's "complete"
11:14

Cookie

?
11:15

Sanqui

not really, actually, because it's possible ArchiveBot saw a link *to* fanfiction.net while archiving another website, and grabbed just one page
11:15

Sanqui

the only thing by archivebot that's close to complete is this job from 2018 archive.fart.website/archivebot/viewer/job/1bkfa
11:16

Sanqui

I would answer your question with a YES because there has been no major and documented effort to archive fanfiction.net since 2018
11:16

Cookie

Okay :-)
11:16

Sanqui

as long as you use best practices, update the wikipage, etc.
11:17

Cookie

So does this mean I should point the ArchiveBot at it? Or is it too large and I should explore different crawling methods?
11:17

Sanqui

archivebot with no offsite links may be sufficient
11:18

Cookie

Thank you!!
11:18

Sanqui

somebody (like me I suppose) will have to run the job for you, but you can help monitor it :)
11:19

Cookie

Sure. I'll read up on how this all works
11:19

Cookie

And I guess I will ask the archive.org people how their existing fanfiction.net collections can be managed
11:20

Sanqui

feel free to poke or pm me in the future
13:16

Cookie

This is a probably a dumb question, but if the data in "go" archives wasn't grouped by date crawled, but instead by name of website, wouldn't it be possible to use archive.org's native category and metadata system to search for a specific website archive?
13:18

Cookie

i.e. instead of this: archive.org/details/archiveteam_archivebot_go_20181027110001 -- which contains sections of several of different website archives, you might have: archive.org/details/archiveteam_archivebot_go_201810_fanfiction.net - which only contains a portion of one website. Additional crawls would either update that one,
13:18

Cookie

or add more archives in the same collection (or topic or tag or whatever)
13:36

Cookie

It would undoubtably take more time and effort than is available. I was just wondering if there is anything specific preventing this from happening.
14:31

bonga

androidcentral.com/apps-software/go…rid-of-nearly-900000-abandoned-apps
14:31

bonga

Google is officially removing the apps
14:32

bonga

But we don't need to worry like the app store
14:32

bonga

They are archived by the APK sites
14:32

bonga

Reviews are the only unarchived thing
15:27

JAA

Cookie: We tried that in the past, and it didn't go well. The problem is that IA items have a size limit, and AB jobs can get *much* bigger than that.
15:28

JAA

So then you need to coordinate items for the same domain between pipelines, which gets really fun...
15:28

JAA

Ideally, IA would allow searching a collection for filenames, and then this would work automatically.
15:29

JAA

Also, the grouping isn't by pipeline. Pipelines upload to a staging server, which groups files together until a set exceeds some size threshold, and then that gets uploaded.
15:30

JAA

So it's more of a time slice of all AB pipelines, although that isn't entirely accurate either.
15:39

Cookie

JAA: Thanks for putting my mind at rest
15:39

Cookie

Filenames might work... But a metadata field specifically for "website name" would be the best.
15:53

JAA

win-raid.com is redirecting to the new Discourse forum since about 2022-05-13 23:50. I finally get to stop my continuous archival of that. :-)
18:41

thuban

ah, they've left. but we've discussed re-archiving ffn on a few occasions; it is not archivebot-suitable
18:42

thuban

previously: hackint.logs.kiska.pw/archiveteam-ot/20210608#c289987, hackint.logs.kiska.pw/archiveteam-bs/20210523#c288184, hackint.logs.kiska.pw/archiveteam-bs/20210524
18:42

thuban

(last has details of site structure)
18:43

JAA

Last time I checked (earlier this year), all story pages were behind Buttflare Attack mode, not just with a ridiculous rate limit but entirely.
18:44

thuban

oof
18:45

» thuban contemplates writing a chromebot that doesn't suck...
18:47

thuban

Sanqui, weren't you doing something with puppeteer and warcprox?
21:02

Doranwen

JAA: yes, I can confirm that ff.n is very much behind the worst mode ever - and it's gotten even worse recently
21:02

Doranwen

bad enough that even browsing it, it'll sit on a "checking you're human" page for ages - I've resorted to pasting the links of any story I want to read into FFDL because at least I don't have to watch it think, and I can check over and solve a captcha or two as necessary
21:03

JAA

Doranwen: Oh, there's still room for it to get worse, but I won't mention it here in case they're lurking here and looking for ideas how to do so.
21:03

Doranwen

LOL
21:04

Doranwen

well, they should realize they've gotten it a little counterproductive at this point - I mean, I *used* to actually read the fics on the site and only d/l them when I was done - now it's a pain to do anything on the site
22:10

bonga

androidcentral.com/apps-software/go…rid-of-nearly-900000-abandoned-apps
22:11

bonga

Google is removing "abandoned" apps from the play store. Let's archive them just in case the APK sites did not.
22:12

bonga

Let's especially archive html pages and reviews as csv of play store apps over 2 years old

2 years ago

« a day earlier

a day later »

today »