00:02:39 TheTechRobo: The library is using requests: https://github.com/digitalmethodsinitiative/itunes-app-scraper/blob/master/itunes_app_scraper/scraper.py 00:02:57 lennier1: So yeah, then those environment variables will work. 00:03:49 Keep in mind that disabling SSL verification like the CURL_CA_BUNDLE="" does (required for warcprox, or modifying the trusted CA store) will spam your terminal with "urllib3: Making insecure connection to localhost. Blablablabla." 00:04:32 The bottleneck seems to be the request for similar apps. I get timed out for awhile if I don't delay about a second between calling get_similar_app_ids_for_app. 00:05:59 Alternately you could brute force it, but only about one in a thousand IDs in the used range are valid. Unless there's some way to predict which are used. 00:24:31 How do I use that sitemap? I downloaded https://apps.apple.com/sitemaps_apps_85_1_20220509.xml.gz and extracted the xml file. I figured it would be readable as text, but doesn't seem to be. Edge says it has an encoding error. 00:33:28 lennier1: file looks fine to me--extracts without issue, result is pure ascii xml. 00:34:12 is yours 23528629 bytes unzipped? what did you use to extract it? 00:40:04 That's odd. Extracted with 7zip. sitemaps_apps_85_1_20220509.xml is 414,082 bytes. 00:41:15 sitemaps_apps_85_1_20220509.xml.gz is 335,329 bytes. SHA1 92B44E456A3D05B89BB47191DD46E3C208611DC0 00:42:37 a9d2324c7bf34222824ee0f45200362f5811f008 *sitemaps_apps_85_1_20220509.xml 00:42:37 ce80bab6dd6b5af7a6213db77b5abe9bd12a8b18 *sitemaps_apps_85_1_20220509.xml.gz 00:43:11 ditto Jake 00:43:53 the gz file (_not_ the extracted xml) is 414082 bytes 00:44:19 i suggest double-checking whether you've invoked 7zip correctly / just using gunzip 00:48:24 (`7z e sitemaps_apps_85_1_20220509.xml.gz` works for me too) 00:50:08 Wait, what was the size of the file you originally downloaded? I started with 335329 bytes and AFTER using 7zip it's 414082 bytes. Is that file still in a compressed format? 00:51:48 The file at https://apps.apple.com/sitemaps_apps_85_1_20220509.xml.gz is 414082 bytes. 00:52:31 & maybe; see what `file` says or try extracting it again. (is it possible you downloaded it with some tool that got confused by gzip transfer-encoding?) 00:52:56 I end up with 334,939 bytes if I gzip the .gz file again, so it might compressed twice 00:53:06 Yes, it's compressed twice. 00:53:34 There is no TE or whatever, it sends it as a plain application/octet-stream. 00:53:51 If I change the .xml extension of the extracted file to .gz, I can get the final .xml file (with no extension). 00:53:55 At least when accessing it with curl. Maybe it does something weird if you use a browser? 00:53:57 er, *content-encoding 00:54:17 JAA: Nah, browser is dumb too. 00:54:20 No CE either. Just chunked TE. 00:55:25 lol, weird. what _did_ you download it with, lennier1? 00:55:39 Firefox 00:55:57 same 00:58:30 Chrome seems to handle it correctly 00:59:53 TheTechRobo: on windows? (lennier1, i assume you're on windows since you mentioned edge) 01:00:13 thuban: Firefox Developer Edition on Debian 01:00:21 I don't use Windows 01:01:02 Windows, yes. 01:02:55 But yeah, there are a lot of app links in that file. I wonder if the sitemap includes all publicly listed apps. 01:06:43 Found a bug report for this: https://bugzilla.mozilla.org/show_bug.cgi?id=610679 - Opened 12 years ago! 01:07:27 wow 01:08:06 JAA: content-encoding is gzip with `curl --compressed` 01:08:58 (and ofc in the browser) 01:17:00 thuban: Yes, and it's Apple that recompresses it in that case. 01:17:14 I also get the 335329 file with that. 01:20:03 Which is actually the correct behaviour, I think. 01:21:12 `Accept-Encoding: gzip` asks the server to send the requested resource in gzip-compressed form, so it does that. The fact that the resource is already compressed doesn't really come into play there. 01:22:39 Doesn't make it any less confusing though. And so many times CE is abused when it's actually TE compression because many clients don't even support the latter. 01:26:24 (`curl --compressed` also decompresses it again as it's supposed to.) 01:34:14 i mean, arguably apple's server is incorrect to recompress a gzip (since 'Accept-Encoding: gzip' doesn't specifically forbid the 'identity' value) 01:34:17 but yes, curl handles it correctly and ff does not 01:34:40 It's not incorrect, but it's certainly not optimal, yeah. 01:35:09 'correct' as in 'The Right Thing' 01:35:29 anyway, apparently this behavior was a workaround for some combination of bugs in apache and apache being easy to misconfigure. transfer protocols were a mistake :( 01:35:57 I feel like I've seen this brought up here before, haha. 01:36:35 That's where 90+% of HTTP's quirks come from, browsers working around server bugs instead of throwing a brick at the people running broken servers. 02:20:44 Usernam edited List of websites excluded from the Wayback Machine (+27): https://wiki.archiveteam.org/?diff=48596&oldid=48547 02:21:40 Hi. Quick question, I hope this is the right channel. I've created some warc files using grab-site which I would like to upload to the internet archive. According to the faq I have to use the subject keyword "archiveteam" (check) and to "let us know". How do I let the team know? Here via IRC? Also, do I upload just the .warc.gz file? Or also the 02:21:40 meta-warc.gz? And how do I set the mediatype to web? 03:11:51 Just as a quick tip, uploads from normal users aren't going into the Wayback Machine anymore. Only specific whitelisted users can directly upload to be included in the Wayback Machine. 03:14:13 Oh, they left. :( 10:52:43 Hi. I made a post here: https://archive.org/post/1124209/can-i-find-out-how-much-of-a-website-has-already-been-archived 10:54:54 I'd like to know how to find out, for any given website -- if any "whole site" archives/swipes have been done. I notice there are wiki pages with this info for many sites, so that's helpful. But I'm interested in a more automated, methodical way of finding out this information without relying on a human updating a wiki page. 10:55:41 Cookie, in terms of Archive Team's projects, e.g. for fanfiction.net we try to upkeep a wikipage such as https://wiki.archiveteam.org/index.php/FanFiction.Net 10:56:48 That is awesome! Anyone can scrape a website and create a WARC file though, right? Any they wouldn't necessarily create or update the relevant wiki page... 10:56:56 that's right 10:57:16 of course, when presented with a WARC file, it is impossible to simply determine whether it is "complete or not" 10:57:25 that's why it's important to have good metadata and record keeping as well 10:57:32 Yes I have realised that after thinking about it 10:57:51 But suppose there isn't good metadata (which is likely!) 10:58:06 But they have uploaded it to the archive anyway 10:58:33 in general, archivebot jobs tend to be "complete" (unless aborted, run with aggressive ignores, hit high error rates, etc.) 10:58:38 and they are searchable here https://archive.fart.website/archivebot/viewer/ 10:59:17 so that's another place to check 11:01:30 Ooh I hadn't seen that yet. 11:01:38 So for example: https://archive.fart.website/archivebot/viewer/domain/www.fanfiction.net 11:02:03 The first result looks "big" (2018). The others look "small" 11:02:44 Can these results be related to the collections here: https://archive.org/details/archiveteam_fanfiction 11:03:10 Oh, no those say 2012 11:03:11 Nope, ArchiveBot get its own collection for all crawls 11:03:33 in fact, the warcs are partitioned, and uploaded in separate items along with other crawls running on the same machine 11:03:51 also, I suspect the first job is incomplete due to a crash. Mainly because the last segment (00070) was uploaded 7 months after the previous one 11:04:07 it might be possible to determine what exactly went on from irc logs, but I don't have time for that atm :) 11:05:09 Okay is this the list of archivebot collections? https://archive.org/details/archivebot?&sort=-publicdate 11:05:18 aye 11:06:59 What is a "GO Pack" ? Is that just what every item is called? And I have to examine the metadata to find out what's in it? 11:07:58 I believe the names are variable based on the pipeline that uploaded them (there are over 50 machines running archivebot instances) 11:08:15 and yeah, that's why the archivebot viewer exists, to make it easier to search through these 11:08:29 If I search that collection for "fanfiction.net" it doesn't find anything: https://archive.org/details/archivebot?query=fanfiction.net&sort=-publicdate 11:09:24 yeah, the collection's not really intended for manual browsing, but the data is all there. the viewer shows you which jobs exist for a given domain, and which items contain the data from that job. 11:09:44 I can't see fanfiction.net here: https://archive.fart.website/archivebot/viewer/domains/f 11:10:17 that's because it's www.fanfiction.net 11:11:08 Ohh haha. Where do I log a suggestion to remove the "www" from sites in this index? ;-) 11:11:49 Anyway the search found it alright 11:11:53 https://github.com/ArchiveTeam/ArchiveBot/issues possibly. I'm not even sure the viewer has its own repo 11:12:32 there's a lot of issues with the whole setup, but we're all volunteers here, it's difficult to overhaul the way things are built already 11:12:38 Thanks for explaining this 11:12:51 np 11:13:18 I'd still like to relate the results at https://archive.fart.website/archivebot/viewer/domain/www.fanfiction.net to the collections I've listed here: https://archive.org/post/1124209/can-i-find-out-how-much-of-a-website-has-already-been-archived 11:13:49 In order to answer the question "is it efficient for me to archive fanfiction.net again, now?" 11:13:55 https://web.archive.org/web/collections/2022*/fanfiction.net 11:14:07 also, what about incremental backups? 11:14:12 if you see an "ArchiveBot" collection here, then it's done by ArchiveBot 11:14:22 otherwise, none of the collections are related 11:14:46 Okay, and if it's done by ArchiveBot then that means it's "complete" 11:14:48 ? 11:15:15 not really, actually, because it's possible ArchiveBot saw a link *to* fanfiction.net while archiving another website, and grabbed just one page 11:15:36 the only thing by archivebot that's close to complete is this job from 2018 https://archive.fart.website/archivebot/viewer/job/1bkfa 11:16:20 I would answer your question with a YES because there has been no major and documented effort to archive fanfiction.net since 2018 11:16:36 Okay :-) 11:16:39 as long as you use best practices, update the wikipage, etc. 11:17:14 So does this mean I should point the ArchiveBot at it? Or is it too large and I should explore different crawling methods? 11:17:31 archivebot with no offsite links may be sufficient 11:18:00 Thank you!! 11:18:47 somebody (like me I suppose) will have to run the job for you, but you can help monitor it :) 11:19:09 Sure. I'll read up on how this all works 11:19:27 And I guess I will ask the archive.org people how their existing fanfiction.net collections can be managed 11:20:48 feel free to poke or pm me in the future 13:16:31 This is a probably a dumb question, but if the data in "go" archives wasn't grouped by date crawled, but instead by name of website, wouldn't it be possible to use archive.org's native category and metadata system to search for a specific website archive? 13:18:23 i.e. instead of this: https://archive.org/details/archiveteam_archivebot_go_20181027110001/ -- which contains sections of several of different website archives, you might have: https://archive.org/details/archiveteam_archivebot_go_201810_fanfiction.net/ - which only contains a portion of one website. Additional crawls would either update that one, 13:18:23 or add more archives in the same collection (or topic or tag or whatever) 13:36:33 It would undoubtably take more time and effort than is available. I was just wondering if there is anything specific preventing this from happening. 14:31:43 https://www.androidcentral.com/apps-software/google-play-store-to-get-rid-of-nearly-900000-abandoned-apps 14:31:54 Google is officially removing the apps 14:32:15 But we don't need to worry like the app store 14:32:26 They are archived by the APK sites 14:32:39 Reviews are the only unarchived thing 15:27:40 Cookie: We tried that in the past, and it didn't go well. The problem is that IA items have a size limit, and AB jobs can get *much* bigger than that. 15:28:25 So then you need to coordinate items for the same domain between pipelines, which gets really fun... 15:28:49 Ideally, IA would allow searching a collection for filenames, and then this would work automatically. 15:29:35 Also, the grouping isn't by pipeline. Pipelines upload to a staging server, which groups files together until a set exceeds some size threshold, and then that gets uploaded. 15:30:14 So it's more of a time slice of all AB pipelines, although that isn't entirely accurate either. 15:39:15 JAA: Thanks for putting my mind at rest 15:39:58 Filenames might work... But a metadata field specifically for "website name" would be the best. 15:53:12 win-raid.com is redirecting to the new Discourse forum since about 2022-05-13 23:50. I finally get to stop my continuous archival of that. :-) 18:41:36 ah, they've left. but we've discussed re-archiving ffn on a few occasions; it is not archivebot-suitable 18:42:03 previously: https://hackint.logs.kiska.pw/archiveteam-ot/20210608#c289987, https://hackint.logs.kiska.pw/archiveteam-bs/20210523#c288184, https://hackint.logs.kiska.pw/archiveteam-bs/20210524 18:42:44 (last has details of site structure) 18:43:21 Last time I checked (earlier this year), all story pages were behind Buttflare Attack mode, not just with a ridiculous rate limit but entirely. 18:44:01 oof 18:45:42 * thuban contemplates writing a chromebot that doesn't suck... 18:47:13 Sanqui, weren't you doing something with puppeteer and warcprox? 21:02:01 JAA: yes, I can confirm that ff.n is very much behind the worst mode ever - and it's gotten even worse recently 21:02:59 bad enough that even browsing it, it'll sit on a "checking you're human" page for ages - I've resorted to pasting the links of any story I want to read into FFDL because at least I don't have to watch it think, and I can check over and solve a captcha or two as necessary 21:03:25 Doranwen: Oh, there's still room for it to get worse, but I won't mention it here in case they're lurking here and looking for ideas how to do so. 21:03:38 LOL 21:04:27 well, they should realize they've gotten it a little counterproductive at this point - I mean, I *used* to actually read the fics on the site and only d/l them when I was done - now it's a pain to do anything on the site 22:10:47 https://www.androidcentral.com/apps-software/google-play-store-to-get-rid-of-nearly-900000-abandoned-apps 22:11:37 Google is removing "abandoned" apps from the play store. Let's archive them just in case the APK sites did not. 22:12:03 Let's especially archive html pages and reviews as csv of play store apps over 2 years old