00:20:14 Hmm, JCenter seems to be throwing 403 for all files for me. 00:28:12 Mass download any of it before? 00:28:16 Regular Bintray still working 00:28:23 So I think I have this in a workins state 00:34:43 JCenter might return. It's supposed to stay working until next year. 00:35:02 But yeah, let's get Bintray itself running ASAP. 00:35:11 Sort of multitasking here 00:35:15 We can't download anything from JCenter without Bintray anyway. 00:35:20 Will try to get it uploaded in a few minutes 00:44:58 Alright https://github.com/OrIdow6/bintray-grab should be good aside for cosmetic/branding thing (it's still SMMB in that area) and backfeed URL 00:45:40 arkiver: ^ 00:52:18 Is there an IRC for bintray? 00:52:33 *a channel 00:52:46 Not so far. 01:00:12 Very rough around the edges, of course, but it should get all info 01:00:23 Site used POST for a bunch of stuff anyway 01:13:56 So I am trying to use a Japanese IP address, with Accept-Language: ja, a realistic UA, am getting the "counter", am using a 10second + random delay, and am being conservative w/ the URLs I visit, and am still being blocked 01:14:03 From Aimix-Z 01:14:42 Oof 01:21:25 that's some world class automation detection if i've seen it 01:22:03 or are you behaving like a normal user 01:22:13 If it stays alive over the weekend I may have enough time to try it more 01:22:16 wow thanks irccloud 01:22:31 i meant to say are you jumping to extremes or are you behaving like a normal user 01:22:37 No, this is a crawler that a human reading logs could detect easily 01:22:43 fair 01:22:57 Unless you were a very methodical user of a text browser 01:23:15 *Unless you suspected they were 02:55:24 maybe they have a human reading logs... and/or detect that only text resources are accessed and not other resources. Kind of like the inverse of a bot trap URL. (if the browser doesn't get all the resources normally accessed by a graphical browser, consider the user to be a bot) 02:56:37 I've tried to consider that 02:57:00 So it does get images, and I also get a "counter"/analytics URL that every page got but seemed not to have a purpose 02:57:17 But due to the nature of the grab setup there is a long delay in some cases 02:59:04 bintray -> binnedtray or spilledtray 02:59:14 Well it shuts down in 4 hours 02:59:26 bintray their website says "UPDATE 4/27/2021: We listened to the community and will keep JCenter as a read-only repository indefinitely. Our customers and the community can continue to rely on JCenter as a reliable mirror for Java packages. 02:59:26 " 02:59:36 JCenter ~= main Bintray 02:59:40 how often does grab-site check the 'delay' file? (does it depend on the current delay?) 02:59:46 *!= 03:00:04 JCenter is kind of integrated into Bintray but also standalone. 03:00:30 You can't discover the content on JCenter once Bintray is down, even though it will still be there for a while. 03:00:55 i set it to something VERY large while i fixed up the ignores, but now i've set it back to 0 and it's not started again... any way to signal for a re-check? 03:01:23 I copied that from https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/ which has other date info too 03:02:12 thuban: Yes, it depends on the current delay, and no, there isn't such a signal. 03:02:44 whoops. 15 minutes it is, i guess 03:02:55 Not entirely sure how it's implemented exactly in grab-site, but I think it's similar to AB, which checks the settings after every URL. 03:03:06 So yeah, don't go too high on the delay settings. :-) 03:03:19 15 minutes doesn't sound too bad though. 03:03:30 We frequently use 3 or 5 minutes on AB. 03:05:34 i didn't actually calculate it, i just threw in 1000000 because 100000 seemed too small and i thought i'd be able to change it again 03:09:14 OrIdow6: hi, so this is not ready yet for warrior? 03:09:19 or is it 03:09:30 if yes I'll get it up 03:09:32 right 4 hours 03:09:39 please confirm ^ 03:09:42 or JAA ^ 03:09:49 oh, other question: i forgot to change to my external drive before starting the crawl :( fortunately i _think_ i've got enough space anyway, but is there a way to move data without losing the state? 03:11:15 arkiver: I only looked over it briefly, no idea. 03:11:45 JAA: shut down in 4 hours? 03:11:51 shutdown* 03:12:35 Unsure, I haven't seen a time announced anywhere, but maybe I missed it. 03:12:42 But it's going down today (1 May). 03:12:54 alright 03:13:15 and we got a list of items? 03:13:17 And they've been warning users with brown-outs and whatnot, so I don't expect it to stay longer. 03:13:31 We have a list of users, and everything else can be discovered from there. 03:13:43 haha is that file actually named .zstandard :P 03:14:09 Yep lol 03:14:13 OrIdow6: For the future, it's .zst 03:14:57 arkiver: AFAIK it is, except for branding (which I can go and change now) and backfeed 03:15:08 Which needs an URL (right now it's example.com) 03:15:20 JAA: I know, but I couldn't remember at the time 03:17:22 Well, and technically the file: item type isn't implemented, but as discussed previously that's deliberate at this point as I suspect that's a lot of data that's mostly available elsewhere 03:18:28 JAA: no error messages, but my queue seems to be stuck. any idea why? 03:19:32 ¯\_(ツ)_/¯ 03:19:53 ;_; 03:21:08 By the way, there are also branded subdomains like google.bintray.com. The files are available under different URLs for those, either dl.bintray.com with an expiring token or that subdomain. 03:28:32 oh, it resumed. apparently readline timed out. 03:30:46 OrIdow6: alright, no time for me to test it much so I'll just fix those things and get it started now 03:31:12 (i kind of suspect one of the threads might still be stuck) 03:32:05 OrIdow6: stdout_sorted.txt is your item list right? 03:32:35 OrIdow6: what is all the aimix-z stuff? 03:33:03 right old code 03:33:06 JAA: Hm, thanks for pointing out that it downloads them differently, may need to handle that differenctly 03:33:16 arkiver: That's the intended item list, yes 03:33:36 Aimix-Z is another site that seems borderline impossible to archive because it aggressively bans people 03:33:56 ok 03:35:56 OrIdow6: i'm replacing zst with gz 03:36:01 I am thinking of trying to use backfeed to make a super-distributed crawl, where each item is just 3 urls and it recurses around 03:36:46 arkiver: Did I do zst wrong again? 03:36:46 OrIdow6: only thing that needs changing is backfeed? 03:37:00 OrIdow6: no, i'd rather use gz when we're not going dicts 03:37:10 wont save much, and gz is still the default for WARCs in general 03:37:18 doing* 03:38:30 arkiver: Give me a few minutes to quickly fix this thing J A A reminded me of 03:38:46 And this isn't the ideal final version, but I sort of ran out of time 03:38:54 And it should work nonetheless 03:39:02 not ideal is fine now 03:39:27 OrIdow6: ok please PR it to the archiveteam clone 03:39:28 Well, J A A told me of half and reminded me of the other half 03:39:35 Ok 03:42:54 OrIdow6: all fixed and pushed 03:43:11 Testing this change to see if it breaks anything 03:43:15 Thanks 03:44:33 OrIdow6: the change jaa proposed? 03:45:11 items queued 03:45:30 arkiver: He didn't propose a change, he told me about a corner case (roughly) 03:46:04 crap we need a target 03:46:14 EggplantN: you around? or HCross Kaz 03:46:41 OrIdow6: any rough size estimate? 03:46:42 TBs? 03:46:45 or not 03:47:06 All the Brits are probably asleep. :-/ 03:47:37 I'd say high GBs or low TBs 03:47:37 yeah i should be as well 03:47:48 ok good 03:47:49 As this should reject all big files 03:48:00 Well, queue them as file:, which isn't implemented yet 03:48:05 Another fun edge case: https://bintray.com/griffon/griffon-plugins?offset=16&max=8&repoPath=%2Fgriffon%2Fgriffon-plugins&sortBy=lowerCaseName&filterByPkgName= 03:48:15 Links to a package that isn't under griffon/griffon-plugins. 03:48:50 Should be able to handle that 03:49:01 As items are users 03:49:35 Yeah, it shows up under sleonidy as well. 03:50:18 In fact, bintray/jcenter is full of these. 03:50:48 I noticed bintray just showed up on the tracker. Is there a channel for that yet? 03:51:01 * endrift scrolls up 03:51:06 I think that means those are also all available under two different URLs. 03:51:06 ah, not yet 03:51:54 What do you mean, shows up? 03:52:09 how does FOS work again 03:52:10 It's included on the user's repos/packages. 03:52:23 So it appears twice. 03:53:24 we'll use a target on FOS 03:53:31 That'll be fun. 03:53:39 OrIdow6: excuse the ping, is the update ready? 03:53:48 Just making the PR 03:53:54 perfect 03:55:08 Alright https://github.com/ArchiveTeam/bintray-grab/pull/1 arkiver 03:56:11 strict.lua was a thing I was using during testing, that would crash upon reading from new variables instead of returning nil 03:57:17 Which apparently I did actually add to git, but oh well 03:58:13 JAA: looks like FOS is still working :P 03:59:11 Yeah, for now. 03:59:19 OrIdow6: started! 03:59:21 people should update 04:00:16 arkiver: Thanks 04:00:26 OrIdow6: why the if not something? 04:00:34 Where? 04:00:46 where we normally check if to_send is nil 04:00:54 before setting the first discovered item 04:01:16 Because strict.lua broke it for some reason 04:01:35 odd 04:02:20 OrIdow6: i think we can just make this multi item size 1 04:02:22 Is drone building a docker image? 04:04:31 changed to multi item size 1 04:04:36 i'll be off now for some sleep 04:04:40 gotta get up earlu 04:05:15 arkiver: OK 04:05:17 Goodnight 04:05:29 I hope you're not getting up early for this project 04:06:16 no not for this project :) 04:07:06 thanks for the work on this, its good we at least archive something here 04:07:41 I'll have a rough size estimate for the files in a bit. 04:14:34 Extrapolated from a 1‰ sample of all users, there should be on the order of ten million files with a total size of 10 TB. May easily be off by quite a bit though since it's such a small sample. 04:18:16 Not too bad 04:18:27 Yeah, I'd expect it to vary a lot 04:21:40 Looks like some users 404. 04:21:58 Two examples: sfali, olacabs 04:22:50 It should deal with those correctly 04:23:14 Well, trying it out, I forgot to check for 200, so it makes 3 unnecessary requests 04:23:27 But it correctly succeeds 04:24:33 :-) 04:27:00 Is there a recommended concurrency? 04:27:53 Not yet 04:28:27 Go nuts. I haven't seen any issues at high concurrencies. 04:28:38 (Not running this, but on that sample above.) 04:31:03 Ok, they start 429ing somewhere between 50 and 100 concurrency with qwarc. 04:32:46 Got it 04:32:50 I'm getting some 401s, is that normal? 04:32:57 Where? 04:33:13 example: 80=401 https://api.bintray.com/maven/zdmytriv/vgs-aws-maven/aws-maven/;publish=1 04:33:24 Makes the worker sleep 04:33:53 problem: as much as i'd like to have outlinks on this ah.com run, for context, i'm concerned there won't be time (and i've done enough of the priority content that i don't want to re-run as --no-offsite-links) 04:33:56 solution: add hacky negative-match ignore, then gs-dump-urls skipped and run them in a separate crawl (or even feed them to archivebot) when i'm done, y/n/q? 04:34:01 It shouldn't be going there 04:34:40 thuban: That's what I've been doing on AB, yeah. Negative lookahead ignore. 04:35:00 Make sure to not miss subdomains, URLs with ports, etc. 04:36:29 JAA: '^((?!alternatehistory.com).)*$' lgty? 04:37:02 lax but this is almost certainly io-bound so i don't know that it matters 04:38:43 I suppose that would work. 04:39:07 I usually do something like ^https?://(?!([^/]*\.)?example.org(:\d+)?/) 04:39:18 Er, example\.org 04:39:59 welp, here goes 04:41:57 thuban: Might want to turn igon on to verify 04:43:00 jodizzle: thanks, but dashboard / gs-dump-urls in_progress look good and i don't want to slow it down 04:43:42 https://github.com/ArchiveTeam/bintray-grab/pull/2 - misc changes - can someone accept this? 05:25:32 JAA: Do you want to accept that? Fine if you defer 05:25:51 jodizzle: Seeing any more errors? I see it's slowed 05:28:09 OrIdow6: I was trying to stop the container gracefully to restart with higher concurrency, but it's still doing backoff from that 401 link. I guess I should just kill it? 05:29:03 Yeah 05:29:14 It will just abort anyway 05:31:06 OrIdow6: Seems fine, merged. 05:31:41 Thanks JAA 05:38:02 Some files are actually served directly on bintray.com, by the way. 05:38:25 E.g. those in the package jfrog/jfrog-mission-control/mc-docker-installer 05:39:12 Er actually, that's an EULA. Great. 05:39:49 AFAICT it does that with small files (threshold somewhere around 1 MB), so that's how I determine it 05:40:22 It gets files directly on the site in the user: item and then queues CDN ones as file: item 05:41:15 Nope, I've seen plenty of small files get served via dl.bintray.com. 05:41:30 But it's a 302 redirect. 05:41:40 That's what I mean 05:41:46 Oh, I see with the EULA 05:42:05 I thought you meant it was a license in a file 05:42:18 Ah 05:42:47 Yeah, no, intermediate page with a scripty button. 05:46:06 And also, https://dl.bintray.com/jfrog/jfrog-mission-control/ is serving completely different files than what's listed on https://bintray.com/jfrog/jfrog-mission-control/mc-docker-installer 05:47:03 Files that aren't even under any project. 05:58:02 https://github.com/ArchiveTeam/seesaw-kit/pull/121 - there was an attempt to do a thing 06:00:31 I'll leave that to someone else as I have zero experience with seesaw's web interface. 06:00:55 Alright 06:05:44 where does this get logged to anyways 06:07:46 A website that's currently down. 06:09:52 rip 06:20:38 Good morning world, what is needed here 06:24:37 more hard drives 06:25:01 Hello HCross 06:25:37 Apropos of the hastily-started (which was my fault) Bintray project, workers and preferably a target that's not FOS 06:26:18 Well, for all I know FOS is fine 06:27:15 Shouldn't be much data, and site may shut down in half an hour anyway 06:35:06 Let me get out of bed and I’ll throw workers at it 06:35:23 Rate limits? 06:35:46 I started getting 429s between 50 and 100 concurrent with qwarc. 06:35:59 No idea what that translates to. 06:36:25 Size per item? 06:36:39 Sorry, trying to size this 06:37:15 50 conc with qwarc corresponded to 65 req/s. 06:37:35 Items aren't big, below 1 MB on average. 06:38:58 does grab-site retry on 'Connection closed' errors? 06:40:30 Yes. Not 'Connection refused' though as far as I can see. 06:43:41 hm, ok. i'm seeing 0s erroring without corresponding 200s following; is that just just an ordering issue? 06:47:21 (gs-dump-urls lists one such in 'error' rather than 'todo' or 'in_progress' but i'm not sure whether that status is intended as final) 06:48:48 JAA: sir, I believe you asked for some archivism 06:49:17 that has been delivered 06:50:59 "Note that, unlike wget, wpull puts retries at the end of the queue." oh, hopefully that's it. nts, check up on this 06:59:16 I've turned it up a bit 06:59:32 Thanks HCross 06:59:53 if we crash into FOS we can deal with that 07:00:34 Nice 07:01:01 I'm seeing some 502s 07:01:17 methinks Bintray may be distressed 07:01:18 Midnight, seems it's still going 07:01:25 but it's just crossed 8am London and we're still alive 07:01:59 Average response time went from 1 to 7 seconds in the past couple minutes for me. 07:02:21 yep 07:02:32 im still pulling quite hard 07:02:40 but let me know if you want me to back the truck off 07:03:50 thisisfine.png :-) 07:04:04 I'm about to drive the truck in even harder 07:04:44 Response time has come down again to 2.5 s for me (one-minute average). 07:06:24 Archiving Truck has been revved up 07:06:39 and is now crashing head first into the Binary wall 07:06:42 Bintray 07:07:26 Yes Rico, kaboom. 07:15:19 JAA: im getting some really big items 07:15:20 is that normal 07:15:39 Hmm 07:15:52 Have some examples? 07:16:05 Files shouldn't be downloaded yet as I understood it. 07:17:22 unfortunately I don't ask it sped past 07:17:53 are we discovering as we go 07:17:58 Hmm yeah, I see now, average item size is 150-ish MiB. 07:18:03 OrIdow6: Is that expected? 07:18:42 JAA: Items coming in are still mostly under <1 MiB; what do you mean 07:18:46 There is backfeed, but I believe the initial list should already be virtually complete. 07:19:05 ? 07:19:09 I have items in the thousands of URLs 07:19:20 2065=200 https://bintray.com/nus-ncl/generic/services-in-one/1-98bd8b8?versionPath=%2Fnus-ncl%2Fgeneric%2Fservices-in-one%2F1-98bd8b8 07:20:08 see how my items done count dropped, but the size shot up 07:20:26 I'm not sure what versionPath is, but that does look like it should correctly have 1000s of URLs 07:20:31 That item 07:20:57 Oh yeah, I was misreading that graph. 07:21:19 Some items are in the 10s of MB, but most are still small. 07:21:29 give me a few minutes 07:21:32 and I'll double again 07:22:03 this will be like the opening minutes of Parler again 07:22:19 Found an image of HCross: https://i.ytimg.com/vi/BvXxIWkcWrA/maxresdefault.jpg 07:22:33 "where did all the items go, we queued a ton" "harry claimed them all" _brief pause_ "harry checked them all back in very quickly" 07:22:56 EggplantN: "oh fuck, oh fuck... FUCK" 07:23:18 lol 07:23:41 I think JFrog's servers will fall over before ours this time. 07:23:53 Unless they have some scaling going on. 07:23:59 EggplantN actually phoned me to yell at me over that 07:24:35 Looks like the main site's hosted in Dallas, by the way. 07:25:03 so I'm hauling it all back to the EU 07:25:03 woo 07:25:23 And then back to FOS in California. lol 07:25:58 Oh well, the real fun will be when/if we grab the actual files. 07:26:14 Very rough estimate puts that at 10M files and 10 TB. 07:30:18 if we get that, I'll move over to my California colo 07:30:22 and start going BRRR 07:32:17 this may be an ideal candidate for meta if we need more targets 07:32:30 JAA: shall we make a channel? 07:32:45 Those aren't going to Dallas, by the way. Amazon and Google CDN as far as I've seen. 07:32:59 It does 07:33:10 Because it has a token that expires 07:33:20 So what are queued aren't the CDN URLs, it's the redirects to them 07:34:08 Only very few have tokens. 07:34:16 But I see re redirects. 07:34:39 All the ones I looked at had tokens 07:34:44 Can you give examples? 07:36:09 About three quarters of the ones I've collected in a test run didn't have tokens. 07:36:12 Perhaps I was biased towards a certain type of file while manually exploring the site 07:37:04 176k of 239k plain dl.bintray.com URLs 07:37:26 A couple random projects that have those: k8ty-app/maven/k8ty-nltk adfactory/maven/adfactory_android est7/maven/rx2errorhandler 07:37:41 I do wonder if they've got an autoscaler that I can crash into harder 07:37:50 if they're in "the cloud" :tm 07:39:04 They seem to be using IBM's hosting. networklayer.com shows up prominently in the routes. 07:39:12 yep 07:39:52 they're hauling me all the way from London on the IBM backbone 07:40:46 I'm going to NY via Level3 first. 07:41:46 JAA: Redirect from what to what? If you mean redirects to dl., it does follow those 07:42:14 ah, I have direct peering with IBM in London 07:42:32 so this is actually very cheap 07:43:01 OrIdow6: .../download_file redirects to dl.bintray.com but without tokens in the latter URL for the majority of projects. 07:43:49 JAA: Oh, I see 07:44:00 Are we already grabbing those? 07:44:10 So the dl. urls themselves can redirect to a CDN or get served directly from dl. 07:44:29 In the former case, they will be queued as file: items; in the latter, they will be fetched as part of the user: item 07:44:40 OH 07:44:58 Ok, that explains some things. 07:45:05 I've been getting some weird 400s on some funky urls. Not sure if this is normal? https://jakel.rocks/up/fd73e7198ba6777f/urls 07:45:34 With some nuance to account for custom subdomains 07:46:03 Jake: That doesn't look right 07:46:21 (As well as 403s on some S3 objects) https://jakel.rocks/up/d4713371c935c8cb/s3-403s 07:47:02 wee 07:47:10 I appear to be downloading most of Kubernetes source code 07:47:16 Right, so we're grabbing all the smaller files, but the larger ones that redirect to Cloudfront get queued to backfeed. 07:47:29 Jake: Do you have the full logs for the first one? 07:47:31 6701=200 https://dl.bintray.com/fabric8/fabric8/.images/de/de7821b9943bd0498290d6e45b0a5f336ca53cb0a101817f4858543fb936d3ae/layer.tar 07:47:37 so I should be seeing that? 07:47:49 Orldow6: No full logs for the first one. I'll see if I can get some. 07:48:05 The second one is an avatar URL that's had some problem extracting, as it's a 403 on S3 I think it's worth leaving the lenient extractor in 07:48:19 I thought we were skipping files entirely for now. But yeah, that's expected then, HCross. 07:48:31 And I guess the average item size will not stay below 1 MB in that case. 07:48:34 ah right 07:48:43 if we're getting a lot of these I may need to rethink a few things 07:49:31 Though they're only the smaller files. Larger ones are on the CDN and not grabbed yet. 07:49:46 Random example of such a CDN redirect: https://dl.bintray.com/kuende/k8s/kube-apiserver 07:51:24 Y’all need fire power or is bincentre close to falling over 07:51:27 I found the 400 again Orldow6: https://jakel.rocks/up/e0799a8d20ba0c30/bintray 07:51:38 EggplantN: im getting backed off to 1024 seconds 07:51:39 lol 07:51:49 but im going to see if that was a one off 07:51:53 and if I can push harder 07:52:05 I was gonna bring the warriors 07:52:26 I think there's a few issues with the script first 07:52:39 Jake: Thanks 07:53:07 EggplantN: not yet 07:53:11 lets iron out the script 07:53:17 and we'll need targets 07:53:44 FOS seems fine so far? We will need targets for the large files though. 07:54:16 But we don't even know yet when this all gets taken down. 07:54:42 Also, I feel for the poor lad who will get the user:bintray item. 07:54:46 HCross deploy at-offload 07:55:25 will do when needed 07:56:44 https://www.irccloud.com/pastebin/jYMVpd0h/ 07:56:46 JAA: ^ 07:56:49 I presume that's intentional 07:57:04 Yep 07:57:08 Those are the large files. 07:58:51 Jake: I am running the problem item you found with a bunch of debug output right now, may take a while 07:59:02 sure, no problem. Should we stop while you do that? 07:59:21 The problem here seems to be that ti is getting too much rather than too little 07:59:58 So unless or until stuff like this predominates, I think it's best to continue 08:00:19 Seeing as the site is apparently in the sort of limbo state where it's already supposed to have shut down 08:00:37 I do think a channel would be a good idea 08:00:54 https://twitter.com/steveonjava/status/1387072410868797440 08:00:56 JAA: ^^^ 08:01:02 does that mean we have a reprieve 08:01:26 but we should still go as hard as we can 08:02:04 ark iver writes a project: The channel is created before it's clear there's even going to be a project in the first place 08:02:17 I write a project: The channel may or may not be created after it starts running 08:02:36 and I crawl out of bed, straight to my laptop and start 08:03:01 im sat here in pyjamas, server wrangling 08:03:10 HCross: JCenter != Bintray 08:03:18 ahh 08:03:36 JCenter's index is on Bintray, but it's still kind of separate. 08:03:48 Once Bintray goes down, we can't discover JCenter's content. 08:04:00 ew 08:04:09 Or at least not as far as I know. 08:04:17 They had directory listing before but removed that. 08:06:53 Ah, I see what the problem is 08:09:00 Yeah, let's make a channel. binnedtray? 08:10:04 ashtray? I'm horrible at channel names. 08:10:32 If you want to be confusing, bitbucket 08:10:56 So I had two checks to make sure Jake's problem didn't happen, and made a mistake in one and messed up the other one with a later commit 08:11:23 Oh, "bin" != "bit" 08:12:15 Channel name rules: no confusion, much pun. 08:13:44 Oh, I see atphoenix already suggested binnedtray earlier. :-) 08:17:10 :) 08:19:45 which one are we using 08:22:05 By the way, I just pushed commits to my copy of the repo, do *not* merge these, I made a mistake 08:30:06