03:17:14 just making sure, bintray has been paused for some time right? just curious is there something being worked on? considering scaling down my workers 03:19:23 etnguyen03: Join #binnedtray 07:33:49 outlinks from my ah.com crawl: archivebot or the urls project? 07:34:30 (about 300k from the first, smaller forum; second forum not finished yet) 08:56:26 Just wondering if anyone is using / has experimented with a AWS / GCP / Azure "Spot" Style fleet of micro instances to get lots of IPs and burst download projects 09:08:02 are the WARRIOR SUPPORTED messages accurate? they say that yahoo answers is not warrior compatible, but it clearly is 09:11:22 Those are kind of old, back when the warrior was not supported by a lot of the projects. I think every project should work on the new warrior now? 09:15:17 thought so, also some of those projects seem to have been completed, like google sites, reddit and periscopes? are they still running? the trackers seem empty 09:16:44 reddit is a continuous project 09:17:02 it regulary get new tasks since its effectively a "tail -f" on new content 09:17:46 got it, and the others? 10:14:22 JAA: TM-exchange trackpages are all uploaded to archive. Gotta write me another quick tool to extract the userpage-urls from that crawl 10:33:42 https://archiveteam.org/ is missing a favicon. I've designed two potential ones. 10:34:21 https://etc.sanqui.net/at_favicon.png - a simple favicon inspired by the 7 inch floppy logo and using EGA colors. 10:34:46 https://etc.sanqui.net/atyahoo_favicon.png - modeled after the old Yahoo! favicon 10:35:14 I personally like the second one, even though it doesn't match any logo we currently use; any wiki admin down to implement it? 10:35:47 That would presumably be J R W R 10:36:29 First one looks almost solid black to me, hard to make out details 10:36:38 Except for the corner 10:36:53 yeah, it's difficult to depict a solid black floppy 10:37:09 so i don't think it would make for a good favicon even if I refined it more 10:39:21 Yahoo thing is clever - with the AB dashboard one (which I am going to guess at like 40% that you made as well) it establishes a sort of theme 10:41:18 nope, I didn't make that one! but I do like it and it inspired me to mimic the Yahoo one, yeah 10:45:24 https://etc.sanqui.net/at2_favicon.png 10:45:37 here's one that's a bit more closely modeled after the current logo 10:45:49 all are also available as .icos at the same address for ease of use 10:48:26 I like the Yahoo one more, but it seems to me that from a "marketing" angle the floppy one (and the new one is a lot better) makes more sense 10:49:18 But anyhow, it's not like I'm in charge of this 10:49:59 yeah, it's alright, it's not like this is a priority in any way, but I kind of, just made a good old bookmark bar and the AT wiki is sticking out for not having a favicon haha 10:50:09 so I thought I'd put my lackluster pixel art skills to ues 10:50:11 use* 15:14:48 Switched to torrent at the Stackexchange dump, that was much faster for some odd reason. Extracting now and after that outlinks should be extracted soon 20:49:49 does grab-site have a secret concurrency limit like seesaw-kit? 20:53:42 how do you mean? hardcoded value that can't be exceeded by config? 20:54:02 otherwise: the server and its response duration affects effective rate, too 20:58:01 i mean 'limit beyond which it starts to get flaky' 20:58:57 i'm currently at 20 and trying to decide whether i should go higher (seeing as i'm technically past the deadline already and there are a lot of pages left to do) 21:00:00 DO NOT CHANGE THE CODE 21:00:09 DO NOT FIDDLE WITH THE WARRIOR/PROJECT CODE 21:00:26 dude chill this is grab-site 21:00:27 If you're using grab-site then that is fine 21:00:45 but I would stick to 20 otherwise it does do strange things afaik 21:00:57 ah, oof 21:01:20 Doesn't wpull have a built-in per-domain concurrency limit? 21:04:59 Not an expert, but according to some logs I have, wpull maxes out at 6 connections per (host, port, use_ssl) tuple. Since grab-site uses wpull, you might be limited by that. 21:08:10 If there is a site that is increasingly fragile and I'm afraid will die due to neglect, is it better for me to use grab-site on my own to make sure it gets archived properly, and then upload to archive.org, or to ask one of you to have ArchiveBot do it? 21:10:15 The nice thing about AB other than convenience is that the data ends up in the WBM. Uploading to archive.org on your own doesn't do that. 21:10:22 ok 21:11:16 But you can also do both if you're particularly concerned (just be mindful of IA resources) 21:11:35 I seriously doubt it would be over 1GB total 21:11:42 probably not even 100MB 21:12:01 What is the site? 21:12:24 pemberley.com 21:12:53 note its very, very slow load time, despite not having anything wild going on 21:14:37 Okay, sure. Join #archivebot 21:14:40 it should probably only be archived at 1 or 2 concurrency, and slowly 21:15:31 oof, those loading times though 21:17:13 yeah - that's why I'm afraid for it 21:17:41 Early modern literature sites run old-school 21:18:13 that seems to be a static site though? 21:18:16 its godaddy too 21:19:06 it should be blazing fast, and super cheap to host 21:19:13 oh no, that's wordpress 21:19:15 huh 21:20:12 it'd probably be cheaper for them to move to wordpress.com 21:21:29 but anyway - in general, if I find other precarious-looking sites, what's the best way to get them done by ArchiveBot? Make the request in that channel, and wait for an op to ask the bot? 21:21:53 Pretty much 21:21:56 That's what I do 21:21:57 basically, yup 21:22:11 (Or if chat scrolls too quickly, throw it in here if it doesn't get seen in ab) 21:22:50 so many mid-sized, deeply linked literary sites, done by hand starting in the late 90s 23:37:53 thuban: wpull is fine at very high concurrencies in principle; it doesn't have the race condition issues like seesaw. But jodizzle's right, there's a hardcoded limit of 6 connections per host/port/use_ssl. Further, all processing is single-threaded. You'll quickly run into bottlenecks on SQLite, HTML parsing, WARC writing, Python's cookie jar, etc. I find that concurrencies above around a dozen are 23:37:59 rarely useful. (Exceptions confirm the rule.)