03:17:14 <etnguyen03> just making sure, bintray has been paused for some time right? just curious is there something being worked on? considering scaling down my workers
03:19:23 <jodizzle> etnguyen03: Join #binnedtray
07:33:49 <thuban> outlinks from my ah.com crawl: archivebot or the urls project?
07:34:30 <thuban> (about 300k from the first, smaller forum; second forum not finished yet)
08:56:26 <Daloader> Just wondering if anyone is using / has experimented with a AWS / GCP / Azure "Spot" Style fleet of micro instances to get lots of IPs and burst download projects
09:08:02 <Zopolis4> are the WARRIOR SUPPORTED messages accurate? they say that yahoo answers is not warrior compatible, but it clearly is
09:11:22 <Jake> Those are kind of old, back when the warrior was not supported by a lot of the projects. I think every project should work on the new warrior now?
09:15:17 <Zopolis4> thought so, also some of those projects seem to have been completed, like google sites, reddit and periscopes? are they still running? the trackers seem empty
09:16:44 <masterX244> reddit is a continuous project
09:17:02 <masterX244> it regulary get new tasks since its effectively a "tail -f" on new content
09:17:46 <Zopolis4> got it, and the others?
10:14:22 <masterX244> JAA: TM-exchange trackpages are all uploaded to archive. Gotta write me another quick tool to extract the userpage-urls from that crawl
10:33:42 <Sanqui> https://archiveteam.org/ is missing a favicon.  I've designed two potential ones.
10:34:21 <Sanqui> https://etc.sanqui.net/at_favicon.png - a simple favicon inspired by the 7 inch floppy logo and using EGA colors.
10:34:46 <Sanqui> https://etc.sanqui.net/atyahoo_favicon.png - modeled after the old Yahoo! favicon
10:35:14 <Sanqui> I personally like the second one, even though it doesn't match any logo we currently use; any wiki admin down to implement it?
10:35:47 <OrIdow6> That would presumably be J R W R
10:36:29 <OrIdow6> First one looks almost solid black to me, hard to make out details
10:36:38 <OrIdow6> Except for the corner
10:36:53 <Sanqui> yeah, it's difficult to depict a solid black floppy
10:37:09 <Sanqui> so i don't think it would make for a good favicon even if I refined it more
10:39:21 <OrIdow6> Yahoo thing is clever - with the AB dashboard one (which I am going to guess at like 40% that you made as well) it establishes a sort of theme
10:41:18 <Sanqui> nope, I didn't make that one!  but I do like it and it inspired me to mimic the Yahoo one, yeah
10:45:24 <Sanqui> https://etc.sanqui.net/at2_favicon.png
10:45:37 <Sanqui> here's one that's a bit more closely modeled after the current logo
10:45:49 <Sanqui> all are also available as .icos at the same address for ease of use
10:48:26 <OrIdow6> I like the Yahoo one more, but it seems to me that from a "marketing" angle the floppy one (and the new one is a lot better) makes more sense
10:49:18 <OrIdow6> But anyhow, it's not like I'm in charge of this
10:49:59 <Sanqui> yeah, it's alright, it's not like this is a priority in any way, but I kind of, just made a good old bookmark bar and the AT wiki is sticking out for not having a favicon haha
10:50:09 <Sanqui> so I thought I'd put my lackluster pixel art skills to ues
10:50:11 <Sanqui> use*
15:14:48 <masterX244> Switched to torrent at the Stackexchange dump, that was much faster for some odd reason. Extracting now and after that outlinks should be extracted soon
20:49:49 <thuban> does grab-site have a secret concurrency limit like seesaw-kit?
20:53:42 <masterX244> how do you mean? hardcoded value that can't be exceeded by config?
20:54:02 <masterX244> otherwise: the server and its response duration affects effective rate, too
20:58:01 <thuban> i mean 'limit beyond which it starts to get flaky'
20:58:57 <thuban> i'm currently at 20 and trying to decide whether i should go higher (seeing as i'm technically past the deadline already and there are a lot of pages left to do)
21:00:00 <HCross> DO NOT CHANGE THE CODE
21:00:09 <HCross> DO NOT FIDDLE WITH THE WARRIOR/PROJECT CODE
21:00:26 <thuban> dude chill this is grab-site
21:00:27 <HCross> If you're using grab-site then that is fine
21:00:45 <HCross> but I would stick to 20 otherwise it does do strange things afaik
21:00:57 <thuban> ah, oof
21:01:20 <jodizzle> Doesn't wpull have a built-in per-domain concurrency limit?
21:04:59 <jodizzle> Not an expert, but according to some logs I have, wpull maxes out at 6 connections per (host, port, use_ssl) tuple.  Since grab-site uses wpull, you might be limited by that.
21:08:10 <LeighR> If there is a site that is increasingly fragile and I'm afraid will die due to neglect, is it better for me to use grab-site on my own to make sure it gets archived properly, and then upload to archive.org, or to ask one of you to have ArchiveBot do it?
21:10:15 <jodizzle> The nice thing about AB other than convenience is that the data ends up in the WBM.  Uploading to archive.org on your own doesn't do that.
21:10:22 <LeighR> ok
21:11:16 <jodizzle> But you can also do both if you're particularly concerned (just be mindful of IA resources)
21:11:35 <LeighR> I seriously doubt it would be over 1GB total
21:11:42 <LeighR> probably not even 100MB
21:12:01 <jodizzle> What is the site?
21:12:24 <LeighR> pemberley.com
21:12:53 <LeighR> note its very, very slow load time, despite not having anything wild going on
21:14:37 <jodizzle> Okay, sure.  Join #archivebot
21:14:40 <LeighR> it should probably only be archived at 1 or 2 concurrency, and slowly
21:15:31 <nyany> oof, those loading times though
21:17:13 <LeighR> yeah - that's why I'm afraid for it
21:17:41 <LeighR> Early modern literature sites run old-school
21:18:13 <nyany> that seems to be a static site though?
21:18:16 <nyany> its godaddy too
21:19:06 <LeighR> it should be blazing fast, and super cheap to host
21:19:13 <nyany> oh no, that's wordpress
21:19:15 <nyany> huh
21:20:12 <LeighR> it'd probably be cheaper for them to move to wordpress.com
21:21:29 <LeighR> but anyway - in general, if I find other precarious-looking sites, what's the best way to get them done by ArchiveBot? Make the request in that channel, and wait for an op to ask the bot?
21:21:53 <AK> Pretty much
21:21:56 <AK> That's what I do
21:21:57 <nyany> basically, yup
21:22:11 <AK> (Or if chat scrolls too quickly, throw it in here if it doesn't get seen in ab)
21:22:50 <LeighR> so many mid-sized, deeply linked literary sites, done by hand starting in the late 90s
23:37:53 <JAA> thuban: wpull is fine at very high concurrencies in principle; it doesn't have the race condition issues like seesaw. But jodizzle's right, there's a hardcoded limit of 6 connections per host/port/use_ssl. Further, all processing is single-threaded. You'll quickly run into bottlenecks on SQLite, HTML parsing, WARC writing, Python's cookie jar, etc. I find that concurrencies above around a dozen are
23:37:59 <JAA> rarely useful. (Exceptions confirm the rule.)