-
etnguyen03
just making sure, bintray has been paused for some time right? just curious is there something being worked on? considering scaling down my workers
-
jodizzle
etnguyen03: Join #binnedtray
-
thuban
outlinks from my ah.com crawl: archivebot or the urls project?
-
thuban
(about 300k from the first, smaller forum; second forum not finished yet)
-
Daloader
Just wondering if anyone is using / has experimented with a AWS / GCP / Azure "Spot" Style fleet of micro instances to get lots of IPs and burst download projects
-
Zopolis4
are the WARRIOR SUPPORTED messages accurate? they say that yahoo answers is not warrior compatible, but it clearly is
-
Jake
Those are kind of old, back when the warrior was not supported by a lot of the projects. I think every project should work on the new warrior now?
-
Zopolis4
thought so, also some of those projects seem to have been completed, like google sites, reddit and periscopes? are they still running? the trackers seem empty
-
masterX244
reddit is a continuous project
-
masterX244
it regulary get new tasks since its effectively a "tail -f" on new content
-
Zopolis4
got it, and the others?
-
masterX244
JAA: TM-exchange trackpages are all uploaded to archive. Gotta write me another quick tool to extract the userpage-urls from that crawl
-
Sanqui
archiveteam.org is missing a favicon. I've designed two potential ones.
-
Sanqui
etc.sanqui.net/at_favicon.png - a simple favicon inspired by the 7 inch floppy logo and using EGA colors.
-
Sanqui
etc.sanqui.net/atyahoo_favicon.png - modeled after the old Yahoo! favicon
-
Sanqui
I personally like the second one, even though it doesn't match any logo we currently use; any wiki admin down to implement it?
-
OrIdow6
That would presumably be J R W R
-
OrIdow6
First one looks almost solid black to me, hard to make out details
-
OrIdow6
Except for the corner
-
Sanqui
yeah, it's difficult to depict a solid black floppy
-
Sanqui
so i don't think it would make for a good favicon even if I refined it more
-
OrIdow6
Yahoo thing is clever - with the AB dashboard one (which I am going to guess at like 40% that you made as well) it establishes a sort of theme
-
Sanqui
nope, I didn't make that one! but I do like it and it inspired me to mimic the Yahoo one, yeah
-
Sanqui
-
Sanqui
here's one that's a bit more closely modeled after the current logo
-
Sanqui
all are also available as .icos at the same address for ease of use
-
OrIdow6
I like the Yahoo one more, but it seems to me that from a "marketing" angle the floppy one (and the new one is a lot better) makes more sense
-
OrIdow6
But anyhow, it's not like I'm in charge of this
-
Sanqui
yeah, it's alright, it's not like this is a priority in any way, but I kind of, just made a good old bookmark bar and the AT wiki is sticking out for not having a favicon haha
-
Sanqui
so I thought I'd put my lackluster pixel art skills to ues
-
Sanqui
use*
-
masterX244
Switched to torrent at the Stackexchange dump, that was much faster for some odd reason. Extracting now and after that outlinks should be extracted soon
-
thuban
does grab-site have a secret concurrency limit like seesaw-kit?
-
masterX244
how do you mean? hardcoded value that can't be exceeded by config?
-
masterX244
otherwise: the server and its response duration affects effective rate, too
-
thuban
i mean 'limit beyond which it starts to get flaky'
-
thuban
i'm currently at 20 and trying to decide whether i should go higher (seeing as i'm technically past the deadline already and there are a lot of pages left to do)
-
HCross
DO NOT CHANGE THE CODE
-
HCross
DO NOT FIDDLE WITH THE WARRIOR/PROJECT CODE
-
thuban
dude chill this is grab-site
-
HCross
If you're using grab-site then that is fine
-
HCross
but I would stick to 20 otherwise it does do strange things afaik
-
thuban
ah, oof
-
jodizzle
Doesn't wpull have a built-in per-domain concurrency limit?
-
jodizzle
Not an expert, but according to some logs I have, wpull maxes out at 6 connections per (host, port, use_ssl) tuple. Since grab-site uses wpull, you might be limited by that.
-
LeighR
If there is a site that is increasingly fragile and I'm afraid will die due to neglect, is it better for me to use grab-site on my own to make sure it gets archived properly, and then upload to archive.org, or to ask one of you to have ArchiveBot do it?
-
jodizzle
The nice thing about AB other than convenience is that the data ends up in the WBM. Uploading to archive.org on your own doesn't do that.
-
LeighR
ok
-
jodizzle
But you can also do both if you're particularly concerned (just be mindful of IA resources)
-
LeighR
I seriously doubt it would be over 1GB total
-
LeighR
probably not even 100MB
-
jodizzle
What is the site?
-
LeighR
pemberley.com
-
LeighR
note its very, very slow load time, despite not having anything wild going on
-
jodizzle
Okay, sure. Join #archivebot
-
LeighR
it should probably only be archived at 1 or 2 concurrency, and slowly
-
nyany
oof, those loading times though
-
LeighR
yeah - that's why I'm afraid for it
-
LeighR
Early modern literature sites run old-school
-
nyany
that seems to be a static site though?
-
nyany
its godaddy too
-
LeighR
it should be blazing fast, and super cheap to host
-
nyany
oh no, that's wordpress
-
nyany
huh
-
LeighR
it'd probably be cheaper for them to move to wordpress.com
-
LeighR
but anyway - in general, if I find other precarious-looking sites, what's the best way to get them done by ArchiveBot? Make the request in that channel, and wait for an op to ask the bot?
-
AK
Pretty much
-
AK
That's what I do
-
nyany
basically, yup
-
AK
(Or if chat scrolls too quickly, throw it in here if it doesn't get seen in ab)
-
LeighR
so many mid-sized, deeply linked literary sites, done by hand starting in the late 90s
-
JAA
thuban: wpull is fine at very high concurrencies in principle; it doesn't have the race condition issues like seesaw. But jodizzle's right, there's a hardcoded limit of 6 connections per host/port/use_ssl. Further, all processing is single-threaded. You'll quickly run into bottlenecks on SQLite, HTML parsing, WARC writing, Python's cookie jar, etc. I find that concurrencies above around a dozen are
-
JAA
rarely useful. (Exceptions confirm the rule.)