-
betamax
JAA: thank you for keeping on top of the election sites (and sorry for not being more involved)
-
betamax
I see that all the lib dem sites are set to larger delays - I assume this is due to rate limiting?
-
JAA
Yeah, they're all hosted on the same IP and have silly rate limits.
-
JAA
1 request per second is already too much and gets your IP banned (connection refused).
-
JAA
Some others are also slowed down for similar reasons.
-
betamax
there's also ~15 or so that are stuck with a very large delay (e.g: e6t5bxy2xr04kg98svkax7o7r with 300 second delay) - is this due to issues with the underlying pipeline?
-
JAA
Those with a 300001 delay are all dead, see #archivebot just now.
-
JAA
I'll take care of those when I clean it up.
-
JAA
There's also a growing list of sites that need special treatment in some way. Some are LibDem sites we aborted, then there's a bunch of sites that embed another domain via an iframe etc.
-
JAA
And no worries re involvement, you created the list. :-) Hope you don't mind the thousands of highlights in #archivebot though. :-P
-
EggplantN
if required we can always just setup grabsite with a /24? JAA/betamax :)
-
EggplantN
and do the random IP iptables rule
-
JAA
If someone who isn't me does it, sure! :-)
-
betamax
JAA: I cleaned up the list, but it was the hundreds of volunteers through Democracy Club (
democracyclub.org.uk) that did the hard work crowdsourcing the data
-
EggplantN
:P i mean, I can get the box + grab-site configured if you wanna put the jobs into it
-
EggplantN
or just AB them as you are :D
-
JAA
AB is fully automated at this point. So if I can throw a list of domains somewhere, that's fine with me. I don't really want to have to set up uploads, add ignores manually, etc.
-
EggplantN
Fair enough ;)
-
JAA
I occasionally go through to see if a job's running wild and check the list of finished jobs to see which ones went wrong. Otherwise, it's automatic.
-
betamax
I didn't mean for this to be such a big project - I (naively) thought it would be straightforward :)
-
JAA
:-)
-
EggplantN
It's a great idea for a long running project and it falls into the same category as another one i mentioned to JAA recently. It would be best looking into creating a specific tool for these projects. I'll have a brainstorm one night
-
JAA
Aye
-
JAA
Gov sites as well. A few found their way into this list, and it's amazing how much stuff on them has never been archived before.
-
JAA
One was retrieving thousands of PDFs, and only like a hundred were in the WBM.
-
betamax
Yeah. I tried to do my own thing for the US 2018 midterm elections (archiving candidate sites) and ran into complexity issues prety quickly. I ended up attempting to archive each site using wget with warc output, then aborting and keeping the partial archive if it took longer than 5 or so minutes, but that isn't a very good technique :D
-
betamax
JAA: it's amazing how much gov / council stuff is just deleted. I've been archiving the UK council webcasts for about a year now (
archive.org/details/public-i-webcast-archive) but a lot is lost forever because webcasts are deleted after a completely arbitrary time period
-
betamax
(for that project I pull PDFs / meeting minutes from the local government sites if they've been mentioned in the metadata for the webcast)
-
JAA
Yeah
-
Ryz
Mmm, there always be deleting, it's not just websites that suddenly go poof, no, the more nefarious stuff is stuff deleted while the website still looks alive s:
-
BerndLauert
What's the team's opinion on archiving game modding sites? And porn? I've never seen this being discussed in the archival circles
-
JAA
-
BerndLauert
lol
-
JAA
Porn depends a bit, but game modding absolutely.
-
BerndLauert
there's a degree of overlap there
-
Kaz
we literally grabbed the porn bit of tumblr
-
aarchi
Any one know how frequently Stack Exchange uploads their dumps to IA? Only the most recent is kept, so I can't tell.
archive.org/details/stackexchange
-
Jake
about every 80-90 days I think.
-
JAA
-
Kaz
checks out. Last one 75 days ago, then 159 days, then 249 days, etc
-
JAA
But also, gross that they only keep the latest dump.
-
JAA
data.stackexchange.com has weekly dumps, by the way.
-
Kaz
yeah, I was just poking around hoping the data was maybe shipped off elsewhere first or something, but doesn't seem to be
-
JAA
-
JAA
So one would have to dig through the logs for that IA item, get the torrent info hashes, and then hope that someone still seeds them.