18:43:16 JAA: thank you for keeping on top of the election sites (and sorry for not being more involved) 18:43:42 I see that all the lib dem sites are set to larger delays - I assume this is due to rate limiting? 18:44:06 Yeah, they're all hosted on the same IP and have silly rate limits. 18:44:46 1 request per second is already too much and gets your IP banned (connection refused). 18:45:02 Some others are also slowed down for similar reasons. 18:45:08 there's also ~15 or so that are stuck with a very large delay (e.g: e6t5bxy2xr04kg98svkax7o7r with 300 second delay) - is this due to issues with the underlying pipeline? 18:45:31 Those with a 300001 delay are all dead, see #archivebot just now. 18:45:54 I'll take care of those when I clean it up. 18:46:30 There's also a growing list of sites that need special treatment in some way. Some are LibDem sites we aborted, then there's a bunch of sites that embed another domain via an iframe etc. 18:47:13 And no worries re involvement, you created the list. :-) Hope you don't mind the thousands of highlights in #archivebot though. :-P 18:47:17 if required we can always just setup grabsite with a /24? JAA/betamax :) 18:47:27 and do the random IP iptables rule 18:48:10 If someone who isn't me does it, sure! :-) 18:48:32 JAA: I cleaned up the list, but it was the hundreds of volunteers through Democracy Club (https://democracyclub.org.uk/) that did the hard work crowdsourcing the data 18:49:02 :P i mean, I can get the box + grab-site configured if you wanna put the jobs into it 18:49:07 or just AB them as you are :D 18:50:19 AB is fully automated at this point. So if I can throw a list of domains somewhere, that's fine with me. I don't really want to have to set up uploads, add ignores manually, etc. 18:50:56 Fair enough ;) 18:51:04 I occasionally go through to see if a job's running wild and check the list of finished jobs to see which ones went wrong. Otherwise, it's automatic. 18:51:08 I didn't mean for this to be such a big project - I (naively) thought it would be straightforward :) 18:51:37 :-) 18:52:01 It's a great idea for a long running project and it falls into the same category as another one i mentioned to JAA recently. It would be best looking into creating a specific tool for these projects. I'll have a brainstorm one night 18:52:28 Aye 18:53:10 Gov sites as well. A few found their way into this list, and it's amazing how much stuff on them has never been archived before. 18:53:32 One was retrieving thousands of PDFs, and only like a hundred were in the WBM. 18:54:48 Yeah. I tried to do my own thing for the US 2018 midterm elections (archiving candidate sites) and ran into complexity issues prety quickly. I ended up attempting to archive each site using wget with warc output, then aborting and keeping the partial archive if it took longer than 5 or so minutes, but that isn't a very good technique :D 18:56:54 JAA: it's amazing how much gov / council stuff is just deleted. I've been archiving the UK council webcasts for about a year now (https://archive.org/details/public-i-webcast-archive) but a lot is lost forever because webcasts are deleted after a completely arbitrary time period 18:57:30 (for that project I pull PDFs / meeting minutes from the local government sites if they've been mentioned in the metadata for the webcast) 18:57:50 Yeah 19:05:12 Mmm, there always be deleting, it's not just websites that suddenly go poof, no, the more nefarious stuff is stuff deleted while the website still looks alive s: 19:13:33 What's the team's opinion on archiving game modding sites? And porn? I've never seen this being discussed in the archival circles 19:14:37 https://transfer.archivete.am/inline/bG4mu/aatt.png 19:15:18 lol 19:17:07 Porn depends a bit, but game modding absolutely. 19:18:00 there's a degree of overlap there 19:18:22 we literally grabbed the porn bit of tumblr 22:25:14 Any one know how frequently Stack Exchange uploads their dumps to IA? Only the most recent is kept, so I can't tell. https://archive.org/details/stackexchange 22:47:56 about every 80-90 days I think. 22:59:04 Every three months per https://meta.stackexchange.com/questions/224873/all-stack-exchange-data-dumps 23:15:20 checks out. Last one 75 days ago, then 159 days, then 249 days, etc 23:15:57 But also, gross that they only keep the latest dump. 23:17:17 https://data.stackexchange.com/ has weekly dumps, by the way. 23:19:36 yeah, I was just poking around hoping the data was maybe shipped off elsewhere first or something, but doesn't seem to be 23:19:56 Relevant discussion: https://meta.stackexchange.com/questions/224873/all-stack-exchange-data-dumps 23:22:07 So one would have to dig through the logs for that IA item, get the torrent info hashes, and then hope that someone still seeds them.