07:52:36 Brickset forums living on borrowed time... should be down already but it isnt and still some random posts coming in over time. Need to catch them with followup pulls 07:53:37 JAA: did you catch motor-talk fully? They will stay up but if a user opts out on the transfer to new owner they zap its data, not sure if posts get zapped, too 13:26:37 Anyone else hosting in Hetzner get hit by that switch going down? 13:27:23 `Switch fault hel1-dc3-sw_21` 13:27:58 presuably only if you're hosting in hel? 13:29:58 Yep only if you were in dc3 too I'd assume 13:37:07 apparently i'm dc1 15:48:33 masterX244: I did not, wasn't aware content was still in danger. I see that they now have a JS challenge thing. Funnily enough, there's a noscript meta redirect fallback that ... just bypasses it? 18:51:56 Currently archiving a Japanese website with video game information (news in particular) that goes back to 1998; I'm just encountering a problem with grabbing each article (can't seem to find any scraper that can actually grab all of the links in the html, despite them being in plaintext view). I've tried using HTTrack, wget, and such to no avail. 18:51:56 If somebody could give me a hand into resolving the problem, thank you in advance. 18:52:43 I've managed to crack the problem for the 1998-2004 URLs but not for 2005-2011: https://nlab.itmedia.co.jp/games/news/0501.html 18:55:36 JAA: that shreddering happens in december, so still a bit of time left. bricksetforums seem to be down finally, there was a hour without login possible before the final kill which allowed me to snag the last few posts 18:56:14 gonna prepare a "cargo train" of WARCs 18:56:41 KrazeeTobi: the links are to a different domain 18:56:57 likely whatever you use sticks to the domain it starts with 18:57:53 Odd, that domain redirects to a valid link... 18:58:46 KrazeeTobi: You'll need to allow it to follow links to the 'gamez' subdomain, yeah. 18:59:21 With the way archivebot's redirect handling works, it'd recurse properly on those, right? 18:59:35 masterX244: Ack. I also ran Brickset through AB, and that finished in time as well I think. 18:59:38 The behavior I've seen is that it recurses if the redirect source *OR* destination is onsite 18:59:44 pokechu22: For some value of 'properly'. 19:00:04 It wouldn't follow outlinks on the article pages, I think. 19:00:19 Not sure about recursing through onsite links on them. 19:00:21 This is not a couple-thousand we're talking to be fair, so AB may not be a great choice 19:00:23 Probably not that either. 19:00:40 KrazeeTobi: AB frequently does millions of URLs these days. 19:01:01 The 'a couple thousand' statement is a decade old. :-) 19:01:11 Ah. The wiki's info on the bot gave me the impression that it's usually good for smaller- wait is it? 19:01:32 The largest jobs we've run are in the hundreds of millions of URLs. 19:01:53 Well... Okay then, guess I was a bit misled there lol 19:02:02 Ah yeah, the wiki page still says 'a few hundred thousand'. 19:02:14 JAA: main page or the forums? 19:02:29 And I guess that's not too bad as a rule of thumb. 19:02:42 main page staying safe, it was the forums that got more and more expensive and less and less traffic thanks to damned social media 19:02:44 masterX244: The forums, recursion from https://forum.brickset.com/ with many aggressive ignores. 19:03:11 Better to be safe than sorry lol, even if I've got a data hoard onset going on hahahahah 19:03:45 not sure if it got latest stuff, what did you ignore off exactly? 19:04:21 Yeah, it almost certainly didn't get everything posted after the job start. 19:04:22 sidenote: my older crawls got imgur-processed already since i kept them locally stored, too even though i sent them off to the wayback 19:04:42 http://archivebot.com/ignores/5vn0jh3mknwgc1j27ungb7dnt?compact=true 19:04:58 ran a ugly "hackery" with a limited crawl off from recent to get a "sync" and then i manually poked for new posts 19:05:59 (there were 2 loginonly sections, crawled them in a separate file that wont get archive.org'd, kept those fully isolated), used my logged in "read markers" to scan quickly if i needed to requeue a thread, too for that manual final sync 19:06:43 Yup, telling HTTrack to grab from the same domain seems to have got it to work -- cheers for the assist everyone 19:08:55 JAA: used this dirty ignore later on to trim off any offsite media ^https://(?!forum\.brickset\.com|us\.v-cdn\.net|secure\.gravatar\.com) 19:09:40 (most users upload straight to the forums and most of the offsite links were already ran thru the pipelines, especially imgur before the shredders ran hot) 19:10:18 Why not ignore gravatar too? 19:10:32 It seems unlikely that gravatar will be deleted afterwards - it could be done in a separate job 19:10:51 Lots of the Gravatar URLs redirected back to the forums for the default avatar thingy. 19:11:02 most users were regulars, that amount was minuscule 19:11:25 had a fuckup and had v-cdn.com initially and not v-cdn.net, quick sqlite dump and generating a list fixed it 19:11:39 (aka fixup crawl for that data) 19:11:46 Uh, the forums still seem to be up? 19:12:10 had maintenance status a little bit ago and that seemed the final coffin nail 19:12:18 Ah, probably just caching of the most visited pages, yeah. 19:12:26 Getting the 503 after following a couple links. 19:12:59 thanks god that the latest threads were cache-sticky so i was able to yoink them off 19:13:06 :-) 19:13:38 sidenote: suckled with conc 20 at my end to emergency-yoink everything, wasnt sure when the shredders were going to start 19:13:57 (aka a IDGAF crawl in AT style) 19:14:25 (i can risk burning one or 2 IPs since i got 2 dedis and 2 vservers at hetzner) 19:15:05 Hmm, I wonder whether I should also rerun it through AB just to get the most recent few pages of content. 19:15:46 Can't hurt, and it'll be tiny. 19:18:26 uploading my yoink already 19:19:09 (keeping the older 2019 mirror, too since that might have deleted content that got lost afterwards, you never know the stupid imagehosters) 19:20:12 one thing that i like on grabsite is that i can reuse a ignoreset from a crawl for future ones. 19:29:31 Oh yeah, those Gravatar URLs actually go to vanillicon.com, not back to the forums. Must've mixed that up with something else. 19:30:03 Outlinks from the AB crawl will also go into #// after some minor filtering. 19:42:32 is there a list of domains crawled by IA somewhere? is there a way to submit missing domains in bulk? 19:43:51 I don't believe they make anything like that public. 19:45:44 JAA: do you have an idea about where/whom I should submit the missing ones? (I take it they won't appreciate the "archive now" being spammed to death) 19:47:24 Not a clue. I'd probably ask info⊙ao for advice. 19:48:15 JAA: thanks, will do 20:28:27 Systwi edited IRC (+357, Added some IRC server URLs, sorted the list…): https://wiki.archiveteam.org/?diff=51105&oldid=50458 21:14:03 thank you for preserving the maemo links o/