-
masterX244
Brickset forums living on borrowed time... should be down already but it isnt and still some random posts coming in over time. Need to catch them with followup pulls
-
masterX244
JAA: did you catch motor-talk fully? They will stay up but if a user opts out on the transfer to new owner they zap its data, not sure if posts get zapped, too
-
AK
Anyone else hosting in Hetzner get hit by that switch going down?
-
AK
`Switch fault hel1-dc3-sw_21`
-
murb
presuably only if you're hosting in hel?
-
AK
Yep only if you were in dc3 too I'd assume
-
murb
apparently i'm dc1
-
JAA
masterX244: I did not, wasn't aware content was still in danger. I see that they now have a JS challenge thing. Funnily enough, there's a noscript meta redirect fallback that ... just bypasses it?
-
KrazeeTobi
Currently archiving a Japanese website with video game information (news in particular) that goes back to 1998; I'm just encountering a problem with grabbing each article (can't seem to find any scraper that can actually grab all of the links in the html, despite them being in plaintext view). I've tried using HTTrack, wget, and such to no avail.
-
KrazeeTobi
If somebody could give me a hand into resolving the problem, thank you in advance.
-
KrazeeTobi
I've managed to crack the problem for the 1998-2004 URLs but not for 2005-2011:
nlab.itmedia.co.jp/games/news/0501.html
-
masterX244
JAA: that shreddering happens in december, so still a bit of time left. bricksetforums seem to be down finally, there was a hour without login possible before the final kill which allowed me to snag the last few posts
-
masterX244
gonna prepare a "cargo train" of WARCs
-
arkiver
KrazeeTobi: the links are to a different domain
-
arkiver
likely whatever you use sticks to the domain it starts with
-
KrazeeTobi
Odd, that domain redirects to a valid link...
-
JAA
KrazeeTobi: You'll need to allow it to follow links to the 'gamez' subdomain, yeah.
-
pokechu22
With the way archivebot's redirect handling works, it'd recurse properly on those, right?
-
JAA
masterX244: Ack. I also ran Brickset through AB, and that finished in time as well I think.
-
pokechu22
The behavior I've seen is that it recurses if the redirect source *OR* destination is onsite
-
JAA
pokechu22: For some value of 'properly'.
-
JAA
It wouldn't follow outlinks on the article pages, I think.
-
JAA
Not sure about recursing through onsite links on them.
-
KrazeeTobi
This is not a couple-thousand we're talking to be fair, so AB may not be a great choice
-
JAA
Probably not that either.
-
JAA
KrazeeTobi: AB frequently does millions of URLs these days.
-
JAA
The 'a couple thousand' statement is a decade old. :-)
-
KrazeeTobi
Ah. The wiki's info on the bot gave me the impression that it's usually good for smaller- wait is it?
-
JAA
The largest jobs we've run are in the hundreds of millions of URLs.
-
KrazeeTobi
Well... Okay then, guess I was a bit misled there lol
-
JAA
Ah yeah, the wiki page still says 'a few hundred thousand'.
-
masterX244
JAA: main page or the forums?
-
JAA
And I guess that's not too bad as a rule of thumb.
-
masterX244
main page staying safe, it was the forums that got more and more expensive and less and less traffic thanks to damned social media
-
JAA
masterX244: The forums, recursion from
forum.brickset.com with many aggressive ignores.
-
KrazeeTobi
Better to be safe than sorry lol, even if I've got a data hoard onset going on hahahahah
-
masterX244
not sure if it got latest stuff, what did you ignore off exactly?
-
JAA
Yeah, it almost certainly didn't get everything posted after the job start.
-
masterX244
sidenote: my older crawls got imgur-processed already since i kept them locally stored, too even though i sent them off to the wayback
-
JAA
-
masterX244
ran a ugly "hackery" with a limited crawl off from recent to get a "sync" and then i manually poked for new posts
-
masterX244
(there were 2 loginonly sections, crawled them in a separate file that wont get archive.org'd, kept those fully isolated), used my logged in "read markers" to scan quickly if i needed to requeue a thread, too for that manual final sync
-
KrazeeTobi
Yup, telling HTTrack to grab from the same domain seems to have got it to work -- cheers for the assist everyone
-
masterX244
JAA: used this dirty ignore later on to trim off any offsite media ^
(?!forum\.brickset\.com|us\.v-cdn\.net|secure\.gravatar\.com)
-
masterX244
(most users upload straight to the forums and most of the offsite links were already ran thru the pipelines, especially imgur before the shredders ran hot)
-
pokechu22
Why not ignore gravatar too?
-
pokechu22
It seems unlikely that gravatar will be deleted afterwards - it could be done in a separate job
-
JAA
Lots of the Gravatar URLs redirected back to the forums for the default avatar thingy.
-
masterX244
most users were regulars, that amount was minuscule
-
masterX244
had a fuckup and had v-cdn.com initially and not v-cdn.net, quick sqlite dump and generating a list fixed it
-
masterX244
(aka fixup crawl for that data)
-
JAA
Uh, the forums still seem to be up?
-
masterX244
had maintenance status a little bit ago and that seemed the final coffin nail
-
JAA
Ah, probably just caching of the most visited pages, yeah.
-
JAA
Getting the 503 after following a couple links.
-
masterX244
thanks god that the latest threads were cache-sticky so i was able to yoink them off
-
JAA
:-)
-
masterX244
sidenote: suckled with conc 20 at my end to emergency-yoink everything, wasnt sure when the shredders were going to start
-
masterX244
(aka a IDGAF crawl in AT style)
-
masterX244
(i can risk burning one or 2 IPs since i got 2 dedis and 2 vservers at hetzner)
-
JAA
Hmm, I wonder whether I should also rerun it through AB just to get the most recent few pages of content.
-
JAA
Can't hurt, and it'll be tiny.
-
masterX244
uploading my yoink already
-
masterX244
(keeping the older 2019 mirror, too since that might have deleted content that got lost afterwards, you never know the stupid imagehosters)
-
masterX244
one thing that i like on grabsite is that i can reuse a ignoreset from a crawl for future ones.
-
JAA
Oh yeah, those Gravatar URLs actually go to vanillicon.com, not back to the forums. Must've mixed that up with something else.
-
JAA
Outlinks from the AB crawl will also go into #// after some minor filtering.
-
apache2
is there a list of domains crawled by IA somewhere? is there a way to submit missing domains in bulk?
-
JAA
I don't believe they make anything like that public.
-
apache2
JAA: do you have an idea about where/whom I should submit the missing ones? (I take it they won't appreciate the "archive now" being spammed to death)
-
JAA
Not a clue. I'd probably ask info⊙ao for advice.
-
apache2
JAA: thanks, will do
-
h2ibot
Systwi edited IRC (+357, Added some IRC server URLs, sorted the list…):
wiki.archiveteam.org/?diff=51105&oldid=50458
-
pupnik
thank you for preserving the maemo links o/