#archiveteam-bs

07:52

masterX244

Brickset forums living on borrowed time... should be down already but it isnt and still some random posts coming in over time. Need to catch them with followup pulls
07:53

masterX244

JAA: did you catch motor-talk fully? They will stay up but if a user opts out on the transfer to new owner they zap its data, not sure if posts get zapped, too
13:26

AK

Anyone else hosting in Hetzner get hit by that switch going down?
13:27

AK

`Switch fault hel1-dc3-sw_21`
13:27

murb

presuably only if you're hosting in hel?
13:29

AK

Yep only if you were in dc3 too I'd assume
13:37

murb

apparently i'm dc1
15:48

JAA

masterX244: I did not, wasn't aware content was still in danger. I see that they now have a JS challenge thing. Funnily enough, there's a noscript meta redirect fallback that ... just bypasses it?
18:51

KrazeeTobi

Currently archiving a Japanese website with video game information (news in particular) that goes back to 1998; I'm just encountering a problem with grabbing each article (can't seem to find any scraper that can actually grab all of the links in the html, despite them being in plaintext view). I've tried using HTTrack, wget, and such to no avail.
18:51

KrazeeTobi

If somebody could give me a hand into resolving the problem, thank you in advance.
18:52

KrazeeTobi

I've managed to crack the problem for the 1998-2004 URLs but not for 2005-2011: nlab.itmedia.co.jp/games/news/0501.html
18:55

masterX244

JAA: that shreddering happens in december, so still a bit of time left. bricksetforums seem to be down finally, there was a hour without login possible before the final kill which allowed me to snag the last few posts
18:56

masterX244

gonna prepare a "cargo train" of WARCs
18:56

arkiver

KrazeeTobi: the links are to a different domain
18:56

arkiver

likely whatever you use sticks to the domain it starts with
18:57

KrazeeTobi

Odd, that domain redirects to a valid link...
18:58

JAA

KrazeeTobi: You'll need to allow it to follow links to the 'gamez' subdomain, yeah.
18:59

pokechu22

With the way archivebot's redirect handling works, it'd recurse properly on those, right?
18:59

JAA

masterX244: Ack. I also ran Brickset through AB, and that finished in time as well I think.
18:59

pokechu22

The behavior I've seen is that it recurses if the redirect source *OR* destination is onsite
18:59

JAA

pokechu22: For some value of 'properly'.
19:00

JAA

It wouldn't follow outlinks on the article pages, I think.
19:00

JAA

Not sure about recursing through onsite links on them.
19:00

KrazeeTobi

This is not a couple-thousand we're talking to be fair, so AB may not be a great choice
19:00

JAA

Probably not that either.
19:00

JAA

KrazeeTobi: AB frequently does millions of URLs these days.
19:01

JAA

The 'a couple thousand' statement is a decade old. :-)
19:01

KrazeeTobi

Ah. The wiki's info on the bot gave me the impression that it's usually good for smaller- wait is it?
19:01

JAA

The largest jobs we've run are in the hundreds of millions of URLs.
19:01

KrazeeTobi

Well... Okay then, guess I was a bit misled there lol
19:02

JAA

Ah yeah, the wiki page still says 'a few hundred thousand'.
19:02

masterX244

JAA: main page or the forums?
19:02

JAA

And I guess that's not too bad as a rule of thumb.
19:02

masterX244

main page staying safe, it was the forums that got more and more expensive and less and less traffic thanks to damned social media
19:02

JAA

masterX244: The forums, recursion from forum.brickset.com with many aggressive ignores.
19:03

KrazeeTobi

Better to be safe than sorry lol, even if I've got a data hoard onset going on hahahahah
19:03

masterX244

not sure if it got latest stuff, what did you ignore off exactly?
19:04

JAA

Yeah, it almost certainly didn't get everything posted after the job start.
19:04

masterX244

sidenote: my older crawls got imgur-processed already since i kept them locally stored, too even though i sent them off to the wayback
19:04

JAA

archivebot.com/ignores/5vn0jh3mknwgc1j27ungb7dnt?compact=true
19:04

masterX244

ran a ugly "hackery" with a limited crawl off from recent to get a "sync" and then i manually poked for new posts
19:05

masterX244

(there were 2 loginonly sections, crawled them in a separate file that wont get archive.org'd, kept those fully isolated), used my logged in "read markers" to scan quickly if i needed to requeue a thread, too for that manual final sync
19:06

KrazeeTobi

Yup, telling HTTrack to grab from the same domain seems to have got it to work -- cheers for the assist everyone
19:08

masterX244

JAA: used this dirty ignore later on to trim off any offsite media ^(?!forum\.brickset\.com|us\.v-cdn\.net|secure\.gravatar\.com)
19:09

masterX244

(most users upload straight to the forums and most of the offsite links were already ran thru the pipelines, especially imgur before the shredders ran hot)
19:10

pokechu22

Why not ignore gravatar too?
19:10

pokechu22

It seems unlikely that gravatar will be deleted afterwards - it could be done in a separate job
19:10

JAA

Lots of the Gravatar URLs redirected back to the forums for the default avatar thingy.
19:11

masterX244

most users were regulars, that amount was minuscule
19:11

masterX244

had a fuckup and had v-cdn.com initially and not v-cdn.net, quick sqlite dump and generating a list fixed it
19:11

masterX244

(aka fixup crawl for that data)
19:11

JAA

Uh, the forums still seem to be up?
19:12

masterX244

had maintenance status a little bit ago and that seemed the final coffin nail
19:12

JAA

Ah, probably just caching of the most visited pages, yeah.
19:12

JAA

Getting the 503 after following a couple links.
19:12

masterX244

thanks god that the latest threads were cache-sticky so i was able to yoink them off
19:13

JAA

:-)
19:13

masterX244

sidenote: suckled with conc 20 at my end to emergency-yoink everything, wasnt sure when the shredders were going to start
19:13

masterX244

(aka a IDGAF crawl in AT style)
19:14

masterX244

(i can risk burning one or 2 IPs since i got 2 dedis and 2 vservers at hetzner)
19:15

JAA

Hmm, I wonder whether I should also rerun it through AB just to get the most recent few pages of content.
19:15

JAA

Can't hurt, and it'll be tiny.
19:18

masterX244

uploading my yoink already
19:19

masterX244

(keeping the older 2019 mirror, too since that might have deleted content that got lost afterwards, you never know the stupid imagehosters)
19:20

masterX244

one thing that i like on grabsite is that i can reuse a ignoreset from a crawl for future ones.
19:29

JAA

Oh yeah, those Gravatar URLs actually go to vanillicon.com, not back to the forums. Must've mixed that up with something else.
19:30

JAA

Outlinks from the AB crawl will also go into #// after some minor filtering.
19:42

apache2

is there a list of domains crawled by IA somewhere? is there a way to submit missing domains in bulk?
19:43

JAA

I don't believe they make anything like that public.
19:45

apache2

JAA: do you have an idea about where/whom I should submit the missing ones? (I take it they won't appreciate the "archive now" being spammed to death)
19:47

JAA

Not a clue. I'd probably ask info⊙ao for advice.
19:48

apache2

JAA: thanks, will do
20:28

h2ibot

Systwi edited IRC (+357, Added some IRC server URLs, sorted the list…): wiki.archiveteam.org/?diff=51105&oldid=50458
21:14

pupnik

thank you for preserving the maemo links o/

11 months ago

« a day earlier

a day later »

today »