-
sep332
pokechu22: thanks!
-
pokechu22
From
blogger.com/profile/01726214044058904036 it seems like there's also
infosec-inmemoriam.blogspot.com which is pretty small; I threw that one in too
-
pokechu22
Thanks for bringing it up
-
mgrandi
-
mgrandi
Fig.co going down this Sunday, may 28th
-
tech234a
-
masterx244|m
planetminecraft (potential source for mediafire links) got a nice fuckyou in its pagination... it only allows you to switch to pages that are close to your current pagination...
-
masterx244|m
might have to pregenerate urllists for ensuring pagination is done in order if i want to spider that site
-
Zaxoosh
Hello! I'm new here, I've got the warrior client running and everything seems to be working well, however I'm looking into running it permanently and potentially refining the process around at all. Could any of you care to share your workflow and how you have your things set up?
-
masterX244
#warrior for that topic
-
masterX244
concurrency=1 and a URL list for starting seems to work for getting that pagination to cooperate...
-
nicolas17
kpcyrd: note that wayback machine exclusions are reversible, the data is not deleted
-
nicolas17
do you thinkt here's anything we could/should do about Letzte?
-
ThreeHM
They have a new website at
letztegeneration.org (I'm getting redirected there when I use the old domain)
-
ThreeHM
-
NickS|m
Hi, I was wondering if archivebot could crawl my dad's blog? Sorry if this is the wrong room; I forget where the right place to ask is. It's
havechanged.blogspot.com
-
JAA
NickS|m: Sure, I have thrown it into ArchiveBot.
-
NickS|m
Thank you so much!
-
NickS|m
He'll be very happy about this
-
JAA
It should all appear in the Wayback Machine within a couple days.
-
spirit
does archivebot retry on errors?
-
pokechu22
Yes
-
spirit
nice
-
pokechu22
It retires any 5xx or 4xx error (other than 404 and 403 and maybe 401) and most network-related errors. The retries wait until it's recursed through everything once, and then pages are retried twice (and anything found during that process is also recursed over)
-
spirit
nice
-
JAA
401 403 404 405 410 are not retried.
-
pokechu22
for particularly unstable sites it's possible to manually requeue errors again, but that requires someone to manually run a script on the database and generally is a pain. Probably won't be needed for artdoxa since there have been 1,357 errors recorded and 771,575 URLs successfully retrieved
-
JAA
200 204 304 are considered successes.
-
JAA
Connection refusals are also not retried.
-
pokechu22
huh, apparently new codes were added a few months ago:
mdn/content 2adf8a0 (some from WebDAV but others from
rfc-editor.org/rfc/rfc9110#name-changes-from-rfc-7231 it seems?)
-
pokechu22
421 Misdirected Request seems to be the only new one that's generally applicable
-
spirit
ah dang, they had to restart something which probably caused those errors
-
spirit
probably nothing to worry about in the big picture
-
pokechu22
1,357 errors is pretty small all things considered (and note that those errors could be from off-site links too)
-
pokechu22
The one thing I'm a bit unsure about is how to approach re-running the site later close to when it goes offline (to grab content between now and then). I guess ignoring
artdoxa-images.s3.amazonaws.com/uploads/artwork/image followed by a number less than 200000 would work (something like /image/(\d{1,5}|1\d{5})/ maybe) and the same for some on-site URLs?
-
pokechu22
I'm pretty sure that it was above 200000 when the job was started which does make things easier (a regex for less than, say, 182563, is a lot more painful to write)
-
spirit
there shouldnt be new content i think
-
spirit
oh there is :\
-
spirit
should be enough to paginate through
artdoxa.com/more?page=2 until mid march is reached
-
masterX244
not sure if archivebot allows starting on a linklist and ignoring any pagination (starting urls ignore the ignore-filter?) to limit what it sees
-
masterX244
(did that once when i had to bypass pagination on a grab-site based crawl)
-
pokechu22
You can do an !a < list job but it's rather annoying to do (it breaks if some of the URLs have more slashes than others, and it also makes adding an ignore later more annoying). Also you'd need to make sure to ignore other things (e.g. the tag pages) to avoid it redownloading everything via those...
-
masterX244
going to have similar fun with grabsite soon, planetminecraft pagination is a bitch since they only allow moving forwards to the visible pages on the pagination bar and any skip ends you on a different page than the URL intends
-
masterX244
(for example trying to go to page 85 from 3 brings you to 4 with the 85 in the URL)
-
spirit
in the worst case i'll just go through those with a slow loop over SPN
-
masterX244
JAA: thanks for the hint that wpull logs ignored URLs in its db, too even when they are skipped
-
masterX244
seems to work so far (the ugly hack with the URLList entry for my crawl). that should yield quite a bunch of juicy olutlinks once parsed
-
umgr036
Korean blog website egloos.com established in 2003 will be closed on 6/16/2023.
-
kpcyrd
I found a snapshot that works:
archive.is/cX7xa and also of the takedown:
archive.is/QM1xu
-
JAA
I threw a bunch of stuff from the orgs in various countries into AB earlier (and also the surviving German things, of course).
-
JAA
And the YouTube channels into #down-the-tube as well.
-
Jake
-
h2ibot
Gridkr edited List of websites excluded from the Wayback Machine (+24,
phcorner.net This URL has been…):
wiki.archiveteam.org/?diff=49833&oldid=49802
-
h2ibot
Yts98 created LINE (+3550, Created page with "{{Infobox project | title =…):
wiki.archiveteam.org/?title=LINE
-
h2ibot
Yts98 edited Template:Instant messengers (+0, Capitalize LINE):
wiki.archiveteam.org/?diff=49835&oldid=49692
-
h2ibot
Yts98 edited Template:Navigation box (+36, Added LINE BLOG and Xuite):
wiki.archiveteam.org/?diff=49836&oldid=49098
-
h2ibot
Yts98 created LINE BLOG (+5257, Created page with "{{Infobox project | title =…):
wiki.archiveteam.org/?title=LINE%20BLOG
-
h2ibot
Cooljeanius edited Deathwatch (+174, /* 2023 */ add home.social):
wiki.archiveteam.org/?diff=49838&oldid=49831