01:38:07 pokechu22: thanks! 01:38:47 From https://www.blogger.com/profile/01726214044058904036 it seems like there's also https://infosec-inmemoriam.blogspot.com/ which is pretty small; I threw that one in too 01:40:22 Thanks for bringing it up 01:55:53 https://www.fig.co/campaigns/psychonauts-2/updates/1707 01:56:07 Fig.co going down this Sunday, may 28th 03:08:21 https://techcrunch.com/2023/05/23/meta-sells-giphy-to-shutterstock-for-53m-after-uk-divestment-order/ 12:52:47 planetminecraft (potential source for mediafire links) got a nice fuckyou in its pagination... it only allows you to switch to pages that are close to your current pagination... 12:53:40 might have to pregenerate urllists for ensuring pagination is done in order if i want to spider that site 15:56:53 Hello! I'm new here, I've got the warrior client running and everything seems to be working well, however I'm looking into running it permanently and potentially refining the process around at all. Could any of you care to share your workflow and how you have your things set up? 16:19:01 #warrior for that topic 16:24:49 concurrency=1 and a URL list for starting seems to work for getting that pagination to cooperate... 18:19:21 kpcyrd: note that wayback machine exclusions are reversible, the data is not deleted 18:19:41 do you thinkt here's anything we could/should do about Letzte? 18:39:43 They have a new website at https://letztegeneration.org/ (I'm getting redirected there when I use the old domain) 18:41:07 Might also be a good idea to get their social media: https://twitter.com/AufstandLastGen https://www.youtube.com/@letztegeneration/videos 19:03:41 Hi, I was wondering if archivebot could crawl my dad's blog? Sorry if this is the wrong room; I forget where the right place to ask is. It's https://havechanged.blogspot.com 19:05:04 NickS|m: Sure, I have thrown it into ArchiveBot. 19:05:36 Thank you so much! 19:05:44 He'll be very happy about this 19:06:02 It should all appear in the Wayback Machine within a couple days. 19:15:14 does archivebot retry on errors? 19:15:43 Yes 19:16:44 nice 19:16:56 It retires any 5xx or 4xx error (other than 404 and 403 and maybe 401) and most network-related errors. The retries wait until it's recursed through everything once, and then pages are retried twice (and anything found during that process is also recursed over) 19:17:29 nice 19:18:16 401 403 404 405 410 are not retried. 19:18:21 for particularly unstable sites it's possible to manually requeue errors again, but that requires someone to manually run a script on the database and generally is a pain. Probably won't be needed for artdoxa since there have been 1,357 errors recorded and 771,575 URLs successfully retrieved 19:18:24 200 204 304 are considered successes. 19:20:02 Connection refusals are also not retried. 19:21:21 huh, apparently new codes were added a few months ago: https://github.com/mdn/content/commit/2adf8a015997173bdb5fdcb55835e3eabd57bf40 (some from WebDAV but others from https://www.rfc-editor.org/rfc/rfc9110#name-changes-from-rfc-7231 it seems?) 19:23:29 421 Misdirected Request seems to be the only new one that's generally applicable 19:35:27 ah dang, they had to restart something which probably caused those errors 19:35:34 probably nothing to worry about in the big picture 19:37:30 1,357 errors is pretty small all things considered (and note that those errors could be from off-site links too) 19:39:59 The one thing I'm a bit unsure about is how to approach re-running the site later close to when it goes offline (to grab content between now and then). I guess ignoring https://artdoxa-images.s3.amazonaws.com/uploads/artwork/image/ followed by a number less than 200000 would work (something like /image/(\d{1,5}|1\d{5})/ maybe) and the same for some on-site URLs? 19:41:57 I'm pretty sure that it was above 200000 when the job was started which does make things easier (a regex for less than, say, 182563, is a lot more painful to write) 19:42:29 there shouldnt be new content i think 19:42:44 oh there is :\ 19:43:20 should be enough to paginate through https://artdoxa.com/more?page=2 until mid march is reached 19:44:21 not sure if archivebot allows starting on a linklist and ignoring any pagination (starting urls ignore the ignore-filter?) to limit what it sees 19:44:40 (did that once when i had to bypass pagination on a grab-site based crawl) 19:47:39 You can do an !a < list job but it's rather annoying to do (it breaks if some of the URLs have more slashes than others, and it also makes adding an ignore later more annoying). Also you'd need to make sure to ignore other things (e.g. the tag pages) to avoid it redownloading everything via those... 19:50:25 going to have similar fun with grabsite soon, planetminecraft pagination is a bitch since they only allow moving forwards to the visible pages on the pagination bar and any skip ends you on a different page than the URL intends 19:50:49 (for example trying to go to page 85 from 3 brings you to 4 with the 85 in the URL) 19:52:21 in the worst case i'll just go through those with a slow loop over SPN 20:04:22 JAA: thanks for the hint that wpull logs ignored URLs in its db, too even when they are skipped 20:33:01 seems to work so far (the ugly hack with the URLList entry for my crawl). that should yield quite a bunch of juicy olutlinks once parsed 20:34:00 Korean blog website egloos.com established in 2003 will be closed on 6/16/2023. 20:57:30 I found a snapshot that works: https://archive.is/cX7xa and also of the takedown: https://archive.is/QM1xu 20:58:53 I threw a bunch of stuff from the orgs in various countries into AB earlier (and also the surviving German things, of course). 20:59:05 And the YouTube channels into #down-the-tube as well. 22:39:50 https://minecraft.curseforge.com/mc-mods/248994 this appears to have been disabled :( 22:55:02 Gridkr edited List of websites excluded from the Wayback Machine (+24, https://phcorner.net/ This URL has been…): https://wiki.archiveteam.org/?diff=49833&oldid=49802 22:55:03 Yts98 created LINE (+3550, Created page with "{{Infobox project | title =…): https://wiki.archiveteam.org/?title=LINE 22:55:04 Yts98 edited Template:Instant messengers (+0, Capitalize LINE): https://wiki.archiveteam.org/?diff=49835&oldid=49692 22:55:05 Yts98 edited Template:Navigation box (+36, Added LINE BLOG and Xuite): https://wiki.archiveteam.org/?diff=49836&oldid=49098 22:55:06 Yts98 created LINE BLOG (+5257, Created page with "{{Infobox project | title =…): https://wiki.archiveteam.org/?title=LINE%20BLOG 22:56:02 Cooljeanius edited Deathwatch (+174, /* 2023 */ add home.social): https://wiki.archiveteam.org/?diff=49838&oldid=49831