01:41:00 JustAnotherArchivist edited Onlyfiles (+432, It's dead, Jim): https://wiki.archiveteam.org/?diff=49404&oldid=49375 01:42:00 JustAnotherArchivist edited Onlyfiles (+59): https://wiki.archiveteam.org/?diff=49405&oldid=49404 01:44:00 Mentalist edited Deathwatch (-17): https://wiki.archiveteam.org/?diff=49406&oldid=49397 02:55:08 Re: WBM CDX Stuffs 02:55:08 JAA: Thanks for the response, I really appreciate it. 02:55:08 Do even privileged users in archiveteam have access to the wbm grab items or is that something I have no chance of touching? 02:55:37 I'm sure Jason has access, but he works at IA, so... 02:56:02 Otherwise, I'm almost certain the answer is no. 05:58:44 JustAnotherArchivist edited Deathwatch (+164, /* 2023 */ Add MyWOT forum): https://wiki.archiveteam.org/?diff=49407&oldid=49406 05:58:54 Haha qwarc goes brrrrr :-) 06:09:01 Looks good, I should have a complete copy of the topic pages in 20-ish minutes. 06:30:57 Done, and apart from one topic with an SQL error, it ran almost suspiciously smoothly. 06:33:16 for sites with a bunch of JS links on one page, is it possible to pass AB multiple entry points for !a instead of just one? 06:34:50 pabs: !a < exists, but it has various sharp edges that can backfire badly with regards to recursion, which is why it's undocumented. Depending on the URL list, site structure, and retrieval order, it might recurse further or less far than you'd think. 06:36:34 for eg I wanted to archive http://www.hungry.com/ but http://www.hungry.com/people/ only has JS links 06:49:51 The challenge is that http://www.hungry.com/people/ gives links to e.g. http://www.hungry.com/~alves/ and http://www.hungry.com/~beers/ 06:50:47 In this case, an !a < list would probably work, since it's (at least assumed to be) safe if you can do e.g. http://www.hungry.com/~alves http://www.hungry.com/~beers etc (all URLs without a slash in the path) 06:53:01 Looks like http://www.hungry.com/hungries/hungries.js stores all of that stuff. AB does sometimes directly extract links from javascript, but I'm not sure if it would extract these or not (I think it will extract the images like /hungries/watkins0.jpg but I'm less sure if it extracts full URLs for whatever reason) 07:00:50 Re: wbm cdx stuff again 07:00:50 Thanks JAA, figures haha, didnt realize Jason was considered part of AT 07:02:10 audrooku|m: Well, he founded it, so yeah. :-) He isn't around much these days though. 07:02:32 JAA: How does qwarc handle threads with multiple pages? 07:03:15 Ah huh didnt know haha, makes a lot of sense though ;) 07:03:27 (my impression is that it just indiscriminately downloads a series of URLs, and you'd generate an incremental list of thread IDs and download those, in which case you'd need special logic to know if multiple pages exist. but I might be wrong on this) 07:05:05 pokechu22: qwarc's just a framework for grabbing stuff with very little overhead. It doesn't do anything on its own. You need to write code (called a spec file, I might change that term at some point) to tell it what to actually do. And yes, my code for MyWOT handled topic pagination (and session IDs and the language switcher). 07:06:25 Ah, it's not just a version of !ao < list that doesn't extract images, it actually does know how to look at the response to extract more links (if you tell it how to). OK, good to know 07:06:42 No, it does not. You need to write that code to make it do that. 07:07:28 Right, but I was assuming that it was *just* capable of doing lists without any way of expanding on it 07:08:07 It provides an interface for 'fetch this URL' with a callback to decide what to do with the response (accept, retry; write to WARC or not). And it handles the HTTP and WARC writing. Plus there's a CLI for actually running it, with concurrency and parallel processes and stuff. 07:08:20 All other logic is user-provided in the spec file. 07:09:03 I did write a spec file a long while ago that does recursive crawls. That's what I was referring to in #archivebot earlier. 07:09:23 But on its own, qwarc doesn't even look at the HTTP body at all. It just downloads it and (unless prevented) writes it to the WARC. 07:09:54 That's also part of why it's so efficient. 07:12:10 This is what the spec file for the MyWOT Forums looks like: https://transfer.archivete.am/GDKC3/forum.mywot.com.py 07:23:34 JAA: as always feel free to put the outlinks from that site in #// :) 07:25:38 arkiver: Yup, and I'll have a bunch from others as well. :-) 07:25:53 awesome! 08:06:29 pokechu22: hmm, how would one find out? guess I could just try !a http://www.hungry.com/hungries/hungries.js ? and separately !a the other stuff 08:08:59 I think the extraction behaves differently if you save the js file itself versus saving a page containing it, so I'm not sure exactly. Probably the easiest thing to do would be just !a http://www.hungry.com/ and watch the log to see whether it finds those pages or not 08:16:34 In any case, I'm going to sleep 11:39:32 did we run all known twu.net sites through AB? 11:39:36 I never got a reply from them, bbtw 11:39:39 btw* 11:42:34 oh 11:43:27 i see a reply in my spam saying the email to support⊙tn could not be delivered 16:45:57 i'd have thought there'd be more available logs for ircs around here -- must be a good sum of wisdom, resources, and history that passes through 17:30:10 I thought so too 19:47:24 Hmm, is there an ongoing project where there's constant archiving of Pastebin and Pastebin-like services? 20:11:41 Hi 20:13:33 I was wondering... what the (Readline timed out)? 20:13:41 and how to fix it? 20:29:36 That just means that the site didn't respond in time. If it's happening for everything on the site, the site might have blocked you. If it's happening somewhat randomly, and in particular the site has lots of large files, try setting the concurrency to 1 20:31:36 Oh, yeah, it's happened randomly. I'll give your solution a shot 20:52:57 pokechu22: Thank you, it makes it even better... but I've noticed that some "soft 404" pages (or redirects) on other sites return "Readline timeout" 20:54:37 Hmm, that probably depends on the site as well 20:55:02 At least with archivebot, pages that give an error like that are automatically retried after everything finishes (and will be attempted up to 3 times total) 21:04:36 It would be better if it archived those "soft error" pages so that the website experience (via WARC) is more similar to the original... 21:09:17 If it's an actual 404 (or 3XX or similar) error code, it'll be saved. An example of an actual readline timeout is... e.g. http://www.holypeak.com/talent/voiceactor/shuka_saito.html which isn't something that it would make sense to save 21:10:38 hmm, actually, that one eventually redirected to a cloudflare error page (after a minute), not a perfect example as that could be saved, but archivebot gives up after 30 seconds (IIRC) 21:11:32 http://www.ja.net/company/policies/aup.html might be a better example then? 21:23:15 Can I extend the archive bot's timeout? 21:28:17 I don't know. I assume you're using grab-site, which may or may not let you change that easily; I'm not familiar with it. (#archivebot via IRC doesn't support changing the timeout though) 21:30:44 I am not using grab-site; I am using the full archive bot system (with pipelines) 22:12:12 Anyway, I was thinking of making the archivebot to check the website (perhaps for a keyword or something) and archive it if true, and otherwise, if it false. 22:12:31 but I don't know how to do it... 22:12:39 *that