20:40:33 <thuban> hey rewby, do you still have that mass warc outlink scanning setup?
20:40:35 <thuban> can it (a) be run on specific sets of warcs and (b) find specific url-like patterns outside of href/src markup?
20:40:55 <rewby> Yes on all three
20:44:36 <thuban> great! what's the best way to specify the warcs (ab job id? domain? list of files?), and are there any restrictions on the patterns?
20:45:32 <rewby> List of http urls is best. It tries it's best to find anything url like based on a set of regexes
20:46:14 <rewby> I have a script to turn ia collections into input for it
20:51:09 <thuban> hmmm. i'd like to scan https://archive.fart.website/archivebot/viewer/job/9minf for the pattern '(file|image):\s*"([^"])",' (where the actual url of interest is in the second group; any urls found elsewhere in the source would be redundant).
20:51:58 <thuban> is that possible, or would i have to match on internal url structure, rather than surrounding text, in order to do any filtering?
20:52:27 <rewby> That should be easy to do
20:53:17 <rewby> I'll look into it tomorrow
20:53:43 <rewby> But should be an easy code patch to make it do what you want
20:55:56 <thuban> ok, cool! let me know if you'd rather have a lookaround/reset-based pattern.
20:56:15 <rewby> Nah, I cant do lookaround
20:56:27 <thuban> yeah, i suspected as much
20:56:41 <rewby> It's basically hyperscan glued to a streaming warc parser
20:57:16 <rewby> So it has the limits that come with that choice of engine
20:58:57 <thuban> makes sense. anyway if you link the results here i will post-process and pass them back to archivebot :)
20:59:09 <rewby> Sure thing
20:59:41 <thuban> thanks very much