20:40:33 hey rewby, do you still have that mass warc outlink scanning setup? 20:40:35 can it (a) be run on specific sets of warcs and (b) find specific url-like patterns outside of href/src markup? 20:40:55 Yes on all three 20:44:36 great! what's the best way to specify the warcs (ab job id? domain? list of files?), and are there any restrictions on the patterns? 20:45:32 List of http urls is best. It tries it's best to find anything url like based on a set of regexes 20:46:14 I have a script to turn ia collections into input for it 20:51:09 hmmm. i'd like to scan https://archive.fart.website/archivebot/viewer/job/9minf for the pattern '(file|image):\s*"([^"])",' (where the actual url of interest is in the second group; any urls found elsewhere in the source would be redundant). 20:51:58 is that possible, or would i have to match on internal url structure, rather than surrounding text, in order to do any filtering? 20:52:27 That should be easy to do 20:53:17 I'll look into it tomorrow 20:53:43 But should be an easy code patch to make it do what you want 20:55:56 ok, cool! let me know if you'd rather have a lookaround/reset-based pattern. 20:56:15 Nah, I cant do lookaround 20:56:27 yeah, i suspected as much 20:56:41 It's basically hyperscan glued to a streaming warc parser 20:57:16 So it has the limits that come with that choice of engine 20:58:57 makes sense. anyway if you link the results here i will post-process and pass them back to archivebot :) 20:59:09 Sure thing 20:59:41 thanks very much