-
thubanhey rewby, do you still have that mass warc outlink scanning setup?
-
thubancan it (a) be run on specific sets of warcs and (b) find specific url-like patterns outside of href/src markup?
-
rewbyYes on all three
-
thubangreat! what's the best way to specify the warcs (ab job id? domain? list of files?), and are there any restrictions on the patterns?
-
rewbyList of http urls is best. It tries it's best to find anything url like based on a set of regexes
-
rewbyI have a script to turn ia collections into input for it
-
thubanhmmm. i'd like to scan archive.fart.website/archivebot/viewer/job/9minf for the pattern '(file|image):\s*"([^"])",' (where the actual url of interest is in the second group; any urls found elsewhere in the source would be redundant).
-
thubanis that possible, or would i have to match on internal url structure, rather than surrounding text, in order to do any filtering?
-
rewbyThat should be easy to do
-
rewbyI'll look into it tomorrow
-
rewbyBut should be an easy code patch to make it do what you want
-
thubanok, cool! let me know if you'd rather have a lookaround/reset-based pattern.
-
rewbyNah, I cant do lookaround
-
thubanyeah, i suspected as much
-
rewbyIt's basically hyperscan glued to a streaming warc parser
-
rewbySo it has the limits that come with that choice of engine
-
thubanmakes sense. anyway if you link the results here i will post-process and pass them back to archivebot :)
-
rewbySure thing
-
thubanthanks very much