-
arkivergreat improvements in URL extraction from PDFs are in
-
xkeyarkiver: what lib(?) do you use for that? and how did it improve if I may ask?
-
xkeymeaning the extraction part
-
arkiverconvert PDF to HTML, extract URLs from HTML
-
arkiverthough it's more difficult than it sounds
-
arkiverbecause some URLs are not 'clickable'
-
arkiversome are split over multiple lines
-
arkiversome do not have https://, etc.
-
arkiver
-
arkiver
-
arkiverxkey: ^
-
xkeyarkiver: thanks a lot! yea ain't easy, hence my question :) let's see about my Lua skills then haha
-
arkiverit pretty much extracts all URLs now
-
xkeynice congrats
-
arkiverbut there are false positives and partial/incorrect URLs being extracted
-
arkiverbut that'll just give us a few more 404s, not a huge problem
-
xkeyack
-
that_lurkerLove seeing this on new or obscure urls in wayback machine lounge.kuhaon.fun/folder/5cf8600497379da6/image.png
-
arkiverthat_lurker: yeah! quite some (and also many important) URLs are archived fast by this project and sometimes are only saved by this project
-
jacksonchen666redirected from #discard: a discord hosted image: cdn.discordapp.com/attachments/9675…7224234979328/azealskywallpaper.png (from youtube.com/channel/UClLOsBKtKS8i9N…gkx3qSP1PsWtseM4YJXQBfy7X4CtPYD3GsD)
-
fireonlivearkiver: are d*****d cdn urls from #discard dumps ok to be queued here willy-nilly or is that in general too much data?
-
TheTechRobofireonlive: I think it's fine - it's what I've been doing (with his blessing)
-
TheTechRobothat may have changed, though
-
fireonliveah oki