15:07:44 great improvements in URL extraction from PDFs are in 15:16:47 arkiver: what lib(?) do you use for that? and how did it improve if I may ask? 15:17:03 meaning the extraction part 15:45:54 convert PDF to HTML, extract URLs from HTML 15:46:07 though it's more difficult than it sounds 15:46:18 because some URLs are not 'clickable' 15:46:23 some are split over multiple lines 15:46:32 some do not have https://, etc. 15:47:41 converting https://github.com/ArchiveTeam/urls-grab/blob/master/urls.lua#L1245-L1270 15:48:03 extracting https://github.com/ArchiveTeam/urls-grab/blob/master/urls.lua#L1128-L1223 15:48:25 xkey: ^ 15:49:45 arkiver: thanks a lot! yea ain't easy, hence my question :) let's see about my Lua skills then haha 15:50:17 it pretty much extracts all URLs now 15:50:30 nice congrats 15:50:33 but there are false positives and partial/incorrect URLs being extracted 15:50:44 but that'll just give us a few more 404s, not a huge problem 15:52:31 ack 16:09:25 Love seeing this on new or obscure urls in wayback machine https://lounge.kuhaon.fun/folder/5cf8600497379da6/image.png 16:16:10 that_lurker: yeah! quite some (and also many important) URLs are archived fast by this project and sometimes are only saved by this project 19:30:29 redirected from #discard: a discord hosted image: https://cdn.discordapp.com/attachments/967501430317535353/1023757224234979328/azealskywallpaper.png (from https://www.youtube.com/channel/UClLOsBKtKS8i9N12l6Uza3g/community?lb=Ugkx3qSP1PsWtseM4YJXQBfy7X4CtPYD3GsD) 19:46:38 arkiver: are d*****d cdn urls from #discard dumps ok to be queued here willy-nilly or is that in general too much data? 20:13:26 fireonlive: I think it's fine - it's what I've been doing (with his blessing) 20:13:34 that may have changed, though 20:14:21 ah oki