#//

15:07

arkiver

great improvements in URL extraction from PDFs are in
15:16

xkey

arkiver: what lib(?) do you use for that? and how did it improve if I may ask?
15:17

xkey

meaning the extraction part
15:45

arkiver

convert PDF to HTML, extract URLs from HTML
15:46

arkiver

though it's more difficult than it sounds
15:46

arkiver

because some URLs are not 'clickable'
15:46

arkiver

some are split over multiple lines
15:46

arkiver

some do not have https://, etc.
15:47

arkiver

converting github.com/ArchiveTeam/urls-grab/blob/master/urls.lua#L1245-L1270
15:48

arkiver

extracting github.com/ArchiveTeam/urls-grab/blob/master/urls.lua#L1128-L1223
15:48

arkiver

xkey: ^
15:49

xkey

arkiver: thanks a lot! yea ain't easy, hence my question :) let's see about my Lua skills then haha
15:50

arkiver

it pretty much extracts all URLs now
15:50

xkey

nice congrats
15:50

arkiver

but there are false positives and partial/incorrect URLs being extracted
15:50

arkiver

but that'll just give us a few more 404s, not a huge problem
15:52

xkey

ack
16:09

that_lurker

Love seeing this on new or obscure urls in wayback machine lounge.kuhaon.fun/folder/5cf8600497379da6/image.png
16:16

arkiver

that_lurker: yeah! quite some (and also many important) URLs are archived fast by this project and sometimes are only saved by this project
19:30

jacksonchen666

redirected from #discard: a discord hosted image: cdn.discordapp.com/attachments/9675…7224234979328/azealskywallpaper.png (from youtube.com/channel/UClLOsBKtKS8i9N…gkx3qSP1PsWtseM4YJXQBfy7X4CtPYD3GsD)
19:46

fireonlive

arkiver: are d*****d cdn urls from #discard dumps ok to be queued here willy-nilly or is that in general too much data?
20:13

TheTechRobo

fireonlive: I think it's fine - it's what I've been doing (with his blessing)
20:13

TheTechRobo

that may have changed, though
20:14

fireonlive

ah oki

11 months ago

« 2 days earlier

a day later »

today »