#//

01:14

Medowar

1=0 galatea2017201720172017201720172017…typefenotypefenotypefenotypefenotyp
01:14

Medowar

efenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypewww.myfonts.comwww.myfonts.comwww.myfonts.comwww.myfonts.com/
01:24

JAA

We should probably just filter out anything that can't be a valid domain.
01:24

JAA

(Ignoring bare IPs for the moment)
01:25

JAA

DNS has limits for this: 63 chars per label, and 255 chars total including the label lengths.
03:34

» nicolas17 scrolls up
03:34

» nicolas17 stabs Cheesy
04:39

datechnoman

arkiver - Just my workers alone are hitting viralcovert.com on average 15k per minute and its been 502'ing for a few hours. There is some kind of weird loop going on and we are definitely killing the website. Might want to filter it out for this project
04:40

datechnoman

Removing it will speed things up a fair bit also :D
04:51

datechnoman

JAA - There is nothing wrong with this url right? :P
04:51

datechnoman

Queuing URL 0.0850.032%E2%80%930.0030.0850.0670…0%930.004%E2%80%93%E2%80%930.003%E2
04:51

datechnoman

%80%93%E2%88%920.014zr3.8593.7803.9213.8623.8773.8343.9203.9043.9503.8993.8983.8933.8673.780%E2%80%933.967hf0.0410.0270.0310.0380.0410.0470.0480.0370.0380.0320.0370.0390.0460.025%E2%80%930.048th0.0020.008%E2%80%93%E2%80%930.0030.007%E2%80%930.003%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.012u%E2%80%93%E2%80%93%E2%80%93%E2%80%9
04:51

datechnoman

30.0070.0030.0020.0040.0030.0020.0020.0030.004%E2%80%93%E2%88%920.019nb%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.006%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.009p0.0120.0380.0160.0190.0150.0290.0170.0170.0100.0200.0160.0090.0160.006%E2%80%930.038y0.0220.0570.0250.0600.0410.0370.0220.0450.0200.0300.0280.
04:51

datechnoman

0060.0120.005%E2%80%930.095la%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.008%E2%80%93%E2%80%93%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E2%88%920.010ce%E2%80%930.004%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%88%920.007pr%E2%80%930.005%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E
04:51

datechnoman

2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.007nd%E2%80%930.006%E2%80%930.011%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.011sm%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.0110.005%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.013gd%E2%8
04:51

datechnoman

0%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.008%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.008dy%E2%80%930.0170.0060.008%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%88%920.017er%E2%80%930.011%E2%80%930.0040.0040.004%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%
04:51

datechnoman

93%E2%80%93%E2%88%920.011yb%E2%80%930.009%E2%80%930.0080.0110.004%E2%80%930.003%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.0116cations8.048.068.028.028.018.058.038.028.008.018.018.028.048.00%E2%80%938.06feo%E2%88%97=totalironasfe2+;bdl=belowdetectionlimit;n=numberofanalysesaveraged.eu/.
04:53

datechnoman

arkiver - Here are some examples of some random/odd regex that is occurring for extracts increasing failed urls counts quite heavily: transfer.archivete.am/A225h/bad_urls_regex.txt
04:53

eggdrop

inline (for browser viewing): transfer.archivete.am/inline/A225h/bad_urls_regex.txt
05:39

arkiver

JAA: yeah we should, these all come from PDFs - we do some pretty aggressive URL extraction there which leads to bad cases
05:39

arkiver

but, the filtering there to filter out impossible cases needs to be improved
05:45

arkiver

thanks for the reports there datechnoman
05:46

arkiver

JAA: are those true maxima? or are they "best practises" (as in, are there known exceptions supported by implementations?)
05:46

arkiver

however, since the majority of these come from URLs extracted from PDFs, i think it's fine to implement these limits anyway for those URLs
05:52

arkiver

moving :todo:backfeed to :todo:secondary
05:55

datechnoman

No worries mate! Always happy to help and provide logs
05:59

arkiver

datechnoman: and yeah that last URL you posted especially is obviously not correct, so i should filter that our before queuing back
05:59

arkiver

from what i see there it may have been extracted from some table or graph
06:00

datechnoman

oohhhh ok interesting. Must be interrupting/extracting it funny
06:01

datechnoman

Kind of makes sense when we are focusing more on science and medical PDF's that have quite a lot of tables in them
06:01

datechnoman

For an hour last night I was randomly pulling PDF urls out of my workers that have been processed and looking if they had hits in archive.org and were coming up with no results
06:02

datechnoman

Great to be grabbing data that would be lost potentially in the future
06:03

arkiver

well we are doing a lot of 'tricks' to try to extract as many URLs as possible
06:03

arkiver

for example also URLs split over multiple lines
06:03

arkiver

or URLs split over multiple lines in tables
06:03

arkiver

with or without https
06:04

arkiver

but we pretty much extract every URL in various forms that can be found in a document now
06:05

arkiver

if someone simply writes "archiveteam.org" (no protocol or /) in a sentence, we will find that in the document
06:05

arkiver

also if the "org" part is split off to the next line due to line length
06:06

arkiver

datechnoman: really great to hear about the PDFs :) and yes indeed! we've been archiving enormous amount of information here that was completely not archived before
06:06

arkiver

unfortunately the sources of news or products (meaning the scientific documents) are often not archived because not many people link to them
06:06

arkiver

everyone just links to the popular article stuff
06:07

arkiver

all of this will get easier though, we're currently going through an initial bump after recent changes. it will settle down after a few days i believe
06:14

datechnoman

Yeah it all take tweaking and work. All good! Will keep providing logs as I see odd things. Your doing magic that I cannot do so no judgement
07:41

arkiver

i see signs of a loop
07:41

arkiver

suspicious URLs
07:47

arkiver

viralcovert is filtered out now
07:57

arkiver

paused as i fix the loop
08:16

arkiver

the loop seems to be a variation of an older loop, but is significantly different at the same time
08:18

datechnoman

Ohhhh well glad you picked up on that
08:18

datechnoman

Great work. I'll standby
08:19

datechnoman

I will say the code and loop configuration is much more robust and accurate than it was 6 months ago
08:20

arkiver

update is in an resumed
08:21

arkiver

datechnoman: well it is actually largely the same i think, just nowadays we have fewer loops because we pretty much came across nearly all loops out there :P
08:21

arkiver

the last few serious spam loops i had to fix were variations on older loops - which points to changes in the 'spam site software' we have to adapt to
08:26

imer

the concept of "spam site software" is so wild to me haha makes total sense though
08:45

imer

arkiver: "viralcovert is filtered out now" in code? still seeing an amount of it on my end
08:59

arkiver

imer: well the one with the arsae parameter
08:59

arkiver

ah...
09:00

arkiver

it's still there as custom: items
09:02

arkiver

imer: should be out now
09:02

arkiver

i keep forgetting about custom: items
09:08

datechnoman

Haha all good. Progress is also a bonus!
09:10

datechnoman

Way more HTTP 200 codes which is great
09:14

arkiver

does anyone have some dashboard on which they track status codes, and perhaps other information?
09:14

datechnoman

Yeah got all of that in grafana
09:15

datechnoman

Will give you a view of my whole fleet
09:15

arkiver

i feel like we talked about that before
09:15

datechnoman

Yeah I gave it to you ages ago and I don't think you ever logged in lol
09:15

arkiver

or maybe not
09:16

datechnoman

I can get you re-setup
09:17

datechnoman

Will show you veeything your after
09:17

datechnoman

All the worker logs, http codes, queued urls etc
09:18

arkiver

i'm in :)
09:21

arkiver

backfeed is staying close to 0! yay!
09:21

arkiver

very good sign, we may be back to grinding away at backlog
09:27

datechnoman

Whoop whoop!
09:37

arkiver

audrooku|m: how important is it that we archive the 1.59 million URLs soon?
09:37

arkiver

if not important, i would like to wait until the queued are down to near 0 here
09:52

datechnoman

arkiver - When a page is archived, are we following the archiving the outlinks with a depth of 1? Cant remember where to check that and if it was turned off as there were too many webpages being queued awhile back
10:12

arkiver

datechnoman: no we are not
10:13

arkiver

only when we queue URLs in urls-sources, we archive them up to a certain depth
10:14

arkiver

datechnoman: you can check for the `depth` parameter in custom: items, which shows how deep we'll crawl it
10:17

datechnoman

Thanks so much. I'll look into that so I don't have to hassle you again :)
10:23

arkiver

just saw this nice URL that is clearly extracted from a PDF document :P variablesareexpressedinlogarithmexc…ed10-yearlaggedemploymentdensity.th
10:23

arkiver

classified as URL due to the .th
10:26

datechnoman

Offt thats a big boy lol
10:27

datechnoman

I feel like it isn't a valid url xD
10:27

arkiver

yeah, but it also fits inside the rules of 63 and 255 chars, so would not be filtered out
10:28

arkiver

it is what it is, URL extraction from PDFs is messy... but i think we can expect a significant number of false positives if we in return extract nearly any URLs
10:28

arkiver

and the false positives do not ask much from CPU for example, it's mostly wasted resources in the sense of that they sit idle, instead of that they are heavily used on useless tasks
10:43

datechnoman

Yeah will just push through them instead
15:37

JAA

arkiver: I'm not aware of anything supporting anything longer than 63 chars per label and 255 chars in total. The limits are documented in RFC 1035. The protocol could in theory accomodate larger domains though.
15:38

JAA

More than 255 chars per label would definitely be impossible because a label is represented by a single length byte plus the label itself.
15:39

JAA

The first two bits of the length byte are restricted to 00 normally, and 11 is used for compression. So probably that's a pretty rigid limit, too.
15:40

JAA

(Compression here just means referring back to a previous copy of the same label, so that example.example.org doesn't need to encode 'example' twice.
15:40

JAA

)
15:40

arkiver

i see
15:40

arkiver

thanks, i will put some checks in place
15:40

arkiver

first fixing another interesting spam loop
15:41

JAA

The 255 chars in total limit is entirely arbitrary in theory; the total length does not appear in the DNS messages, only the individual label lengths.
15:46

arkiver

update is out for a latest looping problems
16:18

datechnoman

arkiver - assume yoh are aware of the academia.edu loop
16:18

datechnoman

Or already put something in place
16:20

arkiver

yep
16:20

arkiver

just fixed
16:20

arkiver

was due to my code change
16:20

arkiver

datechnoman: ^
16:20

datechnoman

All good! Falling back asleep haha. Night
16:21

arkiver

have a good sleep :)
16:22

datechnoman

Thanks :) fell asleep on the couch 4 hours ago lol oops
16:41

arkiver

we're going to try to repair PDF documents if they cannot be read by pdftohtml
17:26

arkiver

update is in!
17:26

arkiver

we're installing ghostscript now
17:26

arkiver

to repair PDFs if needed
17:27

arkiver

that is 60 MB of extra installed data
17:28

arkiver

... which i think is acceptable
17:34

arkiver

other than the extra size, i think this should have minimal impact on performance. the majority of PDFs don't need to be repaired for processing with pdftohtml
17:35

JAA

Does the warrior-install.sh get executed by non-Warrior workers?
17:36

arkiver

yes
17:37

arkiver

github.com/ArchiveTeam/grab-base-df/blob/master/Dockerfile#L32-L33
17:37

arkiver

we use warrior-install.sh in the same way for the youtube-grab project to make it run on both the warrior and non-warrior
17:37

JAA

Ah :-)
17:38

arkiver

i looked before into extracting URLs from doc(x), etc., but will look into that again soon. this would be converting them to PDF and extracting from PDF using pdftohtml (so we don't need separate extraction stuff for various document formats)
17:38

arkiver

however, last i checked the packages needed for converstion to PDF were quite large, but will look into it again
18:56

fireonlive

hmm if it's one of those big burden things maybe it could be split off into a different queue somehow?
18:57

fireonlive

then you could have workers solely dedicated to pdf link extraction/document conversion/etc
18:57

fireonlive

and people can brrrrr on urls with whatever IP space
18:57

fireonlive

could also get more people participating in the extraction part, though don't know if that's a particular bottlenec
18:57

fireonlive

k

13 days ago

« a day earlier

a day later »

today »