01:14:30 <Medowar> 1=0 https://galatea2017201720172017201720172017201720172017201720172017201720172017201720172017201720172017201720172017201720172017201720022017201720172017201720172017fenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotyp
01:14:30 <Medowar> efenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypewww.myfonts.comwww.myfonts.comwww.myfonts.comwww.myfonts.com/
01:24:30 <JAA> We should probably just filter out anything that can't be a valid domain.
01:24:54 <JAA> (Ignoring bare IPs for the moment)
01:25:57 <JAA> DNS has limits for this: 63 chars per label, and 255 chars total including the label lengths.
03:34:36 * nicolas17 scrolls up
03:34:41 * nicolas17 stabs Cheesy
04:39:37 <datechnoman> arkiver - Just my workers alone are hitting https://www.viralcovert.com/ on average 15k per minute and its been 502'ing for a few hours. There is some kind of weird loop going on and we are definitely killing the website. Might want to filter it out for this project
04:40:55 <datechnoman> Removing it will speed things up a fair bit also :D
04:51:08 <datechnoman> JAA - There is nothing wrong with this url right? :P 	
04:51:08 <datechnoman> Queuing URL http://0.0850.032%E2%80%930.0030.0850.0670.0120.0080.0180.0110.0500.092%E2%80%93%E2%88%920.106mg%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.004ca0.004%E2%80%930.003%E2%80%93%E2%80%930.005%E2%80%93%E2%80%93%E2%80%930.004%E2%80%93%E2%80%930.003%E2
04:51:08 <datechnoman> %80%93%E2%88%920.014zr3.8593.7803.9213.8623.8773.8343.9203.9043.9503.8993.8983.8933.8673.780%E2%80%933.967hf0.0410.0270.0310.0380.0410.0470.0480.0370.0380.0320.0370.0390.0460.025%E2%80%930.048th0.0020.008%E2%80%93%E2%80%930.0030.007%E2%80%930.003%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.012u%E2%80%93%E2%80%93%E2%80%93%E2%80%9
04:51:08 <datechnoman> 30.0070.0030.0020.0040.0030.0020.0020.0030.004%E2%80%93%E2%88%920.019nb%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.006%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.009p0.0120.0380.0160.0190.0150.0290.0170.0170.0100.0200.0160.0090.0160.006%E2%80%930.038y0.0220.0570.0250.0600.0410.0370.0220.0450.0200.0300.0280.
04:51:08 <datechnoman> 0060.0120.005%E2%80%930.095la%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.008%E2%80%93%E2%80%93%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E2%88%920.010ce%E2%80%930.004%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%88%920.007pr%E2%80%930.005%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E
04:51:09 <datechnoman> 2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.007nd%E2%80%930.006%E2%80%930.011%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.011sm%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.0110.005%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.013gd%E2%8
04:51:09 <datechnoman> 0%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.008%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.008dy%E2%80%930.0170.0060.008%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%88%920.017er%E2%80%930.011%E2%80%930.0040.0040.004%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%
04:51:10 <datechnoman> 93%E2%80%93%E2%88%920.011yb%E2%80%930.009%E2%80%930.0080.0110.004%E2%80%930.003%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.0116cations8.048.068.028.028.018.058.038.028.008.018.018.028.048.00%E2%80%938.06feo%E2%88%97=totalironasfe2+;bdl=belowdetectionlimit;n=numberofanalysesaveraged.eu/.
04:53:38 <datechnoman> arkiver - Here are some examples of some random/odd regex that is occurring for extracts increasing failed urls counts quite heavily: https://transfer.archivete.am/A225h/bad_urls_regex.txt
04:53:39 <eggdrop> inline (for browser viewing): https://transfer.archivete.am/inline/A225h/bad_urls_regex.txt
05:39:03 <arkiver> JAA: yeah we should, these all come from PDFs - we do some pretty aggressive URL extraction there which leads to bad cases
05:39:15 <arkiver> but, the filtering there to filter out impossible cases needs to be improved
05:45:18 <arkiver> thanks for the reports there datechnoman
05:46:30 <arkiver> JAA: are those true maxima? or are they "best practises" (as in, are there known exceptions supported by implementations?)
05:46:55 <arkiver> however, since the majority of these come from URLs extracted from PDFs, i think it's fine to implement these limits anyway for those URLs
05:52:53 <arkiver> moving :todo:backfeed to :todo:secondary
05:55:39 <datechnoman> No worries mate! Always happy to help and provide logs
05:59:16 <arkiver> datechnoman: and yeah that last URL you posted especially is obviously not correct, so i should filter that our before queuing back
05:59:20 <arkiver> from what i see there it may have been extracted from some table or graph
06:00:51 <datechnoman> oohhhh ok interesting. Must be interrupting/extracting it funny
06:01:10 <datechnoman> Kind of makes sense when we are focusing more on science and medical PDF's that have quite a lot of tables in them
06:01:46 <datechnoman> For an hour last night I was randomly pulling PDF urls out of my workers that have been processed and looking if they had hits in archive.org and were coming up with no results
06:02:05 <datechnoman> Great to be grabbing data that would be lost potentially in the future
06:03:29 <arkiver> well we are doing a lot of 'tricks' to try to extract as many URLs as possible
06:03:35 <arkiver> for example also URLs split over multiple lines
06:03:41 <arkiver> or URLs split over multiple lines in tables
06:03:51 <arkiver> with or without https
06:04:26 <arkiver> but we pretty much extract every URL in various forms that can be found in a document now
06:05:04 <arkiver> if someone simply writes "archiveteam.org" (no protocol or /) in a sentence, we will find that in the document
06:05:46 <arkiver> also if the "org" part is split off to the next line due to line length
06:06:22 <arkiver> datechnoman: really great to hear about the PDFs :) and yes indeed! we've been archiving enormous amount of information here that was completely not archived before
06:06:47 <arkiver> unfortunately the sources of news or products (meaning the scientific documents) are often not archived because not many people link to them
06:06:56 <arkiver> everyone just links to the popular article stuff
06:07:29 <arkiver> all of this will get easier though, we're currently going through an initial bump after recent changes. it will settle down after a few days i believe
06:14:06 <datechnoman> Yeah it all take tweaking and work. All good! Will keep providing logs as I see odd things. Your doing magic that I cannot do so no judgement
07:41:07 <arkiver> i see signs of a loop
07:41:19 <arkiver> suspicious URLs
07:47:59 <arkiver> viralcovert is filtered out now
07:57:18 <arkiver> paused as i fix the loop
08:16:10 <arkiver> the loop seems to be a variation of an older loop, but is significantly different at the same time
08:18:33 <datechnoman> Ohhhh well glad you picked up on that
08:18:45 <datechnoman> Great work. I'll standby
08:19:19 <datechnoman> I will say the code and loop configuration is much more robust and accurate than it was 6 months ago
08:20:28 <arkiver> update is in an resumed
08:21:02 <arkiver> datechnoman: well it is actually largely the same i think, just nowadays we have fewer loops because we pretty much came across nearly all loops out there :P
08:21:31 <arkiver> the last few serious spam loops i had to fix were variations on older loops - which points to changes in the 'spam site software' we have to adapt to
08:26:25 <imer> the concept of "spam site software" is so wild to me haha makes total sense though
08:45:28 <imer> arkiver: "viralcovert is filtered out now" in code? still seeing an amount of it on my end
08:59:15 <arkiver> imer: well the one with the arsae parameter
08:59:51 <arkiver> ah...
09:00:01 <arkiver> it's still there as custom: items
09:02:17 <arkiver> imer: should be out now
09:02:24 <arkiver> i keep forgetting about custom: items
09:08:58 <datechnoman> Haha all good. Progress is also a bonus!
09:10:12 <datechnoman> Way more HTTP 200 codes which is great
09:14:40 <arkiver> does anyone have some dashboard on which they track status codes, and perhaps other information?
09:14:55 <datechnoman> Yeah got all of that in grafana
09:15:18 <datechnoman> Will give you a view of my whole fleet
09:15:18 <arkiver> i feel like we talked about that before
09:15:36 <datechnoman> Yeah I gave it to you ages ago and I don't think you ever logged in lol
09:15:40 <arkiver> or maybe not
09:16:54 <datechnoman> I can get you re-setup
09:17:02 <datechnoman> Will show you veeything your after
09:17:14 <datechnoman> All the worker logs, http codes, queued urls etc
09:18:07 <arkiver> i'm in :)
09:21:40 <arkiver> backfeed is staying close to 0! yay!
09:21:56 <arkiver> very good sign, we may be back to grinding away at backlog
09:27:01 <datechnoman> Whoop whoop!
09:37:45 <arkiver> audrooku|m: how important is it that we archive the 1.59 million URLs soon?
09:37:55 <arkiver> if not important, i would like to wait until the queued are down to near 0 here
09:52:03 <datechnoman> arkiver - When a page is archived, are we following the archiving the outlinks with a depth of 1? Cant remember where to check that and if it was turned off as there were too many webpages being queued awhile back
10:12:18 <arkiver> datechnoman: no we are not
10:13:00 <arkiver> only when we queue URLs in urls-sources, we archive them up to a certain depth
10:14:14 <arkiver> datechnoman: you can check for the `depth` parameter in custom: items, which shows how deep we'll crawl it
10:17:22 <datechnoman> Thanks so much. I'll look into that so I don't have to hassle you again :)
10:23:07 <arkiver> just saw this nice URL that is clearly extracted from a PDF document :P http://variablesareexpressedinlogarithmexceptyearsofschooling.spatiallysmoothedmunicipaldataareused.theinstrumentalvariableforemploymentdensityisthespatiallysmoothed10-yearlaggedemploymentdensity.th/
10:23:14 <arkiver> classified as URL due to the .th
10:26:35 <datechnoman> Offt thats a big boy lol
10:27:10 <datechnoman> I feel like it isn't a valid url xD
10:27:34 <arkiver> yeah, but it also fits inside the rules of 63 and 255 chars, so would not be filtered out
10:28:09 <arkiver> it is what it is, URL extraction from PDFs is messy... but i think we can expect a significant number of false positives if we in return extract nearly any URLs
10:28:48 <arkiver> and the false positives do not ask much from CPU for example, it's mostly wasted resources in the sense of that they sit idle, instead of that they are heavily used on useless tasks
10:43:12 <datechnoman> Yeah will just push through them instead
15:37:39 <JAA> arkiver: I'm not aware of anything supporting anything longer than 63 chars per label and 255 chars in total. The limits are documented in RFC 1035. The protocol could in theory accomodate larger domains though.
15:38:56 <JAA> More than 255 chars per label would definitely be impossible because a label is represented by a single length byte plus the label itself.
15:39:50 <JAA> The first two bits of the length byte are restricted to 00 normally, and 11 is used for compression. So probably that's a pretty rigid limit, too.
15:40:21 <JAA> (Compression here just means referring back to a previous copy of the same label, so that example.example.org doesn't need to encode 'example' twice.
15:40:25 <JAA> )
15:40:34 <arkiver> i see
15:40:45 <arkiver> thanks, i will put some checks in place
15:40:50 <arkiver> first fixing another interesting spam loop
15:41:09 <JAA> The 255 chars in total limit is entirely arbitrary in theory; the total length does not appear in the DNS messages, only the individual label lengths.
15:46:03 <arkiver> update is out for a latest looping problems
16:18:35 <datechnoman> arkiver - assume yoh are aware of the https://www.academia.edu/ loop
16:18:43 <datechnoman> Or already put something in place
16:20:00 <arkiver> yep
16:20:02 <arkiver> just fixed
16:20:06 <arkiver> was due to my code change
16:20:19 <arkiver> datechnoman: ^
16:20:36 <datechnoman> All good! Falling back asleep haha. Night
16:21:42 <arkiver> have a good sleep :)
16:22:39 <datechnoman> Thanks :) fell asleep on the couch 4 hours ago  lol oops
16:41:44 <arkiver> we're going to try to repair PDF documents if they cannot be read by pdftohtml
17:26:43 <arkiver> update is in!
17:26:47 <arkiver> we're installing ghostscript now
17:26:59 <arkiver> to repair PDFs if needed
17:27:10 <arkiver> that is 60 MB of extra installed data
17:28:54 <arkiver> ... which i think is acceptable
17:34:37 <arkiver> other than the extra size, i think this should have minimal impact on performance. the majority of PDFs don't need to be repaired for processing with pdftohtml
17:35:12 <JAA> Does the warrior-install.sh get executed by non-Warrior workers?
17:36:40 <arkiver> yes
17:37:05 <arkiver> https://github.com/ArchiveTeam/grab-base-df/blob/master/Dockerfile#L32-L33
17:37:25 <arkiver> we use warrior-install.sh in the same way for the youtube-grab project to make it run on both the warrior and non-warrior
17:37:26 <JAA> Ah :-)
17:38:36 <arkiver> i looked before into extracting URLs from doc(x), etc., but will look into that again soon. this would be converting them to PDF and extracting from PDF using pdftohtml (so we don't need separate extraction stuff for various document formats)
17:38:53 <arkiver> however, last i checked the packages needed for converstion to PDF were quite large, but will look into it again
18:56:41 <fireonlive> hmm if it's one of those big burden things maybe it could be split off into a different queue somehow?
18:57:16 <fireonlive> then you could have workers solely dedicated to pdf link extraction/document conversion/etc
18:57:32 <fireonlive> and people can brrrrr on urls with whatever IP space
18:57:56 <fireonlive> could also get more people participating in the extraction part, though don't know if that's a particular bottlenec
18:57:58 <fireonlive> k