01:14:30 1=0 https://galatea2017201720172017201720172017201720172017201720172017201720172017201720172017201720172017201720172017201720172017201720022017201720172017201720172017fenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotyp 01:14:30 efenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypewww.myfonts.comwww.myfonts.comwww.myfonts.comwww.myfonts.com/ 01:24:30 We should probably just filter out anything that can't be a valid domain. 01:24:54 (Ignoring bare IPs for the moment) 01:25:57 DNS has limits for this: 63 chars per label, and 255 chars total including the label lengths. 03:34:36 * nicolas17 scrolls up 03:34:41 * nicolas17 stabs Cheesy 04:39:37 arkiver - Just my workers alone are hitting https://www.viralcovert.com/ on average 15k per minute and its been 502'ing for a few hours. There is some kind of weird loop going on and we are definitely killing the website. Might want to filter it out for this project 04:40:55 Removing it will speed things up a fair bit also :D 04:51:08 JAA - There is nothing wrong with this url right? :P 04:51:08 Queuing URL http://0.0850.032%E2%80%930.0030.0850.0670.0120.0080.0180.0110.0500.092%E2%80%93%E2%88%920.106mg%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.004ca0.004%E2%80%930.003%E2%80%93%E2%80%930.005%E2%80%93%E2%80%93%E2%80%930.004%E2%80%93%E2%80%930.003%E2 04:51:08 %80%93%E2%88%920.014zr3.8593.7803.9213.8623.8773.8343.9203.9043.9503.8993.8983.8933.8673.780%E2%80%933.967hf0.0410.0270.0310.0380.0410.0470.0480.0370.0380.0320.0370.0390.0460.025%E2%80%930.048th0.0020.008%E2%80%93%E2%80%930.0030.007%E2%80%930.003%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.012u%E2%80%93%E2%80%93%E2%80%93%E2%80%9 04:51:08 30.0070.0030.0020.0040.0030.0020.0020.0030.004%E2%80%93%E2%88%920.019nb%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.006%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.009p0.0120.0380.0160.0190.0150.0290.0170.0170.0100.0200.0160.0090.0160.006%E2%80%930.038y0.0220.0570.0250.0600.0410.0370.0220.0450.0200.0300.0280. 04:51:08 0060.0120.005%E2%80%930.095la%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.008%E2%80%93%E2%80%93%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E2%88%920.010ce%E2%80%930.004%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%88%920.007pr%E2%80%930.005%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E 04:51:09 2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.007nd%E2%80%930.006%E2%80%930.011%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.011sm%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.0110.005%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.013gd%E2%8 04:51:09 0%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.008%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.008dy%E2%80%930.0170.0060.008%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%88%920.017er%E2%80%930.011%E2%80%930.0040.0040.004%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80% 04:51:10 93%E2%80%93%E2%88%920.011yb%E2%80%930.009%E2%80%930.0080.0110.004%E2%80%930.003%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.0116cations8.048.068.028.028.018.058.038.028.008.018.018.028.048.00%E2%80%938.06feo%E2%88%97=totalironasfe2+;bdl=belowdetectionlimit;n=numberofanalysesaveraged.eu/. 04:53:38 arkiver - Here are some examples of some random/odd regex that is occurring for extracts increasing failed urls counts quite heavily: https://transfer.archivete.am/A225h/bad_urls_regex.txt 04:53:39 inline (for browser viewing): https://transfer.archivete.am/inline/A225h/bad_urls_regex.txt 05:39:03 JAA: yeah we should, these all come from PDFs - we do some pretty aggressive URL extraction there which leads to bad cases 05:39:15 but, the filtering there to filter out impossible cases needs to be improved 05:45:18 thanks for the reports there datechnoman 05:46:30 JAA: are those true maxima? or are they "best practises" (as in, are there known exceptions supported by implementations?) 05:46:55 however, since the majority of these come from URLs extracted from PDFs, i think it's fine to implement these limits anyway for those URLs 05:52:53 moving :todo:backfeed to :todo:secondary 05:55:39 No worries mate! Always happy to help and provide logs 05:59:16 datechnoman: and yeah that last URL you posted especially is obviously not correct, so i should filter that our before queuing back 05:59:20 from what i see there it may have been extracted from some table or graph 06:00:51 oohhhh ok interesting. Must be interrupting/extracting it funny 06:01:10 Kind of makes sense when we are focusing more on science and medical PDF's that have quite a lot of tables in them 06:01:46 For an hour last night I was randomly pulling PDF urls out of my workers that have been processed and looking if they had hits in archive.org and were coming up with no results 06:02:05 Great to be grabbing data that would be lost potentially in the future 06:03:29 well we are doing a lot of 'tricks' to try to extract as many URLs as possible 06:03:35 for example also URLs split over multiple lines 06:03:41 or URLs split over multiple lines in tables 06:03:51 with or without https 06:04:26 but we pretty much extract every URL in various forms that can be found in a document now 06:05:04 if someone simply writes "archiveteam.org" (no protocol or /) in a sentence, we will find that in the document 06:05:46 also if the "org" part is split off to the next line due to line length 06:06:22 datechnoman: really great to hear about the PDFs :) and yes indeed! we've been archiving enormous amount of information here that was completely not archived before 06:06:47 unfortunately the sources of news or products (meaning the scientific documents) are often not archived because not many people link to them 06:06:56 everyone just links to the popular article stuff 06:07:29 all of this will get easier though, we're currently going through an initial bump after recent changes. it will settle down after a few days i believe 06:14:06 Yeah it all take tweaking and work. All good! Will keep providing logs as I see odd things. Your doing magic that I cannot do so no judgement 07:41:07 i see signs of a loop 07:41:19 suspicious URLs 07:47:59 viralcovert is filtered out now 07:57:18 paused as i fix the loop 08:16:10 the loop seems to be a variation of an older loop, but is significantly different at the same time 08:18:33 Ohhhh well glad you picked up on that 08:18:45 Great work. I'll standby 08:19:19 I will say the code and loop configuration is much more robust and accurate than it was 6 months ago 08:20:28 update is in an resumed 08:21:02 datechnoman: well it is actually largely the same i think, just nowadays we have fewer loops because we pretty much came across nearly all loops out there :P 08:21:31 the last few serious spam loops i had to fix were variations on older loops - which points to changes in the 'spam site software' we have to adapt to 08:26:25 the concept of "spam site software" is so wild to me haha makes total sense though 08:45:28 arkiver: "viralcovert is filtered out now" in code? still seeing an amount of it on my end 08:59:15 imer: well the one with the arsae parameter 08:59:51 ah... 09:00:01 it's still there as custom: items 09:02:17 imer: should be out now 09:02:24 i keep forgetting about custom: items 09:08:58 Haha all good. Progress is also a bonus! 09:10:12 Way more HTTP 200 codes which is great 09:14:40 does anyone have some dashboard on which they track status codes, and perhaps other information? 09:14:55 Yeah got all of that in grafana 09:15:18 Will give you a view of my whole fleet 09:15:18 i feel like we talked about that before 09:15:36 Yeah I gave it to you ages ago and I don't think you ever logged in lol 09:15:40 or maybe not 09:16:54 I can get you re-setup 09:17:02 Will show you veeything your after 09:17:14 All the worker logs, http codes, queued urls etc 09:18:07 i'm in :) 09:21:40 backfeed is staying close to 0! yay! 09:21:56 very good sign, we may be back to grinding away at backlog 09:27:01 Whoop whoop! 09:37:45 audrooku|m: how important is it that we archive the 1.59 million URLs soon? 09:37:55 if not important, i would like to wait until the queued are down to near 0 here 09:52:03 arkiver - When a page is archived, are we following the archiving the outlinks with a depth of 1? Cant remember where to check that and if it was turned off as there were too many webpages being queued awhile back 10:12:18 datechnoman: no we are not 10:13:00 only when we queue URLs in urls-sources, we archive them up to a certain depth 10:14:14 datechnoman: you can check for the `depth` parameter in custom: items, which shows how deep we'll crawl it 10:17:22 Thanks so much. I'll look into that so I don't have to hassle you again :) 10:23:07 just saw this nice URL that is clearly extracted from a PDF document :P http://variablesareexpressedinlogarithmexceptyearsofschooling.spatiallysmoothedmunicipaldataareused.theinstrumentalvariableforemploymentdensityisthespatiallysmoothed10-yearlaggedemploymentdensity.th/ 10:23:14 classified as URL due to the .th 10:26:35 Offt thats a big boy lol 10:27:10 I feel like it isn't a valid url xD 10:27:34 yeah, but it also fits inside the rules of 63 and 255 chars, so would not be filtered out 10:28:09 it is what it is, URL extraction from PDFs is messy... but i think we can expect a significant number of false positives if we in return extract nearly any URLs 10:28:48 and the false positives do not ask much from CPU for example, it's mostly wasted resources in the sense of that they sit idle, instead of that they are heavily used on useless tasks 10:43:12 Yeah will just push through them instead 15:37:39 arkiver: I'm not aware of anything supporting anything longer than 63 chars per label and 255 chars in total. The limits are documented in RFC 1035. The protocol could in theory accomodate larger domains though. 15:38:56 More than 255 chars per label would definitely be impossible because a label is represented by a single length byte plus the label itself. 15:39:50 The first two bits of the length byte are restricted to 00 normally, and 11 is used for compression. So probably that's a pretty rigid limit, too. 15:40:21 (Compression here just means referring back to a previous copy of the same label, so that example.example.org doesn't need to encode 'example' twice. 15:40:25 ) 15:40:34 i see 15:40:45 thanks, i will put some checks in place 15:40:50 first fixing another interesting spam loop 15:41:09 The 255 chars in total limit is entirely arbitrary in theory; the total length does not appear in the DNS messages, only the individual label lengths. 15:46:03 update is out for a latest looping problems 16:18:35 arkiver - assume yoh are aware of the https://www.academia.edu/ loop 16:18:43 Or already put something in place 16:20:00 yep 16:20:02 just fixed 16:20:06 was due to my code change 16:20:19 datechnoman: ^ 16:20:36 All good! Falling back asleep haha. Night 16:21:42 have a good sleep :) 16:22:39 Thanks :) fell asleep on the couch 4 hours ago lol oops 16:41:44 we're going to try to repair PDF documents if they cannot be read by pdftohtml 17:26:43 update is in! 17:26:47 we're installing ghostscript now 17:26:59 to repair PDFs if needed 17:27:10 that is 60 MB of extra installed data 17:28:54 ... which i think is acceptable 17:34:37 other than the extra size, i think this should have minimal impact on performance. the majority of PDFs don't need to be repaired for processing with pdftohtml 17:35:12 Does the warrior-install.sh get executed by non-Warrior workers? 17:36:40 yes 17:37:05 https://github.com/ArchiveTeam/grab-base-df/blob/master/Dockerfile#L32-L33 17:37:25 we use warrior-install.sh in the same way for the youtube-grab project to make it run on both the warrior and non-warrior 17:37:26 Ah :-) 17:38:36 i looked before into extracting URLs from doc(x), etc., but will look into that again soon. this would be converting them to PDF and extracting from PDF using pdftohtml (so we don't need separate extraction stuff for various document formats) 17:38:53 however, last i checked the packages needed for converstion to PDF were quite large, but will look into it again 18:56:41 hmm if it's one of those big burden things maybe it could be split off into a different queue somehow? 18:57:16 then you could have workers solely dedicated to pdf link extraction/document conversion/etc 18:57:32 and people can brrrrr on urls with whatever IP space 18:57:56 could also get more people participating in the extraction part, though don't know if that's a particular bottlenec 18:57:58 k