-
Medowar
-
Medowar
efenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypefenotypewww.myfonts.comwww.myfonts.comwww.myfonts.comwww.myfonts.com/
-
JAA
We should probably just filter out anything that can't be a valid domain.
-
JAA
(Ignoring bare IPs for the moment)
-
JAA
DNS has limits for this: 63 chars per label, and 255 chars total including the label lengths.
-
» nicolas17 scrolls up
-
» nicolas17 stabs Cheesy
-
datechnoman
arkiver - Just my workers alone are hitting
viralcovert.com on average 15k per minute and its been 502'ing for a few hours. There is some kind of weird loop going on and we are definitely killing the website. Might want to filter it out for this project
-
datechnoman
Removing it will speed things up a fair bit also :D
-
datechnoman
JAA - There is nothing wrong with this url right? :P
-
datechnoman
-
datechnoman
%80%93%E2%88%920.014zr3.8593.7803.9213.8623.8773.8343.9203.9043.9503.8993.8983.8933.8673.780%E2%80%933.967hf0.0410.0270.0310.0380.0410.0470.0480.0370.0380.0320.0370.0390.0460.025%E2%80%930.048th0.0020.008%E2%80%93%E2%80%930.0030.007%E2%80%930.003%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.012u%E2%80%93%E2%80%93%E2%80%93%E2%80%9
-
datechnoman
30.0070.0030.0020.0040.0030.0020.0020.0030.004%E2%80%93%E2%88%920.019nb%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.006%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.009p0.0120.0380.0160.0190.0150.0290.0170.0170.0100.0200.0160.0090.0160.006%E2%80%930.038y0.0220.0570.0250.0600.0410.0370.0220.0450.0200.0300.0280.
-
datechnoman
0060.0120.005%E2%80%930.095la%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.008%E2%80%93%E2%80%93%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E2%88%920.010ce%E2%80%930.004%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%88%920.007pr%E2%80%930.005%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E
-
datechnoman
2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.007nd%E2%80%930.006%E2%80%930.011%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.011sm%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.0110.005%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.013gd%E2%8
-
datechnoman
0%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.008%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%88%920.008dy%E2%80%930.0170.0060.008%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%80%930.007%E2%80%93%E2%80%93%E2%88%920.017er%E2%80%930.011%E2%80%930.0040.0040.004%E2%80%930.007%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%
-
datechnoman
93%E2%80%93%E2%88%920.011yb%E2%80%930.009%E2%80%930.0080.0110.004%E2%80%930.003%E2%80%930.005%E2%80%93%E2%80%93%E2%80%93%E2%80%93%E2%80%930.0116cations8.048.068.028.028.018.058.038.028.008.018.018.028.048.00%E2%80%938.06feo%E2%88%97=totalironasfe2+;bdl=belowdetectionlimit;n=numberofanalysesaveraged.eu/.
-
datechnoman
arkiver - Here are some examples of some random/odd regex that is occurring for extracts increasing failed urls counts quite heavily:
transfer.archivete.am/A225h/bad_urls_regex.txt
-
eggdrop
-
arkiver
JAA: yeah we should, these all come from PDFs - we do some pretty aggressive URL extraction there which leads to bad cases
-
arkiver
but, the filtering there to filter out impossible cases needs to be improved
-
arkiver
thanks for the reports there datechnoman
-
arkiver
JAA: are those true maxima? or are they "best practises" (as in, are there known exceptions supported by implementations?)
-
arkiver
however, since the majority of these come from URLs extracted from PDFs, i think it's fine to implement these limits anyway for those URLs
-
arkiver
moving :todo:backfeed to :todo:secondary
-
datechnoman
No worries mate! Always happy to help and provide logs
-
arkiver
datechnoman: and yeah that last URL you posted especially is obviously not correct, so i should filter that our before queuing back
-
arkiver
from what i see there it may have been extracted from some table or graph
-
datechnoman
oohhhh ok interesting. Must be interrupting/extracting it funny
-
datechnoman
Kind of makes sense when we are focusing more on science and medical PDF's that have quite a lot of tables in them
-
datechnoman
For an hour last night I was randomly pulling PDF urls out of my workers that have been processed and looking if they had hits in archive.org and were coming up with no results
-
datechnoman
Great to be grabbing data that would be lost potentially in the future
-
arkiver
well we are doing a lot of 'tricks' to try to extract as many URLs as possible
-
arkiver
for example also URLs split over multiple lines
-
arkiver
or URLs split over multiple lines in tables
-
arkiver
with or without https
-
arkiver
but we pretty much extract every URL in various forms that can be found in a document now
-
arkiver
if someone simply writes "archiveteam.org" (no protocol or /) in a sentence, we will find that in the document
-
arkiver
also if the "org" part is split off to the next line due to line length
-
arkiver
datechnoman: really great to hear about the PDFs :) and yes indeed! we've been archiving enormous amount of information here that was completely not archived before
-
arkiver
unfortunately the sources of news or products (meaning the scientific documents) are often not archived because not many people link to them
-
arkiver
everyone just links to the popular article stuff
-
arkiver
all of this will get easier though, we're currently going through an initial bump after recent changes. it will settle down after a few days i believe
-
datechnoman
Yeah it all take tweaking and work. All good! Will keep providing logs as I see odd things. Your doing magic that I cannot do so no judgement
-
arkiver
i see signs of a loop
-
arkiver
suspicious URLs
-
arkiver
viralcovert is filtered out now
-
arkiver
paused as i fix the loop
-
arkiver
the loop seems to be a variation of an older loop, but is significantly different at the same time
-
datechnoman
Ohhhh well glad you picked up on that
-
datechnoman
Great work. I'll standby
-
datechnoman
I will say the code and loop configuration is much more robust and accurate than it was 6 months ago
-
arkiver
update is in an resumed
-
arkiver
datechnoman: well it is actually largely the same i think, just nowadays we have fewer loops because we pretty much came across nearly all loops out there :P
-
arkiver
the last few serious spam loops i had to fix were variations on older loops - which points to changes in the 'spam site software' we have to adapt to
-
imer
the concept of "spam site software" is so wild to me haha makes total sense though
-
imer
arkiver: "viralcovert is filtered out now" in code? still seeing an amount of it on my end
-
arkiver
imer: well the one with the arsae parameter
-
arkiver
ah...
-
arkiver
it's still there as custom: items
-
arkiver
imer: should be out now
-
arkiver
i keep forgetting about custom: items
-
datechnoman
Haha all good. Progress is also a bonus!
-
datechnoman
Way more HTTP 200 codes which is great
-
arkiver
does anyone have some dashboard on which they track status codes, and perhaps other information?
-
datechnoman
Yeah got all of that in grafana
-
datechnoman
Will give you a view of my whole fleet
-
arkiver
i feel like we talked about that before
-
datechnoman
Yeah I gave it to you ages ago and I don't think you ever logged in lol
-
arkiver
or maybe not
-
datechnoman
I can get you re-setup
-
datechnoman
Will show you veeything your after
-
datechnoman
All the worker logs, http codes, queued urls etc
-
arkiver
i'm in :)
-
arkiver
backfeed is staying close to 0! yay!
-
arkiver
very good sign, we may be back to grinding away at backlog
-
datechnoman
Whoop whoop!
-
arkiver
audrooku|m: how important is it that we archive the 1.59 million URLs soon?
-
arkiver
if not important, i would like to wait until the queued are down to near 0 here
-
datechnoman
arkiver - When a page is archived, are we following the archiving the outlinks with a depth of 1? Cant remember where to check that and if it was turned off as there were too many webpages being queued awhile back
-
arkiver
datechnoman: no we are not
-
arkiver
only when we queue URLs in urls-sources, we archive them up to a certain depth
-
arkiver
datechnoman: you can check for the `depth` parameter in custom: items, which shows how deep we'll crawl it
-
datechnoman
Thanks so much. I'll look into that so I don't have to hassle you again :)
-
arkiver
just saw this nice URL that is clearly extracted from a PDF document :P
variablesareexpressedinlogarithmexc…ed10-yearlaggedemploymentdensity.th
-
arkiver
classified as URL due to the .th
-
datechnoman
Offt thats a big boy lol
-
datechnoman
I feel like it isn't a valid url xD
-
arkiver
yeah, but it also fits inside the rules of 63 and 255 chars, so would not be filtered out
-
arkiver
it is what it is, URL extraction from PDFs is messy... but i think we can expect a significant number of false positives if we in return extract nearly any URLs
-
arkiver
and the false positives do not ask much from CPU for example, it's mostly wasted resources in the sense of that they sit idle, instead of that they are heavily used on useless tasks
-
datechnoman
Yeah will just push through them instead
-
JAA
arkiver: I'm not aware of anything supporting anything longer than 63 chars per label and 255 chars in total. The limits are documented in RFC 1035. The protocol could in theory accomodate larger domains though.
-
JAA
More than 255 chars per label would definitely be impossible because a label is represented by a single length byte plus the label itself.
-
JAA
The first two bits of the length byte are restricted to 00 normally, and 11 is used for compression. So probably that's a pretty rigid limit, too.
-
JAA
(Compression here just means referring back to a previous copy of the same label, so that example.example.org doesn't need to encode 'example' twice.
-
JAA
)
-
arkiver
i see
-
arkiver
thanks, i will put some checks in place
-
arkiver
first fixing another interesting spam loop
-
JAA
The 255 chars in total limit is entirely arbitrary in theory; the total length does not appear in the DNS messages, only the individual label lengths.
-
arkiver
update is out for a latest looping problems
-
datechnoman
arkiver - assume yoh are aware of the
academia.edu loop
-
datechnoman
Or already put something in place
-
arkiver
yep
-
arkiver
just fixed
-
arkiver
was due to my code change
-
arkiver
datechnoman: ^
-
datechnoman
All good! Falling back asleep haha. Night
-
arkiver
have a good sleep :)
-
datechnoman
Thanks :) fell asleep on the couch 4 hours ago lol oops
-
arkiver
we're going to try to repair PDF documents if they cannot be read by pdftohtml
-
arkiver
update is in!
-
arkiver
we're installing ghostscript now
-
arkiver
to repair PDFs if needed
-
arkiver
that is 60 MB of extra installed data
-
arkiver
... which i think is acceptable
-
arkiver
other than the extra size, i think this should have minimal impact on performance. the majority of PDFs don't need to be repaired for processing with pdftohtml
-
JAA
Does the warrior-install.sh get executed by non-Warrior workers?
-
arkiver
yes
-
arkiver
-
arkiver
we use warrior-install.sh in the same way for the youtube-grab project to make it run on both the warrior and non-warrior
-
JAA
Ah :-)
-
arkiver
i looked before into extracting URLs from doc(x), etc., but will look into that again soon. this would be converting them to PDF and extracting from PDF using pdftohtml (so we don't need separate extraction stuff for various document formats)
-
arkiver
however, last i checked the packages needed for converstion to PDF were quite large, but will look into it again
-
fireonlive
hmm if it's one of those big burden things maybe it could be split off into a different queue somehow?
-
fireonlive
then you could have workers solely dedicated to pdf link extraction/document conversion/etc
-
fireonlive
and people can brrrrr on urls with whatever IP space
-
fireonlive
could also get more people participating in the extraction part, though don't know if that's a particular bottlenec
-
fireonlive
k