-
AK
Looks like I got the abuse report for the /.well-known/ urls too at some point
-
AK
Oops, just had to reply to the "deadline has expired" email asking Hetzner to please not kill my ips haha
-
fireonlive
weird people/scripts are so uppity about those
-
fireonlive
can we reply with “hi you’re giving off ‘i just got my first cpanel account and these “access logs” things look scary and confusing’ vibes so ima need you to tone that down juuuust a tad”?
-
DLoader
also got abuse mail from hetzner about a hetzner customer getting 8000 requests in 2 hours :D
-
DLoader
I think that was first that wasn't about hitting sinkholes
-
katia
that is 1 request a second approx?
-
DLoader
another one "The owner of the domain www.listvote.com has informed us of "Thousands and thousands of senseless requests. I'm sending 403."."
-
[42]
got the same with hetzner ^
-
TheTechRobo
I wonder what they mean by 'senseless'. Are they calling their own pages useless? :P
-
[42]
it appears to be a bulk abuse report to hetzner for all requests coming from hetzner ips
-
[42]
and hetzner having filtered that by customer
-
AK
Got the same, urgh
-
AK
Not seeing any listvote in the logs at the mo, and the requests are all from the 16th. So we might be clear now
-
project10
-
Craigle
Ah yep, same here
-
arkiver
pushed an update for better URL extraction from PDFs
-
arkiver
now supporting ' dot ' and ' (dot) ' and ' [dot] '
-
arkiver
some document show some URLs as
-
arkiver
example dot org/something
-
arkiver
does anyone here know of other ways URLs may be made 'unclickable' in documents?
-
JAA
Some people wrote pretty elaborate things for Imgur IIRC.
-
arkiver
hmm i had some in my queuing tool as well yeah
-
AK
Possible bad idea incoming, would there be a way for us to capture strings that had a www. or http:// but didn't end up matching the existing url extraction?
-
AK
If they could be recorded somewhere we might be able to keep improving the url extraction by seeing what was missed from it
-
arkiver
AK: like do you have an example?
-
that_lurker
-
that_lurker
"The owner of the domain www.listvote.com has informed us of "Thousands and thousands of senseless requests. I'm sending 403."."
-
that_lurker
heh
-
that_lurker
oh seems like everyone else got one too :P
-
AK
Say this was spotted in a file: "
example dot org/something". The code might go "Ooh this has a http, but we can't match it using the url extraction". It then queues it off to some file somewhere that we can then review later to either: 1. Manually work out the url and feed back in 2. Work out a regex/pattern/extraction method that would have
-
AK
caught the url and then that can be fed back into the url extraction in the workers
-
AK
Basically crowd sourcing the url extraction based off of what wasn't caught by it
-
arkiver
AK: i'm not sure.
-
arkiver
currently we already queue everything as URL that we think might a URL in a PDF
-
arkiver
so anything that is not queued was never even considered to possibly be a URL
-
arkiver
which would mean that implementing that idea would mean saving all text somewhere from which no URL was extracted, and that could get big
-
AK
good point
-
AK
I was sort of thinking for the items that fit in the "We think this might be a url but we can't quite get a 100% right url from it", but if they get queued then that'll sort them out anyway
-
arkiver
if we know of something like that we'll get a URL out of it in some way and queue it
-
arkiver
we have two categories for pieces of text here:
-
[42]
haven't had abuse mails from hetzner in a while until now, but do you have a sort of template response for that?
-
arkiver
1. this is probably a URL! let's queue it just in case
-
arkiver
2. according to how we see things now, this is definitely not a URL
-
[42]
maybe something to include contact info for throttling or excluding their domain (if that's even done)?
-
arkiver
sometimes category 2 is wrong, but if we would have more code in place to determine something in category 2 is likely a URL but we're not sure, that would move to category 1
-
that_lurker
Yeah. Writing statements is always fun :P
-
arkiver
i took out listvote.com for now. it's indeed a loop due to their PDFs
-
[42]
loops as in queueing the same content multiple times?
-
arkiver
no but we pay special attention to queuing all PDFs we come across, and all URLs found in each PDF are queued, etc., so that can create loops
-
arkiver
it's rare though
-
that_lurker
"A statement has been successfully entered for this issue. It will be checked by a staff member and afterwards the relevant ticket will be closed."
-
that_lurker
-
[42]
gist.github.com/Nothing4You/51d1d89ca37ecab444c67cc8bd32dfa5 does this seem reasonable? if so, it could also serve as template for others
-
arkiver
[42]: i would leave out the "while this issue is not addressed"
-
arkiver
excluded is kind of 'issue fixed' right
-
arkiver
?
-
arkiver
you could even leave out the PDF explanation
-
arkiver
just mention we won't be making more requests to listvote.com
-
[42]
ok
-
[42]
updated the gist
-
arkiver
looks good :)
-
arkiver
[42]: ^
-
that_lurker
make it a rare issue
-
arkiver
yeah could add that
-
[42]
updated
-
that_lurker
yeah makes it sound better :)
-
arkiver
yes
-
[42]
DLoader, Craigle: if you still need to reply to hetzner, see above
-
Craigle
[42] Thanks. I sent one earlier. My usual boilerplate about Archiveteam, no abuse intended, etc. Noted that these weren't "useless requests" but also that the site was removed and would not be accessed again
-
Craigle
It generally covers all the bases with them. If not, I'll deal with whatever they come back with
-
that_lurker
"Oopsie whoopsie my server did an oopsie UwU I wiww make suwe i-it does nyot h-happen again" That would either make them never answer or you would get insta banned :P
-
fireonlive
love it
-
fireonlive
-
fireonlive
send this :3
-
that_lurker