-
nicolas17
also $0.01/GB for overage, vs AWS's $0.09/GB :P
-
immibis
that's a pretty normal VPS price if you don't buy from Jeffrey
-
Ryz
Hmm, I see that
eastoregonian.com is constantly processed through here, but an article like
eastoregonian.com/news/local/pertus…10-b736-55b9-b548-f0f3d9c52399.html doesn't appear to be saved
-
Ryz
...Has there been a through archiving of the more past count, or was this article just unlucky and didn't get archied?
-
Ryz
*archived
-
JAA
We never went through past articles on these sites.
-
JAA
Just regular retrieval of the homepage + following links from it (once).
-
JAA
That generally holds for everything getting archived regularly here.
-
Ryz
Hmm, there's
eastoregonian.com/sitemap.xml - but it doesn't look like this project access them frequently though
-
JAA
We don't follow URLs in sitemaps anyway.
-
datechnoman
We did but the growth was huge so had to cull it back right>
-
datechnoman
?*
-
JAA
I don't remember that, but could be, yeah.
-
datechnoman
we have no short supply of urls to process lol
-
» nicolas17 feeds Apple's 8GB OS images into urls
-
datechnoman
Optane says no xD
-
fireonlive
i mean i could archivebot that maybe
-
nicolas17
we archivebot'd an iOS version once
-
nicolas17
~30 files, ~200GB
-
nicolas17
took ages
-
datechnoman
Wouldnt take long at all here
-
datechnoman
:p
-
datechnoman
My workers are downloading on average 800mbps
-
nicolas17
datechnoman: yeah, archivebot was getting one file at a time, and processing the warc probably took ages too
-
fireonlive
oof
-
datechnoman
yeah thats what its designed for
-
datechnoman
Large files not so much
-
fireonlive
3 copies per file!
-
arkiver
immibis: can we stop it please with the stuff like "fascist" hosting providers?
-
arkiver
i also brought this up the other day in #archiveteam-ot
-
immibis
??
-
immibis
Tor literally uses this word for one of its firewall bypass options
-
arkiver
i may be wrong in that case, yes
-
arkiver
a long time ago there was another discussion about some labeling being used to describe a person or some company
-
immibis
you told me in -ot that it's not allowed to talk about actual fascism because archive team is an inclusive space for everyone regardless of politics. Now you're telling me the word also can't be used frivolously to refer to something that exerts an excessive amount of control. Do you just hate the word, or?
-
arkiver
back then i wrote the following about this discussion, on the use of labeling in a political context,
transfer.archivete.am/inline/NzLSU/message.txt (of course this message has some context itself from the discussion back then, but it is still valid)
-
h2ibot
Queuing bot shutting down.
-
h2ibot
Queuing bot started.
-
h2ibot
datechnoman: Restarting unfinished job cmKPUONH for '!a
transfer.archivete.am/HTbgw/filtered_.pdf_output.txt'.
-
arkiver
oh that must the job datechnoman mentioned earlier
-
Ryz
Does...does this project try to process websites' front pages every day or for certain days?
-
Ryz
Asking since checking
futura-sciences.com - there's some URLs done from this project, but it probably isn't in the listings...?
-
arkiver
Ryz: if it's not on urls-sources, it's not being queued regularly
-
arkiver
Ryz: for every URLs we come across, the front page of the website is archived once a month
-
arkiver
Ryz: i see a capture from archiveteam_urls on the first day of every month, which would indeed show it's part of that monthly queuing of domains we come across
-
arkiver
but we only queue a front page in a month if we actually come across a web page of that site in the month
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 9401326 items. (cmKPUONH)
-
Ryz
Hmm... o.o;
-
datechnoman
Thanks for kicking the job arkiver!
-
datechnoman
Don't think there was much left to queue from it
-
imer
arkiver: seeing more of the www.nudetubesex.com spam now (6req/s on my end), seems to be queuing recursively sometimes
transfer.archivete.am/KLzam/2024-03-28_10-55-06.txt
-
eggdrop
-
imer
seem to be blocking somewhat aggressively, so it's not exploding
-
datechnoman
yeah seem to be safe on that front
-
datechnoman
was hoping we would kill their site lol
-
imer
-
imer
up to 23/s on my end for nudetube (cc JAA)
-
imer
looks like backfeed is growing too
-
arkiver
going to check now!
-
arkiver
and yeah for the tow project - i think i'm ready on my side, just waiting for the target now (project will not yield a ton of data)
-
imer
arkiver: so, what's happening with the spam? :D
-
arkiver
imer: looking into it
-
arkiver
also looking at latest CDX
-
imer
said that half an hour ago haha, alrighty
-
arkiver
doing a general round of getting rid of stuff we don't need
-
imer
sounds good
-
arkiver
one of the ways of doing that is also looking at the bare CDX containing URLs and seeing what is repeated in there, etc.
-
arkiver
imer: being filtered
-
arkiver
probably some other stuff coming too
-
imer
thanks :)
-
arkiver
need to keep an eye on the expertini URLs, we archived quite a lot of them
-
arkiver
... which may be due to the "=pdf" part in there, which causes us to think it's a PDF and we queue it
-
arkiver
let's see if it runs out sometimes soon
-
arkiver
some jose947.com spam, which is not exploding yet i think
-
arkiver
same for paroisses-valdesaone.com
-
arkiver
searchukjobs is together with expertini
-
arkiver
(mostly notes to self
-
arkiver
)
-
arkiver
otherwise all looking good
-
arkiver
pushed an update with some minor improvements for handling of certain URLs
-
rewby
arkiver Spinning upt arget.
-
rewby
arkiver: Target online
-
imer
^can confirm working
-
imer
arkiver: "1711648124 ERROR torsocks[74]: General SOCKS server failure (in socks5_recv_connect_reply() at socks5.c:527)" tor proxy now looks broken?
-
imer
it passed the checkip though
-
imer
-
eggdrop
-
imer
gonna let it run, see if it fixes itself
-
BornOn420
It seems like the floodgates have opened on the nudetubesex hack. I see the same behavior now at
pspalls.com,
jsd686.com,
ws.ogutsan.com,
k-hachiken.com
-
myself
Would it be useful to have a !exclude <regex> command? Seems a lot of traffic in here is that but manually.
-
BornOn420
JAA arkiver A lot of that 'nudetubesex' PHP spam on my machines with the domains mentioned above. I see a .*php.*xml URL getting queued every second. And that's excluding the spammy .*php URLs.
-
JAA
Hmm, yeah, that looks bad. :-|
-
BornOn420
good luck at filtering that out without also discarding myIMG.php, shrtnd.php, ndex123_nw.php, etc.
-
nstrom|m
I see tor urls doing stuff but not actually rsyncing anything anywhere
-
nstrom|m
nvm I see it now
-
nstrom|m
guess I'll spin up a few more on it
-
wickerz
Should Tor URLs work on the warrior?
-
wickerz
I see the project available from the list but getting errors when starting. Not sure if on my end or not..
-
wickerz
2024-03-28 20:42:34,693 - seesaw.warrior - ERROR - Error loading pipeline
-
wickerz
Traceback (most recent call last):
-
wickerz
File "/usr/local/lib/python3.9/site-packages/seesaw/warrior.py", line 736, in start_selected_project
-
wickerz
(project, pipeline, config_values) = self.load_pipeline(
-
wickerz
File "/usr/local/lib/python3.9/site-packages/seesaw/warrior.py", line 674, in load_pipeline
-
wickerz
with open(pipeline_path) as f:
-
wickerz
FileNotFoundError: [Errno 2] No such file or directory: '/home/warrior/data/projects/urls-tor-656b405/pipeline.py'
-
wickerz
2024-03-28 20:42:34,694 - seesaw.warrior - WARNING - Project urls-tor did not install correctly and we're ignoring this problem.
-
JAA
Probably not with how it works currently.
-
JAA
arkiver: ^
-
imer
also seem to be missing auto-reclaiming on the tor project
-
imer
or real long ttl
-
JAA
Yeah, not enabled.
-
JAA
We should maybe filter out v2 onions since those can't work anymore.
-
JAA
Or just let them fail into unretrievable, I guess.
-
JAA
A lot of the outstanding claims is that.
-
JAA
I've applied the same reclaim settings as on the main project.
-
JAA
2000s TTL, 3 tries
-
imer
JAA: can we get a filter for jsd686.com? as BornOn420 said the same issue as the nudetube site
-
imer
> 30req/s on my end
-
imer
can confirm the other mentioned ones are there too, just way less currently
-
JAA
^
jsd686%.com is being filtered now.
-
JAA
Blew up a lot since I checked earlier indeed.
-
JAA
tonaku.com is another one.
-
JAA
This should really be handled differently, but I have no idea how.
-
BornOn420
ws.ogutsan.com and www.euszati.hu and www.pspalls.com as well
-
imer
-
BornOn420
yep suspecting that one as well
-
imer
-
imer
dont see the euszati.hu currently
-
BornOn420
See jyqisajojawopy.anvgames.com here as well
-
BornOn420
imer you're right euszati.hu is legit and NOT spam
-
BornOn420
my mistake
-
JAA
There's no clear pattern to these other than /<randomalnum>.php?<randomalnum>.xml which could easily also happen on a legit site.
-
imer
yeah, probably just block the domains and hope for the best for now
-
imer
cant be that many out there
-
imer
hey look whats climbed to the top in my logtop window :D 1 820 17.83/s ws.ogutsan.com 2 490 10.65/s tonaku.com 3 305 6.63/s www.pspalls.com
-
JAA
Moar filters added for those.
-
imer
thanks JAA
-
BornOn420
Just in: anvgames.com (without the weird subdomains)
-
JAA
I saw that earlier but then it disappeared again.
-
JAA
Added