-
fireonlive
via #nanog: <+rss> Anyone got a contact at OpenAI. They have a spider problem.: Posted by John Levine on Apr 10 As I think I have mentioned before, I have the world's lamest content farm at
web.sp.am . Click on a link or two and you'll get the idea. Unfortunately, GPTBot has found it and has not gotten the idea. It has fetched over 3
-
fireonlive
million pages today. Before someone tells me to fix my robots.txt, this is a content farm so [...]
seclists.org/nanog/2024/Apr/63
-
fireonlive
let's hope we don't stumble on it :p
-
nicolas17
>content farm
-
nicolas17
it's easy to solve that issue
-
nicolas17
shut it down
-
fireonlive
i think it's intended to trap bots
-
fireonlive
>IECC ChurnWare 0.3
-
nicolas17
well, it worked?
-
nicolas17
I don't understand
-
fireonlive
hmm maybe he's looking for bad bots? i'm not too sure either
-
nicolas17
"I made a set of websites linking to each other to trap bots in a loop following those links, now a bot got trapped in a loop following those links, how do I stop it"
-
fireonlive
-
fireonlive
>A few years ago the bingbot got trapped but fortunately I knew someone at Microsoft who could pass the word. He reported back that while he could not go into detail, there was a great deal of animated conversation at the other end of the hall, and shortly after that it stopped.
-
fireonlive
lol
-
fireonlive
-
nicolas17
really don't know what he expected
-
fireonlive
don't see anything else about it
-
fireonlive
yeah not sure lol
-
pabs
arkiver: in urls-sources, I note you disable web.archive.org links for a couple of dead blog aggregators that still have working blogs linked from them. should I just copy the blogs URLs into urls-sources instead?
-
pabs
arkiver: btw, found a few more FOSS blog aggregators
ArchiveTeam/urls-sources #30
-
arkiver
pabs: ah that disabling of web.archive.org happens automatically
-
fireonlive
-
» pabs lol at the filename
-
fireonlive
x3
-
fireonlive
its rss feed is also pumped into #web3
-
datechnoman
arkiver quick status update. Everything has been running pretty smoothly. Only real thing of note is that we have been pumping through a fair bit of porn the last few hours
-
datechnoman
Not too sure if we are wanting that here but worth a mention :)
-
datechnoman
Queue has been stable all day
-
fireonlive
i have a few of those to add :P
-
datechnoman
Haha why am I not surprised you piped up fireonlive :P
-
arkiver
datechnoman: i'm going to do a round through a recent CDX, will probably come across the porn and see if it is a problem
-
imer
can probably restore the stashed data as well, right? no explosions so far
-
arkiver
yep coming up!
-
arkiver
so this is only news sites from which we now get outlinks, i plan on soon adding political/government/research sites too
-
arkiver
datechnoman: imer: todo:secondary stash is moving back in
-
datechnoman
Thanks for that arkiver and I also agree that political/government/research is the next step
-
arkiver
yeah :)
-
arkiver
i see a loop similar to a previous loop in the logs
-
arkiver
it has not escalated yet, but will support for it to be killed
-
datechnoman
Sounds like a plan! Weed em out!
-
arkiver
:)
-
arkiver
moving some of these 'share this web page' links to one-time URLs list, so they don't go into the bloom filter
-
datechnoman
That makes sense
-
datechnoman
Is it just another bloom filter? Eg; multiple bloom filters that get queried for different things?
-
arkiver
no, it does not go through a bloom filter at all
-
arkiver
these are usually either those pixels with a one time code in the URL
-
arkiver
or the "share to facebook/twitter/etc." links that only exist on the web page that would be shared itself.
-
arkiver
oh hah, actually they do go through a bloom filter, but we may remove that filter any time :P
-
datechnoman
Haha all good! Also good planning for bloom filter hygiene
-
datechnoman
No point wasting bloom filter resources on one time links etc
-
arkiver
yep
-
arkiver
indeed i see more porn stuff than usual
-
katia
👀
-
datechnoman
Ohhh yes. There is quite a lot. Like don't get me wrong, I like porn as much as the next man lol
-
datechnoman
But the videos were starting to pile up in the GB's :O
-
datechnoman
Which is a lot of HTML we could have instead ;)
-
arkiver
updates are in!
-
arkiver
and forced now as minimum
-
datechnoman
Cheers mate! Love your work as always:D
-
datechnoman
Thanks for requesting the stash also :)
-
arkiver
thanks datechnoman :)
-
arkiver
redo stash is now also being fed back in
-
datechnoman
Smick! Then once we clean it all up maybe look at those new outlink sources :D
-
datechnoman
Get a nice comprehensive archiving solution together!
-
arkiver
yeah!
-
arkiver
capturing all the stuff on the internet
-
arkiver
pabs: merged
-
datechnoman
The more important stuff ****
-
datechnoman
haha
-
datechnoman
Cant get everything ;)
-
datechnoman
nor will we ever
-
arkiver
very true!
-
arkiver
but we're well underway to get the most interesting bits
-
imer
doing a lot of gravatar atm (~50%) cant dig where that might be coming from atm
-
datechnoman
yeah we are smashing through them. They will push through quite fast
-
datechnoman
Going so fast that the websocket it breaking :P
-
arkiver
things are looking good to me
-
arkiver
we got through the gravatar stuff fast
-
fireonlive
datechnoman: :P
-
arkiver
i added the government and political sites as well to have outlinks extracted from
-
fireonlive
sweet
-
fireonlive
juicy pdf links
-
fireonlive
🔗
-
arkiver
the amount of porn in the recent WARCs has been going down
-
arkiver
it's all holding up really surprisingly well
-
» fireonlive 😶
-
arkiver
queue going up a bit, largely URLs related to research
-
arkiver
i'll leave it going while i'm off for the night
-
fireonlive
have fun!
-
datechnoman
Looks good mate. Just needs time to chew threw all the new urls
-
datechnoman
For some odd reason my containers didn't update when you made the change and I they have been idling out of date :( Just kicked them now
-
datechnoman
Should speed things right up!
-
datechnoman
Now we are zooming again:D
-
imer
oh no target -1s, zooming too fast
-
datechnoman
Na was just some mega uploads causing some backlog. All cleared up already :)
-
datechnoman
Much data, much wow :D