02:22:20 via #nanog: <+rss> Anyone got a contact at OpenAI. They have a spider problem.: Posted by John Levine on Apr 10 As I think I have mentioned before, I have the world's lamest content farm at https://www.web.sp.am/ . Click on a link or two and you'll get the idea. Unfortunately, GPTBot has found it and has not gotten the idea. It has fetched over 3 02:22:20 million pages today. Before someone tells me to fix my robots.txt, this is a content farm so [...] https://seclists.org/nanog/2024/Apr/63 02:22:26 let's hope we don't stumble on it :p 02:51:46 >content farm 02:52:02 it's easy to solve that issue 02:52:04 shut it down 02:52:40 i think it's intended to trap bots 02:52:47 >IECC ChurnWare 0.3 02:53:00 well, it worked? 02:53:48 I don't understand 02:54:12 hmm maybe he's looking for bad bots? i'm not too sure either 02:54:13 "I made a set of websites linking to each other to trap bots in a loop following those links, now a bot got trapped in a loop following those links, how do I stop it" 02:55:49 https://mailman.nanog.org/pipermail/nanog/2024-February/224864.html 02:56:08 >A few years ago the bingbot got trapped but fortunately I knew someone at Microsoft who could pass the word. He reported back that while he could not go into detail, there was a great deal of animated conversation at the other end of the hall, and shortly after that it stopped. 02:56:11 lol 02:56:54 ah found his blog post: https://circleid.com/posts/20120713_silly_bing 02:59:06 really don't know what he expected 02:59:50 don't see anything else about it 03:00:31 yeah not sure lol 03:09:32 arkiver: in urls-sources, I note you disable web.archive.org links for a couple of dead blog aggregators that still have working blogs linked from them. should I just copy the blogs URLs into urls-sources instead? 06:04:35 arkiver: btw, found a few more FOSS blog aggregators https://github.com/ArchiveTeam/urls-sources/pull/30 06:09:16 pabs: ah that disabling of web.archive.org happens automatically 06:10:14 https://github.com/ArchiveTeam/urls-sources/blob/master/3600_web3isgoinggreat_com.txt :D 06:10:26 * pabs lol at the filename 06:10:41 x3 06:11:18 its rss feed is also pumped into #web3 07:30:50 arkiver quick status update. Everything has been running pretty smoothly. Only real thing of note is that we have been pumping through a fair bit of porn the last few hours 07:31:19 Not too sure if we are wanting that here but worth a mention :) 07:32:26 Queue has been stable all day 07:54:11 i have a few of those to add :P 08:32:30 Haha why am I not surprised you piped up fireonlive :P 08:56:22 datechnoman: i'm going to do a round through a recent CDX, will probably come across the porn and see if it is a problem 08:59:21 can probably restore the stashed data as well, right? no explosions so far 09:04:11 yep coming up! 09:04:54 so this is only news sites from which we now get outlinks, i plan on soon adding political/government/research sites too 09:06:14 datechnoman: imer: todo:secondary stash is moving back in 09:10:28 Thanks for that arkiver and I also agree that political/government/research is the next step 09:11:15 yeah :) 09:11:23 i see a loop similar to a previous loop in the logs 09:11:34 it has not escalated yet, but will support for it to be killed 09:12:20 Sounds like a plan! Weed em out! 09:14:42 :) 09:17:07 moving some of these 'share this web page' links to one-time URLs list, so they don't go into the bloom filter 09:24:09 That makes sense 09:24:43 Is it just another bloom filter? Eg; multiple bloom filters that get queried for different things? 09:25:04 no, it does not go through a bloom filter at all 09:25:18 these are usually either those pixels with a one time code in the URL 09:25:41 or the "share to facebook/twitter/etc." links that only exist on the web page that would be shared itself. 09:26:03 oh hah, actually they do go through a bloom filter, but we may remove that filter any time :P 09:27:01 Haha all good! Also good planning for bloom filter hygiene 09:28:26 No point wasting bloom filter resources on one time links etc 09:32:29 yep 09:32:54 indeed i see more porn stuff than usual 09:33:25 👀 09:36:05 Ohhh yes. There is quite a lot. Like don't get me wrong, I like porn as much as the next man lol 09:36:27 But the videos were starting to pile up in the GB's :O 09:37:06 Which is a lot of HTML we could have instead ;) 09:46:40 updates are in! 09:49:44 and forced now as minimum 09:50:49 Cheers mate! Love your work as always:D 09:51:02 Thanks for requesting the stash also :) 09:51:09 thanks datechnoman :) 09:52:10 redo stash is now also being fed back in 09:56:12 Smick! Then once we clean it all up maybe look at those new outlink sources :D 09:56:48 Get a nice comprehensive archiving solution together! 10:00:22 yeah! 10:00:28 capturing all the stuff on the internet 10:02:47 pabs: merged 10:02:53 The more important stuff **** 10:02:54 haha 10:03:01 Cant get everything ;) 10:03:07 nor will we ever 11:23:00 very true! 11:23:09 but we're well underway to get the most interesting bits 11:44:51 doing a lot of gravatar atm (~50%) cant dig where that might be coming from atm 11:49:57 yeah we are smashing through them. They will push through quite fast 11:53:30 Going so fast that the websocket it breaking :P 14:29:04 things are looking good to me 14:29:10 we got through the gravatar stuff fast 15:42:31 datechnoman: :P 15:44:45 i added the government and political sites as well to have outlinks extracted from 15:50:53 sweet 15:51:03 juicy pdf links 15:51:07 🔗 15:56:32 the amount of porn in the recent WARCs has been going down 16:03:28 it's all holding up really surprisingly well 16:06:37 * fireonlive 😶 17:05:49 queue going up a bit, largely URLs related to research 17:06:00 i'll leave it going while i'm off for the night 17:20:05 have fun! 20:52:53 Looks good mate. Just needs time to chew threw all the new urls 20:58:00 For some odd reason my containers didn't update when you made the change and I they have been idling out of date :( Just kicked them now 20:58:07 Should speed things right up! 21:06:08 Now we are zooming again:D 21:24:55 oh no target -1s, zooming too fast 21:57:36 Na was just some mega uploads causing some backlog. All cleared up already :) 22:41:26 Much data, much wow :D