00:03:01 JAA: another one 30=200 http://dadanja.com/DPPcRBd.php/KIe9h.xml 23=502 http://dadanja.com/m4zBK.php?2Dok.xml 00:03:22 hope this stops soon -_- 00:03:47 Ack, and same 00:13:05 just spot-checking some =0 urls, lots of wordpress blogs must be using bitninjas blacklist :( 00:14:48 BitNinja-- 00:14:49 -eggdrop- [karma] 'BitNinja' now has -1 karma! 00:21:28 hopefully retries with a cleaner ip get those 00:28:46 bitninja can suck my ass barf 06:24:23 JAA: i have ways to handle those 06:24:35 JAA: but which ones were being filtered out now to 'fix' this? 06:25:55 JAA: let's let the old onions just fail 06:26:00 actually don't some of them still work? 06:26:54 JAA: in case of these kind of sites, can we please not add filters but just pause the project until i can look into it? 06:27:23 i really don't want to add unnecessary filters, or we need to have a good view on which should be removed again 06:30:08 arkiver: All the problematic domains mentioned above. These, I think: jsd686.com ws.ogutsan.com www.pspalls.com nijihypogyhozymy.anvgames.com anvgames.com jyqisajojawopy.anvgames.com tonaku.com 06:30:12 And sure, will do in the future. 06:31:45 Regarding v2 onions, I'd be surprised if you could still establish a circuit to them. The vast, vast majority of relays should run a new enough version of Tor by now that they don't support v2 anymore. I do wonder whether we could run our own relays with an older version of Tor and establish a custom circuit through them... 06:32:22 Oh, forgot dadanja.com in the list above. 06:42:42 JAA: but that means there's possibly always a tiny part that is not update - so we should support that 06:43:27 i can't test well with these being filtered out 06:44:24 JAA: i'll remove the patterns for the websites you listed. will probably let it run for a bit so we get a nice sample, and then i can work on blocking it out 06:47:18 k-hachiken.com too i think 06:47:45 Tor has nice metrics, but it appears that they're currently broken. Welp. 06:48:03 https://metrics.torproject.org/versions.html should have a graph of Tor versions in the network. 06:52:37 i'll be off for 45 minutes or so 06:52:45 well it's pretty awesome we're getting tor now :) 07:04:35 You can add nakedcollegegirlssex.com to the problematic domains with PHP spam 07:09:41 yeah i noticed that one 07:23:42 arkiver: I forgot I still had the terminal open. Here are all the exact filters I added: ^http://jsd686%.com/ ^http://tonaku%.com/ ^http://ws%.ogutsan%.com/ ^http://www%.pspalls%.com/ ^http://k%-hachiken%.com/ ^http://nijihypogyhozymy%.anvgames%.com/ ^http://jyqisajojawopy%.anvgames%.com/ ^http://anvgames%.com/ ^http://dadanja%.com/ 08:31:22 arkiver/JAA: spam is back at a medium level if you want to hit the pause button 30=200 http://nakedcollegegirlssex.com/2fJaqH.php?FXcH.xml 36=200 http://jsd686.com/cA1ij.php?v3DH.xml 08:31:44 unless arkiver needs it "blown up" 08:33:54 in which case give it an hour tops :p 08:39:51 Spam is pretty consistent :( 08:40:02 Might spin down for a bit to save some cash 08:54:42 Target is struggling atm anyway 09:04:07 imer: i'm back, a bit of blow up is good 09:04:19 welcome back :) 09:04:22 :) 09:04:40 very curious with what solution you'll come up with 09:04:44 huh 09:04:54 140k items in backfeed is not what i would call 'blowing up' 09:05:09 slow targets been slowing it down :p 09:05:35 imer: the frame work for "the solution" is this https://github.com/ArchiveTeam/urls-grab/blob/master/urls.lua#L118-L276 09:06:51 ah, so you can run multiple patterns, and if all match it gets thrown out? 09:07:13 well 09:07:32 so that table contain a URL as key, and for each key it contains a list of patterns. 09:08:12 if the URL we get matches the key, it will check if each of the URLs is discovered on the web page. if yes, it will declare the web page spammy and do not queue any new URLs for it 09:08:35 oh, I see. that's smart 09:08:55 yeah it's the next "line of defense" after simple filtering of URLs 09:10:08 and to prevent having to read and search the entire web page (which is costly for CPU), we use URLs discovered by Wget-AT and match against those 09:22:57 good stuff 09:25:27 Thats really smart! 09:25:31 I like it. Will be very affective 09:25:45 When will that roll out? 09:27:21 datechnoman: This has been in use for at least almost a year. Just needs another rule to handle this spam trap, I guess. 09:27:32 ohhh gotcha lol... silly me 09:28:03 Re-read the line and I mis-read it >.< 09:29:27 A bit over a year, in fact, first introduced on 2023-03-21 and then refactored into the current table structure a couple days later. 10:16:22 rewby: can we decrease the megaWARC size for urls-tor to 1 GB? so they are uploaded more frequently 10:18:03 i still need to add to the megaWARC factory that it pumps out a megaWARC at least once a day 10:18:23 maybe that should be done actually instead of the 1 GB limit 15:38:49 well so nothing seems to be exploding after i removed those pattern from the filters list 15:38:54 todo:backfeed stays pretty low 20:11:16 arkiver: probably due to target issues? still seeing those two on the top at ~20%ish of requests 20:11:24 seems like a waste of resources tbh