00:01:19 also $0.01/GB for overage, vs AWS's $0.09/GB :P 02:41:08 that's a pretty normal VPS price if you don't buy from Jeffrey 02:46:38 Hmm, I see that https://www.eastoregonian.com/ is constantly processed through here, but an article like https://www.eastoregonian.com/news/local/pertussis-whooping-walla-walla-county/article_0890b110-b736-55b9-b548-f0f3d9c52399.html doesn't appear to be saved 02:47:05 ...Has there been a through archiving of the more past count, or was this article just unlucky and didn't get archied? 02:47:08 *archived 02:47:29 We never went through past articles on these sites. 02:48:01 Just regular retrieval of the homepage + following links from it (once). 02:49:04 That generally holds for everything getting archived regularly here. 02:50:52 Hmm, there's https://www.eastoregonian.com/sitemap.xml - but it doesn't look like this project access them frequently though 02:53:24 We don't follow URLs in sitemaps anyway. 02:55:23 We did but the growth was huge so had to cull it back right> 02:55:25 ?* 02:58:50 I don't remember that, but could be, yeah. 03:11:34 we have no short supply of urls to process lol 03:17:09 * nicolas17 feeds Apple's 8GB OS images into urls 03:20:29 Optane says no xD 03:41:00 i mean i could archivebot that maybe 03:42:45 we archivebot'd an iOS version once 03:43:00 ~30 files, ~200GB 03:43:02 took ages 03:45:38 Wouldnt take long at all here 03:45:39 :p 03:46:20 My workers are downloading on average 800mbps 03:47:02 datechnoman: yeah, archivebot was getting one file at a time, and processing the warc probably took ages too 03:50:08 oof 03:54:27 yeah thats what its designed for 03:54:31 Large files not so much 04:03:19 3 copies per file! 06:50:45 immibis: can we stop it please with the stuff like "fascist" hosting providers? 06:51:00 i also brought this up the other day in #archiveteam-ot 06:51:01 ?? 06:51:34 Tor literally uses this word for one of its firewall bypass options 06:52:25 i may be wrong in that case, yes 06:54:08 a long time ago there was another discussion about some labeling being used to describe a person or some company 06:55:12 you told me in -ot that it's not allowed to talk about actual fascism because archive team is an inclusive space for everyone regardless of politics. Now you're telling me the word also can't be used frivolously to refer to something that exerts an excessive amount of control. Do you just hate the word, or? 06:55:35 back then i wrote the following about this discussion, on the use of labeling in a political context, https://transfer.archivete.am/inline/NzLSU/message.txt (of course this message has some context itself from the discussion back then, but it is still valid) 06:58:49 Queuing bot shutting down. 06:59:01 Queuing bot started. 06:59:03 datechnoman: Restarting unfinished job cmKPUONH for '!a https://transfer.archivete.am/HTbgw/filtered_.pdf_output.txt'. 06:59:39 oh that must the job datechnoman mentioned earlier 07:00:00 Does...does this project try to process websites' front pages every day or for certain days? 07:00:25 Asking since checking https://www.futura-sciences.com/ - there's some URLs done from this project, but it probably isn't in the listings...? 07:01:05 Ryz: if it's not on urls-sources, it's not being queued regularly 07:01:22 Ryz: for every URLs we come across, the front page of the website is archived once a month 07:02:40 Ryz: i see a capture from archiveteam_urls on the first day of every month, which would indeed show it's part of that monthly queuing of domains we come across 07:03:02 but we only queue a front page in a month if we actually come across a web page of that site in the month 07:03:18 datechnoman: Skipped 4203 invalid URLs: https://transfer.archivete.am/13M2pB/filtered_.pdf_output.txt.bad-urls.txt (cmKPUONH) 07:03:19 datechnoman: Fixed 1 unprintable URLs: https://transfer.archivete.am/K3myf/filtered_.pdf_output.txt.not-printable.txt (cmKPUONH) 07:03:20 datechnoman: Skipped 1 very long URLs: https://transfer.archivete.am/LDhzE/filtered_.pdf_output.txt.skipped.txt (cmKPUONH) 07:03:21 datechnoman: Deduplicating and queuing 9401326 items. (cmKPUONH) 07:03:42 Hmm... o.o; 08:09:19 Thanks for kicking the job arkiver! 08:09:33 Don't think there was much left to queue from it 10:55:31 arkiver: seeing more of the www.nudetubesex.com spam now (6req/s on my end), seems to be queuing recursively sometimes https://transfer.archivete.am/KLzam/2024-03-28_10-55-06.txt 10:55:31 inline (for browser viewing): https://transfer.archivete.am/inline/KLzam/2024-03-28_10-55-06.txt 11:23:53 seem to be blocking somewhat aggressively, so it's not exploding 11:24:20 yeah seem to be safe on that front 11:24:30 was hoping we would kill their site lol 12:07:05 we do seem to be slowing down gradually though https://share.dl.je/2024/03/2024-03-28_12-06-57_Xi335uzUKi.png 14:21:26 up to 23/s on my end for nudetube (cc JAA) 14:22:10 looks like backfeed is growing too 15:28:24 going to check now! 16:06:48 and yeah for the tow project - i think i'm ready on my side, just waiting for the target now (project will not yield a ton of data) 16:09:16 arkiver: so, what's happening with the spam? :D 16:10:21 imer: looking into it 16:10:26 also looking at latest CDX 16:10:36 said that half an hour ago haha, alrighty 16:10:50 doing a general round of getting rid of stuff we don't need 16:11:00 sounds good 16:11:17 one of the ways of doing that is also looking at the bare CDX containing URLs and seeing what is repeated in there, etc. 16:17:35 imer: being filtered 16:17:45 probably some other stuff coming too 16:19:47 thanks :) 16:34:58 need to keep an eye on the expertini URLs, we archived quite a lot of them 16:35:33 ... which may be due to the "=pdf" part in there, which causes us to think it's a PDF and we queue it 16:35:38 let's see if it runs out sometimes soon 16:38:06 some jose947.com spam, which is not exploding yet i think 16:39:34 same for paroisses-valdesaone.com 16:40:43 searchukjobs is together with expertini 16:48:22 (mostly notes to self 16:48:22 ) 16:53:20 otherwise all looking good 16:53:39 pushed an update with some minor improvements for handling of certain URLs 17:33:07 arkiver Spinning upt arget. 17:49:12 arkiver: Target online 17:52:09 ^can confirm working 17:52:47 arkiver: "1711648124 ERROR torsocks[74]: General SOCKS server failure (in socks5_recv_connect_reply() at socks5.c:527)" tor proxy now looks broken? 17:53:01 it passed the checkip though 17:53:32 https://transfer.archivete.am/UPHC3/2024-03-28_17-53-24.txt logs 17:53:32 inline (for browser viewing): https://transfer.archivete.am/inline/UPHC3/2024-03-28_17-53-24.txt 17:54:09 gonna let it run, see if it fixes itself 18:14:40 It seems like the floodgates have opened on the nudetubesex hack. I see the same behavior now at http://www.pspalls.com/, http://jsd686.com/, http://ws.ogutsan.com/, http://k-hachiken.com/ 18:16:47 Would it be useful to have a !exclude command? Seems a lot of traffic in here is that but manually. 18:39:25 JAA arkiver A lot of that 'nudetubesex' PHP spam on my machines with the domains mentioned above. I see a .*php.*xml URL getting queued every second. And that's excluding the spammy .*php URLs. 18:41:02 Hmm, yeah, that looks bad. :-| 18:49:35 good luck at filtering that out without also discarding myIMG.php, shrtnd.php, ndex123_nw.php, etc. 19:29:10 I see tor urls doing stuff but not actually rsyncing anything anywhere 19:45:19 nvm I see it now 19:45:40 guess I'll spin up a few more on it 20:43:47 Should Tor URLs work on the warrior? 20:45:43 I see the project available from the list but getting errors when starting. Not sure if on my end or not.. 20:46:36 2024-03-28 20:42:34,693 - seesaw.warrior - ERROR - Error loading pipeline 20:46:37 Traceback (most recent call last): 20:46:37 File "/usr/local/lib/python3.9/site-packages/seesaw/warrior.py", line 736, in start_selected_project 20:46:37 (project, pipeline, config_values) = self.load_pipeline( 20:46:37 File "/usr/local/lib/python3.9/site-packages/seesaw/warrior.py", line 674, in load_pipeline 20:46:37 with open(pipeline_path) as f: 20:46:37 FileNotFoundError: [Errno 2] No such file or directory: '/home/warrior/data/projects/urls-tor-656b405/pipeline.py' 20:46:38 2024-03-28 20:42:34,694 - seesaw.warrior - WARNING - Project urls-tor did not install correctly and we're ignoring this problem. 20:50:01 Probably not with how it works currently. 20:50:02 arkiver: ^ 22:50:34 also seem to be missing auto-reclaiming on the tor project 22:50:44 or real long ttl 22:55:14 Yeah, not enabled. 22:55:32 We should maybe filter out v2 onions since those can't work anymore. 22:55:41 Or just let them fail into unretrievable, I guess. 22:56:02 A lot of the outstanding claims is that. 23:06:07 I've applied the same reclaim settings as on the main project. 23:06:15 2000s TTL, 3 tries 23:12:11 JAA: can we get a filter for jsd686.com? as BornOn420 said the same issue as the nudetube site 23:13:25 > 30req/s on my end 23:14:03 can confirm the other mentioned ones are there too, just way less currently 23:15:48 ^http://jsd686%.com/ is being filtered now. 23:16:08 Blew up a lot since I checked earlier indeed. 23:16:47 tonaku.com is another one. 23:17:14 This should really be handled differently, but I have no idea how. 23:20:46 ws.ogutsan.com and www.euszati.hu and www.pspalls.com as well 23:20:56 29=502 http://nijihypogyhozymy.anvgames.com/PRdK.php?5CBOTtmC.xml 23:21:09 yep suspecting that one as well 23:22:21 18=200 http://ws.ogutsan.com/UfxeMosf.php?c0apq.xml 37=0 http://anvgames.com/NRIx.php?JZ2KPX9p.xml 39=0 http://www.pspalls.com/7fLRI.php?oDniq5.xml 23:22:28 dont see the euszati.hu currently 23:22:31 See jyqisajojawopy.anvgames.com here as well 23:23:39 imer you're right euszati.hu is legit and NOT spam 23:23:43 my mistake 23:25:03 There's no clear pattern to these other than /.php?.xml which could easily also happen on a legit site. 23:25:15 yeah, probably just block the domains and hope for the best for now 23:25:44 cant be that many out there 23:26:08 hey look whats climbed to the top in my logtop window :D 1 820 17.83/s ws.ogutsan.com 2 490 10.65/s tonaku.com 3 305 6.63/s www.pspalls.com 23:28:07 Moar filters added for those. 23:28:19 thanks JAA 23:34:17 Just in: anvgames.com (without the weird subdomains) 23:35:22 I saw that earlier but then it disappeared again. 23:35:47 Added