02:47:46 spam loop /books on various domains: https://transfer.archivete.am/G2tRz/at-urls-2024-09-05T02-44-15.log 02:47:47 inline (for browser viewing): https://transfer.archivete.am/inline/G2tRz/at-urls-2024-09-05T02-44-15.log 02:58:16 probably need arkiver for a proper fix, not sure if it could be blocked w/ regex filter w/o risk of collateral damage 03:55:19 Good pickup. I was wondering why the queue suddenly started blowing out 03:55:42 Maybe JAA has an idea if a temp fix can be put in place in the meantime 03:56:53 Not sure, it's quite messy. 03:58:01 There's at least three domain labels, and the rest of the URL is /books/[a-z]+/[0-9]+\.html$, but that doesn't seem particularly restrictive. 04:00:24 All good. Wasnt sure if there was some kind of magic 05:36:51 !a https://transfer.archivete.am/U8fpy/pubmed_doi_identifiers.txt 05:36:51 datechnoman: Registering fq9zfPaK for '!a https://transfer.archivete.am/U8fpy/pubmed_doi_identifiers.txt' 05:37:00 datechnoman: Deduplicating and queuing 103977 items. (fq9zfPaK) 05:37:05 datechnoman: Deduplicated and queued 103977 items. (fq9zfPaK) 09:56:50 putting 50x20 on urls to help unfuck the mess ive made :D 10:11:53 monoxane <3 10:11:55 Thanks! :D 12:55:31 JAA arkiver - Could I please hassle either of you to filter out (somehow) the annoying loop for baongoc.vn? https://transfer.archivete.am/B37cB/baongoc.vn_loop_urls.txt 12:55:31 inline (for browser viewing): https://transfer.archivete.am/inline/B37cB/baongoc.vn_loop_urls.txt 12:56:06 This has been going on for weeks and isnt huge but its constant and im seeing 2000+ hits across workers every 5 mins so worth cleaning up if simple enough 12:56:26 No rush but greatly appreciate any tweak :) 13:05:31 yeah those are weird , I don't understand how they keep being 200s even w all that junk added after the filename 13:06:43 looks like it's actually dynamically generating pdf content each time based on the url 13:09:23 Exactly, and that is why we keep archiving the same PDF as the url is dynamically being generated 13:09:48 We have been smothering their website for weeks and im surprised they havent started blocking our IP's lol 13:27:24 datechnoman: yes i will check it in an hour 13:30:16 Thanks mate! No rush :) 13:31:09 I've cranked right up, so ive been combing over the logs to looks at loops and stuff 13:31:19 I about to head off to bed so ill catch round! 13:32:23 100k urls per min my Grafana reckons. Not bad at all :D 13:36:30 not bad good size 13:36:36 I cannot beat that :P 13:41:33 Your doing just fine! 13:42:06 I have 3000 concurrent connects going lol 13:42:28 Connections/concurrency 13:43:21 Can finally destroy the backlog now that mildom is slowly down and not hogging IA ingest 13:44:28 Very nice to see todo going down slowly 15:09:48 nstrom|m: hmm yeah that is an annoying one 15:09:57 JAA: you did nothing around the /books/ loop right? 15:10:16 95% of the time the annoying ones are some chinese site 15:15:36 maybe you have some time to peek into the PRs while you're in the code 👀 15:15:58 knecht: yeah that too 15:25:39 knecht: i do see cloudflare on xrel.to, which is a problem 15:27:04 a problem with the frequency? i suppose it could be turned down a good notch 15:27:28 knecht: merged and left a comment 15:27:36 no problem with frequency 15:28:05 great! thank you very much 15:29:26 :) 15:43:03 those /books/ URLs just give me a 404 now :/ 15:43:08 JAA: nstrom|m ^ FYI 15:49:06 most seem to be 404, a few are still working 15:49:07 39=200 http://yanchixian.miraclekoifood.com/books/hlgjw/15812428.html 15:49:11 just saw that one fly by 15:50:11 yay also not 404 here 15:50:12 thanks! 15:50:37 ah i see /template/default/moban in there 15:51:47 so it's just their latest spam version of https://github.com/ArchiveTeam/urls-grab/blob/master/urls.lua#L248-L271 again 15:54:46 yep definitely looked like a familiar MO 15:54:50 4=200 http://jinglexian.thetakoma.com/books/hjgnyhiz/1442363.html another alive one if it helps 15:55:06 18=200 http://hongkouqu.mycookielicious.com/books/toyhjdi/4649645.html 15:58:10 arkiver: Yeah, I didn't touch them. 16:10:16 very helpful thanks nstrom|m 16:10:25 nstrom|m: do you perhaps have one more for me? 16:10:32 testing a solution 16:11:12 checking 16:12:22 26=200 http://fujinshi.thenydog.com/books/ygicdp/142316.html 16:13:03 1=200 http://neimengguzizhiqu.dqs7755.com/books/aljxyb/72588552.html 16:30:40 datechnoman: the baongoc.vn loop is gone 16:30:42 thanks nstrom|m 16:33:57 thank you! 16:56:17 AK: is it correct you have nothing running anymore for URLs on hel1? 16:56:27 nstrom|m: do you perhaps have a longer log as well? 16:56:42 they are ignored now, but i didn't find out yet where the URLs come from 16:59:18 sent via pm 17:00:37 I have pretty short log retention though so might not be that useful 17:01:27 yeah let's see 17:05:36 couldn't fine the source well 17:09:12 haha they blocked my IP 17:13:35 a fix is in 17:13:40 i'm not sure this fixes the source... 17:13:42 we'll see 17:13:53 i'll move URLs to secondary tomorrow 17:15:25 i'm off now 17:16:18 loops will stand out when moving items to secondary, as they are queued and enlarged in :todo:backfeed, so i want to be around when that is being done 21:45:35 monoxane looks like code update didn't roll out to your workers FYI