07:50:30 Oh that explain the cpu usage being so low, targets can't handle us 08:39:02 rewby - with imgur slowing right down can we give #// and telegram some target bump? 08:39:53 that would be good yes if possible 08:43:13 Been chugging for the past few days :( but can understand as imgur is/was the priority 08:58:14 nyuuzyou: i have only now queue that list of domains you gave a few weeks ago, sorry for the delay 08:58:29 do you happen to have other lists already in the meantime? 08:58:51 it's queued under todo:secondary 08:59:13 was only just below 600k items 08:59:49 actually we might want to get these up to a depth of like 2 09:33:31 I'll try to have a poke at it 10:22:05 Thanks rewby! :) 10:25:38 datechnoman: How's it looking now? 10:25:45 optane9 is chugging at 3gbps constantly 10:26:26 I'm adding buneary to see if it can help 10:27:11 we're going through the last bits of backlog here, and then incoming data should drop again 10:27:47 Having optane and buneary should be plenty. Thanks so much! 10:29:29 Not seeing the spikes and drops anymore. Can throw my normal worker count back at it 10:29:39 wooh :) 10:30:29 Cant wait to start queuing things again :D 10:30:39 Been waiting for the backlog of old urls to clear out before adding more new stuff 10:42:07 we now archive robots.txt, sitemaps, favicon, etc. 10:42:28 (monthly) 10:42:51 but there are also things like ads.txt and security.txt. should we start archiving these as well monthly? https://en.wikipedia.org/wiki/Ads.txt https://en.wikipedia.org/wiki/Security.txt 10:43:01 if anyone has ideas, please let me know! 10:44:02 perhaps https://en.wikipedia.org/wiki/Well-known_URI 16:46:50 https://well-known.dev/resources/ 16:55:15 There's also the new `.well-known/ai-plugin.json` for OpenAI plugins 17:00:12 yeah 17:00:23 we might start archiving these for all domains 17:07:44 Sounds great! 22:40:35 !a https://transfer.archivete.am/R9exx/sitemap_urls_march_april_2023.txt 22:41:14 datechnoman: Skipped 5091 invalid URLs: https://transfer.archivete.am/VSKH7/sitemap_urls_march_april_2023.txt.bad-urls.txt (for 'https://transfer.archivete.am/R9exx/sitemap_urls_march_april_2023.txt') 22:41:15 datechnoman: Deduplicating and queuing 244308 items. (for 'https://transfer.archivete.am/R9exx/sitemap_urls_march_april_2023.txt') 22:41:32 datechnoman: Deduplicated and queued 244308 items. (for 'https://transfer.archivete.am/R9exx/sitemap_urls_march_april_2023.txt') 22:42:52 Juicy sitemaps pulled from various crawls :D