04:33:33 Oh 04:34:06 arkiver: I noticed that https://web.archive.org/web/collections/20230000000000*/https://www.svt.se/ does not display anything from archiveteam_urls, which seemed odd. Turns out the URL is commented out as a duplicate, even though that appears to be the only entry for it. 04:34:28 Line 100125 in 43200_wikidata_Q11033_mass-media.wikidata.txt : #REASON=DUPLICATE#random=RANDOM;keep_random=1;all=1;keep_all=1;depth=1;url=http://www.svt.se/ 04:34:45 (As of 712433bc) 04:36:31 Hmm, or does the collection list exclude archiveteam_urls? https://web.archive.org/web/20230000000000*/https://www.svt.se/nyheter/lokalt/uppsala/ (not commented out, same file) shows plenty of snapshots, nothing in the collection view. 04:37:11 There are in fact very regular snapshots of the homepage from here. Huh 04:38:33 Yeah, we certainly fetch it and its links regularly, so the bug is just the collection view on the WBM, I guess. 04:38:54 Could you update the Wikidata files in urls-sources sometime please? 13:29:27 arkiver: were the random_chars.word.de domains supposed to be filtered out or just ignored? seeing some still, for example 7k9d.wk-giesa.de qo3x.unternehmen-mut.de 87oj.initiative-pro-gd.de (those are alive), can pull more from logs if needed 13:36:35 https://transfer.archivete.am/inline/8tbcL/2023_11_12_urls_spam_domains_de.txt 13:41:28 some filtering so only domains with lots of subdomains are listed: https://transfer.archivete.am/KjXoP/2023-11-12_13-41-04.txt 15:27:58 Oh are we paused? 15:32:03 Seems so 18:50:01 yeah, sorry for not pinging before 18:50:08 paused since the queue was growing fast due to spam 18:50:19 (hopefully unpaused soon again) 20:27:50 Hallo. I recently installed the Warrior and wanted to contribute to the URLs project. I have a fresh install, but the project does not move past "GettingItemFromTracker". I already rebooted my Warrior." 20:29:04 Is this a known issue? Is the project paused? 20:30:59 paused currently 20:31:01 hey MetaWonderrat :) - yeah it's currently paused due to an issue with some spam urls but should hopefully be unpaused soon 20:32:21 What URLs are on the list and how can one know which are spam? 20:33:35 i believe, but could be wrong, it was one of those one link leads to another on the same site and it kinda becomes a infinite growing loop 20:33:49 so just have to flush those out so the queue doesn't grow to infinity 20:36:04 Can I see the list somewhere? As I read it most of it is public somewhere. 20:43:00 feel free to contribute to the telegram project in the mean time until urls is back 20:44:22 I already switched over. 20:45:25 At firtst I tried URLTeam2 but that keeps giving me 404 errors 20:48:23 Is there a guess when "soon" is, regarding the URL project? 20:50:19 No idea but telegram will keep chugging along with items. I suggest running at concurrency 2 personally for minimal bans but the bans are semi random tbh 20:51:29 concurrency 2 personally? bans? 20:51:53 I am new in the Archive warrior thing 21:11:32 Yeah, if you run at high concurrency, the site you're scraping might trigger an automatic system that makes it give back an error, so running under that is sort of important. 21:12:36 If it happens, it's usually only for a short time, though that depends on the project. Some are as short as 20 minutes, and others up to days. I'm not sure about Telegram in particular 21:12:45 concurrency? 21:12:58 requests per minute? 21:13:54 or is it something else? 21:19:56 it's the amount of concurrent threads of the tool it runs at once. Not a particular requests per minute, which makes it a little more tricky to accurately see how many you can run at once, but 2 is fine 21:21:40 i'm going to take out the extraction of URLs from special interest pages 21:21:44 we're not able to handle it right now 21:21:51 but we'll try to put it back in sometimes later 21:24:01 Thank you. I will check back later. I got to go now. Thanks for the info.:] 23:55:58 update is in 23:56:06 we now queue to the imgur project 23:56:07 JAA: ^