-
JAA
Oh
-
JAA
arkiver: I noticed that
web.archive.org/web/collections/20230000000000*/https://www.svt.se does not display anything from archiveteam_urls, which seemed odd. Turns out the URL is commented out as a duplicate, even though that appears to be the only entry for it.
-
JAA
Line 100125 in 43200_wikidata_Q11033_mass-media.wikidata.txt : #REASON=DUPLICATE#random=RANDOM;keep_random=1;all=1;keep_all=1;depth=1;url=
svt.se
-
JAA
(As of 712433bc)
-
JAA
Hmm, or does the collection list exclude archiveteam_urls?
web.archive.org/web/20230000000000*…//www.svt.se/nyheter/lokalt/uppsala (not commented out, same file) shows plenty of snapshots, nothing in the collection view.
-
JAA
There are in fact very regular snapshots of the homepage from here. Huh
-
JAA
Yeah, we certainly fetch it and its links regularly, so the bug is just the collection view on the WBM, I guess.
-
JAA
Could you update the Wikidata files in urls-sources sometime please?
-
imer
arkiver: were the random_chars.word.de domains supposed to be filtered out or just ignored? seeing some still, for example 7k9d.wk-giesa.de qo3x.unternehmen-mut.de 87oj.initiative-pro-gd.de (those are alive), can pull more from logs if needed
-
imer
-
imer
some filtering so only domains with lots of subdomains are listed:
transfer.archivete.am/KjXoP/2023-11-12_13-41-04.txt
-
AK
Oh are we paused?
-
project10
Seems so
-
arkiver
yeah, sorry for not pinging before
-
arkiver
paused since the queue was growing fast due to spam
-
arkiver
(hopefully unpaused soon again)
-
MetaWonderrat
Hallo. I recently installed the Warrior and wanted to contribute to the URLs project. I have a fresh install, but the project does not move past "GettingItemFromTracker". I already rebooted my Warrior."
-
MetaWonderrat
Is this a known issue? Is the project paused?
-
that_lurker
paused currently
-
fireonlive
hey MetaWonderrat :) - yeah it's currently paused due to an issue with some spam urls but should hopefully be unpaused soon
-
MetaWonderrat
What URLs are on the list and how can one know which are spam?
-
fireonlive
i believe, but could be wrong, it was one of those one link leads to another on the same site and it kinda becomes a infinite growing loop
-
fireonlive
so just have to flush those out so the queue doesn't grow to infinity
-
MetaWonderrat
Can I see the list somewhere? As I read it most of it is public somewhere.
-
Flashfire42
feel free to contribute to the telegram project in the mean time until urls is back
-
MetaWonderrat
I already switched over.
-
MetaWonderrat
At firtst I tried URLTeam2 but that keeps giving me 404 errors
-
MetaWonderrat
Is there a guess when "soon" is, regarding the URL project?
-
Flashfire42
No idea but telegram will keep chugging along with items. I suggest running at concurrency 2 personally for minimal bans but the bans are semi random tbh
-
MetaWonderrat
concurrency 2 personally? bans?
-
MetaWonderrat
I am new in the Archive warrior thing
-
vokunal|m
Yeah, if you run at high concurrency, the site you're scraping might trigger an automatic system that makes it give back an error, so running under that is sort of important.
-
vokunal|m
If it happens, it's usually only for a short time, though that depends on the project. Some are as short as 20 minutes, and others up to days. I'm not sure about Telegram in particular
-
MetaWonderrat
concurrency?
-
MetaWonderrat
requests per minute?
-
MetaWonderrat
or is it something else?
-
vokunal|m
it's the amount of concurrent threads of the tool it runs at once. Not a particular requests per minute, which makes it a little more tricky to accurately see how many you can run at once, but 2 is fine
-
arkiver
i'm going to take out the extraction of URLs from special interest pages
-
arkiver
we're not able to handle it right now
-
arkiver
but we'll try to put it back in sometimes later
-
MetaWonderrat
Thank you. I will check back later. I got to go now. Thanks for the info.:]
-
arkiver
update is in
-
arkiver
we now queue to the imgur project
-
arkiver
JAA: ^