-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 9968402 items. (CPB2reGi)
-
h2ibot
datechnoman: Deduplicated and queued 9968402 items. (CPB2reGi)
-
Notrealname1234
Now i get it why my request was declined, should have read the wiki page 🙁
-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 9967898 items. (U6Jj7i7F)
-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 9969829 items. (EWZIoYOf)
-
h2ibot
datechnoman: Deduplicated and queued 9967898 items. (U6Jj7i7F)
-
h2ibot
datechnoman: Deduplicated and queued 9969829 items. (EWZIoYOf)
-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 9968409 items. (agmbVXRJ)
-
thuban
-
eggdrop
-
thuban
i'm not 100% sure whether they're suitable here, as there are a few hosts with a fair number of urls and i haven't done any filtering, but it would be nice to get them if we can
-
datechnoman
thuban - If we were to run the website urls you provided through this channel it would only grab the homepage, sitemaps urls and assets on those pages. We can easily do that but will miss all of the data you are after
-
thuban
i'm aware, it's fine
-
thuban
(i've already submitted appropriate lists to projects that accept them)
-
datechnoman
roger no worries. Have you queued up the blogger/blogspot ones in that project channel etc?
-
datechnoman
If that is what you are referring to above sorry
-
datechnoman
(Just double checking)
-
thuban
yes, that's what i meant (np)
-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 10068 items. (x56ocfmB)
-
h2ibot
datechnoman: Deduplicated and queued 10068 items. (x56ocfmB)
-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 19721 items. (dJKSbgxJ)
-
h2ibot
datechnoman: Deduplicated and queued 19721 items. (dJKSbgxJ)
-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 4953 items. (wdzNUPi9)
-
h2ibot
datechnoman: Deduplicated and queued 4953 items. (wdzNUPi9)
-
datechnoman
thuban - ^^^^ queued
-
thuban
thank you!
-
datechnoman
No worries :)
-
h2ibot
datechnoman: Deduplicated and queued 9968409 items. (agmbVXRJ)
-
datechnoman
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 19 items. (44ECTPXv)
-
h2ibot
datechnoman: Deduplicated and queued 19 items. (44ECTPXv)
-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 9970047 items. (RNck25fz)
-
arkiver
datechnoman: did you find out what was wrong with the previous lists?
-
arkiver
i did not have a look yet (if you don't know, i will still have a look)
-
datechnoman
arkiver - Im going on a hunch and believe its due to really malformed url's or something like that
-
datechnoman
Ive created a url cleaning process to properly clean them before throwing them at the bot and testing it atm
-
datechnoman
I'm also throwing a stack more workers in to pickup the workload
-
h2ibot
datechnoman: Deduplicated and queued 9970047 items. (RNck25fz)
-
arkiver
hmm
-
arkiver
but that cleaning process should be done by the bot
-
datechnoman
Well that successfully queued everything there
-
arkiver
when i have time i'll check your list and make the bot able to handle whatever is problematic in there
-
datechnoman
Odd. Only thing I did differently was properly split the urls as some were double on each line and stuff
-
datechnoman
Appreciate it mate. Love your work :)
-
arkiver
you too :)
-
datechnoman
Just working towards a common goal :)
-
datechnoman
Should get us some nice clean data
-
datechnoman
Did spot checking and some of it was already picked up by this project. Its amazing the reach that this project has
-
arkiver
i'm experimenting with and then pushing out the update to archive outlinks from news sites
-
arkiver
this may create some loops, as in news->news->news->news URLs, but should be fine
-
datechnoman
ack no worries. I remember you mentioning this a few days ago. Good news (pardon the pun) is that news sites are typically hosted behind CDN's so we smash through them pretty fast and eaisly with minimal CPU overhead so shouldn't be too much of an issue
-
arkiver
yeah!
-
datechnoman
I personally think its well worth the efforts. News it important no matter where you are in the world!
-
arkiver
datechnoman: i'm going to move your queued items to secondary
-
arkiver
and the current secondary to redu
-
arkiver
redo*
-
datechnoman
ack no worries
-
datechnoman
Do what you gotta do. We did notice a fair bit of spam in the secondary FYI
-
datechnoman
Its floating around "out" atm
-
datechnoman
Hopefully will die out with the multiple retries
-
datechnoman
Discovering lots of new url's from the PDF documents :)
-
arkiver
yeah :)
-
arkiver
alright my implementation seems to be working
-
arkiver
datechnoman: can you wait please with queuing more big lists?
-
arkiver
i want to push this out when your lists are gone, to better see the effects of the change
-
datechnoman
For sure arkiver. I have a stack of lists and have been slow feeding so more than happy to hold off :)
-
arkiver
thanks!
-
arkiver
going to be exciting to get this update in
-
datechnoman
No worries at all. I guess with everything being in redo and secondary you could enable it right? As it will all feed to the backlog and we can use that as the metric?
-
arkiver
i'd rather not, currently backfeed stays large due to the large lists that were fed in
-
arkiver
so it's more difficult to estimate what part of URLs queued to todo:backfeed comes from the update, and which part from your lists
-
datechnoman
Ack fair call. I'll stay spun up the next 24 hours to smash through it all so you can roll out your update :)
-
arkiver
or are you fine with me stashing your queued lists away for a bit?
-
arkiver
datechnoman: ^
-
arkiver
i'll be off for an hour and then do that if you are fine with it
-
arkiver
i'll also stash todo:redo away then
-
datechnoman
This is your show mate so do as you please arkiver. All I would say is that I'd like it to be requeued once we smash through the news sites outlinks
-
datechnoman
They are more important anyway
-
arkiver
hah no no, it's our show!
-
imer
arkiver: can you look into filtering out skinlookingyounger.com?
transfer.archivete.am/iNVsl/skinlookingyounger.com.log don't seem to be successful for me, but there's a lot of it (~37% of urls on my end)
-
eggdrop
-
imer
-
eggdrop
-
imer
todo is growing quite rapidly too
-
arkiver
imer: added a filter shortly before you messaged :)
-
arkiver
for skinlookingyounger
-
arkiver
not yet vilinkv
-
imer
the filter may not be working then, unless you mean in code
-
imer
but good :)
-
arkiver
i'm back in an hour
-
datechnoman
Can confirm both are spamming up my workers to the point that I don't see many other urls from other domains coming through
-
datechnoman
See you when you get back!
-
arkiver
ah
-
arkiver
though custom items
-
arkiver
through*
-
arkiver
skinlookingyounger is out now
-
arkiver
the vlinkv.shop one looks similar to an older pattern, need to look closer at that when i'm back
-
imer
thanks!
-
arkiver
paused until then
-
imer
can confirm that's doing something - speed is going way up
-
imer
probably a good idea
-
arkiver
yeo
-
arkiver
yep
-
arkiver
nice i check a random PDF and i find an open directory
-
arkiver
we should start attempting to find open directories here perhaps
-
fireonlive
arkiver: i love that idea
-
arkiver
i think we can it, and relatively easily too
-
arkiver
do it*
-
arkiver
anyway i'm off for an hour
-
arkiver
fireonlive: yeah :)
-
fireonlive
:D
-
fireonlive
i'm off to bed
-
fireonlive
ttyt :)
-
datechnoman
Good night fireonlive!
-
datechnoman
Def worth pausing the project. Was exploding
-
datechnoman
Quickly spun down my fleet as I reckon ill need to re-jig them for high density and IO processing for the news site. arkiver - ping me when we go live and ill get spun up with the correct profile
-
datechnoman
(was spun up for PDF/sitemap processing)
-
arkiver
back
-
datechnoman
welcome back :)
-
arkiver
thanks
-
arkiver
datechnoman: imer: if you
-
arkiver
datechnoman: imer: if you're interested, the skinlookingyounger loop was actually very similar to previous loops, with a small difference. i have no added support for it by adding
ArchiveTeam/urls-grab c0f44fa
-
arkiver
-
imer
good stuff
-
arkiver
i removed the filters for skinlookingyounger, since they'll be handled in the code now. means they may still be handed out, but won't create a look anymore (and take very little resources)
-
arkiver
the loop for vilinkv.shop is interesting
-
datechnoman
Awesome thanks so much! arkiver
-
arkiver
it's due to URLs like
tiib.vilinkv.shop/.well-known/openid-configuration redirecting to a different domain, which then gets the various 'special URLs' queued, which link to other domains, etc.
-
arkiver
we'll want to support that in the code as well to prevent future loops like (which will surely occur)
-
datechnoman
There is always something ey :/
-
arkiver
well
-
arkiver
in the beginning there were a lot of loops
-
arkiver
but now there are not a ton of them
-
datechnoman
Mind you, I like your tact if actually blocking the pattern that sites use instead of filtering
-
arkiver
really the more we support in the code, the less we have to fix as we move along
-
datechnoman
I can see lots of things skipped these days so the filtering works (skipped by the workers)
-
arkiver
yep
-
datechnoman
Yeah exactly!
-
datechnoman
Much more efficient and solves the greater issue
-
datechnoman
Also keeps the bloom filter and backfeed happy
-
arkiver
and turns out spam sites tend to use the same "spam software" (? or just the same owner), so blocking patterns helps
-
arkiver
indeed!
-
that_lurker
arkiver++
-
eggdrop
[karma] 'arkiver' now has 19 karma!
-
arkiver
:P
-
that_lurker
datechnoman++
-
datechnoman
arkiver++
-
eggdrop
[karma] 'datechnoman' now has 7 karma!
-
eggdrop
[karma] 'arkiver' now has 20 karma!
-
arkiver
imer: your logs helped a lot, btw
-
that_lurker
imer++
-
eggdrop
[karma] 'imer' now has 2 karma!
-
arkiver
lol
-
arkiver
that_lurker also helped a lot by adding karmas
-
that_lurker
that_lurker++
-
eggdrop
[karma] self karma is a selfish pursuit.
-
that_lurker
damn :P
-
arkiver
:)
-
datechnoman
that_lurker++
-
eggdrop
[karma] 'that_lurker' now has 4 karma!
-
datechnoman
I got you mate
-
datechnoman
No one left behind
-
that_lurker
-
imer
eggdrop++
-
eggdrop
[karma] 'eggdrop' now has 16 karma!
-
arkiver
vilinkv.shop loop handled now too
-
arkiver
moving out current todo:redo and todo:secondary
-
datechnoman
Roger. Moment of silence for the data
-
datechnoman
:(
-
datechnoman
Lol
-
imer
how goes the moving? :)
-
datechnoman
Haha I was going to ask the same thing. Wondering if I go to bed or stay up for a few if we get rolling
-
datechnoman
arkiver - How are we lookin?
-
datechnoman
Well im gonna get some rest. Will look into this tomorrow morning. Night all!
-
imer
good night
-
imer
heading out soon myself as well (not to bed)
-
nyany
is this project currently pause
-
that_lurker
ping arkiver did something break?
-
nyany
that_lurker: the project is likely paused right now while they deal with the above issues
-
imer
yep paused while arkiver moves things around, seems to have disappeared though
-
nyany
yup lol
-
nyany
The good news is that while urls is currently undergoing surgery, we desperately need more 1x1 workers on roblox
-
arkiver
time to get rolling
-
arkiver
this new method may give us some new loops that need eliminating
-
arkiver
running
-
nyany
oooooh
-
arkiver
with this new "outlinks from news sites" feature, we're also getting a lot of social media share URLs. i'm going to look into pushing those into the 'one-time lists', so they don't go into the bloom filter
-
arkiver
hmm looking at 26 TiB/day currently
-
arkiver
which is a lot
-
arkiver
i think this is an initial wave of URLs though, like we've seen before with new features, so this number should go down
-
arkiver
backfeed going down in size now, good, might have been an initial bump
-
arkiver
the stuff in the main todo queue is requeued items from claims which we have a filter for in place.
-
arkiver
(i want to get these out of the way from claims)
-
arkiver
all is looking very good!
-
arkiver
todo:backfeed is near 0 now
-
nyany
lol
-
nyany
i wonder why
-
arkiver
i don't see any serious loops
-
nyany
Sigh. petition to rename this project to whatgoesaroundcumsaround because PORN
-
katia
👀
-
arkiver
nyany: i don't see much of it now?
-
nyany
lol, sorry, that was an off the record remark
-
arkiver
rates are going down now as expected :)
-
arkiver
if this keeps looking good in the coming days, we'll also turn it on for political and government sites!
-
fuzzy8021
i did finally decide i have given up on getting on hetzners good side so boxes i still have with them are running this again
-
arkiver
fuzzy8021: ah :/ sorry to hear
-
arkiver
were the problems back then mostly about IP addresses in the ranges we no block by default?
-
arkiver
well we have found some PDFs (PDFs leading to more PDFs, etc.), hopefully not last too long
-
arkiver
yeah lots of science related PDFs
-
arkiver
an example: just saw
journals.biologists.com/toolbox/dow…/10.1242_jcs.259365/1/jcs259365.pdf getting archived, and it had 181 URLs extracted and queued back - most of which were doi.org URLs. so those will be resolved, leading to more PDFs, etc.
-
arkiver
but that cycle should end at some point, i don't see 'bad looking' loops
-
arkiver
paused for a bit as i investigate why scholar.google.com are not getting URLs discovered
-
arkiver
solved
-
fireonlive
datechnoman: :)
-
Notrealname1234
There has been any cases of the URLs project DDoSing any website?
-
nyany
That's possible with any one of our DPoS projects
-
nyany
-
nyany
I get all my most important things from that website
-
arkiver
uh
-
arkiver
i could improve some stuff there yeah
-
arkiver
it's due to PDF extraction
-
arkiver
with improve, all i mean is get rid of the repeated .
-
arkiver
i _think_ the URL is technically still valid with repeated . taken out
-
arkiver
i'm tired now and might make mistakes, so will make that update tomorrow
-
datechnoman
Good Morning All. Everything seems to be running very smoothly this morning :D great work arkiver!
-
datechnoman
Also great to hear we can support scholar.google.com
-
datechnoman
That is something we definitely want to support :)
-
Ryz
Mmm, would
video.sindonews.com be integrated into being archived and checked more here? Don't think I see much frequency checking via
web.archive.org/web/20240000000000*/https://video.sindonews.com
-
JAA
nyany: The internet is really, really great...