-
nstrom|m
urls is doing more bandwidth than usual because there's a ton of pdfs in queue, I believe
-
datechnoman
we are discovering tonnes of new url's and sources so the queue is growing rapidly
-
datechnoman
Nothing we cant handle
-
datechnoman
*cracks out the credit card* lol
-
nicolas17
discovery rate is nearly 1x (at which point the calculated ETA becomes infinity)
-
nicolas17
discovering 101 items for every 100 items completed
-
» fireonlive watches over datechnoman debt ratio
-
datechnoman
fireonlive i need someone like you to look out for me. It's an addiction....
-
fireonlive
i feel you :(
-
datechnoman
We will eventually get to a point where we discover less lol... eventually...
-
fireonlive
my debt utilization ratio is in the 90s
-
fireonlive
:x
-
fireonlive
debt->credit
-
fireonlive
eventually :p
-
fireonlive
one whole internet archived
-
datechnoman
Haha well if we are including the mortgage then I'm fk'ed lol
-
datechnoman
Even with cheeky thousands of dollars we would be blocked by IA ingest haha
-
datechnoman
The real question is, when is WBM going to hit 1 trillions pages
-
fireonlive
:P we'll set mortgages and cars aside :D
-
fireonlive
ooh indeed
-
fireonlive
i better start posting more memes for JAA to archive
-
fireonlive
:3
-
JAA
:-)
-
datechnoman
Need to setup a bot that auto grabs any link and !a it into the correct channel lol
-
datechnoman
eg; imgur link will be !a in imgur channel, other links !a in #//
-
JAA
But you can't !a individual URLs here, only lists, I think.
-
fireonlive
-
h2ibot
fireonlive: Registering SkBQiE2b for '!a
dl.fireon.live/404'
-
h2ibot
fireonlive: Not a transfer.archivete.am URL. (SkBQiE2b)
-
h2ibot
fireonlive: Something went wrong. (SkBQiE2b)
-
fireonlive
indeed
-
fireonlive
...why is my 404 a 200
-
fireonlive
...why is anything a 200
-
fireonlive
:|
-
JAA
>:-(
-
» fireonlive takes caddy behind the barn
-
» TheTechRobo throws a planet.osm at datechnoman
-
JAA
Yeah, that happened before on AB with a blind !ao.
-
TheTechRobo
yeah, that was fun
-
TheTechRobo
Didn't that crash a few jobs?
-
TheTechRobo
Or am I misremembering?
-
TheTechRobo
Would be nice if pipelines could set a per-file size maximum that would fail the URL if it started to exceed that
-
fireonlive
*non existent memories of --large*
-
fireonlive
I think the proposal was to catch the out of space error and retry later when space was made or something like that
-
JAA
Not sure how many other jobs it affected, but it definitely crashed at least the !ao < job.
-
JAA
And yes, that's the idea, make the error non-fatal.
-
datechnoman
Looks like we hit the curve and are gaining now!
-
arkiver
rewby: we're going to launch a URLs Tor project! :) can we have a target for it? it can be on the same machine as the regular URLs project - i don't expect a ton of data
-
arkiver
it would have
-
arkiver
archiveteam_urlstor_
-
arkiver
urlstor_
-
arkiver
Archive Team URLs Tor:
-
arkiver
and the project is urls-onion on the tracker
-
TheTechRobo
cool!
-
BornOn420
Plenty of
nudetubesex.com/eeQ9Yv3.php and
nudetubesex.com/Wpcne.php?JvSeYeO.xml URLs in my logs, but these all re-direct to one specific page:
171kj.cc
-
BornOn420
Can we filter these out?
-
BornOn420
(oh, and no nudity)
-
datechnoman
I did notice that. Just a bunch of SPAM JAA ^^^^
-
arkiver
datechnoman: please ping me too
-
arkiver
i feel like some of these can be filtered out differently than adding a pattern, but not sure either
-
arkiver
problem about patterns is that they stay in kind of "forever" - so resource use due to them increases over time as more are added
-
arkiver
maybe some can be merged together or deleted later on, but not entirely sure yet how to check that
-
imer
Anyone elses containers acting up? (looking at stats thats a yes) have some that are seemingly stuck
-
arkiver
imer: stuck on what?
-
arkiver
i can have a look in a bit of time
-
imer
2024-03-27T13:17:06.563547423Z Starting MoveFiles for Item and then silence
-
arkiver
oh crap
-
arkiver
sorry, left something in :/
-
arkiver
imer: fixed
-
imer
oh, good. thought something was broken on my end
-
arkiver
there was a debug sleep of 1000 seconds in there, that i forgot to take out
-
imer
oops haha
-
arkiver
this large recent update is for the introduction of the urls-tor-grab project, it will be based on urls-grab with some stuff replaced simply in pipeline.py
-
immibis
tor urls: cool. Does it need an external tor proxy? will i2p be done eventually?
-
AK
Wonder if there's some way of tracking the number that get filtered out. And then maybe that being an opt in thing we could run on some of them. e.g. "This filter hasn't actually filtered anything for 30 days, we can probably remove it".
-
AK
Would only work for filters added to workers though
-
arkiver
immibis: not sure about i2p
-
arkiver
AK: yeah the thing is though that many of these filters are in to prevents expanding loops. once one URL goes through and is not filtered out, it expands in 2 URLs, 4, etc. (or more at each step)
-
JAA
arkiver: Poking Drone so urls-tor-grab will build once pushed.
-
arkiver
thanks JAA :)
-
immibis
can the way back machine handle tor?
-
arkiver
yes
-
AK
Running the tor worker, anything we need to do or is it just run the workers and they handle all the tor bit for us?
-
arkiver
just run docker and you're all done
-
arkiver
it will run tor for you
-
AK
Alright, gimme a ping when you've got a built image and I'll spin up a (low concurrency to start) few workers
-
arkiver
sounds good :)
-
arkiver
we'll start shortly!
-
arkiver
working on the final bits
-
immibis
some Tor sites have custom captchas, like I think dread has one
-
immibis
it probably won't be a terribly complete archive. well it's only targeted urls right, should be ok
-
arkiver
we would not get the content behind captchas
-
arkiver
which is the same as for the general web, we currently don't attempt to solve captchas
-
arkiver
i think we're ready
-
arkiver
FYI AK coming up
-
arkiver
it's up!
-
arkiver
almost
-
fireonlive
exciting news!
-
Darken
Hello, keep getting this error on the project and its looping, not sure if anyone has reported yet
transfer.archivete.am/137OT4/error.txt
-
eggdrop
-
arkiver
up!
-
Darken
Legend arkiver
-
arkiver
Darken: i can't replicate :/
-
arkiver
Darken: this is urls-grab right? (not Tor)
-
Darken
You fixed it
-
Darken
and yes it was
-
Darken
when you said up! it fixed
-
arkiver
not sure what fixed it but good :P
-
arkiver
pausing tor project
-
arkiver
why is there non-onion stuff in the project
-
fireonlive
onion sites linking to clear net?
-
AK
Seeing a fair few `Failed ConnectTor for Item` on some of mine, not sure if that's a me problem
-
arkiver
fireonlive: no those would be queued back to the urls-grab project
-
fireonlive
ahh ok
-
fireonlive
is there a separate tracker leaderboard for urls-tor?
-
arkiver
found the problem
-
arkiver
yeah urls-onion, maybe i should change that to urls-tor
-
fireonlive
ahh onion
-
fireonlive
ye tor would match the image/title i suppose
-
imer
tor: getting a timeout in checkip after supposedly connecting
-
imer
oh, it worky
-
arkiver
imer: did it take many tries?
-
arkiver
i think a new route needs to be requested upon a timeout, will get that in later this week if it is a solution indeed
-
imer
arkiver: after update it just worked first try, dont see it submitting the results though.. it just kinda gave up? running on conc 1:
transfer.archivete.am/iXZmR/2024-03-27_18-07-34.txt
-
eggdrop
-
arkiver
oh yeah we don't have a target yet
-
arkiver
(bad error message - it's actually saying there's no target
-
imer
aah
-
imer
all good then
-
arkiver
rewby: i changed the name on the tracker of the project from urls-onion to urls-tor FYI
-
arkiver
imer: did you cut some lines out of that log?
-
imer
yes
-
imer
uh, no
-
imer
I truncated the length
-
arkiver
can you post the full log?
-
imer
sure
-
imer
-
eggdrop
-
arkiver
ah okey looking fine
-
imer
45s for checkip, might need a longer timeout?
-
arkiver
yeah
-
arkiver
will make it 120 seconds
-
arkiver
we also have a 120 second timeout on retrieving a URL with Wget-AT
-
arkiver
imer: done, 120 second now
-
imer
will these be visible via WBM then or just in a special collection?
-
arkiver
they'll be in the archiveteam_urlstor collection
-
arkiver
and they will be visible in the Wayback Machine
-
imer
cool
-
fireonlive
:3
-
arkiver
project is unpaused
-
arkiver
(i had to clean up a mess i made)
-
arkiver
enjoy :)
-
arkiver
i'll be off now
-
arkiver
when we have a target, the party can truly start
-
fireonlive
🥳
-
fireonlive
arkiver++
-
eggdrop
[karma] 'arkiver' now has 18 karma!
-
nicolas17
hm
-
nicolas17
my VPS is already running a tor daemon, I don't think I can afford the RAM for a second :P
-
fireonlive
🤔oO( nicolas17 + tor = ? )
-
nicolas17
what are you speculating about
-
fireonlive
your use cases :D
-
nicolas17
ah
-
nicolas17
at one point it was
-
nicolas17
"digitalocean gives me 1TB/mo upload and I barely use it, I'll get my money's worth by letting a tor relay burn through the rest"
-
nicolas17
now I'm also checking for changes in opensource.samsung.com file lists, and my VPS IP has been banned like a month ago already :P
-
Terbium
oof 1TB/mo egress is smol
-
immibis
compared to hetzner yes. compared to aws no.
-
fireonlive
ahh :)
-
Terbium
digitalocean--
-
eggdrop
[karma] 'digitalocean' now has -1 karma!
-
Terbium
aws--
-
eggdrop
[karma] 'aws' now has -1 karma!
-
fireonlive
but digitalocean has the best peering ever :P
-
imer
-
eggdrop
-
fireonlive
very nice
-
JAA
Nice
-
JAA
I was going to suggest Shrek. :-P
-
Terbium
Shrek--
-
eggdrop
[karma] 'Shrek' now has -1 karma!
-
immibis
aws--
-
eggdrop
[karma] 'aws' now has -2 karma!
-
Terbium
azure--
-
eggdrop
[karma] 'azure' now has -1 karma!
-
Terbium
gcp--
-
eggdrop
[karma] 'gcp' now has -1 karma!
-
Terbium
cloudflare--
-
fireonlive
is there any provider we ++?
-
fireonlive
:3
-
imer
we might be slowing down expertini.com, there's also some regional? redirect stuff going on from the looks of it
transfer.archivete.am/14XnYB/2024-03-27_21-14-46.txt some 500s, lots of redirects
-
eggdrop
-
imer
hetzner++
-
eggdrop
[karma] 'hetzner' now has 1 karma!
-
imer
ah, http -> https redirects
-
fireonlive
cloudflare--
-
eggdrop
[karma] 'cloudflare' now has -1 karma!
-
fuzzy8021
any thoughts on if there would be abuse messages on tor stuff?
-
immibis
what do you mean?
-
immibis
some fascist hosting providers prohibit all use of tor. the rest don't care unless they get an abuse message, which only happens to exit nodes, which you aren't
-
fuzzy8021
good deal. thanks
-
datechnoman
Man my cluster was cooked after those code changes. All the containers were hung from the pipeline issue. Just rolled everything and it appears to be working again as normal
-
nicolas17
Terbium: I pay $6/mo, and as I said in normal conditions I don't even use most of that 1TB