-
datechnoman
arkiver - does this include the pdf extractions I sent you? Happy to start creating more dumps if required. Just didnt want to waste time and energy processing through all of them if they arent needed
-
datechnoman
Happy to queue them directly in here also so you dont have to worry about being apart of the process :)
-
imer
probably still working through my list
-
imer
(commoncrawl extract)
-
datechnoman
Sweet! Sounds good
-
datechnoman
Get all that good loot :P
-
Ryz
Hmm, thoughts on adding
realty.ria.ru into this project? Don't think any of it has been run through here checking on WBM...and considering the job that's running on AB ><;
-
Ryz
Also saying because the job's taking forever because of having to catch up with the most recent articles ><;
-
datechnoman
No lack of horsepower here lol
-
arkiver
datechnoman: no, it does not include that yet!
-
datechnoman
Ooohhhhh. More goodies to come! :O arkiver
-
datechnoman
I like it!
-
fireonlive
datechnoman++
-
fireonlive
arkiver++
-
eggdrop
[karma] 'datechnoman' now has 6 karma!
-
eggdrop
[karma] 'arkiver' now has 17 karma!
-
datechnoman
Well I guess I'll start processing more pdf dumps if we are now processing them
-
arkiver
so we have now gathered 20k .onion URLs, which are stashed at
tracker.archiveteam.org/urls-onion
-
datechnoman
I believe we have a pdf stash one also from what I see on the queueing of warriors
-
datechnoman
?
-
AK
Abuse report came in from cloudblock.espresso-gridpoint.net to hetzner. Wow are they aggressive. 48 hours to respond which is way sooner than normal. Anyone else seen one for accessing www.wb-automation.com?
-
imer
nothing here so far, got a bitninja one the other day (been a while)
-
AK
haha yeah bitninja sent one to my home isp, so had fun telling Zen that bitninja are stupid and I don't care. If they sent another I might tone stuff down at home though
-
imer
running this at home would sketch me out (since isps/law enforcement can be incompetent)
-
AK
var re
-
AK
*:facepalm:
-
datechnoman
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot
datechnoman: Deduplicating and queuing 9401326 items. (cmKPUONH)
-
arkiver
datechnoman: is this part of what you sent to me?
-
datechnoman
Na it isnt
-
arkiver
ah okey
-
arkiver
let's not yet queue large batches of PDFs - they can be resource intense to process
-
arkiver
so best to do that when "the timing is right"
-
arkiver
but maybe these are from a different source that needs queuing now
-
datechnoman
ack all good mate. I purposely did a small one (800MB) compared to the multiple GB ones I sent you
-
arkiver
yeah, i will do the big batches later :)
-
datechnoman
These's ones are from Twitter Outlink Sources
-
arkiver
sounds very nice! :)
-
arkiver
i see 2 million items in the queue
-
arkiver
so the majority was already queued earlier i guess
-
datechnoman
I wont queue anymore for now. Just wanted to clear it off my server so I can get batching on those large ones for you
-
arkiver
yeah :)
-
arkiver
thank you!
-
arkiver
also we'll start archiving those .onion URLs we collected
-
datechnoman
All good! Anytime. Keeps me busy and I want to contribute more than just throwing some money/workers at projects
-
arkiver
you're doing pretty awesome stuff with getting these tons of URLs
-
datechnoman
I've got a few billions atm lol
-
datechnoman
billion**
-
datechnoman
With that being said I need to run some dedup over the different source subsets
-
arkiver
at some point we may be able to go through all of them
-
arkiver
in the far future
-
datechnoman
and I need to optimse better with how im processing / storing them all
-
arkiver
how are you currently storing them?
-
datechnoman
Yeah for sure. Costs me nothing to keep them around (other than the cost of the drive which is in the past) ;)
-
datechnoman
I have a 130TB server at home
-
datechnoman
Looks like 4 Million have queued sofar and still going
-
arkiver
oh yeah it's still ongoing
-
arkiver
so these are stored as simple files and you're grepping through them?
-
datechnoman
They are split up in .zst files (thousands of them) in data set / data source folder structures
-
datechnoman
Then I just zcat and grep over them by streaming them from my storage server
-
datechnoman
I've been lazy the last 2-3 weeks and havent processed any url's. Been busy with life. Will get back onto it soon though
-
nyany
Sorry, just jumping in on this discussion
-
nyany
If you're looking at archiving .onion URLs, it might be worth noting that some providers do not take kindly to any form of Tor running on their network
-
arkiver
yeah it would be "opt-in" of course meaning one needs to actually run the project
-
nyany
I don't quite know if that'd be enough, or if it'd be fair
-
nyany
I'm not sure of how feasible it'd be, but might it be better to create a separate copy of the grabber that includes the functionality for grabbing Tor? or simply grab Tor as a separate project?
-
datechnoman
I don't want to be touching tor at all
-
arkiver
nyany: yes on that last one
-
arkiver
that is the plan actually
-
nyany
Oh, I see
-
datechnoman
I'd be asking to be banned on all cloud platforms at that point :o
-
nyany
Well I'm glad I spoke up, thanks for that clarification
-
nyany
-
nyany
OVH appears to be relay friendly
-
nyany
as is Hetzner
-
datechnoman
It says try to avoid those hosters?
-
datechnoman
Still wouldn't want to risk it IMO
-
nyany
They say to avoid them because there are already a metric ton of them on the network
-
nyany
It also says to avoid Frantech, but they're openly pro-Tor
-
nyany
I use two of their servers for this project actually
-
datechnoman
Ahh misread that
-
imer
still baffling to me even running a relay can get you in trouble. exit nodes I get
-
nyany
Unfortunately, some providers are largely ignoring to the correct ways to use Tor, and are influenced by what they see online
-
nyany
Most are familiar with it as "that thing criminals use" and as soon as they see that word you can bet it's in their TOS
-
nyany
Anyways, on a slightly more on topic note, a bunch of my workers are getting "max connections reached" error for the target "optane9.targets.atinfra.net"
-
imer
target has been a bit sad for a while, if the last update from rewby is still accurate it's on IA's end
-
nyany
oh that kind of issue
-
nyany
ok
-
imer
well, running into the targets safety limit to not overwhelm IA
-
nyany
Yeah I'm familiar with that problem, it's unfortunate
-
imer
IA stats aren't public anymore :( so can't see how things look on their end
-
nyany
-
datechnoman
Terribly slow. IA must be getting hammered by other people also
-
datechnoman
Workers are basically all idle :(
-
datechnoman
IA seems to be catching up now yay. Traffic is flowing
-
JAA
-
h2ibot
-
JAA
pabs: ^
-
h2ibot
-
h2ibot
-
h2ibot
JAA: Deduplicating and queuing 5693454 items. (fR5vvnxF)
-
h2ibot
JAA: Deduplicated and queued 5693454 items. (fR5vvnxF)
-
datechnoman
arkiver - My !a import job (cmKPUONH) appears to have never completed. Might need a kick
-
datechnoman
definitely no rush. Plenty of backlog lol
-
datechnoman
Just want to make sure it dosent block up a slot for other people
-
rewby
Optane9 is unstuck
-
rewby
I don't knwo whatever it is you're doing, but you're doing a lot of it
-
nicolas17
I was merely running 1 deviantart container and getting lots of optane9 errors ._.
-
rewby
Yes, and?
-
rewby
That target does a lot of things
-
rewby
Telegram and urls are currently doing a good job of using every last bit of disk space
-
nicolas17
I just mean "it wasn't me!" >.>
-
nicolas17
let's blame fireonlive for no reason
-
rewby
It was more aimed at the people throwing large jobs into both telegram and urls
-
rewby
This is usually either arkiver or datechnoman.
-
» JAA whistles.
-
rewby
And not that I care that much. It'll all even out over time.
-
JAA
Yeah, backpressure's nice. :-)
-
rewby
Yeah, it'll fix itself eventually
-
rewby
Just making sure people (specifically JAA and arkiver) are aware of what the big hitters are in case they need to shift capacity to critical projects.
-
JAA
Aye :-)
-
rewby
Also, people were really backlogged man.
-
rewby
Once the first round of uploads started finishing after my fixes, optane9 got hit with a solid 9gbps for a good 30 seconds
-
rewby
And it's still spiking up to 5gbps as packs finish uploading
-
JAA
Nice
-
datechnoman
rewby - just making sure your getting value for money on the hardwre
-
datechnoman
Thank you very much for all your hard work
-
datechnoman
Didnt want to ping you as I assumed it might have been IA S3 being slow
-
datechnoman
Optane9 is a beast