-
nimaje
don't you normally want to avoid thundering herd and do that by using randomized delays and expotention backoff? why is it intended there?
-
rewby
Basically, it has to do with how backpressure works on these
-
rewby
When disks fill up beyond 80%, they stop accepting new connections
-
rewby
And then once they drop back below a safe level they start accepting new uploads again
-
rewby
At 100% usage of available bandwidth, you will see the disks fill up, then it chews on it again, and then when it has space, it's immediately full and has enough to chew on again
-
rewby
This way no part of a target is underutilized
-
rewby
It basically never runs out of things to process
-
rewby
So that way the targets are running as fast as they can processing as much as they can
-
rewby
If people scale down below this level, then there's bandwidth/throughput going spare
-
rewby
Because the only way you don't get this behaviour out of targets is to underutilize them
-
rewby
There's asterisks on this of course
-
rewby
If you're scaling up from "not enough capacity" to "more than enough" it actually makes sense to do some limiting for a bit to smooth the flow of data so everything can steady state properly
-
rewby
But I can do that from the tracker, that's not something worker runners should need to bother themselves with
-
rewby
And also, there's another asterisk here in that some targets actually can process faster than they have network capacity
-
rewby
So they'll never really back up usually
-
rewby
Because you literally cannot write to them fast enough to trigger this
-
rewby
But even that asterisk has asterisks
-
rewby
Because they very much can back up if some worst-case behaviour of various programs and connections happens
-
rewby
Example being if the IA is overloaded, then yes they can't upload as fast as you load into them
-
rewby
Or if a temp files pile up due to rsync edge cases
-
imer
I'd hazard a guess the actual number of "can I write to you now" requests isn't that high either, so risk of self-ddos is low?
-
rewby
There's a 400 conn hard limit anyway
-
rewby
If it rejects a conn due to a "max connections reached (-1)", it's because the disks are full
-
rewby
If it's "max connections reached (400)" it's just you lost the lottery of the herd
-
rewby
And the target has said "this many people, no more"
-
rewby
And even without that, these targets are all nvme mostly
-
rewby
They can handle thousands of parallel connections no problem
-
rewby
My record is somewhere in the thousands per second range
-
rewby
(On a single target)
-
rewby
I appreciate that people are trying to help by going "oh errors, I should slow down"
-
rewby
But the thing is, targets are weird
-
arkiver
so for "why should we keep trying to upload?" rewby explains the reasons. and "why isn't keeping trying to upload a problem?" imer notes the reason
-
rewby
Generally, just keep going
-
rewby
Let them break tm
-
rewby
I'll deal with the mess when I get to it
-
rewby
And sometimes it's just a bit of peak load
-
rewby
and the system will process it eventually
-
rewby
arkiver: Imer is sort of yes sort of no with this.;
-
rewby
I've definitely self-ddosed targets
-
rewby
The thing is, the limiting factor on them is not parallel connections
-
rewby
Or even rsyncs per second
-
arkiver
challenge accepted :P
-
rewby
Sure, it's more efficient to do bigger connections
-
rewby
*uploads
-
arkiver
inb4 someone physically goes in and breaks rewby's stuff :)
-
imer
in the context of why no exponential backoff* "cause it's probably not needed"
-
rewby
It's probably not needed and actually gets in the way of the reason I noted above for wanting 100% util
-
rewby
Exponential backoff would settle below 100% load
-
arkiver
yeah
-
rewby
So just let 'em bang on the targets
-
rewby
They can defend themselves
-
imer
thanks for explaining, very insightful :)
-
rewby
Let me know if it's happening for an extended period of time
-
rewby
Because that usually indicates something I need to look into, but it's not a reason for you to stop
-
rewby
Usually that thing is "I need to get more capacity here"
-
arkiver
hopefully we'll be in calm waters again soon if nothing else pops up
-
arkiver
these 2 months are extreme
-
rewby
Alternatively, someone has just dumped a large pile of urls into the queue and there's a quick surge of stuff. (👀 arkiver)
-
arkiver
oops
-
arkiver
:P
-
rewby
Eh it's fine
-
arkiver
oh right
-
rewby
Just takes an hour or so to suck down
-
rewby
I don't bother scaling for just a single dump
-
rewby
So I'll just let y'all marinade in the -1 errors until it's swallowed them all
-
arkiver
we have a ton of URLs to go through that have been stashed away from the current reddit situation
-
arkiver
but we'll go through those slowly later
-
rewby
Eh, reddit has dedicated targets
-
rewby
So you can blow this up all you want
-
arkiver
i mean outlinks from reddit that will be queued here eventually
-
arkiver
and recently the queue increases have been due to spam problems
-
arkiver
but those seems to be largely gone away now (maybe new ones soon)
-
rewby
I consider that a "y'all" problem. :P
-
rewby
I'll upload whatever you feed me
-
arkiver
:P
-
arkiver
FYI those reddit outlinks are currently stashed here
tracker.archiveteam.org/urls-stash-reddit
-
rewby
I honestly want to see what we can get optane9 up to
-
rewby
It's been doing 5gbps fairly stabily the last few days
-
rewby
Down to 3 atm
-
arkiver
pretty awesome!
-
arkiver
lot of data
-
arkiver
expensive though
-
rewby
?
-
rewby
For the IA, yes
-
rewby
optane9 is a dedicated hardware target
-
rewby
It's not leased or anything
-
rewby
I know the owner of the rack and the ISP that provides network to it
-
rewby
More amused at the large numbers than anything.
-
rewby
Outbound to IA is peered anyway
-
rewby
Especially v6 traffic is ezpz
-
arkiver
yeah was mostly talking about IA
-
arkiver
how long has optane9 been around?
-
rewby
Uh. A while
-
rewby
Year or two?
-
rewby
It's been almost exactly a year since I last did a reinstall on this box
-
threedeeitguy
related question, how much ram do the targets tend to run? Lots of ideas rattling around and all of them bad :p
-
rewby
It used to be known as 9inch
-
rewby
No we don't do ramdisks
-
rewby
It's been discussed
-
rewby
optane9's solution is perhaps easy to guess
-
arkiver
rewby: right! yes i knew this as 9inch then
-
rewby
It disappeared for a bit due to a hardware issue
-
rewby
Then was brought back
-
rewby
And we called it optane9 because that was slightly less crude and didn't turn up eyebrows when someone visited the rack
-
imer
"It disappeared..." just like data on a ramdisk? ;)
-
rewby
No, we actually emptied it properly
-
rewby
More of a thing that it had issues with some parts of it so it wasn't safe to run
-
rewby
And, well, 2022 shipping of computer parts
-
rewby
Need I say more
-
rewby
In consumer space, not as bad
-
rewby
Enterprise stuff like servers tho
-
rewby
hoo boy
-
rewby
I think we ended up getting parts from a decommed server to fix this one up
-
threedeeitguy
rewby not a ram disk. Ive been digging into the couple of rust rsync implementations that are out there. As I said many bad ideas 😝
-
rewby
Rsync's not the issue
-
rewby
Megawarc factory is
-
rewby
rsync is literally the least bottlenecked part of this
-
rewby
Sure, it's rsync that gives you "errors"
-
rewby
But again, that's literally because the backpressure scripts we have instruct it to send those
-
rewby
If you see an rsync connection limit error, it's literally never an rsync problem
-
arkiver
rewby: you've done some on making various parts of the megWARCs factory less resource intensive right?
-
arkiver
for example the one moving files around
-
rewby
I've done chunker yeah
-
rewby
I can't do much about megawarc itself at the moment
-
rewby
The main issue is just that it's all mediated by the file system
-
rewby
And there's no feedback after rsync completes
-
rewby
The moment the upload itself finishes, I *cannot* lose the file
-
rewby
So it all has to be persisted to disk
-
rewby
And due to the amount of tiny files and raw throughput, this needs NVMe SSDs
-
rewby
(Or a very large amount of sata SSDs, and even then you can't really top the nvmes)
-
rewby
This is all fine and dandy
-
rewby
Except
-
rewby
SSDs have limited write cycles
-
rewby
Consider every TB that goes through a target gets written to the ssds
-
rewby
Consumer SSDs have total write durabilities of less than half a petabyte
-
rewby
Prosumer ones usually approach 800TB
-
rewby
Maybe 1PB if you've got a good one
-
rewby
Our record is blowing out an NVMe in 3 months
-
rewby
And when I say blow out, I mean "it's gone so far the controller's failed on it"
-
rewby
The only way to actually deal with this is with fancy enterprise SSDs that are a) very very expensive and b) have a huge amount of non-provisioned capacity to allow them to have write durability in petabytes
-
rewby
Or special types like optane which are designed to have insanely low latency and super high durability
-
rewby
I've written up a design doc in the past of what I want in a new pipeline
-
threedeeitguy
I understand that rsync is not the limit. But if you controlled the rsync implementation then surely it would it be possible to tie the tracker and the targets togethers so that the targets understand what data they are receiving. Something like the tracker issues a batch ID that the worker then has to provide to the target before upload is allowed
-
threedeeitguy
to commence. Data then has an extra state so instead of Upload > Tracker Done and now must not loose this file we have Upload > in memory > Tracker *Done* > Megawarc factory in memory > Megawarc shipped to disk > Tracker Done for realzies.
-
rewby
Yeah but like, at that point why use rsync
-
rewby
We can ship custom clients
-
rewby
We control the client side too
-
rewby
And using http(s) instead actually makes things easier since then you can use cool stuff like QUIC/HTTP3 and normal load balancers
-
threedeeitguy
True. Mind sharing the design doc?
-
rewby
Uh. I'll have to do some digging.
-
rewby
Remind me in 3 hours okay?
-
rewby
I have a meeting coming up I need to deal with
-
threedeeitguy
yep, np.
-
rewby
The basic idea was "instead of waiting for a whole chunk to complete, just write them out to disk in one go
-
rewby
You can write out megawarcs incrementally
-
rewby
So don't return "complete" until it's written out as a megawarc
-
rewby
This saves a lot of resources
-
rewby
And also saves SSDs a lot because it's only a single write at that point
-
rewby
And also much less parallel io request
-
rewby
Instead of thousands of little IO requests a second
-
rewby
It's just like 3-4 parallel megawarcs writing out
-
threedeeitguy
ah nice. Ive been reading over the warc spec but hadn't got as far as megawarcs, being able to stream into it is good.
-
rewby
Which means you can start using hdds and such more efficiently
-
rewby
The thing about warcs
-
rewby
They're a record based format
-
rewby
So the unit is the record
-
rewby
Be it a request record
-
rewby
Or a response
-
rewby
The compression is *also* record based
-
rewby
Each record is individually compressed
-
rewby
This way the WBM can seek into files
-
rewby
And accessing a record is an O(1) operation instead of O(n)
-
rewby
megawarcs are literally nothing more than just concatenated warcs
-
rewby
(asterisk)
-
rewby
(There's some details around compression dictionaries)
-
rewby
The main thing the megawarc factory does is read all the files in the chunk and decompress them as a quick sanity check and then write the compressed data to the end of the megawarc
-
rewby
And does a quick bit of record keeping in a json file to help locate what parts of the megawarc came from where
-
rewby
But the primary action is just a fancy concat
-
rewby
(With zstd ones it outputs a skippable frame with the dict first)
-
threedeeitguy
I like big disks and I cannot lie. Some of the ztsd stuff makes my head hurt, my background is very much SharePoint and higher level ms stuff so this is all new (and much more interesting :) )
-
rewby
Welcome to low level bullshit
-
rewby
Where it's all about doing as much as we can with as little hardware as possible
-
imer
i've done something similar recently, having the client wait until the data has settled in its final state is definitely the way to go if the client can retry on failure/hold until processing is done
-
imer
less ways for data to get lost as well if it's either ready or not and needs to be retried from scratch
-
rewby
Yeah
-
rewby
Waiting for the megawarc to complete isn't reasonable
-
rewby
Some projects do like one megawarc per wweek
-
rewby
But waiting for the data to be committed to a megawarc on disk should be very doable
-
rewby
And that means we can rambuffer all the inbound stuff
-
rewby
Most targets have 128G or more ram if hardware
-
rewby
Or 16 if VM targets
-
imer
yep, waiting for it to be uploaded to IA would be great, no way of losing data then, but not really realistic :)
-
rewby
It's also worth pointing out the current pipeline is very single threaded
-
rewby
As in, you can't write out one warc until the previous one is written
-
rewby
And that is entirely blocked on how long it takes to decompress
-
rewby
(Because, again, decompressing is used as a sanity check to see if the data is even a little valid)
-
rewby
So if we can decompress while the data is coming into the target and then once it's received properly into a rambuffer, immediately write it out to a megawarc
-
rewby
That'd be amazing
-
rewby
And much faster
-
imer
Just have to be mindful of the decompressed size if you want to keep it all in memory, definitely need some sanity limits in place
-
rewby
You don't need to store the decompressed results
-
rewby
The current pipeline literally outputs to /dev/null
-
rewby
-
datechnoman
Ahhh my shit fell over
-
datechnoman
Thats why optane9 is quiet :P
-
rewby
datechnoman: The first part of this rant might be interesting to you
-
datechnoman
I just read through the whole convo
-
datechnoman
Good info
-
datechnoman
Good to know :)
-
datechnoman
I've gotta get some sleep but will get my cluster back up tomorrow to chew some data. Time to work on that reddit backlog once the queue is empty
-
imer
ooh, you'd stream-"decompress and parse" to see if its all valid and then just append the original to megawarc, thats even better
-
imer
(simplified, as you said megawarc does a bit more?)
-
arkiver
imer: short summary on what megawarc does
-
arkiver
reads the compressed WARC into buffer
-
arkiver
uses that same buffer to:
-
arkiver
- check if it decompresses correctly
-
arkiver
- write to megaWARC (combined WARC)
-
arkiver
- extract some metadata from warcinfo record at start
-
arkiver
so it reads once and does those things
-
arkiver
writing to megaWARC happens while testing if the WARC is valid, so if we at some points conclude the WARC is invalid, we cut the megaWARC back to the size it has before we started appending this WARC to it
-
arkiver
an invalid WARC is then stored in a tar file, which is uploaded alongside the megaWARC (and should thus be empty if nothing went wrong)
-
imer
cool, thanks for explaining
-
imer
from what I understood (in an ideal future pipeline) rew_by wanted to decouple parsing warc -> writing to file so you can prepare multiple warcs in parallel and write to megawarc once determined to be good
-
imer
so just thinking out loud, I enjoy this stuff if you couldnt tell :)
-
arkiver
well I guess the parsing would be faster
-
imer
definitely want to look at helping out on the dev side once I have more time on my hands
-
arkiver
but it would double the disk IO?
-
arkiver
on the reading side
-
imer
not if you have it buffered in memory
-
arkiver
right
-
arkiver
so
-
arkiver
if we assume the WARCs still need to be written to disk, and the CPU part is taking much less time than the disk IO part, this may not help much
-
arkiver
was the plan to keep the megaWARC itself in memory as well?
-
myself
isn't the current process more-or-less single-threaded? Does it end up cpu-bound or i/o-bound?
-
arkiver
basically writing to the megaWARC is still happening sequentially, so if CPU is not a big part of this chain anyway, all we'll do with this is accumulating WARCs in memory waiting to be written to disk
-
arkiver
unless the megaWARC itself stays in memory and we can submit this very fast to IA (meaning sending a single WARC at a time is fast enough)
-
arkiver
myself: i think disk IO - but not sure. if disk IO bound, then yeah not sure if this will help much
-
JAA
arkiver: The long-term idea is to avoid writing the individual WARCs to disk at all. Instead, have long-running megawarcing processes, uploads go to a tmpfs, get queued to the megawarcing, and once confirmed written to the megawarc, the upload completes (which causes the worker to delete its copy). That halves the disk I/O.
-
arkiver
JAA: it would make us even more reliant on IA throughput
-
arkiver
is disk IO really the biggest problem here?
-
JAA
IA wouldn't come into the equation at all. The megawarc would still be written to the target's disk.
-
JAA
It just avoids the disk roundtrip of the miniwarcs.
-
arkiver
ah i see
-
arkiver
hmm
-
JAA
I mean, in theory, we could even avoid writing the megawarc to disk and do multi-part IA uploads, but that sounds like a bad idea.
-
arkiver
did you try out multi part uploads to IA?
-
JAA
Yeah. The uploads themselves are fine, but the completion is slow.
-
JAA
Every part gets written to the item server normally with an archive.php task, including snowballing.
-
JAA
It computes the checksums for each part etc.
-
JAA
Then the completion task runs individually and joins the parts.
-
arkiver
i see
-
arkiver
it may be reasonable in GB chunks
-
JAA
I'd need to check my task logs for what other annoying things I saw at the time.
-
JAA
But I basically concluded that it wasn't usable for anything at scale.
-
arkiver
i like the megaWARC writing idea
-
JAA
Yeah
-
arkiver
python does not have a ton of overhead for this i'd think?
-
arkiver
some, but not enough to cause problems, bottle neck would still be writing the file to disk
-
arkiver
and in case of very large files we'd start writing them to disk anyway and queue them up for direct archiving
-
JAA
Using Python for what?
-
arkiver
this idea
-
JAA
Which part of it? :-P
-
arkiver
receiving WARCs, processing them, writing the megaWARC to disk
-
JAA
Hmm, it might not be fast enough for the networking part.
-
JAA
Handling hundreds of connections with a total throughput in the gigabit/s isn't Python's strength.
-
arkiver
i wish i had more experience with languages that not Python or C (or limited Lua)
-
arkiver
that are not*
-
JAA
See 3461553593 for an example task of multi-part completion that took over 1.5 hours. Bit bigger than our megawarcs, but still close enough to get an idea of how it might perform.
-
arkiver
ouch
-
arkiver
though
-
JAA
The double hashing + copying to the mirror server are the slow parts.
-
arkiver
performance greatly differs over time on these IA machines
-
arkiver
they're often over used
-
JAA
Well, and apparently concatenating the files also took almost 40 minutes there.
-
arkiver
hmm
-
arkiver
is it rewriting the entire thing?
-
JAA
Which isn't too surprising since it's reading and writing from the same HDD.
-
JAA
Well, yes, it has to.
-
arkiver
hmm
-
arkiver
i'm not sure if this is problematic
-
JAA
Each part upload results in one file in the spool dir, and then it essentially cats those to get the complete file.
-
arkiver
do you still have the code you used to do this?
-
JAA
ia-upload-stream in my little-things
-
arkiver
and can it handle checking the hash?
-
arkiver
confirming it
-
JAA
I've since implemented single-part upload because my experience was so meh.
-
arkiver
or wait it checks that for every part of course
-
arkiver
nvm
-
JAA
The client sends the Content-MD5 for each part, which gets checked by IA on receipt I believe.
-
arkiver
so this is 170 GB
-
arkiver
it doesn't look to bad to me honestly
-
JAA
Then IA copies the part from the S3 endpoint to the item's spool dir.
-
arkiver
it's not too bad to upload our megaWARC is, say, 1 GB chunks like this
-
JAA
Then it calculates its hashes, and copies it to the mirror server.
-
arkiver
yeah
-
JAA
And then on completion, it concats on the primary server, again calculates hashes, and again copies to the mirror.
-
arkiver
so doubles this
-
arkiver
but IA servers are usually not busy with disk IO stuff, CPU is nowadays the bottle neck
-
rewby
arkiver: We're not doing this in python
-
JAA
It'd be nice if the mirror server could also concat simultaneously and then the rsync would just transfer the timestamp etc.
-
arkiver
so yes it wastes resources a bit (but perhaps resources that would otherwise not be used). i wonder how much this would save us? if it's significant this is worth considering seriously
-
rewby
I have Opinions :tm:
-
JAA
But I guess that won't happen.
-
rewby
But I'm in a meeting
-
arkiver
JAA: maybe in the future
-
arkiver
or
-
arkiver
actually these things happen if they become problems ;)
-
arkiver
it's not a problem now
-
rewby
These issues with the megawarc are already a problem
-
arkiver
no one uses it, so no need to optimize
-
rewby
But again, will elaborate later
-
arkiver
thanks rewby whenever you have time
-
arkiver
(make sure to ping me please)
-
imer
JAA: IA S3 mirrors unfinished s3 upload parts? that's an odd decision, also afaik S3 spec doesn't force parts to be uploaded in order, but might be mistaken there
-
imer
so concat-as-you-go might not be a thing you can do
-
arkiver
imer: yes on all that
-
arkiver
it's possible just not implemented
-
JAA
imer: No, S3 doesn't mirror, but the archive.php task that copies data from S3 to the item servers does.
-
arkiver
also it'll still have to hash the individual pieces of course to confirm hashes
-
JAA
And I'm not suggesting concat-as-you-go, which is indeed not easy.
-
JAA
I don't think the MD5 calculation on the S3 endpoint is the bottleneck here.
-
JAA
And it calculates that anyway I believe to return an ETag (which you have to supply for completion).
-
JAA
(The ETag doesn't have to be a hash, of course, but it is in the case of IA.)
-
imer
so, upload parts to S3 -> S3 mp upload complete -> concat to item server + mirror? seems pretty optimized if you need the parts concat'ed into one file
-
JAA
Upload parts to S3, they get copied to item server and mirrored. Complete, they get merged on the item server and mirrored.
-
imer
mh, lots of copies. in the s3-compatible api i implemented recently it doesn't concat the file on completion, but assembles it transparently on-demand, but that's probably not feasable for IA's usecase
-
JAA
That would definitely be awkward, yeah.
-
imer
i'd say leave the uncompleted parts in a staging area and only concat to final storage once done, but thats got the tradeoff of losing data if staging area explodes (although not sure how much of a concern that really is if communicated right?)
-
imer
definitely not an easy problem
-
arkiver
for IA the most important part is data integrity and not losing data
-
arkiver
for that the current solutions works well
-
imer
yep, sounds very reasonable
-
JAA
The S3 endpoints are basically the only place where data can be lost currently, as I understand it.
-
JAA
But perhaps those are backed by a RAID 1 at least.
-
JAA
For my own uploads, I don't trust it until the data shows up on the item itself and has been mirrored.
-
JAA
(I.e. until no archive.php tasks are pending on the item anymore)
-
imer
Thats why I like the "have the client wait until <upload> is secure" so much, if anything goes wrong on the path to final storage just respond with an error and the client can simply retry, makes reasoning about all the error cases a lot easier if you can (worst case) hand it back for retrying
-
imer
but yeah, tradeoffs
-
rewby
Okay
-
rewby
So
-
rewby
arkiver: here's your ping, prepare for a rant (JAA)
-
rewby
I have a number of issues with the current pipeline
-
rewby
Let us for a second ignore completely what data is being moved and look at it purely from a disk io perspective
-
rewby
Every byte that comes into the network interface gets written to disk, read and then written again and finally read once more
-
rewby
*however*
-
rewby
This is where filesystems and files come into play
-
rewby
We get say 100 - 400 parallel uploads
-
rewby
So 400 tiny files being written
-
rewby
Then read one by one
-
rewby
And written into say 1 big file
-
rewby
Which is then read (and sent to IA)
-
rewby
In theory fine, but consider the characteristics of this IO
-
rewby
The net -> packer part is entirely tiny files
-
rewby
Many of them
-
rewby
And packer -> net is just one big file
-
rewby
(mostly)
-
rewby
This means that at steady state, if we have 1gbps coming in
-
rewby
We are writing 2gbps to dism
-
rewby
*disk
-
rewby
With a mixture of tiny files and big ones
-
rewby
This workload is absolutely atrocious for hdds, too much seeking
-
rewby
With that many tiny files the hdds will seek to all hell and throughput suffers
-
rewby
Even setups with dozens of disks can only sustain 1-2 gbps of this
-
rewby
Wheras a single nvme drive will happily do more than 24 hdds can do in the tiny files department
-
rewby
But this brings another problem
-
rewby
nvmes (and other ssds) have limited write durability
-
rewby
On pro gear (and prosumer) gear you often get quoted on a TBW number
-
rewby
Aka how many TB can you write to this disk before you start running into failures.
-
rewby
Depending on quality (and thus price) of the disk this can be from 400TB to 40PB
-
rewby
With price going up fairly exponentially as you approach the PB numbers
-
rewby
(At work we have a disk with a durability of like 30PB, but on the flip side those samsungs cost several grand)
-
rewby
If we combine this with the write amplification factor of 2
-
rewby
You may start seeing the problem
-
rewby
Even modest projects often end up around 100TB around here these days
-
rewby
and that's ignoring the really big ones like urls, shreddit, etc that really push the numbers
-
rewby
hel1, for example, has a really nice ssd
-
rewby
But even that's suffered: Data Units Written: 7,879,804,265 [4.03 PB]
-
rewby
However, that machine has been doing this for a while
-
rewby
But the real shocker is the reads:
-
rewby
Data Units Read: 1,165,376,309 [596 TB]
-
rewby
Consider that thanks to the wonderful linux kernel page cache, our ratio between reads and writes is 4
-
rewby
We write 4 times as much data to these disks as we read
-
rewby
In other words, this workload is the absolute worst for ssd
-
rewby
*ssds
-
rewby
Because 75% of the time we write data and then never read it
-
rewby
And simply overwrite it again later
-
rewby
(also consider this ssd is 500G and how many total drive rewrites this poor thing has suffered)
-
JAA
Fun fact: AB pipelines have basically the exact same problem. Lots of tiny files getting written, then read again, then appended to the WARC, then deleted. The read/write ratio is typically closer to 8 there though.
-
rewby
This means we have really hard requirements on these devices
-
rewby
They need to do *throughput* and deal with tons of writes
-
rewby
This makes it really hard to source hardware
-
rewby
The only way to reasonably acquire such fast disks without having to worry about our writes, is cloud
-
rewby
Which is why we use a lot of hetzner cloud
-
JAA
But yeah, if we streamed megawarcs to disk from memory-buffered miniwarcs, we could just use servers with a few HDDs and a bunch of RAM.
-
rewby
Yeah so that's where I was going
-
JAA
I.e. standard servers available everywhere.
-
rewby
My preferred solution is to take all the tiny high-iops stuff
-
rewby
And keep that in ram
-
rewby
And just get a machine with like 12 drives, set up one packer per drive
-
rewby
*or per pair
-
JAA
Yep, that'd be ideal.
-
rewby
And just write out ~100MB/s per drive(pair)
-
rewby
Even shite harddrives will happily sit at 100MB/s all day every day
-
rewby
And if they only ever write a single file
-
rewby
Then no seeking to worry about
-
JAA
I'd do mirrored pairs as a layer of protection against data loss.
-
rewby
As I said, drive pair
-
rewby
RAID1
-
JAA
Yup
-
rewby
But either way
-
rewby
It'd be infinitely better than this mess
-
rewby
Because there is another problem with the current pipeline
-
rewby
And that is space
-
rewby
which may sound stupid
-
rewby
But hear me out
-
JAA
I said in the past that I think this could be done with an rsync wrapper (like rrsync), but I don't believe that's true anymore. It would confirm the upload to the clients too soon. So we'd need to replace the uploads entirely. There's already curl uploads to HTTP targets in the code though, so that's not a big deal.
-
rewby
On disk we keep: 1. Unfinished uploads. 2. Files waiting to complete a chunk. (up to 15G per chunk) 3. Chunks waiting to be packed 4. chunks currently packing 5. the output of chunks currently packing (so essentially each currently packing chunk takes up 2x space) and 6. megawarcs to upload
-
rewby
Consider the following alternative pipeline:
-
threedeeitguy
Write mwarc > check for pipe out > either Read mwarc if free upload slot or start another one if space > repeat sounds nice.
-
rewby
1. Client uploads file to ram. 2. While uploading the decompression testing stuff happens (in parallel) 3. Once file is uploaded it is immediately written to megawarc (or error tar) 4. repeat previous steps until megawarc done 5. megawarc gets moved to uploaders
-
rewby
This may not sound that much better
-
rewby
But
-
rewby
The current megawarc factory is singlethreaded
-
threedeeitguy
masive reduction in dupe data.
-
rewby
So the time needed to pack a single file is entirely dependent on how fast a single cpu core can gunzip or zstd -d
-
threedeeitguy
And allows you to go as wide as you have disks
-
rewby
So we need to often run 4 or more of these in parallel
-
rewby
Because server cpus aren't great on single thread
-
rewby
But excellent at having lots of cores
-
rewby
But this brings another problem
-
threedeeitguy
overclocked threadripper ftw
-
rewby
Lets assume a pack is 15G
-
rewby
Each packer needs space for 2 (in and out)
-
rewby
So each packer is 30G
-
rewby
If we need 4 of them, that's 120G
-
rewby
Consider that the hetzner vms we use have 160G of ssd in total
-
rewby
And that's a large number by cloud standards
-
rewby
And that's not even accounting for the parallel uploaders
-
rewby
Or the inbound
-
rewby
So not only does the current setup need nvmes, and fast ones
-
rewby
It also needs big ones to make good throughput
-
rewby
Because of how many packs we need to keep on disk in parallel
-
rewby
So this drives the cost up even further
-
rewby
Some of you may want to chime in with links to your favorite online store where a 1 or 2 TB nvme is only like 100-300 euro
-
rewby
But for those, I refer you to my point on TBW
-
rewby
Because if you check the spec sheets, you'll quickly see those numbers aren't very impressive on the cheap ssds
-
threedeeitguy
vs a 24 bay shelf that only needs cheapo hdds and could have 12 "workers"... I see the appeal.
-
rewby
Let's consider samsung's 980 pro line
-
rewby
Nobody here would argue them to be bad ssds I think (although one might argue about their firmware incident last year)
-
rewby
-
rewby
That's the data sheet
-
rewby
Under their warranty you see that a 500GB 980 pro is only warranted to 300TB
-
rewby
Experience tells me they'll do up to 600T before the controller conks out
-
rewby
I know that because we've blown 4 of them up
-
threedeeitguy
yeah. Ive nearly killed my first 970's in just a few years of desktop use.
-
rewby
Died in the line of target duty
-
rewby
After like 4 months
-
rewby
Sure they're not expensive
-
rewby
But if you have to buy new ones every few months...
-
rewby
So this is another reason I like my proposed pipeline
-
JAA
Just manipulate the SMART data and exchange them on warranty. /s
-
rewby
It is much more able to deal with the nature of server cpus
-
rewby
Because our target hw usually has at least 32 cores
-
rewby
We can just barely use them because we don't have enough fast space to have enough packers running to actually use them
-
rewby
My proposal for a new pipeline basically goes from file uploaded to file written to megawarc in a matter of seconds
-
rewby
Because by the time the upload finishes, the decompression test is also done
-
imer
samsungs "enterprise" pm9a3 aren't too expensive anymore, you get 1 drive write/day for 5 years, so just under ~7pb for the 3.7tb one for example (for 280€), but everything you said still applies
-
rewby
imer: Not good enough
-
rewby
Consider
-
imer
yep.
-
rewby
4T per day
-
rewby
Is like 47MB/s
-
rewby
That's less than half a gig
-
rewby
That is *fuck all* bandwidth by our sale
-
rewby
*scale
-
imer
still gonna wear through it fast
-
imer
less fast than consumer gear, but still fast
-
rewby
Sure, but you see my point
-
rewby
The machines with the best record so far are the one running on optane (no shit) and the one running on kioxia enterprise drives that has pulled the silicon lottery
-
rewby
Because those are sat at 255% used
-
rewby
And still going
-
rewby
Again, in my ideal world we only write to disks once
-
rewby
We buffer all the inbound stuff in memory
-
rewby
And we do the validation of the inbound data "live"
-
rewby
So when you do a read(socket) of like 2MB, you also validate that 2MB
-
rewby
And that way we can really use the properly multicore stuff
-
rewby
Like, at that point looking into those ARM machines becomes interseting
-
JAA
We can also reject faulty uploads at that point rather than writing them to an error tar.
-
rewby
Yep
-
rewby
That was gonna be my other point
-
rewby
We can just decline broken files
-
rewby
And then cause the whole item to fail
-
rewby
So it gets retried later
-
imer
I was about to ask that as well hah
-
JAA
Yup, since these uploads would probably use HTTPS anyway, we have some nice status codes at our disposal. :-)
-
rewby
And of course, if we do this, one of my requirements would be to build in a lot monitoring.
-
rewby
As in
-
rewby
Have counters for throughput
-
rewby
For rejected files
-
rewby
For parallel uploads
-
rewby
Etc
-
rewby
etc
-
threedeeitguy
imer Assuming 10GB/s paper napkin says 27 drives in z2 (on a single box) or 128k with a 5yr lifespan. And im pretty sure we do more than 10GB/s
-
rewby
(aka prometheus this shit up real good)
-
rewby
GB/s or gbps
-
rewby
Difference
-
rewby
I mostly measure in gigabit
-
rewby
Because in the end, from my perspective, network throughput is the end measure
-
rewby
A target is only good in sofar as it's steady-state (as in, it's not backlogging and receive/transmit are roughly balanced) network throughput is
-
rewby
I don't particularly care if a project ends up being 1TB or 1PB
-
rewby
If it's 1TB being done over the span of a day
-
rewby
Or 1PB over the span of 100 days
-
rewby
*1000 days
-
rewby
Both are the same to me
-
rewby
And need equal capacity
-
rewby
Just for different durations
-
rewby
My capacity planning is purely on throughput
-
threedeeitguy
GB/s with lots of rounding for easy maths. I figured its such an outlandish number that its not worth doing accuratley. Also I should not be trusted with a calculator, fixed maths below.
-
threedeeitguy
Assuming 10GB/s at the disk and 50MB/s per drive then paper napkin says 27 drives in z2 (on a single box) at a cost of 7.5k with a 5yr lifespan. And im pretty sure we do more than 10GB/s
-
rewby
optane9 likes to do 5gbps
-
rewby
that's gigabit
-
rewby
So call it like 500-600 MB/s
-
imer
threedeeitguy: yeah, there's no point. I was just saying you *can* get ~10x better ones (compared to consumer drives) for not much more money nowadays, 10x better isnt good enough though
-
threedeeitguy
Yeah. Just gotta cut that cost by 10x ;)
-
rewby
If we go real fast, a goal would be 10gbps because everything I get these days is 10gbps (or more)
-
rewby
*per target
-
rewby
So 1.25GB/s is the threshold number
-
rewby
So you'd need about 25 worth of throughput
-
rewby
However, a raidz2 wouldn't work
-
rewby
I think
-
rewby
There reason is how raidz2 works
-
rewby
raidz2 is essentially a parity system
-
rewby
Which exploits a property of xor
-
rewby
Consider a xor b = c
-
rewby
If you lose any single one of a b or c, you can xor the remaining 2 to recreate the missing one
-
rewby
This is the basic premise of raid5 and raidz1
-
rewby
Raidz2/raid6 just adds an extra copy of the parity data on top of that
-
rewby
Either way, you're bound to the performance of a single disk
-
rewby
All that is to say, you definitely can make the current system work
-
rewby
But it requires a lot of ssda
-
rewby
*ssds
-
rewby
And expensive one
-
rewby
We work around some of this with hetzner targets by just having lots of them
-
rewby
Because cloud
-
rewby
Sure, optane9 can do like 3gbps
-
rewby
And to equal that I need like 6 hetzner vms
-
rewby
Or even more
-
threedeeitguy
be interesting to see what cpu reqs are to decompress at 10gbps
-
rewby
But at least I can get the hetzner vms
-
rewby
But if we could use our resources more efficiently
-
rewby
That'd be so amazing
-
rewby
Like, ram is no problem
-
imer
-
rewby
We don't run on latest gen hardware
-
rewby
Most of it is late ddr3 or early ddr4
-
rewby
The industry is moving to ddr5
-
rewby
Both of these are getting stupidly cheap
-
rewby
For 300 usd you can easily get 256G of ram
-
rewby
Even ddr4
-
rewby
ddr3 is even cheaper
-
rewby
So I'm not too worried about exhausting ram
-
rewby
All of our dedicated boxes already have 128G or more
-
rewby
And since we can ship whatever uploader code we want
-
rewby
We don't need to be restricted to just "what can curl do" or "what can rsync do"
-
rewby
If we find we need to do multipart uploads
-
rewby
Or build in some upload resuming or whatever
-
rewby
No problem
-
threedeeitguy
Yeah I saw a full tray of 24 16gb drr3 modules for £99 earlier today. Crazy.
-
rewby
My point exactly
-
rewby
Ram is cheap and performs well
-
rewby
In the design doc that I'm currently trying to find
-
rewby
I had an idea of the first request to a target doing a couple of things
-
rewby
It would allocate the buffer in ram
-
rewby
And reject you if no more space
-
rewby
And also return you a "permalink" of sorts to that target
-
rewby
So if we stick a loadbalancer infront of things, you can basically bypass that after the first request
-
rewby
And always end up at the same target to finish your upload
-
rewby
Additionally, by not accepting uploads for which we have no space, we don't get a pile of tiny temp files everywehre
-
rewby
*everywhere
-
rewby
This also has another advantage
-
rewby
If some project is a mix of tiny and huge files
-
rewby
We can deploy targets of mismatched capacities
-
rewby
Where only some of them can handle the big ones
-
rewby
And it'd all be fine
-
threedeeitguy
I was pondering that earlier. Central orchestrator that hands out links to workers saying go here there's space waiting for you vs dumb lb and retry if you hit a full one
-
rewby
Because the smaller ones would just say "nah bro, try someone lse"
-
rewby
Yeah, that'd be ideal, but you still need this mechanism
-
rewby
My original design had this
-
JAA
Yep, and we could also have a slow separate target which just writes the individual uploads to disk anyway, for really big items.
-
rewby
You sent the first request to the orchestrator, it'd sort the list of targets by "who can likely fit this"
-
rewby
And then try that list until it finds someone who can
-
rewby
That way if a target filled up between the last list update and "now" it'd handle it the less efficient but still good way
-
rewby
Yep
-
rewby
Indeed JAA
-
rewby
That was part of my thoughts
-
rewby
For those cases where we have truly huge files
-
rewby
We just have one box that writes to disk anyways
-
rewby
And be the target of last resort
-
rewby
You can get really creative with dispatching logic once you can guarantee that sending someone to a target will actually complete an upload
-
rewby
And when you have accurate accounting of how much space has already been "allocated" to in-flight
-
rewby
Because a target with 10GB free and 10 1G files in flight is as good as full
-
JAA
-
rewby
YEah
-
rewby
It's not the best doc
-
rewby
And probably definitely needs improving
-
JAA
It exists. That's better than most docs around here. :-P
-
rewby
If we want to do this properly, should be a etherpad or wiki page
-
rewby
And people can spend some time proposing edits and such
-
rewby
One thing this design doc also did which I think is nice is track which items are in each warc and who uploaded them
-
rewby
Because in the past if we had to go find out who did what or where corrupt data came from it was always a pain
-
rewby
If people are willing to actually help make this a reality, feel free to reach out
-
rewby
I'm perpetually overworked myself and thus never got around to properly doing this. I have some prototype work that's incomplete and doesn't work
-
rewby
I'm more than happy to mentor someone on the details of how targets work and to help flesh out ideas
-
rewby
I have one note though, no python or nodejs
-
rewby
Both are atrocious to deal with when doing things that really need multithreading
-
rewby
And their package management makes me cry
-
rewby
And also, I have a preference for languages with a good static type system for a few reasons
-
imer
so rust?
-
threedeeitguy
Works for me.
-
rewby
Something like go or rust is fine for me
-
rewby
Reason being: You can verify invariants much easier
-
rewby
You can get a variable in and be sure what you can and cannot do with it
-
rewby
If I need to know what operations I can do on something, or what something is, figureing that out with python is awful
-
threedeeitguy
rust has been my first forray into anything lower than c# or typescript and I kinda love it.
-
rewby
Because I need to trace that through the whole call tree just to figure out what fields it has
-
rewby
I'm really into rust since early last year as well
-
threedeeitguy
Python is my own personal hell.
-
rewby
But I know some of our other tooling is golang
-
rewby
I enjoy rust's type system
-
rewby
But also acknowledge it's async is painful
-
imer
rewby: guessing this is gonna be a more long term thing, so i'll throw my name in the hat (currently still too busy with stuff for serious commitments, but that should relax soon)
-
rewby
It's a more long term thing yea
-
rewby
I don't expect this to be done in a week or even a month
-
rewby
But atm nobody is working on it
-
imer
definitely more familiar with rust here (and yes, async is painful sometimes)
-
rewby
I'm too busy patching holes and keeping up with *gestures at imgone, shreddit, etc*
-
threedeeitguy
I helped a friend through the coding for his astrophysics degree. They had to use python for everything. It was soooooo slow. Doing the same thing in a SQL query or with c# was hundreds or thousands of times faster for some of the problems.
-
rewby
As I said, if there's interest, we can set up a group (channel) for the development (or use -dev) and get people set up with soem repos and guidance
-
rewby
And get some documentation space set up and such
-
rewby
(I am a team lead for infrastructure on my day job, all too familiar with this stuff)
-
threedeeitguy
I have plenty of time, persuading my brain to focus is another matter. Would be more than game for giving it a go though.
-
rewby
If y'all are willing to give it a go, I can make a kanboard happen with some basic tasks
-
rewby
And whomever feels like it can pick a task and do it
-
threedeeitguy
Sounds good.
-
imer
sure
-
fireonlive
!help
-
h2ibot
fireonlive: The following commands are available: (for '')
-
h2ibot
fireonlive: !help: Print this help message. (for '')
-
h2ibot
fireonlive: !a: Deduplicate and archive a list of URLs hosted on transfer.archivete.am. CAREFUL, DDOS. (for '')
-
fireonlive
hmm
-
fireonlive
-
h2ibot
fireonlive: Invalid privileges, need one of ('@', '+').
-
fireonlive
ah ok
-
JAA
I archived it with AB already, and also the other links in the post.
-
fireonlive
ah, thanks!
-
fireonlive
always on the ball :)