#//

11:54

nimaje

don't you normally want to avoid thundering herd and do that by using randomized delays and expotention backoff? why is it intended there?
11:56

rewby

Basically, it has to do with how backpressure works on these
11:56

rewby

When disks fill up beyond 80%, they stop accepting new connections
11:57

rewby

And then once they drop back below a safe level they start accepting new uploads again
11:57

rewby

At 100% usage of available bandwidth, you will see the disks fill up, then it chews on it again, and then when it has space, it's immediately full and has enough to chew on again
11:58

rewby

This way no part of a target is underutilized
11:58

rewby

It basically never runs out of things to process
11:58

rewby

So that way the targets are running as fast as they can processing as much as they can
11:59

rewby

If people scale down below this level, then there's bandwidth/throughput going spare
11:59

rewby

Because the only way you don't get this behaviour out of targets is to underutilize them
11:59

rewby

There's asterisks on this of course
12:00

rewby

If you're scaling up from "not enough capacity" to "more than enough" it actually makes sense to do some limiting for a bit to smooth the flow of data so everything can steady state properly
12:00

rewby

But I can do that from the tracker, that's not something worker runners should need to bother themselves with
12:01

rewby

And also, there's another asterisk here in that some targets actually can process faster than they have network capacity
12:01

rewby

So they'll never really back up usually
12:01

rewby

Because you literally cannot write to them fast enough to trigger this
12:01

rewby

But even that asterisk has asterisks
12:02

rewby

Because they very much can back up if some worst-case behaviour of various programs and connections happens
12:02

rewby

Example being if the IA is overloaded, then yes they can't upload as fast as you load into them
12:02

rewby

Or if a temp files pile up due to rsync edge cases
12:02

imer

I'd hazard a guess the actual number of "can I write to you now" requests isn't that high either, so risk of self-ddos is low?
12:02

rewby

There's a 400 conn hard limit anyway
12:03

rewby

If it rejects a conn due to a "max connections reached (-1)", it's because the disks are full
12:03

rewby

If it's "max connections reached (400)" it's just you lost the lottery of the herd
12:03

rewby

And the target has said "this many people, no more"
12:04

rewby

And even without that, these targets are all nvme mostly
12:04

rewby

They can handle thousands of parallel connections no problem
12:04

rewby

My record is somewhere in the thousands per second range
12:04

rewby

(On a single target)
12:05

rewby

I appreciate that people are trying to help by going "oh errors, I should slow down"
12:05

rewby

But the thing is, targets are weird
12:05

arkiver

so for "why should we keep trying to upload?" rewby explains the reasons. and "why isn't keeping trying to upload a problem?" imer notes the reason
12:05

rewby

Generally, just keep going
12:06

rewby

Let them break tm
12:06

rewby

I'll deal with the mess when I get to it
12:06

rewby

And sometimes it's just a bit of peak load
12:06

rewby

and the system will process it eventually
12:06

rewby

arkiver: Imer is sort of yes sort of no with this.;
12:06

rewby

I've definitely self-ddosed targets
12:07

rewby

The thing is, the limiting factor on them is not parallel connections
12:07

rewby

Or even rsyncs per second
12:07

arkiver

challenge accepted :P
12:07

rewby

Sure, it's more efficient to do bigger connections
12:07

rewby

*uploads
12:07

arkiver

inb4 someone physically goes in and breaks rewby's stuff :)
12:07

imer

in the context of why no exponential backoff* "cause it's probably not needed"
12:07

rewby

It's probably not needed and actually gets in the way of the reason I noted above for wanting 100% util
12:07

rewby

Exponential backoff would settle below 100% load
12:08

arkiver

yeah
12:08

rewby

So just let 'em bang on the targets
12:08

rewby

They can defend themselves
12:08

imer

thanks for explaining, very insightful :)
12:08

rewby

Let me know if it's happening for an extended period of time
12:08

rewby

Because that usually indicates something I need to look into, but it's not a reason for you to stop
12:09

rewby

Usually that thing is "I need to get more capacity here"
12:09

arkiver

hopefully we'll be in calm waters again soon if nothing else pops up
12:09

arkiver

these 2 months are extreme
12:10

rewby

Alternatively, someone has just dumped a large pile of urls into the queue and there's a quick surge of stuff. (👀 arkiver)
12:10

arkiver

oops
12:10

arkiver

:P
12:10

rewby

Eh it's fine
12:11

arkiver

oh right
12:11

rewby

Just takes an hour or so to suck down
12:11

rewby

I don't bother scaling for just a single dump
12:11

rewby

So I'll just let y'all marinade in the -1 errors until it's swallowed them all
12:11

arkiver

we have a ton of URLs to go through that have been stashed away from the current reddit situation
12:11

arkiver

but we'll go through those slowly later
12:11

rewby

Eh, reddit has dedicated targets
12:11

rewby

So you can blow this up all you want
12:11

arkiver

i mean outlinks from reddit that will be queued here eventually
12:12

arkiver

and recently the queue increases have been due to spam problems
12:12

arkiver

but those seems to be largely gone away now (maybe new ones soon)
12:12

rewby

I consider that a "y'all" problem. :P
12:12

rewby

I'll upload whatever you feed me
12:12

arkiver

:P
12:13

arkiver

FYI those reddit outlinks are currently stashed here tracker.archiveteam.org/urls-stash-reddit
12:13

rewby

I honestly want to see what we can get optane9 up to
12:13

rewby

It's been doing 5gbps fairly stabily the last few days
12:13

rewby

Down to 3 atm
12:13

arkiver

pretty awesome!
12:13

arkiver

lot of data
12:14

arkiver

expensive though
12:14

rewby

?
12:14

rewby

For the IA, yes
12:16

rewby

optane9 is a dedicated hardware target
12:16

rewby

It's not leased or anything
12:17

rewby

I know the owner of the rack and the ISP that provides network to it
12:17

rewby

More amused at the large numbers than anything.
12:18

rewby

Outbound to IA is peered anyway
12:19

rewby

Especially v6 traffic is ezpz
12:19

arkiver

yeah was mostly talking about IA
12:20

arkiver

how long has optane9 been around?
12:20

rewby

Uh. A while
12:20

rewby

Year or two?
12:21

rewby

It's been almost exactly a year since I last did a reinstall on this box
12:21

threedeeitguy

related question, how much ram do the targets tend to run? Lots of ideas rattling around and all of them bad :p
12:21

rewby

It used to be known as 9inch
12:22

rewby

No we don't do ramdisks
12:22

rewby

It's been discussed
12:22

rewby

optane9's solution is perhaps easy to guess
12:22

arkiver

rewby: right! yes i knew this as 9inch then
12:22

rewby

It disappeared for a bit due to a hardware issue
12:22

rewby

Then was brought back
12:23

rewby

And we called it optane9 because that was slightly less crude and didn't turn up eyebrows when someone visited the rack
12:23

imer

"It disappeared..." just like data on a ramdisk? ;)
12:24

rewby

No, we actually emptied it properly
12:24

rewby

More of a thing that it had issues with some parts of it so it wasn't safe to run
12:24

rewby

And, well, 2022 shipping of computer parts
12:24

rewby

Need I say more
12:24

rewby

In consumer space, not as bad
12:24

rewby

Enterprise stuff like servers tho
12:24

rewby

hoo boy
12:25

rewby

I think we ended up getting parts from a decommed server to fix this one up
12:29

threedeeitguy

rewby not a ram disk. Ive been digging into the couple of rust rsync implementations that are out there. As I said many bad ideas 😝
12:29

rewby

Rsync's not the issue
12:29

rewby

Megawarc factory is
12:29

rewby

rsync is literally the least bottlenecked part of this
12:29

rewby

Sure, it's rsync that gives you "errors"
12:30

rewby

But again, that's literally because the backpressure scripts we have instruct it to send those
12:30

rewby

If you see an rsync connection limit error, it's literally never an rsync problem
12:31

arkiver

rewby: you've done some on making various parts of the megWARCs factory less resource intensive right?
12:31

arkiver

for example the one moving files around
12:32

rewby

I've done chunker yeah
12:32

rewby

I can't do much about megawarc itself at the moment
12:32

rewby

The main issue is just that it's all mediated by the file system
12:32

rewby

And there's no feedback after rsync completes
12:32

rewby

The moment the upload itself finishes, I *cannot* lose the file
12:32

rewby

So it all has to be persisted to disk
12:33

rewby

And due to the amount of tiny files and raw throughput, this needs NVMe SSDs
12:33

rewby

(Or a very large amount of sata SSDs, and even then you can't really top the nvmes)
12:33

rewby

This is all fine and dandy
12:33

rewby

Except
12:33

rewby

SSDs have limited write cycles
12:34

rewby

Consider every TB that goes through a target gets written to the ssds
12:34

rewby

Consumer SSDs have total write durabilities of less than half a petabyte
12:34

rewby

Prosumer ones usually approach 800TB
12:34

rewby

Maybe 1PB if you've got a good one
12:35

rewby

Our record is blowing out an NVMe in 3 months
12:35

rewby

And when I say blow out, I mean "it's gone so far the controller's failed on it"
12:35

rewby

The only way to actually deal with this is with fancy enterprise SSDs that are a) very very expensive and b) have a huge amount of non-provisioned capacity to allow them to have write durability in petabytes
12:36

rewby

Or special types like optane which are designed to have insanely low latency and super high durability
12:36

rewby

I've written up a design doc in the past of what I want in a new pipeline
12:36

threedeeitguy

I understand that rsync is not the limit. But if you controlled the rsync implementation then surely it would it be possible to tie the tracker and the targets togethers so that the targets understand what data they are receiving. Something like the tracker issues a batch ID that the worker then has to provide to the target before upload is allowed
12:36

threedeeitguy

to commence. Data then has an extra state so instead of Upload > Tracker Done and now must not loose this file we have Upload > in memory > Tracker *Done* > Megawarc factory in memory > Megawarc shipped to disk > Tracker Done for realzies.
12:37

rewby

Yeah but like, at that point why use rsync
12:37

rewby

We can ship custom clients
12:37

rewby

We control the client side too
12:37

rewby

And using http(s) instead actually makes things easier since then you can use cool stuff like QUIC/HTTP3 and normal load balancers
12:37

threedeeitguy

True. Mind sharing the design doc?
12:38

rewby

Uh. I'll have to do some digging.
12:38

rewby

Remind me in 3 hours okay?
12:38

rewby

I have a meeting coming up I need to deal with
12:38

threedeeitguy

yep, np.
12:39

rewby

The basic idea was "instead of waiting for a whole chunk to complete, just write them out to disk in one go
12:39

rewby

You can write out megawarcs incrementally
12:39

rewby

So don't return "complete" until it's written out as a megawarc
12:39

rewby

This saves a lot of resources
12:39

rewby

And also saves SSDs a lot because it's only a single write at that point
12:40

rewby

And also much less parallel io request
12:40

rewby

Instead of thousands of little IO requests a second
12:40

rewby

It's just like 3-4 parallel megawarcs writing out
12:40

threedeeitguy

ah nice. Ive been reading over the warc spec but hadn't got as far as megawarcs, being able to stream into it is good.
12:40

rewby

Which means you can start using hdds and such more efficiently
12:40

rewby

The thing about warcs
12:40

rewby

They're a record based format
12:40

rewby

So the unit is the record
12:40

rewby

Be it a request record
12:40

rewby

Or a response
12:41

rewby

The compression is *also* record based
12:41

rewby

Each record is individually compressed
12:41

rewby

This way the WBM can seek into files
12:41

rewby

And accessing a record is an O(1) operation instead of O(n)
12:41

rewby

megawarcs are literally nothing more than just concatenated warcs
12:41

rewby

(asterisk)
12:42

rewby

(There's some details around compression dictionaries)
12:42

rewby

The main thing the megawarc factory does is read all the files in the chunk and decompress them as a quick sanity check and then write the compressed data to the end of the megawarc
12:43

rewby

And does a quick bit of record keeping in a json file to help locate what parts of the megawarc came from where
12:43

rewby

But the primary action is just a fancy concat
12:43

rewby

(With zstd ones it outputs a skippable frame with the dict first)
12:43

threedeeitguy

I like big disks and I cannot lie. Some of the ztsd stuff makes my head hurt, my background is very much SharePoint and higher level ms stuff so this is all new (and much more interesting :) )
12:43

rewby

Welcome to low level bullshit
12:44

rewby

Where it's all about doing as much as we can with as little hardware as possible
12:45

imer

i've done something similar recently, having the client wait until the data has settled in its final state is definitely the way to go if the client can retry on failure/hold until processing is done
12:45

imer

less ways for data to get lost as well if it's either ready or not and needs to be retried from scratch
12:45

rewby

Yeah
12:46

rewby

Waiting for the megawarc to complete isn't reasonable
12:46

rewby

Some projects do like one megawarc per wweek
12:46

rewby

But waiting for the data to be committed to a megawarc on disk should be very doable
12:46

rewby

And that means we can rambuffer all the inbound stuff
12:46

rewby

Most targets have 128G or more ram if hardware
12:47

rewby

Or 16 if VM targets
12:47

imer

yep, waiting for it to be uploaded to IA would be great, no way of losing data then, but not really realistic :)
12:48

rewby

It's also worth pointing out the current pipeline is very single threaded
12:48

rewby

As in, you can't write out one warc until the previous one is written
12:49

rewby

And that is entirely blocked on how long it takes to decompress
12:49

rewby

(Because, again, decompressing is used as a sanity check to see if the data is even a little valid)
12:49

rewby

So if we can decompress while the data is coming into the target and then once it's received properly into a rambuffer, immediately write it out to a megawarc
12:49

rewby

That'd be amazing
12:50

rewby

And much faster
12:54

imer

Just have to be mindful of the decompressed size if you want to keep it all in memory, definitely need some sanity limits in place
12:55

rewby

You don't need to store the decompressed results
12:56

rewby

The current pipeline literally outputs to /dev/null
12:56

rewby

github.com/ArchiveTeam/megawarc for the tool and github.com/ArchiveTeam/archiveteam-megawarc-factory for (an older version of) the target pipeline
13:00

datechnoman

Ahhh my shit fell over
13:00

datechnoman

Thats why optane9 is quiet :P
13:00

rewby

datechnoman: The first part of this rant might be interesting to you
13:01

datechnoman

I just read through the whole convo
13:01

datechnoman

Good info
13:01

datechnoman

Good to know :)
13:03

datechnoman

I've gotta get some sleep but will get my cluster back up tomorrow to chew some data. Time to work on that reddit backlog once the queue is empty
13:10

imer

ooh, you'd stream-"decompress and parse" to see if its all valid and then just append the original to megawarc, thats even better
13:11

imer

(simplified, as you said megawarc does a bit more?)
13:18

arkiver

imer: short summary on what megawarc does
13:19

arkiver

reads the compressed WARC into buffer
13:19

arkiver

uses that same buffer to:
13:19

arkiver

- check if it decompresses correctly
13:19

arkiver

- write to megaWARC (combined WARC)
13:19

arkiver

- extract some metadata from warcinfo record at start
13:19

arkiver

so it reads once and does those things
13:20

arkiver

writing to megaWARC happens while testing if the WARC is valid, so if we at some points conclude the WARC is invalid, we cut the megaWARC back to the size it has before we started appending this WARC to it
13:21

arkiver

an invalid WARC is then stored in a tar file, which is uploaded alongside the megaWARC (and should thus be empty if nothing went wrong)
13:25

imer

cool, thanks for explaining
13:25

imer

from what I understood (in an ideal future pipeline) rew_by wanted to decouple parsing warc -> writing to file so you can prepare multiple warcs in parallel and write to megawarc once determined to be good
13:26

imer

so just thinking out loud, I enjoy this stuff if you couldnt tell :)
13:27

arkiver

well I guess the parsing would be faster
13:27

imer

definitely want to look at helping out on the dev side once I have more time on my hands
13:27

arkiver

but it would double the disk IO?
13:27

arkiver

on the reading side
13:27

imer

not if you have it buffered in memory
13:27

arkiver

right
13:31

arkiver

so
13:31

arkiver

if we assume the WARCs still need to be written to disk, and the CPU part is taking much less time than the disk IO part, this may not help much
13:32

arkiver

was the plan to keep the megaWARC itself in memory as well?
13:34

myself

isn't the current process more-or-less single-threaded? Does it end up cpu-bound or i/o-bound?
13:34

arkiver

basically writing to the megaWARC is still happening sequentially, so if CPU is not a big part of this chain anyway, all we'll do with this is accumulating WARCs in memory waiting to be written to disk
13:35

arkiver

unless the megaWARC itself stays in memory and we can submit this very fast to IA (meaning sending a single WARC at a time is fast enough)
13:35

arkiver

myself: i think disk IO - but not sure. if disk IO bound, then yeah not sure if this will help much
13:50

JAA

arkiver: The long-term idea is to avoid writing the individual WARCs to disk at all. Instead, have long-running megawarcing processes, uploads go to a tmpfs, get queued to the megawarcing, and once confirmed written to the megawarc, the upload completes (which causes the worker to delete its copy). That halves the disk I/O.
13:55

arkiver

JAA: it would make us even more reliant on IA throughput
13:55

arkiver

is disk IO really the biggest problem here?
13:55

JAA

IA wouldn't come into the equation at all. The megawarc would still be written to the target's disk.
13:55

JAA

It just avoids the disk roundtrip of the miniwarcs.
13:56

arkiver

ah i see
13:56

arkiver

hmm
13:56

JAA

I mean, in theory, we could even avoid writing the megawarc to disk and do multi-part IA uploads, but that sounds like a bad idea.
13:57

arkiver

did you try out multi part uploads to IA?
13:58

JAA

Yeah. The uploads themselves are fine, but the completion is slow.
13:58

JAA

Every part gets written to the item server normally with an archive.php task, including snowballing.
13:58

JAA

It computes the checksums for each part etc.
13:59

JAA

Then the completion task runs individually and joins the parts.
13:59

arkiver

i see
13:59

arkiver

it may be reasonable in GB chunks
14:01

JAA

I'd need to check my task logs for what other annoying things I saw at the time.
14:01

JAA

But I basically concluded that it wasn't usable for anything at scale.
14:02

arkiver

i like the megaWARC writing idea
14:02

JAA

Yeah
14:02

arkiver

python does not have a ton of overhead for this i'd think?
14:02

arkiver

some, but not enough to cause problems, bottle neck would still be writing the file to disk
14:03

arkiver

and in case of very large files we'd start writing them to disk anyway and queue them up for direct archiving
14:03

JAA

Using Python for what?
14:03

arkiver

this idea
14:03

JAA

Which part of it? :-P
14:04

arkiver

receiving WARCs, processing them, writing the megaWARC to disk
14:04

JAA

Hmm, it might not be fast enough for the networking part.
14:05

JAA

Handling hundreds of connections with a total throughput in the gigabit/s isn't Python's strength.
14:05

arkiver

i wish i had more experience with languages that not Python or C (or limited Lua)
14:05

arkiver

that are not*
14:06

JAA

See 3461553593 for an example task of multi-part completion that took over 1.5 hours. Bit bigger than our megawarcs, but still close enough to get an idea of how it might perform.
14:07

arkiver

ouch
14:07

arkiver

though
14:07

JAA

The double hashing + copying to the mirror server are the slow parts.
14:07

arkiver

performance greatly differs over time on these IA machines
14:07

arkiver

they're often over used
14:07

JAA

Well, and apparently concatenating the files also took almost 40 minutes there.
14:08

arkiver

hmm
14:08

arkiver

is it rewriting the entire thing?
14:08

JAA

Which isn't too surprising since it's reading and writing from the same HDD.
14:09

JAA

Well, yes, it has to.
14:09

arkiver

hmm
14:09

arkiver

i'm not sure if this is problematic
14:09

JAA

Each part upload results in one file in the spool dir, and then it essentially cats those to get the complete file.
14:09

arkiver

do you still have the code you used to do this?
14:09

JAA

ia-upload-stream in my little-things
14:10

arkiver

and can it handle checking the hash?
14:10

arkiver

confirming it
14:10

JAA

I've since implemented single-part upload because my experience was so meh.
14:10

arkiver

or wait it checks that for every part of course
14:10

arkiver

nvm
14:10

JAA

The client sends the Content-MD5 for each part, which gets checked by IA on receipt I believe.
14:10

arkiver

so this is 170 GB
14:10

arkiver

it doesn't look to bad to me honestly
14:10

JAA

Then IA copies the part from the S3 endpoint to the item's spool dir.
14:11

arkiver

it's not too bad to upload our megaWARC is, say, 1 GB chunks like this
14:11

JAA

Then it calculates its hashes, and copies it to the mirror server.
14:11

arkiver

yeah
14:11

JAA

And then on completion, it concats on the primary server, again calculates hashes, and again copies to the mirror.
14:11

arkiver

so doubles this
14:11

arkiver

but IA servers are usually not busy with disk IO stuff, CPU is nowadays the bottle neck
14:12

rewby

arkiver: We're not doing this in python
14:12

JAA

It'd be nice if the mirror server could also concat simultaneously and then the rsync would just transfer the timestamp etc.
14:12

arkiver

so yes it wastes resources a bit (but perhaps resources that would otherwise not be used). i wonder how much this would save us? if it's significant this is worth considering seriously
14:12

rewby

I have Opinions :tm:
14:12

JAA

But I guess that won't happen.
14:12

rewby

But I'm in a meeting
14:12

arkiver

JAA: maybe in the future
14:12

arkiver

or
14:13

arkiver

actually these things happen if they become problems ;)
14:13

arkiver

it's not a problem now
14:13

rewby

These issues with the megawarc are already a problem
14:13

arkiver

no one uses it, so no need to optimize
14:13

rewby

But again, will elaborate later
14:13

arkiver

thanks rewby whenever you have time
14:13

arkiver

(make sure to ping me please)
14:13

imer

JAA: IA S3 mirrors unfinished s3 upload parts? that's an odd decision, also afaik S3 spec doesn't force parts to be uploaded in order, but might be mistaken there
14:14

imer

so concat-as-you-go might not be a thing you can do
14:14

arkiver

imer: yes on all that
14:14

arkiver

it's possible just not implemented
14:14

JAA

imer: No, S3 doesn't mirror, but the archive.php task that copies data from S3 to the item servers does.
14:14

arkiver

also it'll still have to hash the individual pieces of course to confirm hashes
14:14

JAA

And I'm not suggesting concat-as-you-go, which is indeed not easy.
14:15

JAA

I don't think the MD5 calculation on the S3 endpoint is the bottleneck here.
14:15

JAA

And it calculates that anyway I believe to return an ETag (which you have to supply for completion).
14:16

JAA

(The ETag doesn't have to be a hash, of course, but it is in the case of IA.)
14:17

imer

so, upload parts to S3 -> S3 mp upload complete -> concat to item server + mirror? seems pretty optimized if you need the parts concat'ed into one file
14:18

JAA

Upload parts to S3, they get copied to item server and mirrored. Complete, they get merged on the item server and mirrored.
14:24

imer

mh, lots of copies. in the s3-compatible api i implemented recently it doesn't concat the file on completion, but assembles it transparently on-demand, but that's probably not feasable for IA's usecase
14:26

JAA

That would definitely be awkward, yeah.
14:27

imer

i'd say leave the uncompleted parts in a staging area and only concat to final storage once done, but thats got the tradeoff of losing data if staging area explodes (although not sure how much of a concern that really is if communicated right?)
14:28

imer

definitely not an easy problem
14:29

arkiver

for IA the most important part is data integrity and not losing data
14:29

arkiver

for that the current solutions works well
14:29

imer

yep, sounds very reasonable
14:31

JAA

The S3 endpoints are basically the only place where data can be lost currently, as I understand it.
14:31

JAA

But perhaps those are backed by a RAID 1 at least.
14:32

JAA

For my own uploads, I don't trust it until the data shows up on the item itself and has been mirrored.
14:33

JAA

(I.e. until no archive.php tasks are pending on the item anymore)
14:38

imer

Thats why I like the "have the client wait until <upload> is secure" so much, if anything goes wrong on the path to final storage just respond with an error and the client can simply retry, makes reasoning about all the error cases a lot easier if you can (worst case) hand it back for retrying
14:40

imer

but yeah, tradeoffs
14:46

rewby

Okay
14:46

rewby

So
14:47

rewby

arkiver: here's your ping, prepare for a rant (JAA)
14:47

rewby

I have a number of issues with the current pipeline
14:47

rewby

Let us for a second ignore completely what data is being moved and look at it purely from a disk io perspective
14:48

rewby

Every byte that comes into the network interface gets written to disk, read and then written again and finally read once more
14:48

rewby

*however*
14:48

rewby

This is where filesystems and files come into play
14:48

rewby

We get say 100 - 400 parallel uploads
14:48

rewby

So 400 tiny files being written
14:48

rewby

Then read one by one
14:48

rewby

And written into say 1 big file
14:48

rewby

Which is then read (and sent to IA)
14:48

rewby

In theory fine, but consider the characteristics of this IO
14:49

rewby

The net -> packer part is entirely tiny files
14:49

rewby

Many of them
14:49

rewby

And packer -> net is just one big file
14:49

rewby

(mostly)
14:49

rewby

This means that at steady state, if we have 1gbps coming in
14:49

rewby

We are writing 2gbps to dism
14:49

rewby

*disk
14:49

rewby

With a mixture of tiny files and big ones
14:50

rewby

This workload is absolutely atrocious for hdds, too much seeking
14:50

rewby

With that many tiny files the hdds will seek to all hell and throughput suffers
14:50

rewby

Even setups with dozens of disks can only sustain 1-2 gbps of this
14:50

rewby

Wheras a single nvme drive will happily do more than 24 hdds can do in the tiny files department
14:51

rewby

But this brings another problem
14:51

rewby

nvmes (and other ssds) have limited write durability
14:51

rewby

On pro gear (and prosumer) gear you often get quoted on a TBW number
14:51

rewby

Aka how many TB can you write to this disk before you start running into failures.
14:52

rewby

Depending on quality (and thus price) of the disk this can be from 400TB to 40PB
14:52

rewby

With price going up fairly exponentially as you approach the PB numbers
14:52

rewby

(At work we have a disk with a durability of like 30PB, but on the flip side those samsungs cost several grand)
14:53

rewby

If we combine this with the write amplification factor of 2
14:53

rewby

You may start seeing the problem
14:53

rewby

Even modest projects often end up around 100TB around here these days
14:53

rewby

and that's ignoring the really big ones like urls, shreddit, etc that really push the numbers
14:55

rewby

hel1, for example, has a really nice ssd
14:55

rewby

But even that's suffered: Data Units Written: 7,879,804,265 [4.03 PB]
14:55

rewby

However, that machine has been doing this for a while
14:55

rewby

But the real shocker is the reads:
14:55

rewby

Data Units Read: 1,165,376,309 [596 TB]
14:56

rewby

Consider that thanks to the wonderful linux kernel page cache, our ratio between reads and writes is 4
14:56

rewby

We write 4 times as much data to these disks as we read
14:56

rewby

In other words, this workload is the absolute worst for ssd
14:56

rewby

*ssds
14:56

rewby

Because 75% of the time we write data and then never read it
14:57

rewby

And simply overwrite it again later
14:57

rewby

(also consider this ssd is 500G and how many total drive rewrites this poor thing has suffered)
14:57

JAA

Fun fact: AB pipelines have basically the exact same problem. Lots of tiny files getting written, then read again, then appended to the WARC, then deleted. The read/write ratio is typically closer to 8 there though.
14:58

rewby

This means we have really hard requirements on these devices
14:58

rewby

They need to do *throughput* and deal with tons of writes
14:58

rewby

This makes it really hard to source hardware
14:58

rewby

The only way to reasonably acquire such fast disks without having to worry about our writes, is cloud
14:58

rewby

Which is why we use a lot of hetzner cloud
14:58

JAA

But yeah, if we streamed megawarcs to disk from memory-buffered miniwarcs, we could just use servers with a few HDDs and a bunch of RAM.
14:58

rewby

Yeah so that's where I was going
14:58

JAA

I.e. standard servers available everywhere.
14:59

rewby

My preferred solution is to take all the tiny high-iops stuff
14:59

rewby

And keep that in ram
14:59

rewby

And just get a machine with like 12 drives, set up one packer per drive
14:59

rewby

*or per pair
14:59

JAA

Yep, that'd be ideal.
14:59

rewby

And just write out ~100MB/s per drive(pair)
15:00

rewby

Even shite harddrives will happily sit at 100MB/s all day every day
15:00

rewby

And if they only ever write a single file
15:00

rewby

Then no seeking to worry about
15:00

JAA

I'd do mirrored pairs as a layer of protection against data loss.
15:00

rewby

As I said, drive pair
15:00

rewby

RAID1
15:00

JAA

Yup
15:00

rewby

But either way
15:00

rewby

It'd be infinitely better than this mess
15:01

rewby

Because there is another problem with the current pipeline
15:01

rewby

And that is space
15:01

rewby

which may sound stupid
15:01

rewby

But hear me out
15:02

JAA

I said in the past that I think this could be done with an rsync wrapper (like rrsync), but I don't believe that's true anymore. It would confirm the upload to the clients too soon. So we'd need to replace the uploads entirely. There's already curl uploads to HTTP targets in the code though, so that's not a big deal.
15:02

rewby

On disk we keep: 1. Unfinished uploads. 2. Files waiting to complete a chunk. (up to 15G per chunk) 3. Chunks waiting to be packed 4. chunks currently packing 5. the output of chunks currently packing (so essentially each currently packing chunk takes up 2x space) and 6. megawarcs to upload
15:02

rewby

Consider the following alternative pipeline:
15:02

threedeeitguy

Write mwarc > check for pipe out > either Read mwarc if free upload slot or start another one if space > repeat sounds nice.
15:03

rewby

1. Client uploads file to ram. 2. While uploading the decompression testing stuff happens (in parallel) 3. Once file is uploaded it is immediately written to megawarc (or error tar) 4. repeat previous steps until megawarc done 5. megawarc gets moved to uploaders
15:04

rewby

This may not sound that much better
15:04

rewby

But
15:04

rewby

The current megawarc factory is singlethreaded
15:04

threedeeitguy

masive reduction in dupe data.
15:04

rewby

So the time needed to pack a single file is entirely dependent on how fast a single cpu core can gunzip or zstd -d
15:04

threedeeitguy

And allows you to go as wide as you have disks
15:04

rewby

So we need to often run 4 or more of these in parallel
15:05

rewby

Because server cpus aren't great on single thread
15:05

rewby

But excellent at having lots of cores
15:05

rewby

But this brings another problem
15:05

threedeeitguy

overclocked threadripper ftw
15:05

rewby

Lets assume a pack is 15G
15:05

rewby

Each packer needs space for 2 (in and out)
15:05

rewby

So each packer is 30G
15:05

rewby

If we need 4 of them, that's 120G
15:06

rewby

Consider that the hetzner vms we use have 160G of ssd in total
15:06

rewby

And that's a large number by cloud standards
15:06

rewby

And that's not even accounting for the parallel uploaders
15:06

rewby

Or the inbound
15:06

rewby

So not only does the current setup need nvmes, and fast ones
15:06

rewby

It also needs big ones to make good throughput
15:06

rewby

Because of how many packs we need to keep on disk in parallel
15:07

rewby

So this drives the cost up even further
15:07

rewby

Some of you may want to chime in with links to your favorite online store where a 1 or 2 TB nvme is only like 100-300 euro
15:07

rewby

But for those, I refer you to my point on TBW
15:08

rewby

Because if you check the spec sheets, you'll quickly see those numbers aren't very impressive on the cheap ssds
15:08

threedeeitguy

vs a 24 bay shelf that only needs cheapo hdds and could have 12 "workers"... I see the appeal.
15:08

rewby

Let's consider samsung's 980 pro line
15:09

rewby

Nobody here would argue them to be bad ssds I think (although one might argue about their firmware incident last year)
15:09

rewby

s3.ap-northeast-2.amazonaws.com/glo…_SSD_980_PRO_Data_Sheet_Rev.1.2.pdf
15:09

rewby

That's the data sheet
15:09

rewby

Under their warranty you see that a 500GB 980 pro is only warranted to 300TB
15:09

rewby

Experience tells me they'll do up to 600T before the controller conks out
15:10

rewby

I know that because we've blown 4 of them up
15:10

threedeeitguy

yeah. Ive nearly killed my first 970's in just a few years of desktop use.
15:10

rewby

Died in the line of target duty
15:10

rewby

After like 4 months
15:10

rewby

Sure they're not expensive
15:10

rewby

But if you have to buy new ones every few months...
15:11

rewby

So this is another reason I like my proposed pipeline
15:11

JAA

Just manipulate the SMART data and exchange them on warranty. /s
15:11

rewby

It is much more able to deal with the nature of server cpus
15:11

rewby

Because our target hw usually has at least 32 cores
15:11

rewby

We can just barely use them because we don't have enough fast space to have enough packers running to actually use them
15:12

rewby

My proposal for a new pipeline basically goes from file uploaded to file written to megawarc in a matter of seconds
15:12

rewby

Because by the time the upload finishes, the decompression test is also done
15:12

imer

samsungs "enterprise" pm9a3 aren't too expensive anymore, you get 1 drive write/day for 5 years, so just under ~7pb for the 3.7tb one for example (for 280€), but everything you said still applies
15:13

rewby

imer: Not good enough
15:13

rewby

Consider
15:13

imer

yep.
15:13

rewby

4T per day
15:13

rewby

Is like 47MB/s
15:13

rewby

That's less than half a gig
15:13

rewby

That is *fuck all* bandwidth by our sale
15:13

rewby

*scale
15:13

imer

still gonna wear through it fast
15:14

imer

less fast than consumer gear, but still fast
15:14

rewby

Sure, but you see my point
15:14

rewby

The machines with the best record so far are the one running on optane (no shit) and the one running on kioxia enterprise drives that has pulled the silicon lottery
15:14

rewby

Because those are sat at 255% used
15:14

rewby

And still going
15:15

rewby

Again, in my ideal world we only write to disks once
15:16

rewby

We buffer all the inbound stuff in memory
15:16

rewby

And we do the validation of the inbound data "live"
15:16

rewby

So when you do a read(socket) of like 2MB, you also validate that 2MB
15:17

rewby

And that way we can really use the properly multicore stuff
15:17

rewby

Like, at that point looking into those ARM machines becomes interseting
15:18

JAA

We can also reject faulty uploads at that point rather than writing them to an error tar.
15:18

rewby

Yep
15:18

rewby

That was gonna be my other point
15:18

rewby

We can just decline broken files
15:18

rewby

And then cause the whole item to fail
15:18

rewby

So it gets retried later
15:18

imer

I was about to ask that as well hah
15:18

JAA

Yup, since these uploads would probably use HTTPS anyway, we have some nice status codes at our disposal. :-)
15:18

rewby

And of course, if we do this, one of my requirements would be to build in a lot monitoring.
15:18

rewby

As in
15:19

rewby

Have counters for throughput
15:19

rewby

For rejected files
15:19

rewby

For parallel uploads
15:19

rewby

Etc
15:19

rewby

etc
15:19

threedeeitguy

imer Assuming 10GB/s paper napkin says 27 drives in z2 (on a single box) or 128k with a 5yr lifespan. And im pretty sure we do more than 10GB/s
15:19

rewby

(aka prometheus this shit up real good)
15:19

rewby

GB/s or gbps
15:19

rewby

Difference
15:19

rewby

I mostly measure in gigabit
15:20

rewby

Because in the end, from my perspective, network throughput is the end measure
15:20

rewby

A target is only good in sofar as it's steady-state (as in, it's not backlogging and receive/transmit are roughly balanced) network throughput is
15:21

rewby

I don't particularly care if a project ends up being 1TB or 1PB
15:21

rewby

If it's 1TB being done over the span of a day
15:21

rewby

Or 1PB over the span of 100 days
15:21

rewby

*1000 days
15:21

rewby

Both are the same to me
15:21

rewby

And need equal capacity
15:22

rewby

Just for different durations
15:22

rewby

My capacity planning is purely on throughput
15:22

threedeeitguy

GB/s with lots of rounding for easy maths. I figured its such an outlandish number that its not worth doing accuratley. Also I should not be trusted with a calculator, fixed maths below.
15:22

threedeeitguy

Assuming 10GB/s at the disk and 50MB/s per drive then paper napkin says 27 drives in z2 (on a single box) at a cost of 7.5k with a 5yr lifespan. And im pretty sure we do more than 10GB/s
15:22

rewby

optane9 likes to do 5gbps
15:22

rewby

that's gigabit
15:23

rewby

So call it like 500-600 MB/s
15:23

imer

threedeeitguy: yeah, there's no point. I was just saying you *can* get ~10x better ones (compared to consumer drives) for not much more money nowadays, 10x better isnt good enough though
15:23

threedeeitguy

Yeah. Just gotta cut that cost by 10x ;)
15:23

rewby

If we go real fast, a goal would be 10gbps because everything I get these days is 10gbps (or more)
15:23

rewby

*per target
15:24

rewby

So 1.25GB/s is the threshold number
15:24

rewby

So you'd need about 25 worth of throughput
15:24

rewby

However, a raidz2 wouldn't work
15:24

rewby

I think
15:24

rewby

There reason is how raidz2 works
15:25

rewby

raidz2 is essentially a parity system
15:25

rewby

Which exploits a property of xor
15:25

rewby

Consider a xor b = c
15:25

rewby

If you lose any single one of a b or c, you can xor the remaining 2 to recreate the missing one
15:25

rewby

This is the basic premise of raid5 and raidz1
15:26

rewby

Raidz2/raid6 just adds an extra copy of the parity data on top of that
15:26

rewby

Either way, you're bound to the performance of a single disk
15:27

rewby

All that is to say, you definitely can make the current system work
15:27

rewby

But it requires a lot of ssda
15:27

rewby

*ssds
15:27

rewby

And expensive one
15:28

rewby

We work around some of this with hetzner targets by just having lots of them
15:28

rewby

Because cloud
15:28

rewby

Sure, optane9 can do like 3gbps
15:28

rewby

And to equal that I need like 6 hetzner vms
15:28

rewby

Or even more
15:28

threedeeitguy

be interesting to see what cpu reqs are to decompress at 10gbps
15:28

rewby

But at least I can get the hetzner vms
15:28

rewby

But if we could use our resources more efficiently
15:29

rewby

That'd be so amazing
15:29

rewby

Like, ram is no problem
15:29

imer

github.com/facebook/zstd#benchmarks seems doable?
15:29

rewby

We don't run on latest gen hardware
15:29

rewby

Most of it is late ddr3 or early ddr4
15:29

rewby

The industry is moving to ddr5
15:29

rewby

Both of these are getting stupidly cheap
15:29

rewby

For 300 usd you can easily get 256G of ram
15:29

rewby

Even ddr4
15:29

rewby

ddr3 is even cheaper
15:29

rewby

So I'm not too worried about exhausting ram
15:30

rewby

All of our dedicated boxes already have 128G or more
15:30

rewby

And since we can ship whatever uploader code we want
15:30

rewby

We don't need to be restricted to just "what can curl do" or "what can rsync do"
15:31

rewby

If we find we need to do multipart uploads
15:31

rewby

Or build in some upload resuming or whatever
15:31

rewby

No problem
15:31

threedeeitguy

Yeah I saw a full tray of 24 16gb drr3 modules for £99 earlier today. Crazy.
15:31

rewby

My point exactly
15:31

rewby

Ram is cheap and performs well
15:31

rewby

In the design doc that I'm currently trying to find
15:32

rewby

I had an idea of the first request to a target doing a couple of things
15:32

rewby

It would allocate the buffer in ram
15:32

rewby

And reject you if no more space
15:32

rewby

And also return you a "permalink" of sorts to that target
15:32

rewby

So if we stick a loadbalancer infront of things, you can basically bypass that after the first request
15:32

rewby

And always end up at the same target to finish your upload
15:33

rewby

Additionally, by not accepting uploads for which we have no space, we don't get a pile of tiny temp files everywehre
15:33

rewby

*everywhere
15:33

rewby

This also has another advantage
15:33

rewby

If some project is a mix of tiny and huge files
15:33

rewby

We can deploy targets of mismatched capacities
15:33

rewby

Where only some of them can handle the big ones
15:34

rewby

And it'd all be fine
15:34

threedeeitguy

I was pondering that earlier. Central orchestrator that hands out links to workers saying go here there's space waiting for you vs dumb lb and retry if you hit a full one
15:34

rewby

Because the smaller ones would just say "nah bro, try someone lse"
15:34

rewby

Yeah, that'd be ideal, but you still need this mechanism
15:34

rewby

My original design had this
15:35

JAA

Yep, and we could also have a slow separate target which just writes the individual uploads to disk anyway, for really big items.
15:35

rewby

You sent the first request to the orchestrator, it'd sort the list of targets by "who can likely fit this"
15:35

rewby

And then try that list until it finds someone who can
15:35

rewby

That way if a target filled up between the last list update and "now" it'd handle it the less efficient but still good way
15:35

rewby

Yep
15:35

rewby

Indeed JAA
15:35

rewby

That was part of my thoughts
15:35

rewby

For those cases where we have truly huge files
15:35

rewby

We just have one box that writes to disk anyways
15:36

rewby

And be the target of last resort
15:36

rewby

You can get really creative with dispatching logic once you can guarantee that sending someone to a target will actually complete an upload
15:36

rewby

And when you have accurate accounting of how much space has already been "allocated" to in-flight
15:37

rewby

Because a target with 10GB free and 10 1G files in flight is as good as full
15:39

JAA

I found rewby's design doc from a while back: s3.services.ams.aperture-laboratori…b6/HTTP%20Packer%20Design%20Doc.pdf
15:39

rewby

YEah
15:39

rewby

It's not the best doc
15:40

rewby

And probably definitely needs improving
15:40

JAA

It exists. That's better than most docs around here. :-P
15:40

rewby

If we want to do this properly, should be a etherpad or wiki page
15:40

rewby

And people can spend some time proposing edits and such
15:41

rewby

One thing this design doc also did which I think is nice is track which items are in each warc and who uploaded them
15:41

rewby

Because in the past if we had to go find out who did what or where corrupt data came from it was always a pain
15:43

rewby

If people are willing to actually help make this a reality, feel free to reach out
15:43

rewby

I'm perpetually overworked myself and thus never got around to properly doing this. I have some prototype work that's incomplete and doesn't work
15:44

rewby

I'm more than happy to mentor someone on the details of how targets work and to help flesh out ideas
15:48

rewby

I have one note though, no python or nodejs
15:48

rewby

Both are atrocious to deal with when doing things that really need multithreading
15:48

rewby

And their package management makes me cry
15:49

rewby

And also, I have a preference for languages with a good static type system for a few reasons
15:49

imer

so rust?
15:49

threedeeitguy

Works for me.
15:49

rewby

Something like go or rust is fine for me
15:49

rewby

Reason being: You can verify invariants much easier
15:49

rewby

You can get a variable in and be sure what you can and cannot do with it
15:50

rewby

If I need to know what operations I can do on something, or what something is, figureing that out with python is awful
15:50

threedeeitguy

rust has been my first forray into anything lower than c# or typescript and I kinda love it.
15:50

rewby

Because I need to trace that through the whole call tree just to figure out what fields it has
15:50

rewby

I'm really into rust since early last year as well
15:50

threedeeitguy

Python is my own personal hell.
15:50

rewby

But I know some of our other tooling is golang
15:50

rewby

I enjoy rust's type system
15:50

rewby

But also acknowledge it's async is painful
15:51

imer

rewby: guessing this is gonna be a more long term thing, so i'll throw my name in the hat (currently still too busy with stuff for serious commitments, but that should relax soon)
15:51

rewby

It's a more long term thing yea
15:51

rewby

I don't expect this to be done in a week or even a month
15:51

rewby

But atm nobody is working on it
15:51

imer

definitely more familiar with rust here (and yes, async is painful sometimes)
15:51

rewby

I'm too busy patching holes and keeping up with *gestures at imgone, shreddit, etc*
15:52

threedeeitguy

I helped a friend through the coding for his astrophysics degree. They had to use python for everything. It was soooooo slow. Doing the same thing in a SQL query or with c# was hundreds or thousands of times faster for some of the problems.
15:52

rewby

As I said, if there's interest, we can set up a group (channel) for the development (or use -dev) and get people set up with soem repos and guidance
15:53

rewby

And get some documentation space set up and such
15:53

rewby

(I am a team lead for infrastructure on my day job, all too familiar with this stuff)
15:54

threedeeitguy

I have plenty of time, persuading my brain to focus is another matter. Would be more than game for giving it a go though.
15:55

rewby

If y'all are willing to give it a go, I can make a kanboard happen with some basic tasks
15:55

rewby

And whomever feels like it can pick a task and do it
15:55

threedeeitguy

Sounds good.
15:55

imer

sure
17:29

fireonlive

!help
17:29

h2ibot

fireonlive: The following commands are available: (for '')
17:29

h2ibot

fireonlive: !help: Print this help message. (for '')
17:29

h2ibot

fireonlive: !a: Deduplicate and archive a list of URLs hosted on transfer.archivete.am. CAREFUL, DDOS. (for '')
17:29

fireonlive

hmm
17:29

fireonlive

!a christianselig.com/apollo-end/reddit-third-call-may-31-end.m4a
17:29

h2ibot

fireonlive: Invalid privileges, need one of ('@', '+').
17:29

fireonlive

ah ok
17:34

JAA

I archived it with AB already, and also the other links in the post.
17:36

fireonlive

ah, thanks!
17:36

fireonlive

always on the ball :)

a year ago

« a day earlier

a day later »

today »