-
pabs
JAA: sounds about right. TBH this is the first time I heard of Freed-ora
-
monoxane
yo so how hard would it be to get some more targets online if we had the storage + network to provide for it
-
monoxane
if you've seen pixiv over the last 2 days you may have seen that me and a few friends have thrown some 100g boxes at it and are currently bottlenecked by the 2 online targets
-
monoxane
we know the targets need to offload to IA at an appropriate speed, but have quite a bit of available storage to buffer ourselves with
-
monoxane
at a point we were hitting 7.5gbps from the source but are now limited by the targets disks filling up and stopping connections 😔
-
monoxane
its less of a thing for this in particular but we're trying to work out how we can provide some infra for the next "oh shit its going down in 24 hours" site scrape
-
monoxane
500gbit of bandwidth, a /24, and 100tbit of local storage will help some of those a fair bit 😉
-
JAA
Pinging some relevant people: rewby HCross arkiver ^
-
monoxane
we're also working on rewriting an api compatible warrior that will scale much higher
-
monoxane
for reference last night we had 3328 warrior threads running across 13 nodes for shits n gigs, and were nowhere near capacity
-
monoxane
also considering rolling a new version of the megawarc factory with some improvements, the real question is how does it get from the targets to IA and what do we need to do to facilitate that
-
monoxane
and yes, aware that IA only has ~20gbps S3 capacity, we'd be egress shaping down to about 5gps, hence the fuck off massive target cache to hold it for a bit
-
monika
monoxane could you clarify on the "api compatible" warrior? are you modifying the existing warrior or writing one from scratch
-
monika
i believe modifying warrior code is a big no no
-
monoxane
new one that does the same thing with the same apis just less jank and some more options to allow us to vertically scale easier and with an updated docker image
-
monika
JAA what's your opinion ^
-
nepeat
i'd be interested in learning more and supporting this warrior improvement
-
nepeat
personally, i'd love to add on prom metrics and getting the logging to fit the structlog format to work with my systems
-
monoxane
im not the guy doing that so i might be wrong on whats actually happening, but we've found that one of the main limiting factors of the warrior is its concurrency settings and the inability to disable things like the web ui
-
monoxane
and also the fact that some of the python libs used in it are effectively vaporware that havnt been updated since 2017
-
monika
if you run the bare project containers the UI is already disabled
-
monika
atdr.meo.ws/archiveteam/<PROJECT>-grab
-
monika
allows for 20 concurrency too
-
monoxane
ooo we did not know that
-
monika
go crazy
-
monoxane
that is going to make a massive difference
-
monoxane
aight the warrior isnt being changed anymore :)
-
monoxane
but we are gonna write our own cluster agent and c2 implementation :P
-
nepeat
ditching k8s already?
-
monoxane
no, still using k8s, just writing a controller that handles the deployment and configuration of those bare images instead of the warrior
-
monoxane
we are already working on that but via the warrior, knowing about the bare images is a massive game changer
-
monoxane
hm these dont seem to actually contain anything though 😔
-
monika
huh?
-
monoxane
at least the pixiv-2-grab one'd dockerfile literally only has a from line in it
-
nepeat
-
nepeat
this is the dockerfile to refer to
-
nepeat
-
monoxane
ah okay, thats some fucky shit i havnt seen before :P
-
monoxane
will play around with it after i finish my actual job for the day lol
-
OrIdow6
arkiver: See above, they have dropped their plan to modify "the warrior"
-
monoxane
yes now we're just gonna bypass it 😆
-
monoxane
we dont wanna do anything that will screw anyone else here but there are definitely challenges with scaling warrior to 3000+ instances over 10+ nodes and actually managing it
-
nepeat
-
nepeat
i like nomad, it's simple and has scaled up with my 100-300 instances well
-
monoxane
yea the other thing is the nodes we're using already have k3s and are running some other workloads, so we cant just jump to nomad
-
nepeat
ah, preexisting prod
-
monoxane
yes, if you knew what these nodes usually do you'd be absolutely shocked that we can run AT workloads on them, and also absolutely not surprised at all that we can pin 500gbps
-
monoxane
but dont worry its all approved by the owners :)
-
OrIdow6
I haven't been following this conversation enough to know the meaning of "bypass it", but basically, the hard rules are: -don't modify wget-lua/wget-at, including messing with the build process to get it to accept wider ranges of library versions -don't modify Seesaw or the other libraries it uses -don't modify the project scripts -keep a clean, vanilla connection from wget and the project scripts to the Internet
-
monoxane
understood we’ll definitely be sticking to that
-
monoxane
i mean we’ll be running the project containers directly not managed through warrior
-
nepeat
that's what most of us hardcore users do
-
nepeat
you're definitely on the right path to hauling top rates
-
monoxane
we dont care about the leaderboards lmao, even considered randomising the DOWNLOADER ids so other people dont get discouraged by 1 name munching 10tb a day
-
monoxane
its more, if we can help in an "oh fuck" situation where theres 24 hours to get an entire site archived, we'll put in everything we've got
-
monoxane
because i've been part of some of those were even with all the capacity we've had, some content is still lost, and in a couple of cases it was a fair bit of content
-
schwarzkatz|m
Appreciate the work you guys do, monoxane!
-
Jake
(also related to earlier conversation, it's easier if you use a known downloader name so that you can be contacted)
-
monoxane
yea we're gonna use some sort of team name when its all up and running
-
monoxane
instead of just my nick lol
-
nepeat
kinda wondering, how up to date are all of the archive team repos?
-
neggles
"don't modify Seesaw or the other libraries it uses" aww
-
neggles
I believe the current plan was to use MagnusInstitute or possibly MagnusArchivist as downloader name, TBC though
-
neggles
OrIdow6: would it be OK to rework the warrior docker image somewhat so it's a bit more... modern, for lack of a better way to put it? I was digging through repos and whatnot last night piecing together how it all works and... oof.
-
OrIdow6
neggles: I don't know what that implies exactly
-
OrIdow6
The core that you shouldn't modify is in the READMEs under "Distribution-specific setup"
-
OrIdow6
And to my understanding the warrior, Docker images, etc. are basically just wrappers around a preconfigured version of this
-
OrIdow6
But I don't know the details of those, and if you want specifics you should wait around for someone who does
-
neggles
OK, no problem
-
neggles
don't want to step on anyone's toes; I have a local 3/4-ish-complete copy of what I'm talking about, it's mostly a slightly cleaner build process (same steps, same sources, similar end result) just with bullseye underneath, theoretically arm64 support, and a few more things configured through environment variables (webui port, UID/GID)
-
rewby
So the thing is: don't run custom builds of wget-at. It causes issues
-
rewby
Mostly around compression and or file integrity
-
rewby
And upgrading the base distro changes lib versions, which then causes the above
-
rewby
As for targets, we don't generally accept them from just anyone who shows up randomly. Once data is on a target it is really hard to figure out what needs to be redone if that target disappears.
-
rewby
Notably, we only accept targets in the form of bare metal or vms. We have provisioning playbooks for them
-
rewby
Also, they destroy ssds
-
rewby
And HDDs are not gonna keep up
-
rewby
Also, 100T isn't much
-
rewby
I have targets with that much sitting around as well
-
rewby
I can look into reshuffling a few things
-
rewby
Also, monoxane, do *NOT* use team names. That is forbidden. We will ban you if we discover this.
-
monoxane
oop okay
-
monoxane
will not
-
rewby
We have had many issues with this before
-
rewby
In the past, people have used team names and then one member's infra fucks up and we need then to stop. Inevitably that person is unreachable and the other members can't get to that specific bit of infra. We end up banning the whole thing because that's the most granular tool we have.
-
rewby
This has happened multiple times.
-
rewby
So we prohibit team names in general now
-
monoxane
yea that makes heaps of sense i dont know why i didnt think about it
-
rewby
Yeah, each person's infra needs a unique uploader name
-
rewby
If you wanna do TeamBlah-monoxane then by all means go for it
-
neggles
does it qualify as one person's infra if all the workers are being managed from a central point, and go idle if they can't talk to it?
-
rewby
Uh. Unsure.
-
monoxane
we'll take that as a no then
-
monoxane
dont wanna antagonise
-
rewby
Basically, each uploader name should be associated with the person who can run sudo poweroff
-
monoxane
may have come in a bit too hot with the ideas
-
monoxane
copy
-
neggles
the whole "we need any one of us to be able to hit the kill switch" thing did occur to ud
-
neggles
s/ud/us
-
rewby
Even if you don't have the skills to fix the issue, you can at least shut the thing down, you feel?
-
rewby
Yeah so, that "any one of us" idea has been tried before
-
rewby
It never works out in practice
-
rewby
So we want to be able to tracker-ban one control domain
-
BPCZ
rewby: would a target with 5PiB of flash and 100PiB of hdd be of much use?
-
monoxane
BPCZ isnt that just the IA :P
-
BPCZ
Though if a target going missing is an issue that might be an issue since that system is my testing ground :/
-
rewby
BPCZ: Depends on the networking, how much abuse against cpu and flash you're willing to take and how long it's available for
-
rewby
Yeah, no testing grounds
-
rewby
Targets going missing without >24h notice is a Big Problem
-
monoxane
someone buy a VAST cluster already
-
BPCZ
VAST is dog shit
-
rewby
Just give me bare metal tbh
-
rewby
That usually works best
-
BPCZ
Understandable
-
rewby
I have a whole Ansible system to provision and manage metal
-
rewby
Not OSS because, like a lot of AT code, it was all written with -2 hours of notice/planning
-
BPCZ
:( but we could clean it up
-
rewby
Specifically I just hardcodes a ton of secrets in it because I had a deadline of a few hours before
-
BPCZ
lol
-
rewby
It's the secrets bit that's the issue
-
BPCZ
Wish I could contribute hardware but that’s a big nono, I can chuck ungodly amounts of compute and ephemeral storage around but most OSS projects get annoyed when you show up do 5x the work they’ve done in 3 years then disappear
-
rewby
Our problem is ephemeral is a big no for targets
-
rewby
Workers, sure
-
rewby
And i can scale targets up if need be.
-
rewby
I've just been sick for the last 3 weeks and haven't been able to babysit them like I usually do
-
nepeat
oooooo ansible scripts
-
nepeat
i've been trying to research the backend infra and a lot of the stuff seems stale for that
-
BPCZ
Paperclips and chewing gum
-
rewby
Targets aren't that complicated tbh, it's mostly OSS except for my provisioning code
-
rewby
Tracker...
-
rewby
Talk to Kaz. He's been on a journey to RE that thing
-
nepeat
heh
-
nepeat
is the current tracker code open sourced?
-
rewby
Only F.usl really knows how that thing works.
-
rewby
You assume all of it even has a source code repo
-
rewby
Bold
-
nepeat
HAHA OH GOD
-
nepeat
my inner sre cries a little
-
BPCZ
>ruby
-
rewby
Same
-
BPCZ
Off to a terrible start
-
nepeat
ruby is cool!
-
rewby
Oh trackerproxy isn't ruby
-
rewby
It's all redis and nix+lua
-
rewby
*nginx
-
rewby
Damn autocorrect
-
BPCZ
I wish there was better docs on the infrastructure, seems neat
-
nepeat
+1
-
rewby
Same here
-
nepeat
i'd love to make some changes that would improve my quality of life with my infra
-
monoxane
+1
-
monoxane
i’ll just make my own with blackjack and hookers and an ia s3 key /s
-
nepeat
hell yeah prom exporters and structlogs
-
monoxane
too much work
-
rewby
Using your own S3 key wouldn't work btw
-
monoxane
yea i know
-
nepeat
spicy
-
rewby
You don't have access to the magical collections where we drop things.
-
BPCZ
IA is using S3
-
monoxane
it only lets you upload via the site doesn’t it
-
BPCZ
Now?
-
BPCZ
Sadage
-
nepeat
s3 compatible, not actual s3
-
monoxane
the web ui upload from ia is an s3 thing
-
rewby
It's an S3 "compatible" endpoint
-
rewby
We call it s3
-
monoxane
and yea not s3 from amazon, just the protocol
-
BPCZ
Thank god ok
-
nepeat
everyone implements s3 compatible apis
-
neggles
S3 =/= AWS S3
-
rewby
It's cursed
-
BPCZ
I don’t even know if IA has multiple tape libraries yet
-
rewby
It's all hdds
-
rewby
Afaik
-
nepeat
i've heard they're running ceph these days?
-
monoxane
yea i think it’s hdd with a little bit of flash in front for web stuff
-
BPCZ
Probably too much effort to keep a library alive, those bastards always have issues
-
monoxane
there’s a page on the site talking about petabox
-
monoxane
somewhere else talks about s3 on top of it too
-
monoxane
whichbis where i got the idea to just ask for a key from :P
-
rewby
Also, re SRE cries. You really don't wanna know the tracker. Some of it is Debian wheezy
-
monoxane
they’d absolutely say no though
-
BPCZ
If it’s Ceph then S3 is just gratis
-
monoxane
tell me to piss right off and never come back
-
rewby
You can get keys piss easy
-
rewby
Make an account on the IA and go to your profile
-
monoxane
not long lasting ones though
-
rewby
It'll give them
-
monoxane
oh interesting
-
rewby
They're just account creds iirc
-
rewby
The thing is, we have collections with special flags that make the wbm index them
-
monoxane
yea and they probably revolve them if i uploaded at 10gbps
-
monoxane
yeap
-
rewby
Randos cant just upload warcs and have them show up in the wbm
-
nepeat
reliability and automation would be great things to look at
-
neggles
most of what struck me as I was digging through code piecing together how this stuff works was, idk, disappointment? but the existential kind
-
nepeat
not pure brute force...
-
rewby
But our collections are special
-
rewby
And have restricted uploader access
-
rewby
But all of the IA side is managed by ark.iver
-
rewby
I get a set of S3 creds and a collection to shove stuff into
-
rewby
If you see us discuss vars, that's our slang for the info I need from him to interface with IA
-
rewby
Oh trust me, I wanna replace so much of it
-
nepeat
kinda curious, has something like vault been looked at for keeping the secrets outside of env files?
-
rewby
But there's only so many hours in a day and I'm overworked as is
-
neggles
IA is important, AT is important, but it seems like there's... can't find the right way to say it but "oh come on, companies spend tens of millions on <next stupid internet fad> but *none* of them feel like giving any real resources to something that actually does some good?"
-
rewby
Looked at? Sure. But time is limited for most of us.
-
rewby
Note that we have 0 budget
-
rewby
We fund this ourselves
-
neggles
yeah, absolutely not having a go at anyone here
-
rewby
Target costs are split between me and like 4-5 other people who all pay for the hardware they donate
-
rewby
But importantly, I have names, phone numbers, addresses etc
-
rewby
We know where to send goons if someone fucks off
-
neggles
I guess i'm just kinda surprised none of the tech giants have decided to get themselves some positive press by throwing a (for them) miniscule amount of funding and resources at this
-
rewby
We don't have an org
-
neggles
surprised isn't the right word, disappointed
-
rewby
Which makes that hard
-
nepeat
some of us work for the tech giants ;)
-
BPCZ
Some of us would prefer dirty money not get involved
-
nepeat
i wouldn't say the money's dirty
-
rewby
Money would be nice to finance proper target hw.
-
rewby
Or at least pay hosting bills
-
nepeat
it's what makes it possible for people like me to spin up a lot of instances for the warrior IPs
-
neggles
all money is dirty depending on how you look at it, but that's a whole other question, and if it doesn't come with any strings attached other than "tell people we did this" that's fine
-
rewby
From archiveteam.org: Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths dedicated to saving our digital heritage.
-
rewby
This makes money hard
-
neggles
(that sounded wrong, s/that's fine/i wouldn't have a problem with it at least/)
-
neggles
nepeat: the org whose resources we are making use of do have a /22 or so available
-
BPCZ
I’m kind of surprised IA can’t provided a reasonable set of targets
-
nepeat
BPCZ: this isn't the IA
-
rewby
We're not the IA
-
rewby
That graciously deal with the storage and retrieval parts of web archiving for us
-
neggles
(and they don't get nearly enough funding either, hence the relatively low amount of ingress they can handle)
-
rewby
Which is more than we could ask for anyways
-
neggles
yeah
-
nepeat
wondering, how can i help with some of the infra and client code?
-
nepeat
me putting out my thoughts is one thing for the overburdened team but i like to get my hands dirty and implement said thoughts
-
neggles
well, to say what we probably should've opened wit- heh nepeat that's p much what I was about to say
-
neggles
monoxane builds k8s-based application orchestration stacks for a living
-
rewby
I have a decently interesting design for new target software. But not had the time to implement it.
-
rewby
Also, F.usl has been working on a new tracker for years, might need help
-
monoxane
yea i’m kubelord, 80% of my job is building kube applications to orchestrate hundreds of gbit of traffic and the orchestration for the orchestrators to make it all manageable from a unified web interface
-
rewby
I personally don't trust kube for targets
-
rewby
This data is very persistent and not redundant
-
monoxane
replacing the warrior with a kubernetes controller that runs the direct job containers is gonna be a 3 day job at most, will look at it over christmas
-
monoxane
oh yea for targets it’s absolutely not the right tool
-
» rewby is the target person
-
nepeat
containerized targets would be very fucky, storage would have to be separated to force that to work...
-
nepeat
pretty much creating target2.0 if you are doing that
-
neggles
that's not particularly difficult if you're running on baremetal
-
neggles
but it's probably not worth the effort
-
rewby
I have plans for new target software
-
nepeat
agreed, given targets aren't disposable
-
monoxane
but for collection at scale? kube, a 100gbe host, and a /24 will give up to 4000 concurrent downloads across an entire public ip range in seconds
-
rewby
To: not destroy ssds as much and go faster
-
monoxane
ramdisk time :P
-
rewby
NO
-
rewby
Data loss
-
nepeat
:openeyescryinglaughing:
-
rewby
Again, if we lose uploaded data, it's gone
-
monoxane
yea ik
-
rewby
And we have no good way of figuring out what was lost
-
monoxane
1pb of zeusrams when
-
monoxane
actually a bluefield2 and some nvmeof would make a wonderful target
-
neggles
"if we lose data after it hits the target we can't tell what we lost" seems like a problem worth solving
-
rewby
Also, one of my servers is under 1.5 years old. Its ssds have 3.5PiB written
-
rewby
neggles: again, i have plans
-
BPCZ
Hahah I happen to know if a project trying to do multi tbps persisted storage via kube
-
rewby
I just need to write it down
-
BPCZ
It’s going poorly
-
monoxane
also that, maybe it’s worth adding another step to the tracker for “egresses to ia”
-
neggles
oh yeah no i'm not suggesting it's easy
-
neggles
cause doing what mono just suggested doubles tracker load (and it sounds like the tracker is a bit of a black box at the moment?)
-
schwarzkatz|m
are there even any good news regarding that site lately
-
schwarzkatz|m
why is it so awfully quiet here currently, where is everybody :c
-
rewby
schwarzkatz|m: It's not quiet?
-
joepie91|m
that way you optimize for scraping the high-result-count ones first
-
joepie91|m
I believe that this is part of Google's n-gram dataset somewhere
-
joepie91|m
hm, I thought there was a letter dataset also
-
joepie91|m
(which afaik is used in google's language detection thingem)
-
madpro|m
<rewby> "Also, F.usl has been working..." <- 🥲
-
madpro|m
<rewby> "Also, F.usl has been working..." <- 🥲
-
BPCZ
monoxane: how does one become a kubelord
-
monoxane
a lot of "wtf how the fuck does that work" and reading golang code
-
neggles
if my own attempts are anything to go by, the first step involves creating & recreating your cluster 27 times in 3 different configurations before you find one that doesn't have a showstopping problem that rears its head after you're 3/4 done
-
neggles
(assuming you don't want to pay <cloud provider> half a kidney)
-
monoxane
lmao also that
-
rewby
That tracks with my experience
-
monoxane
it took me 8 tries to make a kube cluster, now i can do it in 10 min from bare os
-
neggles
oh the other option is to pay red hat $texas for openshift
-
nepeat
boring
-
neggles
or go dig up all the OSS components of openshift and do it yourself
-
BPCZ
Oh ok so I’m most of the way there then. I write go for work and write kube oci providers and modify core kube crap to pass in hardware that’s not supposed to be passed in just yet
-
BPCZ
Just need to get to the standing up a cluster part … most of the time I barely figure out a process once and just have an ansible playbook for next time
-
neggles
having spent the better part of this year attempting to stand up a cluster that doesn't have some incredibly stupid limitation that makes me throw my hands up in defeat and forget about it for a month
-
neggles
good luck >.>
-
nepeat
this is overcomplicating the overcomplicated setup
-
BPCZ
neggles: I mean all clusters have limitations. I work in distributed systems and clusters professionally. Kube just isn’t used heavily for the big stuff
-
monoxane
BPCZ if you really wanna get standing up clusters down, do kubernetes-the-hard-way, like 4 times over, and you will know everything about the intenals and why things are like they are
-
monoxane
-
BPCZ
Thanks!
-
neggles
the problem with k8s related stuff, from where i'm standing anyway, is it's all focused on "too big" or "too small"
-
nepeat
keep it simple. for my AT stuff, i got nomad (containers) + vault (mtls certs) + loki (logs!)
-
monoxane
(doesnt have to be gcloud, its just what they use as demo env)
-
neggles
there are a lot of ways to spin it up on single hosts that work quite well, are very straightforward, and behave
-
neggles
and a lot of ways to spin it up on <cloud provider> that work very well, are easy to manage, and cost an unpredictably-large fortune
-
monoxane
and yea, 1 node: easy, 2 to 6: incredibly painful, 6 to 1000: easy af
-
BPCZ
Did kube ever grow network topology knowledge? I recall that being a sticking point a while back
-
neggles
still a big problem.
-
BPCZ
Figures
-
neggles
there are several potential solutions, no clear winner
-
neggles
the frontrunner seems to be cillium
-
monoxane
its a problem but its got a whole lot better now, you can do l3 super easy with stuff like cilium or kube-router that dont rely on internal tunnels between nodes
-
monoxane
big thing about cilium is it does ebpf offloading so all the inter-pod stuff is done in the kernel and offloaded to the nic, instead of in userspace like the older CNIs
-
neggles
and you can handle rerouting traffic to the 'correct' node without overwriting the source address
-
monoxane
and also yknow just use bird to advertise everything between the nodes over bgp instead of cry when the vxlan is broken for no reason
-
monoxane
looking at you flannel
-
nepeat
+1 to using bgp lol
-
neggles
tl;dr it's getting a lot better, rapidly, but it's still not there yet
-
neggles
bit of an xkcd competing standards problem
-
nepeat
i just have wireguard tunnels and bgp to route my throwaway networks when i spin it up
-
BPCZ
Yeah I don’t trust kube for the workloads, and neither does google. Iirc they use nomad for some stuff, but the devs I’ve talked to over their says everything falls over when you get into the high hundred thousand messages a second range with nodes
-
nepeat
been looking at netmaker and got it rolled out for this iteration of my cluster
-
monoxane
i am currently working on standing up a cluster with 3 nodes in 3 locations connected via ipsec tunnels + bgp + kube-router for shits n gigs
-
neggles
google use borg, which is not k8s, but is not not k8s
-
BPCZ
Specific workload, they don’t use borg for it
-
nepeat
kinda curious, any of you got dashboards for the AT stuff yet?
-
monoxane
the best one to look at for implementaion and scale imo is spotify
-
nepeat
-
BPCZ
They use a few thousand node nomad cluster
-
schwarzkatz|m
rewby: what do you mean, quiet?
-
monoxane
they run 98% of workloads in 13 globally distributed clusters with the capability to hard failover any clusters traffic to any other site in under 5 seconds, they manage it all with an internal tool theyre making open source called BackStage
-
BPCZ
Sounds cool
-
monoxane
my works clusters are a fair bit smaller and a completely different ballpark, we just have 6 nodes running ~110 pods total but the application stack is designed to be entirely fault tollerrant internally so any service or any node goes down and we're still good
-
monoxane
most of the clusters are completely offline most of their life too
-
neggles
there was a big sportsball event you might've heard about recently; i will not elaborate further
-
monoxane
yea and another, and another :P
-
nepeat
i can't say anything about what i do but it's reinforced some good ideas for my personal setups, this included
-
nepeat
oh god not the world cup
-
monoxane
kube runs the video routing for the superbowl and 80% of global live sports tv
-
BPCZ
I wish companies would actually rework applications they chuck into kube. I had to de-kube something recently because the company wrapped a state full system into kube and washed their hands like that would be fine and kube would recover things better than other options
-
monoxane
oh yea ours is kube from the ground up, you cannot forklift existing workloads into kube and expect it to go well
-
BPCZ
nepeat: you can just say you professionally scan everyone’s butthole while they sleep it’s ok we get it. Companies just really like to know what our bowls are doing
-
monoxane
lmao
-
nepeat
lmao
-
monoxane
but which type of scan
-
monoxane
optical or something more exotic like ground penetrating radar through the roof?
-
nepeat
i just work for a place that inspires creativity and brings joy...
-
monoxane
narrator: it does not
-
BPCZ
All the scans, WiFi, roomba radar, brain wave from your sexual partners. If it could detect butthole the kube workload nepeat works on tries to collect it
-
BPCZ
Mousewitz?
-
rewby
schwarzkatz|m: In response to: 11:54 <schwarzkatz|m> why is it so awfully quiet here currently, where is everybody :c
-
BPCZ
This whole conversation reminds me I need to be planning my next job and figuring out where to live next.
-
BPCZ
SF or Seattle seem to be the two big options
-
neggles
nepeat: so bytedance :v
-
rewby
Anyone happen to know the deadline for pixiv?
-
schwarzkatz|m
What is happening with this dumb matrix thing, sorry for posting duplicate messages
-
schwarzkatz|m
Deadline was 2022-12-15, that’s when their TOS changed
-
rewby
schwarzkatz|m: Yeah your matrix stuff is bork. I tried checking my matrix alt and it's delayed like mad.
-
rewby
*Ah*
-
rewby
Right okay
-
rewby
I'm gonna move some stuff around
-
schwarzkatz|m
I think it’s only happening in the mobile app though, I have the same problem with discord sometimes
-
rewby
monoxane: You were complaining about target limits? Right?
-
rewby
-
nepeat
oh neat
-
neggles
rewby: wound some more capacity in?
-
neggles
lets see how it looks this side...
-
rewby
It's provisioning
-
rewby
Just hit it as hard as you can
-
rewby
I'll scale it up to meet
-
rewby
I've hit the *deploy hetzner cloud* buttons
-
neggles
"just hit it as hard as you can" <- you may live to regret that
-
rewby
Trust me, I've seen worse
-
monoxane
rewby we have 1.4tbps online right now lol
-
rewby
And I have backpressure
-
rewby
The system doesn't accept more data than it can take
-
schwarzkatz|m
Argh I hat mobile apps
-
rewby
If you hit a target too hard, it'll just shut off inbound and process what it has on disk
-
rewby
I can easily scale this into 16-20 gbps
-
neggles
more the "scale up to meet" heh
-
rewby
Yea I guess
-
rewby
I can scale as high as IA has inbound on s3
-
neggles
looks like the source is now the limiter
-
rewby
Here's your reminder: There's several projects with deadlines in the next 10 days
-
rewby
pixiv, uploadir, vlive and buzzvideo are the main ones I know of
-
rewby
So throw spare capacity at those
-
BPCZ
Can’t believe I went home for Christmas and can’t even do this during the holiday
-
neggles
sounds like we should spin up some more workers pointed at those other projects then
-
neggles
~3000 of them seems to be about all pixiv can handle
-
nepeat
oh man, i see the pixiv spike
-
nepeat
-
rewby
I'm not done scaling everything
-
monoxane
ill have you know we're currently doing 20gbps
-
rewby
I'm well aware yes
-
neggles
pixiv has definitely run out of outbound, can't pull more than 300mbit or so from it on top of what we're hitting
-
rewby
I have metrics
-
neggles
so
-
neggles
time to point some at the others?
-
rewby
Yes
-
rewby
I have three separate ansibles going on trying to move stuff around
-
neggles
workin' on it
-
monoxane
uploadir has 0 tasks available, so id say we should focus on the others
-
rewby
211k out
-
rewby
Hm
-
rewby
Lemme flush that
-
rewby
If you refresh you'll see uploadir tasks
-
monoxane
cool got em
-
neggles
just provisioning some more VMs :)
-
rewby
Hm. I think I'm hitting an IA bottleneck
-
rewby
Lemme investigate
-
monoxane
-
monoxane
its likely its saturated s3 ingress though
-
rewby
There's 2 lbs
-
rewby
I think 10g each?
-
monoxane
yeap
-
rewby
I need permission to override and use the other one though
-
rewby
I have asked, but can't do much until I hear back
-
monoxane
valid
-
monoxane
bit of fun eh? :P
-
rewby
Just let the targets fill up and workers back off, when I hear back I can get the throughput up
-
rewby
My inbound on targets was like 16gbps
-
rewby
Right up until disks started fillin gup
-
rewby
*filling up
-
monoxane
we were doing 20.02gbps at peak
-
monoxane
from the core network
-
rewby
On the upside, pixiv and vlive are looking to be done in ~24 hours according to my numbers
-
monoxane
sweet
-
rewby
vlive in <6 hours
-
neggles
vlive seems to have the most source capacity
-
rewby
We also don't have too many items there
-
rewby
Either way, when arkiver wakes up he's gonna have a field day finding more items and things to archive
-
neggles
there are six more spare boxes - much smaller, "only" 8 core, but with 10G links - i've just been handed keys to
-
neggles
used to be minecraft servers
-
rewby
I hit a spicy 18.6gbps inbound just a few ago
-
neggles
is uploadir stalled/already down?
-
neggles
seeing basically zero action out of those
-
rewby
Shouldn't be. But IIRC there's a speed limit on that
-
neggles
ah
-
monoxane
IA just cracked 20gbps inbound
-
Doomaholic
Holy crap
-
rewby
Over half of that is us
-
rewby
tbh, we've done better
-
neggles
um, maybe silly question but where do the -grab containers store the files between pull/push?
-
neggles
internally
-
rewby
In /grab
-
neggles
not in a subdir?
-
rewby
I don't remember
-
neggles
fairo
-
neggles
i guess i opened one that hasn't started yet
-
rewby
Data storage is synchronous
-
rewby
So it always does download -> upload -> download -> upload
-
rewby
So if it's waiting for work, it won't have any data stored
-
neggles
ah
-
neggles
some of these video files are big enough for kube to go "hey, you didn't ask for disk" and boot the pods
-
neggles
ah all under /grab/data excellent
-
monoxane
yea im currently looking at a cluster thats half evicted pods becuase they used 15gb of ephemeral storage 😆
-
rewby
Yeah, the video files are big
-
neggles
ok it's happy now
-
rewby
For people who were asking to donate target hw: This is what we do to disks: Data Units Written: 6,793,004,499 [3.47 PB]
-
rewby
That's in 1.5 years
-
madpro|m
I mean, Archive Team cannot be the only people making software for this nowadays. Can it?
-
madpro|m
There are tons of companies that do crawling for a business, surely they have open-sourced some more robust trackers by now?
-
madpro|m
Not that I know, as I have been searching for myself as well for the past 2 years or so.
-
neggles
anyone with a functional wide-scale web crawler / ripper is not going to hand that out for free
-
neggles
that's a surefire way to stop it working
-
madpro|m
I cannot say I'm nearly as skeptic, seeing other projects like Hadoop in distributed computing
-
neggles
hadoop is not the expensive/proprietary/"magic" part of a hadoop-based workflow though
-
neggles
it's worthless without the rules and flows and transforms etc
-
neggles
while the value of a crawler (on a commercial level anyway) comes from being able to skip around things trying to block crawlers
-
rewby
The thing with these kind of trackers: They are tied closely to your workflow.
-
rewby
They are very specific
-
neggles
same goes for hadoop setup
-
rewby
If you tried to make an end-all-be-all tracker you'd end up with something as complex as kube
-
neggles
or SAP
-
rewby
With a much smaller market
-
madpro|m
Well there you go
-
rewby
So instead people make trackers that are good enough for their workflow
-
neggles
ERP systems are the perfect example; they do everything for everyone, but they do it by having 27,000 different modules that can be wired together in practically infinite ways
-
rewby
But then you end up being very very tied to your company
-
monoxane
i think pixiv might need a purge too, theres 1m out but it never goes below 99.95k and surely theres not a million jobs being processed rn lmao
-
rewby
I'll give it a look in a sec
-
madpro|m
Better close this tangent, before discussion shifts back to pixiv.
-
rewby
I'm actually disappointed in my upload rate
-
rewby
I've done 25G to them before
-
rewby
I'm not too worried about pixiv
-
rewby
Tldr: It doesn't recycle jobs from the out-list until todo is empty
-
rewby
And there's 8M in too
-
rewby
*todo
-
monoxane
im not either its just a bit of a high number
-
monoxane
ah okay i didnt know that
-
rewby
Also, monoxane, the bare -grab containers will do concurrency up to 20
-
monoxane
yes we know
-
rewby
kk
-
madpro|m
For now, in terms of tracker development we should look to making do with what we have. The IA wiki and GitHub have a long way to go in terms of documentation.
-
madpro|m
Exploiting our own resources and all that.
-
monoxane
every single one of the 3000 containers currently running across 20 nodes are on max concurrency
-
rewby
Ah okay
-
monoxane
we are entirely restricted by IAs ingrest right now
-
monoxane
*ingest
-
neggles
"haha kubernetes go brrr"
-
rewby
I've redirected vlive to a pile of spinning rust
-
rewby
To sink your data into
-
Doomaholic
Bless
-
monoxane
excellent
-
neggles
ooooh one of these is a 5900X
-
neggles
-
rewby
-
Doomaholic
Delicious
-
monoxane
i think the next bottleneck might actualy be rsync connections on the targets
-
monoxane
like 70% of my pods are sitting here idle waiting to retry dumping
-
monoxane
its error 400 not -1 so its not the disk full cutoff
-
arkiver
thanks for the ping OrIdow6
-
arkiver
still reading some backlog
-
arkiver
monoxane: are there several people running under a single 'team name'?
-
arkiver
neggles: feel free to make a PR on the warrior docker image
-
arkiver
monoxane: if you have a ton of IPs available - telegram could definitely benefit from that, we got quite some backlog to work through
-
arkiver
on uploadir - roughly half of the items was 404
-
MrSolid
hi guys
-
MrSolid
can you please help me archive the website ac-web.org
-
arkiver
MrSolid: what is the reason?
-
arkiver
ac-web.org is not loading for me
-
MrSolid
sites been down for months since being sold to new owners and trying to migrate to new community so information isnt lost
-
MrSolid
-
arkiver
well if the site is down, we can't archive it
-
MrSolid
its up for me thats odd
-
MrSolid
maybe someones crawling it right now haha
-
arkiver
in any case, sounds like a site we should archive yes
-
MrSolid
thank you arkiver
-
arkiver
loading very slowly now
-
MrSolid
i just hope the new owner doesnt shut the site down again before its archived
-
monoxane
arkiver not any more, for a little bit yesterday there were a couple people using one name but after we got a harsh no we split and each person controlling a set of nodes is using a different name
-
monoxane
will switch some of them to telegram in the morning
-
arkiver
monoxane: sounds good - separate names is definitely better for keeping track of who doing what
-
arkiver
and yeah as rewby said, feel free to prepend something to the names to show people as being part of the same group
-
monoxane
yea will probably do that at some point, the team name thing was more because some people don’t want to be identified so those people are just using team name suffixed with country, identifiable enough for someone to tell ‘em to stop if it’s broken but not to be worked out from the leaderboard
-
arkiver
right yeah
-
arkiver
so what is this group of people?
-
monoxane
friends, some of which work at a tier 1 global isp and have some resources at their disposal
-
arkiver
pretty awesome
-
monoxane
the 20gbps we were pulling today didn’t even make a single pixel increase in their usage charts (outside of the routes going to targets and the sources)
-
arkiver
watch out that if you were to run a project like the URLs project (outlinks from various sources), it may contain any URL you can find online
-
monoxane
it was approved because it’s just a fun little load test on their links 😆
-
arkiver
though I'd say it is one of our most valuable projects
-
arkiver
telegram is likely very safe to run
-
arkiver
hah :) sounds good
-
monoxane
we will likely only run at full tilt when there’s an “oh fuck” event where we have 24 hours to pull and entire site
-
arkiver
alright
-
monoxane
and just leave a couple nodes running on a range of projects
-
arkiver
yeah we have some end of year shutdown going on at the moment
-
monoxane
>just a couple nodes
-
monoxane
i say this as if they’re not 100gbe directly attached to an isp core
-
arkiver
for the current short term projects bandwidth is the bottle neck somewhere along the way
-
arkiver
but for the long term projects, IPs if the bottle neck
-
arkiver
is*
-
monoxane
yea, we have some potential solutions to the ip bottleneck
-
monoxane
one of which involves giving a single node an entire /24 😅
-
arkiver
"couple nodes" with each a different /24?
-
arkiver
that'd be pretty awesome :)
-
monoxane
the only problem with that is burning /24s is less justifiable than burning 20gbps out of a 20+tbit network
-
arkiver
yeah which is likely the reason as well why our long term project have IPs as bottle neck rather than bandwidth
-
rewby
I think I'm currently still burning two /24s on telegram
-
rewby
Or rather, I'm burning someone else's /24s
-
arkiver
rewby: and it is really making a difference!
-
arkiver
we're slowly working through the huge telegram backlog
-
arkiver
note though that we currently cannot keep up with new discovered group posts (we can only keep up with new discovered channel posts)
-
arkiver
i'm stashing the group posts at another project at the moment,
tracker.archiveteam.org/telegram-groups-temp , which now has 4 billion items
-
arkiver
so we'll just feed that in slowly whenever there is room
-
arkiver
it's already very good we can keep up with channel posts however, we're discovering and archiving many of them
-
Jake
(I missed quite the night here!)
-
mgrandi
@arkiver: how are you guys doing telegram? The web view of groups ?
-
mgrandi
Also, update on the FA forums, I'm pretty sure that's not what the GDPR means , and also lol, like that's going to stop anyone
forums.furaffinity.net/threads/foru…rd-coming-soon.1682702/post-7381985
-
arkiver
mgrandi: yes
-
arkiver
on telegram
-
ivan
"I dont know how that works or if it can take as many messages or forum pages this site has." haha
-
mgrandi
@arkiver: that is the easiest way yeah, I have a lot of experience with tdlib but it's daunting how many things to support so the web view probably is the easiest way for now!
-
schwarzkatz|m
what is their concern with GDPR on an archived website
-
schwarzkatz|m
I don't really get it
-
arkiver
mgrandi: yeah, and the web view can go into the Wayback Machine
-
schwarzkatz|m
according to deathwatch,
zhihu.com/club/explore will stop working on 12-26. it's probably a good idea to archive/grab all links from these pages beforehand
-
arkiver
schwarzkatz|m: as in,
zhihu.com is shutting down?
-
arkiver
hmm
-
arkiver
i missed that on deathwatch
-
schwarzkatz|m
it says only /explore
-
arkiver
hmm yeah but I see the entire thing (zhihu.com) is shutting down next year?
-
schwarzkatz|m
looks like it
-
arkiver
fun project
-
arkiver
any idea where 'Circles' comes from in Zhihu Circles?
-
arkiver
JAA: do you know if anything was done for the furaffinity forums?
-
schwarzkatz|m
looks like a bunch of api calls to get, I'll try to grab them from /explore
-
arkiver
the deathwatch page says "Zhihu Circles" is removing that public access, is Zhihu Circles all of zhihu.com ?
-
schwarzkatz|m
a circle seems to be a /club/[0-9]+
-
schwarzkatz|m
so all items on /explore are circles
-
arkiver
i see. thank you
-
schwarzkatz|m
-
mgrandi
schwarzkatz|m: the guy probably doesn't know what he is talking about or is thinking that we would be taking the legit forum database
-
rewby
Well then, today was an *experience*
-
rewby
We're aware pixiv, vlive, etc are having target issues
-
rewby
It's actually intentional
-
rewby
HCross and me have paused high activity projects at the moment. We have too much backlogged data to process and we need to be careful with the IA.
-
rewby
Announcements in project specific channels in a few.
-
neggles
rewby: ah, sounds like we might've sent it a bit too hard
-
JAA
arkiver: I thought Fur Affinity was thrown into AB, but apparently not. I'll take a look later. Might also qwarc it.
-
h2ibot
Arcorann edited Deathwatch (+292, /* 2023 */):
wiki.archiveteam.org/?diff=49262&oldid=49255
-
schwarzkatz|m
JAA: would collecting all thread & subforum urls be helpful?
-
datechnoman
Worst case throw Fur Affinity in #Y project to be trawled through
-
JAA
schwarzkatz|m: Not needed, with qwarc, I'd probably just bruteforce thread IDs anyway.
-
schwarzkatz|m
with pagination? :O
-
JAA
Of course.
-
JAA
I've archived XenForo forums before with qwarc, so just need to adjust domains and am probably good to go.
-
schwarzkatz|m
okay then, let me know if I could help otherwise :)
-
JAA
If they can take the load, I could grab it all in hours. Chances are they can't though. :-)
-
datechnoman
Fur Affinity appears to be Cloudflare backed including their images so they will be able to process high throughput id say
-
arkiver
JAA: alright, sounds good
-
arkiver
and with bruteforcing thread IDs, you could still somehow get the pages?
-
arkiver
outlink can of course go into #// :)
-
JAA
datechnoman: That doesn't really mean much as it just depends on what the backend server is. Could be a RPi in someone's closet for all we know.
-
JAA
arkiver: Yes, I will get thread pagination. No images etc., but those can be extracted later and fed to #// along with the outlinks, yeah.
-
datechnoman
Fair call JAA. I guess the CDN helps with throughput and load but the backend processing of the requests is a different story. You can tell my main focus is the #// which is everything all over the place
-
JAA
Yeah, it certainly helps with things cached on the CDN. When bruteforcing threads, most won't be in the cache.
-
arkiver
sounds good
-
JAA
schwarzkatz|m: So forum.lacartoonerie.com is NXDOMAIN now. It was down since end of November anyway, but I guess that means it definitely won't be coming back.
-
schwarzkatz|m
good that we got it then :)
-
JAA
Your grab is on IA?
-
JAA
The ArchiveBot job didn't get far.
-
schwarzkatz|m
I thought you got it all :/
-
JAA
No, it got errors pretty soon after I started it. That's why I asked about whether you had also seen timeouts in your crawl.
-
JAA
I don't think it managed to retrieve much more after that.
-
JAA
-
JAA
Missing the -meta.warc.gz though, do you still have that?
-
schwarzkatz|m
that's unfortunate then
-
schwarzkatz|m
my grab is partially in WBM since I at first used SPN exclusively
-
JAA
Looks like someone else also did something in September, but it's in WARCZone:
archive.org/details/warc_forum_lacartoonerie_com_20220927
-
schwarzkatz|m
I have deleted all files after I uploaded that, looks like I didn't see that one
-
JAA
Oof
-
schwarzkatz|m
what's in there?
-
JAA
Log
-
JAA
Less important than the data I suppose, but yeah, please upload it on future grabs.
-
schwarzkatz|m
will do
-
JAA
Do we know of any list of projects on SourceHut that will be removed? If not, can someone try to compile one?
sourcehut.org/blog/2022-10-31-tos-update-cryptocurrency
-
schwarzkatz|m
searching for related words turns up maybe less than 20 public repos in total. maybe it's a good idea to get these and then archive all 1058 repos?
-
JAA
Sounds reasonable.
-
JAA
Not sure about archiving all repos actually, but sounds like it shouldn't be too big. Unless there are a dozen copies of Linux and Chromium on it. :-|
-
arkiver
"how about we just get everything?" "sounds reasonable" :P
-
JAA
-
arkiver
hahaha yeah!
-
JAA
I will grab all of sr.ht eventually anyway (when that bot is ready), I'm just not entirely certain it's worth doing that now.
-
schwarzkatz|m
-
JAA
Yeah, as expected, there are at least a couple copies of the Linux repo. Those would be duplicated.
-
schwarzkatz|m
contains also non cryptocurrency stuff, didn't sort that out
-
JAA
Is it 1058 repos or 1058 projects? Projects can have multiple repos, I think.
-
schwarzkatz|m
projects then :D
-
JAA
Thanks for the list, will do the magic later.
-
schwarzkatz|m
great
-
JAA
And I might just throw
sr.ht into AB and add aggressive ignores to get a general record of what's on there.
-
JAA
The project pages should have some records of the (short) commit IDs, too, which could be used to verify mirrors, for example.
-
JAA
arkiver: Heard anything from GeoLog?
-
JAA
Ah, the repos are on a separate domain anyway, right. So it'd grab those and not recurse further, which is even better.
-
JAA
SourceHut does also support unlisted repos, which would be tricky to find.
-
arkiver
JAA: no, nothing
-
arkiver
ACTUALLY
-
arkiver
got a reply literally few hours ago
-
arkiver
:)
-
JAA
:-)
-
Ryz
Ooo, reply? O:
-
Ryz
arkiver?
-
pabs
JAA: #swh folks pointed me at this rejection of an API to list all SourceHut repos:
lists.sr.ht/~sircmpwn/sr.ht-dev/patches/4859
-
pabs
JAA: btw, could you pastebin a link of the sr.ht repos you archive into #swh (libera) so they can grab them too?