When voltage is dropping (or phase is falling out of alignment) don't you want to shed load to stabilize? The issue seems to be more that too much load could shed too quickly causing wild oscillations in grid conditions from an undervolt to an overvolt. Seems like the correct, and not terribly complicated, solution to this is to have a process whereby the grid operator can request large customers to shed load and move to backup power temporarily.
That seems like such a reasonable suggestion that I would be shocked if such a thing does not already exist and is simply just not reactive enough and too manual.
SpaceX was launching a modest % of the LEO constellation but after the Blue Origin failure, SpaceX is the only launch provider who can fill that gap and actually let LEO deliver on contracted time.
Please don't misunderstand me, I'm no Musk sycophant though I do love SpaceX and Starlink. I want us to have multiple providers of super cheap space launch capabilities and multiple diverse LEO satellite constellations (3-4 on a global scale makes sense I think?).
I'm sure BlueOrigin will get there some day and I'm sure LEO will get there too (maybe even in the 2028 window if they expand their SpaceX launch partnership).
In fact the article comes dangerously close to admitting that there is correlation without correlation, it opens with:
> Here is the short version. In 2012, Caltrain budgeted its electrification project — the backbone of the Peninsula's transit future and a prerequisite for high-speed rail to ever reach San Francisco — at roughly $1.5 billion. By 2017 that number had ballooned to $1.9 billion. In between, the Town of Atherton sued.
While I don't agree with what Atherton did here (in general, I did not look at the specifics), you have to be fairly negligent to think you're going to build something in California without a massive legal headache. This is a legislative problem which it sounds like, for this narrow case, the legislature actually solved (shockingly to me). I find it hard to blame the residents of the city for exercising their rights.
> This is a legislative problem which it sounds like, for this narrow case, the legislature actually solved (shockingly to me). I find it hard to blame the residents of the city for exercising their rights.
Filing frivolous lawsuits is also a right but we don't withhold our criticism of that practice. What Atherton did seems like the wealthy person's equivalent of that, down to it being dismissed. Legal? yes. Cynical and amoral, also yes.
I agree, and I do not take issue with the general complaint (frivolous lawsuits) I am merely pointing out that your ire should be directed more at the legislature not at the people.
This is almost definitely an issue of equipment failure.
Cooling in datacenters is like everything else both over and under provisioned.
It's overprovisioned in the sense that the big heat exchange units are N+1 (or in very critical and smaller load facilities 2N/3N). This is done because you need to regularly take these down for maintenance work and they have a relatively high failure rate compared to traditional DC components and require mechanical repairs that require specialized labor and long lead times. In a bigger facility its not uncommon to have cooling be N+3 or more when N becomes a bigger number because you're effectively always servicing something or have something down waiting for a blower assembly which needs to be literally made by a machinist with a lathe because that part doesn't exist anymore but that's still cheaper than replacing the whole unit.
The system are also under-provisioned in the sense that if every compute capacity in the facility suddenly went from average power draw to 100% power draw you would overload the cooling capacity, you would also commonly overload things in the electrical and other paths too. Over provisioning is just the nature of the industry.
In general neither of these things poses a real problem because compute loads don't spike to 100% of capacity and when they do spike they don't spike for terribly long and nobody builds facilities on a knife-edge of cooling or power capacity.
The problem comes when you have the intersection of multiple events.
You designed your cooling system to handle 200% of average load which is great because you have lots of headroom for maintenance/outages.
Repair guy comes on Tuesday to do work on a unit and finds a bad bearing, has to get it from the next state over so he leaves the unit off overnight to not risk damaging the whole fan assembly (which would take weeks to fabricate).
The two adjacent cooling units are now working JUST A BIT harder to compensate and one of them also had a motor which was just slightly imbalanced or a fuse which was loose and warming up a bit and now with an increased duty cycle that thing which worked fine for years goes pop.
Now you're minus two units in an N+2 facility. Not really terrible, remember you designed for 200% of average load.
That 3rd unit on the other side of the first failed unit, now under way more load, also has a fault. You're now minus 3 in a N+2 facility.
Still, not catastrophic because really you designed for 200% of average load.
The thing is, it's now 4AM, the onsite ops guy can't fix these faults and needs to call the vendor who doesn't wake up till 7AM and won't be onsite till 9.
Your load starts ramping up.
Everything up above happens daily in some datacenter in the USA. It happens in every datacenter probably once a year.
What happens next is the confluence of events which puts you in the news.
One of your bigger customers decides now is a great time to start a huge batch processing job. Some fintech wants to run a huge model before market open or some oil firm wants to do some quick analysis of a new field.
They spin up 10000 new VMs.
Normally, this is fine, you have the spare capacity.
But, remember, you planned for 200% of AVERAGE cooling capacity and this is not nodes which are busy but not terribly busy, these are nodes doing intense optimized number crunching work which means they draw max power and thus expel max waste heat.
Not only has your load in terms of aggregate number of machines spiked but their waste heat impact is also greater on average.
Boom, cascading failure, your cooling is now N-4.
Server fans start ramping up faster which consumes more power.
Your cooling is now N-5.
Alarms are blaring all over the place.
Safeties on the cooling units start to trip as they exceed their load and refrigerant pressures rise.
Reminds when i did noogler training back in the day and one of the talks described a cascading failure at a datacenter, starting with a cat which was too curious near a power conditioner, and briefly conducted
Its cold up here in the winter, sadly, the residual heat from even totally passive components like switch gear is enough to warm things up enough to attract them. .001% of 1MW of power is still quite warm. (I have no idea how much switchgear leaks but i know they are warm even in winter outdoors).
And, yeah, the rest of the writeup is also an amalgamation of some panic-inducing experiences in my life.
There are often little bits of Neal Stephenson or Andy Weir novels which sound a little like this, describing a technical fault in a plot-driven way (often as a cascade), and I do find those to be uniquely enjoyable. I'm sure there are other authors who do similar things, though maybe "cloud/AI data center" stories should be its own micro-genre, given how crucial these things are to society.
I'd expect someone like AWS to just throttle machines before overloading their cooling. Because they probably can do that, while e.g. a data center that just rents the space can't really throttle their customers nicely.
Reducing clock speeds, even if they could do that -- and I'm not sure they can, given how Nitro is designed -- would be problematic since a lot of customer workloads assume homogeneous nodes.
But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.
Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".
This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.
spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.
Its harder and harder to throttle machines with hardware segmentation capabilities effectively passing through hardware components "intact"
A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.
The cooling units dont fail just because they get to 100% duty cycle. That's pretty much "normal operation", you just get... higher efficiency coz the cooling side is warmer
Not according to POTUS math. You can have 200%, 500%, 600%, 1200%. You just have to say it enough and people will question if they really might not understand percentages enough, and just go with it.
I would have thought with all the data centers being built the parts for cooling systems would be standardized with replacements available from Grainger immediately.
City: just take a walk through manhattan and in a block or two look at the giant open-pit excavation with a 200-year-old morass of undocumented infrastructure under the street. This is before you even try to run fiber up to units in buildings which were built before electricity was standard. I am hardly saying it can't be done, simply that it is not as easy as density makes it seem.
State: the exact opposite problem -- just drive two hours north of NYC and (if you're not still in manhattan) you'll be in some fantastic areas of the state, but, the exact opposite problem exists.
Of note, I do think both of these problems are solvable and we should fundamentally solve them. Just anybody who thinks it's easy or cheap to do so is being myopic. If spent wisely, could be a very useful investment of our money, however.
Do you think the wilds two hours north of NYC are more or less difficult for laying fibre lines than between homes literally in the alps? 60% of switzerland is alps. Not exactly a cake walk for infrastructure development.
And why would they need open pit excavation for FTTH in NYC? Are there not existing trenches and under-street ducting for cables already in most of the city? Surely there are going to be some tricky areas but how to the other utilities like phones and electric work on their cabling?
> It also writes files in it's own uninterpretable format to object storage, so if you lose the metadata store, you lose your data.
That's so confusing to me I had to read it five times. Are you saying you lose the metadata, or that the underlying data is actually mangled or gone, or merely that you lose the metadata?
One of the greatest features of something like this to me would be the ability to durable even beyond JuiceFS access to my data in a bad situation. Even if JuiceFS totally messes up, my data is still in S3 (and with versioning etc even if juicefs mangles or deletes my data, still). So odd to design this kind of software and lose this property.
FUSE generally has low overall performance because of an additional data transfer process between the kernel space and user space, which is less than ideal for AI training.
As I understand it, if the metadata is lost then the whole filesystem is lost.
I think this is a common failure mode in filesystems. For example, in ZFS, if you store your metadata on a separate device and that device is destroyed, the whole pool is useless.
I remember being amazingly excited to have saved up enough money to go to the store and buy a 33.6 modem (an amazing upgrade from my 14.4).
A year or so later I upgraded to a v.92 only to realize my ISP (I think it was IDT at the time) didn't support that and only supported some other 56k "standard" (details are sketchy on this, I was like 12). I was devastated and it was too late to drive back to computer city to exchange it for the correct one.
It is very janky. The speed camera I have an old Core i5 that is running YOLOv8 on the integrated GPU and it can just /barely/ handle 30FPS of inference. The code is all Python and vibe coded (for science). The speed camera needs a perpendicular view to work best for how I set it up (measuring two reference points with a known distance). So the ALPR camera is separate and I basically just buffer video and built this ultra janky scheme where I call an HTTP endpoint and it saves the last few seconds and then I batch process to associate the plate later in the web app. It is all CSV and plain files; this is a perfect append only DB scenario. Eventually it will need the wonders of the big data format SQLite probably, but I am sure Claude will know what to do ;) The long term solution would be to have a proper radar circuit and two cameras facing both road directions to capture the rear plate as people often don't use front plates here even though they are required to by law.
(the point, though, is you don't need a lot of GPU power to do say YOLOv8 inference on the pre-trained models) and OpenCV makes this all pretty darn easy.
When voltage is dropping (or phase is falling out of alignment) don't you want to shed load to stabilize? The issue seems to be more that too much load could shed too quickly causing wild oscillations in grid conditions from an undervolt to an overvolt. Seems like the correct, and not terribly complicated, solution to this is to have a process whereby the grid operator can request large customers to shed load and move to backup power temporarily.
That seems like such a reasonable suggestion that I would be shocked if such a thing does not already exist and is simply just not reactive enough and too manual.
reply