SE Radio 709: Bryan Cantrill on the Data Center Control Plane
Bryan Cantrill, the co-founder and CTO of Oxide Computer company, speaks with host Jeremy Jung about challenges in deploying hardware on-premises at scale. They discuss the difficulty of building up Samsung data centers with off-the-shelf hardware, how vendors silently replace components that cause performance problems, and why AWS and Google build their own hardware. Bryan describes the security vulnerabilities and poor practices built into many baseboard management controllers, the purpose of a control plane, and his experiences building one in NodeJS while struggling with the runtime's future during his time at Joyent. He explains why Oxide chose to use Rust for its control plane and the OpenSolaris-based Illumos as the operating system for their vertically integrated rack-scale hardware, which is designed to help address a number of these key challenges. Brought to you by IEEE Computer Society and IEEE Software magazine.
---
Bryan Cantrill, the co-founder and CTO of Oxide Computer company, speaks with host Jeremy Jung about challenges in deploying hardware on-premises at scale. They discuss the difficulty of building up Samsung data centers with off-the-shelf hardware, how vendors silently replace components that cause performance problems, and why AWS and Google build their own hardware. Bryan describes the security vulnerabilities and poor practices built into many baseboard management controllers, the purpose of a control plane, and his experiences building one in NodeJS while struggling with the runtime’s future during his time at Joyent. He explains why Oxide chose to use Rust for its control plane and the OpenSolaris-based Illumos as the operating system for their vertically integrated rack-scale hardware, which is designed to help address a number of these key challenges.
Brought to you by IEEE Computer Society and IEEE Software magazine.
---
---
Show Notes
#### Related Episodes
- SE Radio 413: Spencer Kimball on CockroachDB
- SE Radio 690: Florian Gilcher on Rust for Safety-Critical Systems
#### Related links
- Oxide Computer
- Oxide and Friends
- Illumos
- Platform as a Reflection of Values
---
#### Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.
Jeremy Jung 00:00:18 Hey, this is Jeremy Jung for Software Engineering Radio and today I’m talking to Brian Cantrill. He’s the co-founder and CTO of Oxide Computer company and he was previously the CTO of Joyent and he also co-authored the DTRAce Tracing Framework while he was at Sun Microsystems. Brian, welcome to Software Engineering Radio.
Bryan Cantrill 00:00:38 Awesome, thanks for having me. It’s great to be here.
Jeremy Jung 00:00:41 You’re the CTO of a company that makes computers, but , I think before we get into that, a lot of people who build software now that the actual computer is abstracted away, they’re using AWS or they’re using some kind of Cloud service. So, I thought we could start by talking about data centers because you were previously working at Joyent and I believe you got bought by Samsung and you’ve previously talked about how you had to figure out how do I run things at Samsung’s scale. So yeah, how was your experience with that? What were the challenges there?
Bryan Cantrill 00:01:21 Yeah, I mean, so at Joyent, and so Joyent was a Cloud computing pioneer. We competed with the likes of AWS and then later GCP and Azure. And we were operating at a scale, right? We had a bunch of machines, a bunch of DCs, but ultimately, we were a VC backed company and a small company by the standards of, certainly by Samsung standards. And so, when Samsung bought the company, the reason by the way that Samsung bought Joyent is Samsung’s Cloud bill was, let’s just say it was extremely large. They were spending an enormous amount of money every year on the public Cloud. And they realized that in order to secure their fate economically, they had to be running on their own infrastructure just did not make sense. And there was not really a product that Samsung could go buy that would give them that on-prem Cloud in that regard, like the state of the market was really no different.
Bryan Cantrill 00:02:19 And so they went looking for a company and bought Joyent. And when we were on the inside of Samsung, we learned about Samsung scale. And Samsung loves to talk about Samsung scale. And I got to tell you, it is more than just chest thumping like Samsung scale really is, I mean just the sheer, the number of devices, the number of customers, just this absolute size. They really wanted to take us out to levels of scale certainly that we had not seen. The reason for buying Joyent was to be able to stand up on their own infrastructure so that we were going to go buy, we did go buy a bunch of hardware. And I remember just thinking, God, I hope Dell is somehow magically better. I hope the problems that we have seen in the small, I just remember hoping and hope it was of course a terrible strategy, and it was a terrible strategy here too.
Bryan Cantrill 00:03:09 And the problems that we saw at the large were, and when you scale out the problems that you see kind of once or twice, you now see all the time and they become absolutely debilitating. And we saw a whole series of really debilitating problems. I mean in many ways like comically debilitating in terms of showing just how bad the state-of-the-art debt is. And we had, I mean it should be said, we had great software and great software expertise, and we were controlling our own system software. But even controlling your own system software, your own hosts(?), your own control plane, which is what we had to join, ultimately, you’re limited. You got, I mean you got the problems that you can obviously solve the ones that are in your own software, but the problems that are beneath you, the problems that are in the hardware platform, the problems that are in the componentry beneath you become the problems that are in the firmware.
Bryan Cantrill 00:04:09 Those problems become unresolvable, and they are deeply, deeply frustrating. And we just saw a bunch of them again, they were comical in retrospect. And I’ll give you like a couple of concrete examples just to give you an idea of what kind of what you’re looking at. One of the, our data centers had really pathological IO latency. We had a very database heavy workload, and this was kind of right at the period where you were still deploying on rotating media on hard drives. So, this is like so, and all flash by did not make economic sense when we did this in 2016. This probably it’d be interesting to know like when was the, the kind of the last time that actual hard drives made sense because I feel like this was close to it. So, we had a bunch of pathological Io problems, but we had one data center in which the outliers were actually quite a bit worse and there was so much going on in that system.
Bryan Cantrill 00:05:07 It took us a long time to figure out like why. And because when you’re Io when you’re seeing worse Io, I mean naturally you want to understand like what’s the workload doing? You’re trying to take a first principles approach, what’s the workload doing? So this is a very intensive database workload to support the object storage system that we had built called Manta and the metadata tier was stored and was we were using Postgres for that. And that was just getting absolutely slaughtered and ultimately very Io bound with these kind of pathological Io latencies and as we trying to like peel away the layers to figure out what was going on. And I finally had this thing, so it’s like okay, we are seeing at the device layer, at the disc layer, we are seeing pathological outliers in this data center that we’re not seeing anywhere else.
Bryan Cantrill 00:05:58 And that does not make any sense. And the thought occurred to me, I’m like, well maybe we are, do we have like different or different rev of firmware on our HGST drives, HGST now part of WD Western Digital where the drives we had everywhere. So maybe I had a firmware bug I did would not be the first time in my life at all that I would have a drive firmware issue. And I went to go pull the firmware rev and I’m like, Toshiba makes hard drives? So, I had no idea that Toshiba even made hard drives let alone that they were in our data center. I’m like, what is this? And as it turns out, and this is, part of the challenge when you don’t have an integrated system, which not to pick on them, but Dell doesn’t and what Dell would routinely put just sub make substitutes and they make substitutes that they, it’s kind of like you’re going to like, I don’t know, Instacart or whatever and they’re out of the thing that you want.
Bryan Cantrill 00:06:55 So, you’re, someone makes a substitute and like sometimes that’s okay but it’s not okay in a data center and you really want to develop and validate an end-to-end integrated system. And in this case, like Toshiba doesn’t, I mean Toshiba does make hard drives, but they are a, or they did, they basically were not competitive, and they were not competitive in part for the reasons that we were discovering. They had serious firmware issues. So, these were drives that would just simply stop acknowledging any reads from the order of 2,700 milliseconds, long time 2.7 seconds. And that was a, it was a drive firmware issue, but it was highlighted like a much deeper issue, which was the simple lack of control that we had over our own destiny. And it’s an example among many where Dell is making a decision that lowers the cost of what they are providing you marginally, but it is then giving you a system that they shouldn’t have any confidence in because it’s not one that they’ve actually designed and they leave it to the customer, the end user to make these discoveries.
Bryan Cantrill 00:08:02 And these things happen up and down the stack and for every, for whether it’s, and, and not just to pick on Dell because it’s true for HPE, it’s true for Supermicro, it’s true for your Switch vendors. It’s true for storage vendors where the one that is left actually integrating these things and trying to make the whole thing work is the end user sitting in their data center. There’s not a product that they can buy that gives them elastic infrastructure, a Cloud in their own DC, the product that you buy is the public Cloud. Like when you go in the public Cloud, you don’t worry about the stuff because it’s AWS’s issue or it’s GCP’s issue. And they are the ones that get this to ground and they, and this was kind of the eye-opening moment, not a surprise, they’re not Dell customers, they’re not HP customers, they’re not Supermicro customers.
Bryan Cantrill 00:08:53 They have designed their own machines and to varying degrees depending on which one you’re looking at. But they’ve taken the queen sheet of paper and the frustration that we had kind of at Joyent in and beginning to wonder and then Samsung and kind of wondering what was next is that what they built was not available for purchase in the data center. You could only rent it in the public Cloud. And our big belief is that public Cloud computing is an important revolution in infrastructure. Doesn’t feel like a different deep thought, but Cloud computing is a really important revolution. It shouldn’t only be available to rent, you should be able to actually buy it. And there are a bunch of reasons for doing that. One in the one we saw at Samsung is economics, which I think is still the dominant reason where it just does not make sense to rent all of your compute in perpetuity.
Bryan Cantrill 00:09:43 But there are other reasons too. There’s security, there’s risk management, there’s latency, there are a bunch of reasons why one might want to own one’s own infrastructure. But that was very much the, so the genesis for Oxide was coming out of this very painful experience and a painful experience that, because I mean a long answer to your question about like what was it like to be at Samsung scale? Those are the kinds of things that we, I mean, in our other data centers, we didn’t have Toshiba drives, we only had the HDSC drives, but it’s only when you get to this larger scale that you begin to see some of these pathologies. But these pathologies then are really debilitating in terms of those who are trying to develop a service on top of them. So, it was very educational in that regard and you’re very grateful for the experience at Samsung I’m selling in terms of opening our eyes to the challenge of running at that kind of scale.
Jeremy Jung 00:10:33 And during that time at Joyent, when you experienced some of these issues, was it more of a case of you didn’t have enough servers experiencing this? So, if it would happen you might say like, well this one’s not working so maybe we’ll just replace the hardware. What was the thought process when you were working at that smaller scale and how did these issues affect you?
Bryan Cantrill 00:10:56 Yeah, at the smaller scale you see fewer of them, right? You just see it’s like, okay we, what you might see. It’s like that’s weird, we kind of saw this in one machine versus seeing it in a hundred or a thousand or 10,000. So, you just see them less frequently as a result they are less debilitating. I think that it’s, when you go to that larger scale, those things that were unusual now become routine and they become debilitating. So, it really is in many regards a function of scale. And then I think it was also, it was a little bit dispiriting that kind of the substrate we were building on really had not improved. And if you buy a computer server, buy an X86 server, there is a very low layer of firmware, the BIOS, the Basic Input Output System, the UEFI BIOS and this is like an abstraction layer that has existed since the 80’s and hasn’t really meaningfully improved.
Bryan Cantrill 00:11:50 But beyond that, like this lowest layer of platform enablement software is really only impeding the operability of the system. You look at the baseboard management controller, which is the kind of the computer within the computer, there is an element in the machine that needs to handle environmentals, that needs to handle operate the fans and so on. And that traditionally has this, the spaceport management controller and that architecturally just hasn’t improved in the last two decades. And it’s a proprietary piece of silicon generally from a company that no one’s ever heard of called ASPEED, which has to be, it is written all on caps so I guess it needs to be screamed. ASPEED has a proprietary part, there is a root password infamously the root password is encoded effectively in silicon. And for anyone who kind of goes deep into these things like oh my god, are you kidding me? When we first started Oxide, the Wi-Fi password was a fraction of the ASPEED root password for the BMC. It’s kind of like a little BMC humor. But those things, it was just dispiriting that the state-of-the-art was still basically personal computers running in the data center. And that’s part of what was the motivation for doing something new.
Jeremy Jung 00:12:59 And for the people using these systems, whether it’s the baseboard management controller or it’s the BIOS or UEFI component, what are the actual problems that people are seeing?
Bryan Cantrill 00:13:17 Oh man you are going to have like some fraction of your listeners maybe a big fraction is like yeah, like what are the problems? That’s a good question. And then you’re going to have the people that actually deal with these things, like their heads already hit the desk being like what are the problems? Like what are the non-problems? Like what works actually? That’s like a shorter answer. I mean there are so many problems and a lot of it is just like, I mean there are problems just architecturally these things are just so, I mean the problems spread of the horizon so you can kind of start wherever you want. But I mean as like as a really concrete example, okay so the BMCs, the computer within the computer, that needs to be on its own network. So, you know how have like not one network, you got two networks and that network by the way, that’s the network that you’re going to log into to like to reset the machine when it’s otherwise unresponsive.
Bryan Cantrill 00:14:05 So going into the BMC, are you able to control the entire machine? Well, it’s like alright so now I’ve got a second network that I need to manage, what is running on the BMC? Well, it’s running some ancient, ancient version of Linux that you, it’s like well how do I patch that? How do I like to manage the vulnerabilities with that? Because if someone is able to root your BMC, they control the system. So, it’s like this is not, and now you’ve got to go deal with all of the operational hair around that. How do you upgrade that system, updating the BMC? I mean it’s like you’ve got this like second shadow bad infrastructure that you have to go manage generally not open source. There’s something called open BMC which people use to varying degrees, but you’re generally stuck with the proprietary BMC.
Bryan Cantrill 00:14:51 So you’re generally stuck with ILO from HPE or IRAC from Dell or the Supermicros BMC, the HP BMC and it is just excruciating pain. This is assuming that by the way, that everything is behaving correctly. That the problem is that these things often don’t behave correctly and then the consequence of them not behaving correctly is really dire because it’s at that lowest layer of the system. So, I mean, I’ll give you a concrete example. A customer of theirs supported the me so I won’t disclose the vendor, but let’s just say that a well-known vendor had an issue with their temperature sensors were broken and the thing would always read basically the wrong value. So, it was the BMC that had to like, invent its own different kind of thermal control loop, and it would index on the actual inrush current. They would look at that at the current that’s going into the CPU to adjust the fan speed.
Bryan Cantrill 00:15:47 That’s an interesting idea. That doesn’t work because that’s not the temperature. So that software would crank the fans whenever you had an inrush of current and this customer had a workload that would spike the current and when it would spike the current, the fans would kick up and then they would slowly degrade over time. Well, this workload was spiking the current faster than the fans would degrade but not fast enough to actually heat up the part. And ultimately over a very long time in a very painful investigation, this customer determined that like my fans are cranked and my data center for no reason we’re blown cold air. And it’s like this is on the order of like a hundred watts a server of energy, that you shouldn’t be spending and that comes down to this kind of broken software hardware interface at the lowest layer that has real meaningful consequence in terms of hundreds of kilowatts across a data center.
Bryan Cantrill 00:16:44 So this stuff has very, very, very real consequence and it’s such a shadowy world. Part of the reason that that your listeners that have dealt with this, that heads will hit the desk is because it is really aggravating to deal with problems at this layer. You feel powerless, you don’t control or really see the software that’s on them. It’s generally proprietary. You are relying on your vendor. Your vendor is telling you that like, boy, I don’t know, you’re the only customer seeing this. I mean the number of times I have heard that for and I have pledged that we’re not going to say that at Oxide because it’s such an unaskable thing to say like you’re the only customer saying this. It’s like, it feels like are you blaming me for my problem, feels like you’re blaming me for my problem. And what you begin to realize is that to a degree these folks are speaking their own truth because the folks that are running at real scale at hyperscale, those folks aren’t Dell, HP Supermicro customers. They’ve done their own thing. So, it’s like, Dell’s not seeing that problem because they’re not running at the same scale, but when you do run you only have to run at modest scale before these things just become overwhelming in terms of the headwind that they present to people that want to deploy infrastructure.
Jeremy Jung 00:17:57 Yeah. So maybe to help people get some perspective, at what point do you think that people start noticing or start feeling these problems? Because I imagine that if you just have a few racks orÖ
Bryan Cantrill 00:18:14 Do you have a couple racks or do you wonder or just wondering because no, no, no. I would think, I think anyone who deploys any number of servers, especially now, especially if your experience is only in the Cloud, you’re going to be like, what the hell is this? I mean just again, just to get this thing working at all, it’s so hairy and so congealed, right? It’s not designed, I mean nobody who is setting up a rack of servers is going to think to themselves like yes, this is the right way to go to it. This all makes sense. I mean it’s a bag of bolts; it’s a bunch of parts that you’re putting together. And so even at the smallest scales that stuff is painful just architecturally it’s painful at the small scale, but at least you can get it working. I think the stuff that then becomes debilitating at larger scale are the things that are worse than just like I can’t, like this thing is a mess to get working.
Bryan Cantrill 00:19:03 It’s like the fan issue where you are now seeing this over, hundreds of machines or thousands of machines. So, it is painful at more or less all levels of scale. There’s, there is no level at which the pc, which is really what this is, this is the personal computer architecture from the 1980s and there is really no level of scale where that’s the right thing to go deploy. Especially if what you are trying to run is elastic infrastructure, a Cloud. Because the other thing is like we’ve kind of been talking a lot about that hardware layer. Like hardware is just the start. Like you got to go put software on that and actually run that as elastic infrastructure. So, you need a hypervisor, yes, but you need a lot more than that. You need a distributed database, you need web endpoints, you need ACLI, you need all the stuff that you need to actually go run an actual service of compute or networking or storage. I mean, and even for compute, there’s a ton of work to be done and compute is by far, I would say the simplest of the three. When you look at like network services, storage services, there’s a whole bunch of stuff that you need to go build in terms of distributed systems to offer that as a Cloud. So, I mean it is painful at more or less every level if you are trying to deploy Cloud computing on-prem.
Jeremy Jung 00:20:21 And for someone who doesn’t have experience building or working with this type of infrastructure, when you talk about a control plane, what does that do in the context of the system?
Bryan Cantrill 00:20:35 So control plane is the thing that is everything between your API request and that infrastructure being acted upon. So, you go say, hey, I want a provision of VM, okay great, we’ve got a whole bunch of things we’re going to provision with that. We’re going to provision a VM, we’re going to get some storage that’s going to go along with that, that’s got a network storage service that’s going to come out of, we’ve got a virtual network that we’re going to either create or attach to. We’ve got a whole bunch of things we need to go do for that. For all these things. There are metadata components that we need to keep track of this thing that beyond the actual infrastructure that we create. And then we need to go like act on the actual compute elements, the HostOS (?), what have you, the switches what have you, and actually go create these underlying things and then connect them.
Bryan Cantrill 00:21:18 And there’s of course the challenge of just getting that working is a big challenge, but getting that working robustly, getting that working is when you go to provision of VM all the steps that need to happen and what happens if one of those steps fails along the way, what happens if, one thing we’re very mindful of is these kind of, you get these long tails of like generally our VM provisioning happened within this time, but we get these long tails where it takes much longer, what’s going on? Where in this process are we spending time? And there’s a whole lot of complexity that you need to go deal with that. That whole distributed system, that itself needs to be reliable and available. So, you need to be able to, what happens if you pull a sled or if a sled fails, how does the system deal with that?
Bryan Cantrill 00:22:03 How does the system deal with getting another sled added to the system? Like how do you grow this distributed system and then how do you update it? How do you go from one version to the next? And all of that has to happen across an air gap where this is going to run as part of the computer. So, it’s fractally complicated. There is a lot of complexity here in the software system and all of that we call the control plane. And what exists at AWS, at GCP, at Azure, when you are hitting an endpoint that’s provisioning an EC2 instance for you, there is an AWS control plane that is doing all this and has some of these similar aspects and certainly some of these similar challenges.
Jeremy Jung 00:22:43 And for people who have run their own servers with something like say VMware or Hyper-V or Proxmox, are those in the same category?
Bryan Cantrill 00:22:55 Yeah, I mean a little bit. I mean to kind of like vSphere, yes. VMware, no. So, it’s like VMware ESX is kind of a key building block upon which you can build something that is a more meaningful distributed system when it’s just like a machine that you’re provisioning VMs on, it’s like, okay, well that’s actually you as the human might be the control plane. Like not so much easier problem, but when you got tens, hundreds, thousands of machines, you need to do it robustly. You need something to coordinate that activity, and you need to pick which side you land on, and you need to be able to move these things, you need to be able to update that whole system. That’s when you’re getting into a control plane. So, some of these things have kind of edged into a control plane, certainly VMware now Broadcom has delivered something that’s kind of Cloud-ish.
Bryan Cantrill 00:23:36 I think that for folks that are truly born on the Cloud, it still feels somewhat like you’re going backwards in time when you look at this kind of on-prem offerings. But it’s got these aspects to it for sure. And some of these other things when you’re just looking at KVM or just looking at Proxmox, you kind of need to connect it to other broader things to turn it into something that really looks like manageable infrastructure. And then many of those projects are really, they’re either proprietary products like vSphere or you are really dealing with open-source projects that are not necessarily aimed at the same level of scale. You look at again Proxmox or you’ll get an OpenStack and, OpenStack is just a lot of things, right? I mean OpenStack has got so many, OpenStack was kind of a free for all for every infrastructure vendor and I, there was a time when people were like, aren’t you worried about all these companies together that are coming together for OpenStack? I’m like, haven’t you ever worked for like a company? Like companies don’t get along by the way, it’s like having multiple companies work together on a thing that’s bad news, not good news. And I think, one of the things that OpenStack has definitely struggled with what actually there’s so many different kind of vendor elements in there that it’s very much not a product, it’s a project that you’re trying to run. But that’s, but that very much is in, I mean that’s similar certainly in spirit.
Jeremy Jung 00:24:56 And so I think this is kind of like alluding to earlier the piece that allows you to allocate compute storage, manage networking gives you that experience of I can go to a web console or I can use an API and I can spin up machines, get them all connected. At the end of the day the control plane is allowing you to do that in hopefully a user-friendly way.
Bryan Cantrill 00:25:22 That’s right. Yep. And in the, I mean in order to do that in a modern way, it’s not just like a user-friendly way. You really need to have a COI and a web UI and an API, those all need to be drawn from the same kind of single ground truth. Like you don’t want to have any of those be an afterthought for the other. You want to have the same way of generating all of those different endpoints, entries into the system.
Jeremy Jung 00:25:45 And if you take your time at Joyent as an example, what kind of tools existed for that versus how much did you have to build in-house as far as the hypervisor and managing the compute and all that?
Bryan Cantrill 00:25:59 Yeah, so we built more or less everything in-house. I mean what you have is, and I think, over time we’ve gotten slightly better tools. For example, at Joyent when we were building a Cloud at Joyent, there wasn’t really a good distributed database so we were using Postgres as our database for metadata and there were a lot of challenges and Postgres is not attributed database, it’s running with a primary secondary architecture and there’s a bunch of issues there, many of which we discovered the hard way when we work coming to Oxide you have much better options to pick from in terms of distributed databases. There was a period that now seems maybe potentially brief in hindsight but of a really high-quality open-source distributed databases. So, there were really some good ones to pick from. We built on Cockroach DB on CRDB, so that was a really important component that we had at Oxide that we didn’t have at Joyent.
Bryan Cantrill 00:26:55 So we were, I wouldn’t say we were rolling our own distributed database, we were just using Postgres and dealing with an enormous amount of pain there. In terms of the surround on top of that a control plane is much more than a database obviously and you’ve got to deal with, there’s a whole bunch of software that you need to go right to be able to transform these kind of API requests into something that is reliable infrastructure, right? And there’s a lot to that, especially when networking gets in the mix, when storage gets in the mix, there are a whole bunch of like complicated steps that need to be done. At Joyent, we in part because the history of the company and like look, this just is not going to sound good but it is what it is and I’m just going to own it. We did it all in Node at Joyent, which I know sounds really right now just sounds like you built it with Tinker Toys you okay, did you think it was you built the skyscraper with Tinker Toys?
Bryan Cantrill 00:27:49 It’s like well okay, we actually, we had greater aspirations for the Tinker Toys once upon a time. But let’s just say that that experiment , that experiment did ultimately end in a predictable fashion, and we decided that maybe Node was not going to be the best decision long term. Joyent was the company behind NodeJS back in the day Ryan Dahl worked for Joyent and then we landed that in a foundation in about 2015, something like that and began to consider our world beyond Node. A big tool that we had in the arsenal when we started Oxide is Rust. So indeed, the name of the company is a tip of the hat to the language that we were pretty sure we were going to be building a lot of stuff in namely Rust and Rust has been huge for us, a very important revolution in programming languages.
Bryan Cantrill 00:28:37 I think what has been surprising is the sheer number of layers at which we use Rust in terms of, we’ve done our own embedded firmware in Rust we’ve done in the host operating system, which is still largely in C but very big components are in Rust. The hypervisor pros are all Rusts then of course the control plane distributed system on that is all Rusts. So that was a very important thing that we very much did not need to build ourselves. We were able to really leverage, a terrific community. We were able to use, and we’ve done this at Joyent as well, but at Oxide we’ve used Illumos as a HostOS component, which our variant is called Helios. We’ve used Beehive as that kind of internal hypervisor component. We’ve made use of a bunch of different open-source components to build this thing which has been really, really important for us and open-source components that didn’t exist even like five years prior as part of why we felt that 2019 was the right time to start the company. That’s when we started Oxide.
Jeremy Jung 00:29:32 And you had mentioned that at Joyent you had tried to build this in Node. What were the issues or the challenges that you had doing that?
Bryan Cantrill 00:29:43 Oh boy. Yeah, again, we kind of had higher hopes in 2010 I would say when we set on this, the problem that we had just writ large JavaScript been really designed to allow as many people on earth to write a program as possible. Which is good. I mean that’s like a laudable goal, but the problem is it’s much more difficult to write rigorous software. We want to be able to write rigorous software and it’s actually okay if it’s a little harder to write rigorous software that’s actually okay if it gets, leads us to more rigorous artifacts. But in JavaScript, I mean just a concrete example, there is nothing to prevent you from referencing a property that doesn’t exist in JavaScript. So, if your fat finger a property name, you are relying on something to tell you, by the way, I think you’ve misspelled this because there is no type of definition for this thing and I don’t know that you’ve got one that’s spelled correctly, one that’s spelled incorrectly, that’s often undefined.
Bryan Cantrill 00:30:42 And then when you actually go, so you’ve got this typo that is lurking in what you want to be rigorous software and if you don’t execute that code, like you won’t know that’s there and then you do execute that code and now you’ve got an undefined object and now that’s either going to be an exception or it can, again depends on how that’s handled. It can be really difficult to determine the origin of that error of that programming and that is a programmer error. And one of the big challenges that we had with Node is that programmer errors and operational errors, like I’m at a disc space as an operational error, those get conflated and it becomes really hard and in fact I think the language wanted to make it easier to just kind of drive on in the event of all errors and it’s like actually not what you want to do if you’re trying to build a reliable robust system.
Bryan Cantrill 00:31:35 So we had no end of issues. We’ve got a lot of experience developing rigorous systems again coming out of operating systems development and so on. And we brought some of that rigor if strangely to JavaScript. So, one of the things that we did is we brought a lot of postmortems, diagnosis ability and observability to Node. And so, if one of our Node processes died in production, we would get a core dump from that process, a core dump that we could meaningfully process. So, we did a bunch of kind of wild stuff, I mean wild stuff where we could make sense of the JavaScript objects in a binary core dump. Things that we thought were really important and this is the rest of the world just looks at this being like, what the hell is this? I mean it’s so out of step with the, the problem is that we were trying to bridge two disconnected cultures of one developing really rigorous software and really designing it for production diagnosis ability and the other really designing it to software to run in the browser and for anyone to be able to like, kind of liven up a webpage, right?
Bryan Cantrill 00:32:45 Is kind of the origin of live script and then JavaScript and we were kind of the only ones sitting at the intersection of that and you begin when you are the only ones sitting at that kind of intersection, you just are you’re kind of fighting a community all the time. And we just realized that there were so many things that the community wanted to do that we felt are like, no this is going to make software less diagnosable, it’s going to make it less robust. And then you realize like we’re the only voice in the room because we have got desires for this language that it doesn’t have for itself and this is when you realize you’re in a bad relationship with software, it’s time to move on. And in fact, actually several years after we’d already kind of broken up with Node and it was like, it was a bit of an acrimonious breakup.
Bryan Cantrill 00:33:28 There was a famous slash infamous fork of node called IO.js and this was viewed because the community thought that Joyent was being, was not being an appropriate steward of NodeJS and was not allowing more things to come in the node. And of course the reason that we of course felt that we were being a careful steward and we were actively resisting those things that would cut against its fitness for a production system but it’s not way the community saw it and I think we knew before the fork that’s like this is not working and we need to get this thing out of our hands and what are the wrong hands for this? This needs to be in a foundation. So, we’ve kind of gone through that breakup and maybe it was two years after that that I, a friend of mine who was running the node summit, I was actually, it’s unfortunately now passed away Charles Beeler. But Charles’s venture capitalist great guy and Charles was running Node Summit and came to me in 2017.
Bryan Cantrill 00:34:27 He’s like, I really want you to keynote Node Summit. And I’m like, Charles, I’m not going to do that. I’ve got nothing nice to say. I’m the last person you want to keynote. He’s like, oh if you have nothing nice to say, you should definitely keynote. You’re like, oh God, okay here we go. He’s like, no, I really want you to talk about like you should talk about the Joyent breakup with Node.js. I’m like, oh man. And that led to a talk that I’m really happy that I gave because it was a very important talk for me personally called Platform is a reflection of values and really looking at the values that we had for Node and the values that Node had for itself, and they didn’t line up. Like there’s nobody in the Node community who’s like, I don’t want Rigor, I hate Rigor. It’s just that if they had to choose between Rigor and making the language approachable, they would choose approachability every single time.
Bryan Cantrill 00:35:23 They would never choose Rigor. And that was a big eyeopener I would say if you watched this talk because I knew that there’s like the audience was going to be filled with people who had been a part of the fork in 2014, I think was the fork, the IO.js fork. I said a little bit of a trap for the audience. And the trap, I said, what I kind of talked about the values that we had and the aspirations we had for Node, the aspirations that Node had for itself and how they were different. And I’m like look, in hindsight like a fracture was inevitable and in 2014 there was finally a fracture and do people know what happened in 2014? And you could listen to that talk, everyone almost says in unison like IO.js, I’m like oh right IO.js. Right. That’s not what I was thinking of.
Bryan Cantrill 00:36:11 And I go to the next slide and is a tweet from a guy named TJ Holloway Chuck, who was the most prolific contributor to Node and it was his tweet also in 2014 before the fork, before the a IO.js fork explaining that he was leaving Node and that he was going to go and if you turn the volume all the way up you can hear the audience gasp and it’s just delicious because the community had never really confronted why TJ left. And I went through a couple of folks, Felix, a bunch of other folks, early Node folks that were there in 2010 were leaving in 2014 and they were going to go primarily. And they were going to go because they were sick of the same things that we were sick of. They had hit the same things that we had hit, and they were frustrated.
Bryan Cantrill 00:36:59 I really do believe this that platforms do reflect their own values and when you are making a software decision you should select values that align with the values that you have for that software. That’s way more important than other things that people look at. I think people look at for example, quote unquote community size way too frequently. Community size is like eh, maybe it can be fine. And then we need to go like act on the actual compute elements, the HostOS, what have you, the switches what have you, and actually go create these underlying things and then connect them. There are strengths and weaknesses to both approaches just as like there’s a strength to being in a big city versus a small town. Me personally, I’ll take the small community more or less every time because the small community is almost always self-selecting based on values and just for the same reason that I like working at small companies or small teams. There’s a lot of value to be had in a small community. It’s not to say that large communities is valueless, but again, long answer to your question of kind of where did things go south with Joyent and Node, they went south because the values that we had and the values the community had didn’t line up and that was a very educational experience as you might imagine.
Jeremy Jung 00:38:12 Yeah. And given that you mentioned how because of those values, some people moved from Node to Go and in the end for much of what Oxide is building, you ended up using Rust. What would you say are the values of Go and Rust and how did you end up choosing Rust given that?
Bryan Cantrill 00:38:33 Yeah, I mean well so the value for yeah and so Go, I mean I understand why people move from Node to Go, Go to me was kind of a lateral move. There were a bunch of things that I, Go is still garbage collected, which I didn’t like. Go also is very strange in terms of there are this kind of like these autocratic kind of decisions that are very bizarre. I mean generics is kind of a famous one, right? Where Go kind of as a point of principle didn’t have generics even though Go itself actually the innards of Go did have generics. It’s just that you or a Go user weren’t allowed to have them. And I just think that the arguments against generics were kind of disingenuous and indeed like they ended up adopting generics and then there’s like some super weird stuff around like they’re very anti assertion, which is like what? How is someone against assertions, it doesn’t even make any sense but it’s like, oh nope, okay, there’s a whole scree on it.
Bryan Cantrill 00:39:25 Nope, we’re against assertions. And then, against versioning, there’s another thing like Rob Pike has kind of painlessly been like, you should always just run on the way to commit. And you’re like, does that make sense? I mean is this we actually built it? And so, there are a bunch of things like that you’re just like, okay this is just exhausting and mean. There’s some things about Go that are great and plenty of other things that I just, I’m not a fan of. I think that the in the end like Go cares a lot about like compile time. It’s super important for Go right? It is very quick compile time. I’m like okay, but that’s like compile time is not like, it’s not unimportant, it doesn’t have zero importance, but I’ve got other things that are like lots more important than that.
Bryan Cantrill 00:40:03 What I really care about is I want a high performing artifact and I wanted garbage collection out of my life. I got to tell you, garbage collection to me is an embodiment of this like larger problem of where do you put cognitive load in the software development process and what garbage collection is saying to me it is right for plenty of other people and the software that they want to develop. But for me and the software that I want to develop infrastructure software; I don’t want garbage collection because I can solve the memory allocation problem. I know when I’m like done with something or not that, I mean it’s like whether that’s in C with, I mean it’s actually like, it’s really not that hard to not leak memory in a C-base system and you can give yourself a lot of tooling that allows you to diagnose where memory leaks are coming from.
Bryan Cantrill 00:40:54 So it’s like that is a solvable problem. There are other challenges with that, but like when you are developing a really sophisticated system that has garbage collection, it’s using garbage collection, you spend as much time trying to dork with the garbage collector to convince it to collect the thing that you know is garbage. You are like, I’ve got this thing, I know it’s garbage now I need to use these like tips and tricks to get the garbage collector. I mean it’s like, it feels like every Java performance issue goes to like minus XX call and use the other garbage collector, whatever one you’re using, use a different one and using a different approach. It’s like you’re in the worst of all worlds where the reason that garbage collection is helpful, helpful is because the programmer doesn’t have to think at all about this problem. But now you’re actually dealing with these long pauses in production, you’re dealing with all these other issues where actually you need to think a lot about it and it’s kind of, its witchcraft, it’s this black box that you can’t see into.
Bryan Cantrill 00:41:46 So it’s like what problem have we solved exactly? So, the fact that Go had garbage collection, it’s like eh, no I do not want like, and then you get all the other like weird fat wuss and, everything else. I’m like, no thank you goes a no thank you for me I get that why people like it or use it, but that was not going to be it. I’m like, I want C but I, there are things I didn’t like about C too. I was looking for something that was going to give me the deterministic kind of artifact that I got out of C, but I wanted library support and C is tough because it’s all convention. There’s just a bunch of other things that are just thorny. And I remember thinking vividly in 2018, I’m like, well it’s roster bust, I’m going to go into Rust.
Bryan Cantrill 00:42:29 And when I did what a lot of people were doing at that time and people have been doing since of really getting into Rust and really learning it, appreciating the difference in the model for sure, of the ownership model people talk about, that’s also obviously very important. It was the error handling that blew me away and the idea of like algebraic types, I never really had algebraic types and the ability to have, and for error handling is where these really, you really appreciate these things where it’s like how do you deal with a function that can either succeed in return something or it can fail? And the way C deals with that is bad with these kind of sentinels for errors and does negative one mean failure? Does zero mean failure some C functions zero means failure. Traditionally in Unix, zero means success and they’re like, what if you want to return a file descriptor?
Bryan Cantrill 00:43:22 You know it’s like, oh. And then it’s like okay then it’ll be like zero through positive N will be a valid result. Negative numbers will be, and like was it negative one and I said err no or is it a negative number that did not, I mean it’s like, and that’s all convention, right? People do all those different things and it’s all convention and it’s easy to get wrong, easy to have bugs, can’t be static, quit checked and so on. Then what go says it’s like; well you’re going to have like two return values and then you’re going to have to like just like constantly check all of these all the time. Which is also kind of gross. JavaScript is like, hey let’s toss an exception if, if we don’t like something, if we see an error, we’ll throw an exception. There are a bunch of reasons I don’t like that.
Bryan Cantrill 00:43:56 And you look at what Rust does where it’s like, no, no, no, we’re going to have these out algebra types, which is to say this thing can be this thing or that thing, but it has to be one of these. And by the way, you don’t get to process this thing until you conditionally match on one of these things. You’re going to have to have a pattern match on this thing to determine if it’s a this or a that. And if it in the result type that you, the result is a generic where it’s like it’s going to be either the thing that you want to return, it’s going to be an okay that contains the thing you want to return or it’s going to be an error that contains your error and it forces your code to deal with that. And what that does is it shifts the cognitive load from the person that is operating this thing in production to the actual developer that is in development. And that to me is like I love that shift and that shift to me is really important and that’s what I was missing. That’s what Rust gives you. Rust forces you to think about your code as you write it, but as a result you have an artifact that is much more supportable, much more sustainable and much faster.
Jeremy Jung 00:45:04 Yeah, it sounds like you would rather take the time during the development to think about these issues because whether it’s garbage collection or it’s air handling at runtime when you’re trying to solve a problem, then it’s much more difficult than having dealt with it to start with.
Bryan Cantrill 00:45:24 Yeah, absolutely. And I just think that like why also like if it’s software, it’s, again, if it’s infrastructure software, I mean the kind of the question that you should have when you’re writing software is how long is this software going to live? How many people are going to use this software? And if you are writing an operating system, the answer for this thing that you’re going to write, it’s going to live for a long time. Like if we just look at plenty of aspects of the system that have been around for decades, it’s going to live for a long time and many, many, many people are going to use it. Why would we not expect people writing that software to have more cognitive load when they’re writing it to give us something that’s going to be a better artifact. Now conversely, you’re like, hey, I kind of don’t care about this and, I don’t know, I’m just like, I want to see if this whole thing works.
Bryan Cantrill 00:46:10 I’ve got, I like, I’m just stringing this together. I don’t like, no, the software like will be lucky if it survives until tonight, but then who cares? Garbage clock if you’re prototyping something, whatever. And this is why you really do get like different choices, different technology choices depending on the way that you want to solve the problem at hand. And for the software that I want to write, I do like that cognitive load that is upfront. And although I think, I think the thing that is really wild that is the twist that I don’t think anyone really saw coming is that in a, in an LLM age, the cognitive load upfront almost needs an asterisk on it because so much of that can be assisted by an LLM. And now, I mean I would like to believe, and maybe this is me being optimistic that in the LM age we will see, I mean Ross is a great fit for the LLMH because the LLM itself can get a lot of feedback about whether the software that’s written is correct or not, much more so than you can for other environments.
Jeremy Jung 00:47:08 Yeah, that is a interesting point in that I think when people first started trying out the LLMs to code, it was really good at these maybe looser languages like Python or JavaScript and initially it wasn’t so good at something like Rust. But it sounds like as that improves, if it can write it then because of the Rigor or the memory management or the error handling that the language is forcing you to do, it might end up being a better choice for people using LLMs.
Bryan Cantrill 00:47:47 Yes. Yeah, yeah, absolutely. It gives you more certainty in the artifact that you’ve delivered. I mean, you a lot about a Rust program that compiles correctly. I mean there are certain classes of errors that you don’t have, that you actually don’t know on a C program or a Go program or a JavaScript program. I think that’s going to be really important. I think we are on the cusp and maybe we’ve already seen it, this kind of great bifurcation in the software that we write where the rigorous software becomes much more important. We have this foundational software that we’re going to rely on as much more bedrock and then we’re going to have much more software where that rigor is not as much of a constraint because the constraints are the ability for it to be customized to my need and done very quickly. And it’s going to feel like, I think two different worlds and I think in an exciting way, I think that the future is definitely exciting for software.
Jeremy Jung 00:48:40 Another interesting decision about the Oxide computer is you chose an operating system that I think most people aren’t familiar with rather than a Linux or a free BSD. You chose, I believe it’s an Illumos, is that how you pronounce it?
Bryan Cantrill 00:48:57 Yeah, yeah, yeah. I mean, yes. So Illumos which very much inherits from open Solaris, which inherits from the Solaris heritage, which was SunOS 4.x before that was the Solaris itself is this unholy love child of SonOS 4.x and the true kind of BSD lineage and the AT&T lineage in Unix and SVR4. In many ways, I know that people haven’t necessarily heard of it, but it is true Unix in a real historical sense in a way that Linux and even the BSDs are not, and there are aspects where it shows it, but in terms of it is an idiosyncratic decision, it’s not one that we took lightly. People assume like, oh it’s a bunch of old Sun folks, like of course they’re going to pick Illumos. And we had a lot of operational experience with Smarto asset at Joyent.
Bryan Cantrill 00:49:50 But that’s a, it’s more nuanced than that, I would say. It is true that a bunch of the technologies that we have developed over the years, we developed them for good reasons, and we don’t want to be without them. I mentioned the postmortem diagnose ability of JavaScript, postmortem diagnose ability is really important to us. And to me, debug ability is really important, and debug ability is not something that others operating systems have taken that seriously. Just to put it bluntly, you can judge a lot about an operating system by its built-in debugger and there’s not really a built in debugger present for Linux, but it’s more than just that. If you look at Linux in particular, Linux is a kernel and, this is something that Torvalds makes clear at, in his defense, at more or less every juncture: Linux is a kernel.
Bryan Cantrill 00:50:33 And this is the kind of the very famous, the Stallmanism of, like, what you are calling Linux, I call CANoe Linux. And it’s obviously, on the one hand, a little bit ridiculous. On the other hand, like, not wrong in that the software that you need to actually have a functional system is a lot more than just Linux. You need a Libc, right? And there’s actually more than one Libc. Is it going to be musl, is it going to be glibc? There are some other alternate libc’s. And part of the problem is if you’re going to use a Linux kernel as the basis for a host operating system, that’s nowhere near enough because you need to now build an entire distro, effectively, of all these different tools on top of that. And the maintenance burden of that is off the charts.
Bryan Cantrill 00:51:16 And it’s actually funny, one of our colleagues, Laura Abbott, came from RedHat and I’m like, well this will be interesting. because again in my mind was open, more open than it had been in, a long time about like, maybe we should use Linux. I don’t know. And Laura felt strongly that we shouldn’t, but for reasons that were different. Laura was very concerned about what the distro management burden would be. And she had really seen that upfront at RedHat and she’s like, do not make this decision lightly, that you’ve got to make all these other decisions. It’s something I really like. I remember at the time being like, okay, yeah, I know I get that. But then in the years since I’m like, oh wow, that’s a big deal. That’s a bigger deal than I realized. So, we don’t have to do that.
Bryan Cantrill 00:51:56 We get an operating system that’s got a lot of stuff like built in. Now of course it means that like we’ve also, talk about small communities, definitely a small community and there are challenges there too. But it is one that has allowed us to get the technologies that we really need. CFS containers, DTrace, virtual networking, a bunch of things that we really needed, we had in place and then allowed us to also move the system in a dimension that we needed and wanted to move it. So for us, it’s been the right decision, but I think again, the internet is kind of disbelieving that we actually were at a real juncture there, but we were, we hand-on-heart were actually at the moment where I was like most potentially intrigued or at least wanted to explore something like Linux, Torvalds was just on an absolute bender against CFS, which he’s done occasionally. And I’m like, dude this is the wrong time to be on this CFS temper tantrum, which he’s again done a couple of times. But it was kind of the, and a problem with Linux is that CFS is not a first-class file system, has always been on the outside, because of this kind of these ridiculous licensing concerns. And it was a reminder of like, oh yeah, right. This… Yeah, okay. I think we’ll go our own way. Thank you.
Jeremy Jung 00:53:05 And this choice of your operating system as a user of an Oxide computer, would they…?
Bryan Cantrill 00:53:11 No. You don’t know. Okay. I mean, it’s just like you don’t know if you’re a user of AWS, you’ve got no idea what Nitro is or Annapurna or these other things that are like, do you know whether they use KVM at AWS? No, you’ve got no idea. I mean, do they use KVM or Zen? It’s like the answer is, well it depends on your instance type and the age of the thing. And I bet, but like as a user that’s opaque to you. What you’ve got is like, you want to hit an API endpoint, you want to provision an instance, and you want to run an operating system of your choice. You want to run Windows or Linux or FreeBSD or whatever in your cloud. That’s what you want to do. And that’s what we enable you to do.
Jeremy Jung 00:53:49 And the Oxide machine itself, I don’t think we’ve really explained physically what it is. What are people going to be receiving? What are the components exactly? What is it, how is it different? What would I actually be buying and getting into my data center? So maybe you could talk a little bit about what that is.
Bryan Cantrill 00:54:09 Yeah, so what you get is a crate, a very large crate. And in that crate is a rack, and that rack, it’s an Oxide-designed rack, and that rack has two switches and 32 compute sleds that are in it. And that rack has been designed holistically. So, we have designed all of these parts to work together. We’ve got some very basic things. We’ve got a DC busbar in the rack, so we’ve got a power shelf that rectifies from AC to DC, and then we run DC up and down the back of the rack, and our compute sleds blind mate into that DC power. That is basic. And all of the hyper scalers have a DC busbar-based system. You can’t buy a DC busbar-based system from Dell, HP, Supermicro because in order to have a busbar-based system, you really need to be designing at at least the rack level because the rack needs to be the, you can’t just buy an Oxide sled, you need to buy the rack that is going to contain it.
Bryan Cantrill 00:55:12 And then we very much designed the rack around the switch. So, the switch, we actually have a cabled back plane and in addition to blind mating into power, our compute sleds blind made into networking. So, the networking is a passive cabled back plane, and it means there’s no actual cabling in the sled itself. So, if you want to take a sled out, you just take a sled out, there’s no cables. One of the challenges with the kind of traditional server design and the baseboard management controller, it’s got to be on two different networks. Well, ours is on two different networks too, but you don’t see any of those because it’s all sitting on that cable back plane. And our switch, which has that Intel Tofino at its core has a, there’s actually another switch that serves as the switch for the service processors on the system. And then that’s all coherently interconnected.
Bryan Cantrill 00:55:57 So what you get is a real product and that thing wheels into the data center, you get power applied and you are on the tech port doing the basic configuration we need, which basically like we need to know who to talk to for time, more or less. We got to find; there’s some real basics that we need. We need a BGP session, and we need to be able to actually like connect to a network. But once we can do that, we’re off. And you’re provisioning VMs, you’re provisioning storage, you’re provisioning networks, virtual networks, just like you’re on the public Cloud. And one of the things that’s been really fun, because this is really painful with traditional infrastructure, it’s this kind of bag of bolts approach. It can take a comically long time while I’m talking like easily months to get the gear has arrived, get to the point where devs are on there, you’re like, what takes months?
Bryan Cantrill 00:56:44 It’s like, well glad you asked. It’s like we got the wrong switch from this vendor. That server came with the wrong rack rails, we’re missing the cabling over here. We got it all together. And the software vendor, the software didn’t work and the software vendors pointing. So, it’s like these things just like can take a long time to get up and running with the Oxide rack. Nothing wheels in, you power it on, you plug it in, you get a BEP session and you’re provisioning VMs. And one of the experiences that’s been really fun is to be with customers as they are doing that for the first time and be like, oh my God, I’m on the like we’re provisioning infrastructure like, and I can just like let my devs at it. Like we just got this thing powered on like two hours ago. And that is really neat and very vindicating of the approach that we’ve taken. It’s really exciting.
Jeremy Jung 00:57:33 And for a listener who’s never had to do it the traditional way, what are all the different things that they would need to get into this rack to have the equivalent?
Bryan Cantrill 00:57:44 You got to go buy switch from a switch vendor, you got to go buy, you’re buying your compute for someone else. Maybe you’re buying storage for someone else, you’re cabling it together, you’ve got your service, your BMCs got to be on a network, you’ve got to separate switch for that. You’ve got to have, and then you, that’s all just the hardware that you get. Get software on it. Like what are you doing for, are you getting OpenStack? Are you doing VMware? Are you doing something else? And it’s like, then that’s all got to work. You’ve got to get that image on there. You’ve got to, and it’s like you’re just kind of building this thing as you go. I mean there’s just nothing that is, you’re the one kind of literally cobbling this thing together and it’s just not a product experience.
Jeremy Jung 00:58:17 Yeah, and I think the interesting part was something you talked about earlier too is that if something isn’t behaving the way you expect, for example the IO latency or maybe some kind of compute issue, it seems like in most of those cases you really have no one to turn to. You can go to Dell or HD, and they’ll probably just shrugged.
Bryan Cantrill 00:58:39 Well yeah, there was kind of a tweet when we were starting the company that, that we were all, I think may have made it into a slide deck or two where someone said that they were dealing with a support issue with Dell and EMC and they felt like they were talking to their divorced parents. It’s not necessarily malicious because like a lot of the problem is that like you have these systems that are each well designed on their own or they’re well designed, maybe that’s putting, that is giving them way too much credit. They’re designed on their own and then they have an issue when you put them together that each can plausibly blame on the other. And this happens a lot where you have, it’s like actually these two things are kind of behaving somewhat reasonably on their own, but you put it together and it, it like the result is terrible. And the problem is, it is the end user that is left being like, well, who helps me with my terrible problem? It’s like, great, you each have made a convincing case that it’s not your problem. So, I guess it’s who’s, I guess it’s my problem, I guess. And that is what is really, really frustrating to people.
Jeremy Jung 00:59:44 And I’m curious, back when you were at Sun and Sun was selling their own servers, if somebody had a problem with one of those servers, was that a case where they could go to Sun and then Sun would actually go and figure out, okay, what’s the problem here?
Bryan Cantrill 01:00:01 On its best days. Yeah, on its best days. We’re not always its best days. Certainly, the vision that I always had for Sun Systems is that you’d be able to do that. It was not always the case. I started a storage group inside of Sun called Fishworks in 2006. And we partly wanted to really make good on this idea of systems thinking. The fish in Fishworks stands for Fully Integrated Software and Hardware. And we learned a whole set of painful lessons because we were still relying on quote unquote commodity hardware there and got bit by a bunch of firmware we didn’t control. So, throughout my career, I have tried to make good on this promise, but I really feel it’s only at Oxide that we’ve been able to control our fate sufficiently from top to bottom to really make good on it. And it shows in the experiences that people are having, which is really great.
Jeremy Jung 01:00:53 And it sounds like maybe that’s primarily because the software running on each of these components is in your control, so you’re able to trace through and see where the actual problem is, rather than having all these proprietary vendors?
Bryan Cantrill 01:01:09 Yes, it is that for sure it is that we just like have more of the components at our own control. Don’t want to understate the importance of that. I think it’s also just the sense of responsibility that we take, that our view is, we’re going to own this problem no matter what. And we’ve had now a couple of concrete examples where we have customer issues that like at the end is like not a is it an Oxide problem? It’s like, well no, it fills in a gap between Oxide and a system we were talking to. And I think it’s a point of pride that we take responsibility for taking that all the way to root cause, sorry to be plugging around podcast here whenever , the internet jokes that we are actually a podcasting company that is merely creating computers for content creation.
Bryan Cantrill 01:01:53 And I’m not sure they’re wrong. And so, whenever we have any kind of like crisis at Oxide in terms of debugging a problem or what have you, we’re like, this is going to be great content. And we had one of these, an episode that we called How Is Other Networks where we had a customer where we took this particular switch and talking to the Oxide rack and the Oxide rack and the router were making like reasonable decisions on their own, but you added it up and it was a nightmare. You know we were able to get that debugged to root cause and even though you get any, like our behavior was reasonable and its behavior was reasonable, but we want to be able to take responsibility for that. So, I think it’s twofold. It’s the fact that we control all these layers of the stack, but then it’s also very much the sense that we have suffered at the other end of this. And what we want to be able to do is really take on that responsibility and get a customer actually righted wherever the problem may lie.
Jeremy Jung 01:02:44 I think that’s a good spot to end it on, but anything else you wanted to mention?
Bryan Cantrill 01:02:49 No, it’s great. I know, we hit on a bunch of things. Thanks for the wide-ranging conversation. This was great.
Jeremy Jung 01:02:54 If people want to check out what’s going on at Oxide, check out your podcast.
Bryan Cantrill 01:02:59 Yeah. Oxide and Friends, you can join us live in the Discord. I would love to have people, join us live. But check out the podcast feed and check out the RFDs. You can, those are all out there as well. And then check out the repost. Everything we’ve talked about today is open source so people can dive in there and see what we’re actually doing. We really pride ourselves on that transparency, so we leave nothing to the imagination. I think that sometimes people are like, okay, maybe you guys could share just a little bit less, so maybe transparency to a fault, but there’s a lot to go dive into if you want to dive into Oxide.
Jeremy Jung 01:03:31 I think for the people who are actually managing these data centers and working with traditional vendors, transparency is exactly what they want to hear.
Bryan Cantrill 01:03:42 It is. Yes. I think that that transparency is, we did have an early customer that had an issue, and we wanted to get back to them and let them know we’re working on your issue. Customer’s like, no, no, I know you’re working on it. I’m watching the GitHub issue actually. It’s great. And it’s like, oh, okay. We love that. We love the fact that people can see what we’re working on. We have always felt there’s much more to gain than there is to lose.
Jeremy Jung 01:04:04 Very cool. Well, Bryan, thank you for chatting with me today. This is fun.
Bryan Cantrill 01:04:08 Absolutely. Thanks for having me. This was a lot of fun. And sorry to take you into the filth. That is the modern data center.
Jeremy Jung 01:04:16 All right, well this has been Jeremy Jung for Software Engineering Radio. Thanks for listening.
[End of Audio]
---
[Original source](https://se-radio.net/2026/02/se-radio-709-bryan-cantrill-on-the-data-center-control-plane/)