Estafette
Compose Login
You are browsing eu.zone1 in read-only mode. Log in to participate.
rss-bridge 2025-09-10T01:54:00+00:00

SE Radio 685: Will Wilson on Deterministic Simulation Testing

In this episode, Will Wilson, CEO and co-founder of Antithesis, explores Deterministic Simulation Testing (DST) with host Sriram Panyam. Wilson was part of the pioneering team at FoundationDB that developed this revolutionary testing approach, which was later acquired by Apple in 2015. After seeing that even sophisticated organizations lacked robust testing for distributed systems, Wilson co-founded Antithesis in 2018 to make DST commercially available. Deterministic simulation testing runs software in a fully controlled, simulated environment in which all sources of non-determinism are eliminated or controlled. Unlike traditional testing or chaos engineering, DST operates in a separate environment from production, allowing for aggressive fault injection without risk to live systems. The key breakthrough is perfect reproducibility -- any bug found can be recreated exactly using the same random seed. Antithesis built "The Determinator," a custom deterministic hypervisor that simulates entire software stacks including virtual hardware, networking, and time. The system can compress years of stress testing into shorter timeframes by running simulations faster than wall-clock time. All external interfaces that could introduce non-determinism (network calls, disk I/O, system time) are mocked or controlled by the simulator. The approach has proven effective with major organizations including MongoDB, Palantir, and Ethereum. For Ethereum's critical "Merge" upgrade in 2022, Antithesis found and helped fix several serious bugs that could have been catastrophic for the live network. The platform typically finds bugs that traditional testing methods miss entirely -- such as those arising from rare race conditions, complex timing issues, and unexpected system interactions. This episode is sponsored by Monday Dev

---

In this episode, Will Wilson, CEO and co-founder of Antithesis, explores Deterministic Simulation Testing (DST) with host Sriram Panyam. Wilson was part of the pioneering team at FoundationDB that developed this revolutionary testing approach, which was later acquired by Apple in 2015. After seeing that even sophisticated organizations lacked robust testing for distributed systems, Wilson co-founded Antithesis in 2018 to make DST commercially available.

Deterministic simulation testing runs software in a fully controlled, simulated environment in which all sources of non-determinism are eliminated or controlled. Unlike traditional testing or chaos engineering, DST operates in a separate environment from production, allowing for aggressive fault injection without risk to live systems. The key breakthrough is perfect reproducibility — any bug found can be recreated exactly using the same random seed.

Antithesis built “The Determinator,” a custom deterministic hypervisor that simulates entire software stacks including virtual hardware, networking, and time. The system can compress years of stress testing into shorter timeframes by running simulations faster than wall-clock time. All external interfaces that could introduce non-determinism (network calls, disk I/O, system time) are mocked or controlled by the simulator.

The approach has proven effective with major organizations including MongoDB, Palantir, and Ethereum. For Ethereum’s critical “Merge” upgrade in 2022, Antithesis found and helped fix several serious bugs that could have been catastrophic for the live network. The platform typically finds bugs that traditional testing methods miss entirely — such as those arising from rare race conditions, complex timing issues, and unexpected system interactions.

[Image: Monday.dev_Logo]

---

---

Show Notes

#### Related Episodes

- SE Radio 241: Kyle Kingsbury on Consensus in Distributed Systems

- SE Radio 282: Donny Nadolny on Debugging Distributed Systems

#### Related Resources

- Deterministic Simulation Testing for our Entire SaaS

- The crazy MongoDB bug

- NATS bug: Hunting for one-in-a-million bugs in NATS

- Interview with a DST practitioner: Antithesis-Driven Testing

- Write-up on Ethereum testing: Testing the Ethereum Merge

- FoundationDB paper (PDF) – see section 4 and section 6.2

- 2015 talk at StrangeLoop: Testing Distributed Systems w/ Deterministic Simulation by Will Wilson

- A more recent talk: Testing a Single-Node, Single Threaded, Distributed System Written in 1985 by Will Wilson

---

Transcript

Transcript brought to you by IEEE Software magazine.

This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Sri Panyam 00:00:18 Hello and welcome to Software Engineering Radio. This is your host, Sri Panyam. Today I’m joined by Will Wilson, CEO and co-founder of Antithesis. Will started off in biotech and went on to work on distributed systems at FoundationDB, Apple and Google. And then he eventually went on to found Antithesis, an autonomous testing platform based on Deterministic Simulation Testing (DST). Will, welcome to the show.

Will Wilson 00:00:45 Thank you so much for having me.

Sri Panyam 00:00:47 For our listeners who may not be familiar with the concept, could you share what is DST or Deterministic Simulation Testing?

Will Wilson 00:00:54 Yeah, sure. So let’s start with the way that people normally test software. So the way you normally test software is first you write some code and then you try and think about what situations you want to cover with a test, what situations you want to make sure are going to work in real life, and then you write some tests to cover those situations. And then you occasionally run those tests. Maybe you do them on every CI build or maybe you do them on every PR, maybe you do them only when you’re about to release your software, but they cover those cases and you hope that when you deploy to production, everything is going to go okay. Now let me ask you a question. Do you believe if all your tests pass that everything is going to be okay?

Sri Panyam 00:01:35 I would like to think so.

Will Wilson 00:01:36 But do you actually?

Sri Panyam 00:01:38 Well, if anything in the last 20 years has shown me, no..

Will Wilson 00:01:42 Exactly. So why is that? Well, it’s because, I just said you are writing the tests that cover the cases that you thought were important to cover, but the things that are actually going to screw you in production are the situations that you did not think of ahead of time. They’re the things you did not think to cover. And so what that means is this conventional strategy to writing tests is very, very good for catching regressions. If there’s something that I know my software does correctly and I accidentally break it in a new version, if I have a test for it, that’s pretty good at catching that. But a test is very bad at finding out if my software actually works in the real world in all the unanticipated or unexpected situations that the real world will throw at me. So deterministic simulation testing is a very different strategy which tries to actually give us the confidence that if our test passed, things will go well in the real world.

Will Wilson 00:02:38 And the way it does that is basically by constructing a model of the real world, a whole system simulation where random requests come in and ask your software to do things and tasks are going to run in random orders and these requests are going to have slightly randomized parameters. And also the things that happen in the real world that can go wrong are also going to happen. So occasionally requests will fail, requests to downstream dependent systems will fail, network requests may get delayed or reordered, threads may randomly pause and run later. The garbage collector may just turn on and that whole node may slow down. All the stuff that causes people to get paged in the middle of the night and causes weird production behaviors that nobody can understand, we’re going to do all that too. And we’re going to do it at a vastly accelerated rate so that we cover many, many years or centuries worth of real-world badness in just a few seconds or minutes in our tests.

Will Wilson 00:03:36 And if our software survives all that and things actually go, okay, we’re now going to have the confidence to actually deploy and to actually feel, things will be all right when we have it in the real world. The final ingredient to this, the part that makes it deterministic is our model of the real world, unlike the real world is perfectly deterministic. Meaning it’s all driven from a random number generator that’s under our control. So this is very important because sometimes what happens is there’s a particular, let’s say that a packet takes 21 milliseconds to get from service A to service B. And if it takes 19 milliseconds, instead something goes wrong, right? There’s some distributed race condition, something loses and as a result there’s some bug that a user can see. You may have great difficulty replicating that situation again, right? Maybe it only happened, it only took that long because of some very rare set of circumstances. But the determinism of DST means that any set of circumstances, no matter how rare, no matter how unusual, we can perfectly replicate. And so if we see a bug once, we can see it as many times as we want and we can debug it with very, very high efficiency. How you actually achieve that is sort of the hard part. But once you have it, it’s an incredible asset.

Sri Panyam 00:04:58 Awesome, thank you. Before I dive in, I would like to get some of the origin stories here. Here it started off in FoundationDB. Could you share some of that background there?

Will Wilson 00:05:08 Yeah. So this is sort of a fun story. Basically, actually even before FoundationDB, my co-founder Dave Scherer, before FoundationDB had another company called Visual Sciences, which you can think of as having made a very, very early web analytics product that was backed by a very powerful analytic database that’s sort of a very early version of Googleís BigQuery. And this would’ve been in the early 2000s, so it was really very, very ahead of its time. So when Dave was building that big analytic data store, he was really interested in using model-based techniques to look for sort of long tail performance pathologies and like strange sort of high latency corner cases. And so he built this little model-based simulation of the database to see if he could find these issues. It’s sort of in some ways similar to formal methods if you’re familiar with that.

Will Wilson 00:06:03 But he was using it to analyze performance rather than correctness. The problem that Dave ran into was, of course, once he had this all working in the model, how do you know that your real code is behaving anything like the model? And so that company got sold and that never went anywhere, but it stuck around in the back of his mind. And so, when we were doing FoundationDB, he had the same idea again this time to use it to study correctness rather than performance. But then he had the further idea, what if we just made the actual program into the model? Then you would never have to worry that, oh, I did some analysis on this model and it gave me this result, but actually my real world code has a bug in it so it doesn’t perfectly match that specification.

Will Wilson 00:06:49 So the trouble is, real world code has things like concurrency and it has things like it reads the clock and it gets random numbers from the operating system and it does all kinds of things that make it fundamentally non-deterministic and make it so that if you run some battery of tests or conduct some simulation, the result might vary from time to time. And so we set about writing FoundationDB in a very special way so that it could be perfectly simulated. We did things like we used our own user space concurrency primitives and scheduling so that the kernel was never deciding which task to run. It’s sort of similar to the Node.js concurrency model, but we were doing that in C++ long beforehand. We obviously had to mock out or stub out any form of network communication between nodes and handle that in process.

Will Wilson 00:07:43 We had a fake clock and a fake random number generator that got accessed instead of the ones provided by the OS. And as we sort of knocked away each of these forms of randomness and each of these forms of non-determinism, the simulation became more controllable until we eventually got to a point where it really was like you could really just run a program and it would simulate an entire interacting FoundationDB cluster with possibly dozens or hundreds of simulated nodes speaking over simulated networks with, randomly introduced networking delays and congestion behavior and whatever else you want. And this was just such an insanely productive mode in which to work. People who are familiar with FoundationDB these days mostly know it for its reputation, which is being one of the highest quality databases ever made, never losing your data, never doing anything wrong.

Will Wilson 00:08:39 And it’s true, we got that, and we got that because of this technique. But the thing that made almost an even bigger difference was that we were able to get that with a very, very small team. Because what this technique enables is this incredible form of pair programming with the computer where you write some code and then you run some simulations and you say, did I introduce any bugs? And you know, if the answer is yes, then you just try again. It’s like having an Oracle or a Genie that can tell you if your code is correct. It’s this incredible, incredible accelerator for an entire team. And that was what made us really excited about this technique and I think this is why so many others have started adopting it and picking it up as well.

Sri Panyam 00:09:23 Interesting. That reminds me as we go deeper. I mean to build a simulation; you need a model. So how do you award the paradox of, hey look, I don’t know what I don’t know, and I built this model.

Will Wilson 00:09:35 Yeah, that is a really, really good question. And basically the answer is you have to err a little bit on the side of your model being worse than reality rather than better than reality.

Sri Panyam 00:09:45 Because you an example of that, what worse looks.

Will Wilson 00:09:47 Yeah, sure. So your network in your data center is occasionally going to have periods of being down and it’s going to occasionally have periods where it’s a little bit slow and it’s occasionally going to have periods where it’s working great and these periods in your network are probably totally uncorrelated with what your program is doing. And so suppose that there’s some bug and the bug only happens if your program is in the middle of some particular operation. You’re doing some, I don’t know, you’re doing some transaction on your database and at exactly that moment the network does something a little bit wrong. Or it’s a little bit slower than you expect or whatever. And only if these two things happen at the same time this bug happens, and let’s say the bug is really, really bad, in order to make that bug happen very reliably and very reproducibly in our tests, we don’t want to make the network completely uncorrelated from what your program is doing.

Will Wilson 00:10:48 We want to make it basically as malicious as it can be. We want to try making the network bad at each different sort of moment that your program could be doing something. And we want to artificially increase the probability that this network badness happens at exactly the right spot to trigger that bug. Because that way you see it in your test like in a few seconds or a few minutes, rather than having to wait for days or weeks, which you would have to do if it were happening in production. And so that’s kind of an example of where if you make the simulation more hostile than reality, it can catch stuff quicker than it would happen in the real world and also ensures that you don’t miss things that can happen in the real world.

Sri Panyam 00:11:31 So what would be a good, I suppose, evaluation or identification function to know when or what a bad network is? For example, a network loss of 10% may be really good in some obligations, but really bad in some. So how do you decide?

Will Wilson 00:11:47 Thatís totally true? And furthermore, there can be a very dangerous pattern where if you basically make these parameters a little bit wrong. You could accidentally get yourself in a situation where the simulation isn’t really testing anything. Consider the corner case where we just make it so the network never transports anything at all. Now your program never does anything and then all your tests are green because we didn’t find any bugs because your program didn’t do anything that would be a bad outcome. And so there’s a few techniques that we can use to avoid this and to make the simulation really good and really well tuned instead. One is just let’s look at things code coverage or the more sophisticated version of this is behavioral coverage. If you have certain corner cases in your code that you know are interesting and are important to cover, you should just log a little message when that situation happens. And then at the end of a test run, we should make an assertion that we’re able to see all those messages. That we basically got at some point to every one of those situations. And the generalization of that is something that we call sometimes assertions. They’re the dual or the opposite of normal assertions. Rather than asserting that something is always or never happening, we’re asserting that sometimes happens because that’s an important indication about whether our test is working effectively.

Sri Panyam 00:13:10 Is that in a probabilistic model or is it truly sometimes?

Will Wilson 00:13:14 Right, right so it’s at least one. We’re saying that if I run my program a thousand times, I want to guarantee that I see this behavior at least once that is a property of my testing system. I should be able to make the situation occur. The ultimate guarantor though is just, are you still seeing bugs in production? Basically, I think people, regardless of whether they’re doing DST or not, if you ever see a bug in production that is actually two bugs, it’s a bug in your program and it’s also a bug in your tests and you should maybe not solve them in that order, maybe you should solve the bug in your tests first. Who knows what else it will turn up. And this is a very, this is a practice that requires a little bit of discipline to get into, but once you do, it’s so incredibly powerful. Just understand why it is that you didn’t see that bug the minute it was introduced. And often it will point to an entire area of your, of your software or of your organization or whatever that is not working correctly.

Sri Panyam 00:14:18 So it sounded there was a bootstrap phase here for a team to kind of onboard. Let’s, let’s say I’m running a game. A multiplayer game running on the internet, that’s probably as low as it gets. What would the bootstrap phase for me be to adopt this?

Will Wilson 00:14:33 So this gets a little bit into sort of what our company does. Basically up until we came around, this technique was very, very powerful, but it was a giant pain to adopt. Because it required you to do things mock out your dependencies and somehow take control of concurrency and scheduling within your program, simulate all of your network connections there. There’s sort of a lot that had to be done and there were applications and use cases where reliability and correctness was so important that people would do all these things. I know of actually quite a large number of people now because they, they come talk to me. Who actually started doing this in the years after FoundationDB got acquired by Apple. The team at Dropbox that worked on their synchronization technology built their own DST. Tiger Beetle is a very well-known financial transactions database.

Will Wilson 00:15:31 They built their own DST. It’s quite good, you know, a number of people actually in the finance or FinTech or crypto space built their own DST and did very good job at it because they really wanted to catch any kind of problem before it hit abroad because it could result in losing tons of money. And obviously other database vendors were also doing this as were a number of teams that got influenced by what we had done at FoundationDB. And by the way, after I went to work at Google, I learned that there were teams inside Google that had done this and just never told anybody about it. So I think this is a technique that’s been independently reinvented a few times, but it remained a thing that only very, very committed people who really felt like their program had to be perfectly reliable would do because the adoption costs were high.

Will Wilson 00:16:25 So our new company Antithesis tries to drastically reduce the adoption cost for this style of testing by basically doing, like solving two, like of the problems that you and I have already discussed. One is how do you take an arbitrary collection of software and make it deterministic. And prior to us being around that took a lot of work and what we’ve done is write a hypervisor, which is deterministic and you can take an arbitrary collection of software and just put it in the hypervisor and now it’s deterministic too. And so that already on its own makes this much easier. The second problem is the one you touched on, which is what’s the network drop rate. What is the way of tuning all these parameters of the simulation to find all the bugs and to find them as quickly as possible? This is a very hard optimization machine learning science problem.

Will Wilson 00:17:22 And it’s sort of unreasonable, in my opinion, for every team on earth to try and solve this independently. And so we have also just done a huge amount of research and a huge amount of development to make this simulation self-tuning and to make it very intelligent and very good at finding the correct parameters and the correct settings. So you just don’t even have to think about that. And so the goal here is to make it much, much easier for people to adopt this stuff, which hopefully just makes them a lot faster at writing software and also produce much higher quality software.

Sri Panyam 00:17:59 I want to quickly touch back on a point you mentioned about the cost rate. What were the both costs for these organizations to adopt DST in their own setting engineering wise financially?

Will Wilson 00:18:09 Yeah, I think if you don’t have a solution like Antithesis, then the question is really dominated by your dependencies? Because the hard thing about DST is that unless you have a hypervisor, you need to somehow make it, so all your dependencies are deterministic too. And that really does just hugely limit what you can do. Most software in the world is not deterministic. It’s got concurrency, it’s got threads, it checks timers, it does all kinds of stuff. When we were developing FoundationDB, we were so influenced by this DST style of development that we actually started removing our dependencies to make it easier to test in this way. So very early versions of our database were using Apache Zookeeper for coordination for instance. And that just became a thing that we could no longer accept because it made it harder to simulate. And so we deleted zookeeper and wrote our own Paxos implementation that could be simulated and that made sense for us. But that’s a very, very hard thing to ask somebody to do if they’re just starting to adopt this technology, which is one reason that I think that thus far it’s only been, it’s only happened in relatively specialized places.

Sri Panyam 00:19:27 It’s almost asking them to re-reinvent or reimplement what you, what you guys are doing.

Will Wilson 00:19:32 Yeah, it’s just, a very high barrier to entry and that’s what we’re trying to fix.

Sri Panyam 00:19:37 Fair enough. That brings me back to the hypervisor that you had mentioned. I mean, can you tell me more about that?

Will Wilson 00:19:43 Sure. So we started from the question of how could we make DST easy to adopt? And it’s well we need to make it so that everybody can make their program deterministic super easily. And then it’s, well if your program’s written in Java, you could try and make a deterministic JVM and if your program’s written in CI guess you could try and make a deterministic Linux user space. Actually Facebook did that, that project’s called Hermit, but we’re going to have customers who use every language. So we can’t just play whack-a-mole here doing all these things. And then it’s, well so much of the value is being able to test microservices where service A is talking to service B over a network. But now we need to make that networking deterministic. How are we going to do that? Oh wow, we need some crazy networking library that we can drop in.

Will Wilson 00:20:30 But everybody does network differently. How do we like. So very, very quickly you get to, well I guess we should try and get under all these things and make a deterministic operating system, but then you hit another problem. Well first of all, not everybody in the world uses Linux and we would love to be able to have customers on Windows someday and so on. But also the interface between the Linux kernel and everything running on top of it is really complicated and it’s always changing. And so maintaining a perfect deterministic Linux as a small company just seemed a very tall order. And then we were, you know what, you know what is a very simple interface that has not changed in 40, 50 years, the X86 CPU instructions. Look, so let’s just get under the operating system and let’s emulate a deterministic computer where no matter what the operating system does, no matter what interrupts its sets, no matter how it, is managing memory or doing anything from the outside or rather from the inside, it’s like living in a perfect isolated hermetic, deterministic universe and it, it can’t actually cause any non-determinism.

Will Wilson 00:21:49 And then once you build that, it’s so easy to adopt. Now people don’t have to change their software at all. They just give it to us. We run it inside the magic box and it’s perfectly deterministic. So it’s really very, very powerful.

Sri Panyam 00:22:03 How does this overcome the, I mean this box is still in isolation or is external to your TCP stack itself or, or the never clear itself. So how do you reconcile that?

Will Wilson 00:22:13 So basically what we do is we take our customers multiple microservices. Representing a bunch of different physical nodes in their compute backend plus whatever their dependencies are. Maybe they have a database or maybe they’ve got some API server or whatever and we take them all and we put them inside the same guest in operating system environment, which is Linux. And then we connect them over TCP locally inside the VM. So they can speak over network connections between different docker containers running all in the same version of Linux. And then we can actually reach in and interfere with those network connections. We can slow the packets, we can drop packets, we can do whatever. We can create simulated partitions. We can and we can correlate all this with the code that’s running. The thing I told you before because we see what code is running because we control the operating system.

Will Wilson 00:23:09 And so it’s actually a very complete solution. It works very well. The one thing that’s a little bit of friction is if you are highly dependent on some third-party web service that you cannot put in a container and have a hard time mocking that is going to cause a problem. So we try and reduce that friction for our customers also by writing very good, very high-quality mocks for a lot of common web services. So for example, we have a whole fake AWS that we can run inside the simulation with you and if your program uses AWS, it just runs in there and uses the fake one and can’t even tell the difference. We actually test our own stuff that way because we’re very heavy AWS users, we can do the same for Snowflake and GCP and Stripe and you know, whatever.

Will Wilson 00:23:53 We can write a good mock for it. And that’s nice because basically just a few of those covers the vast majority of APIs that people use. And if there’s a long tail, you know, I think we’ll get to it someday. The other thing that the hypervisor gives us, which is very, very powerful, is extremely efficient and rapid snapshotting and restore of the whole state of the guest operating system. So if you use a VMware hypervisor for instance, you can snapshot that VM and then you can reload it. And that takes a little bit of time, but you can do it. Our hypervisor is optimized to perform that operation incredibly quickly and efficiently. More than that, even when it’s running multiple VMs simultaneously on the same computer, those VMs can actually address the same physical memory if their local version of that memory hasn’t changed.

Will Wilson 00:24:55 So we’re basically doing copy on right on all of the memory of the guest operating system. So now if we run a hundred VMs in parallel on the same machine, they’re not using 100 times the memory. We’re basically deduplicating all of the memory between all the different VMs, which is a thing that I don’t know anybody else whose hypervisor can do that. I’m sure there is one somewhere, but what it means is that we can now test your stuff really efficiently. Because we can use huge amounts of parallelism without drastically increasing the memory requirements, which just at the end of the day means that we can find bugs faster and we can find rare bugs faster.

Sri Panyam 00:25:30 It’s interesting because the implication of this means that A, you’re kind of running the entire customer world in a single sandbox and you can also do things like simulate connection between A and B if you was in the same zone or same region or even across regions. Because you want to know that this is for DR and for various kinds of, fault tolerance.

Will Wilson 00:25:52 That’s exactly. All that we have to do is really change the latency, on the virtual connection or

Sri Panyam 00:25:58 And error rates.

Will Wilson 00:25:59 And error rates. And also, you know, correlated failures. We’ll make things in the same DR region fail together more often, but not across regions. And people use us to test DR all the time. People also love using us to test live upgrades. Because that’s a very scary situation for many teams. You have some big service deployed and you have customers who are using it. And you want to incrementally upgrade your cluster and while the upgrade is rolling out, you’re going to have servers with both the old version and the new version serving requests simultaneously. Does that all work? It can be very hard to test that, but it’s quite easy to test it with our environment.

Sri Panyam 00:26:37 Interesting. So two things here from a quote unquote cost perspective. What is the biggest installation that you’ve run so far on this? You know, how many services, how many instances, RAMs, course?

Will Wilson 00:26:50 Yeah, so the biggest one that I know of, so we have a customer of Warp Stream, which was recently acquired by Confluent. Yeah, so I think they may be our largest one. So, so they developed a Kafka replacement with blob storage on S3. And so we started out just testing their Kafka replacement and, and making sure that that was good and the things that were really durable were really durable and so on. And then what happened was they added all the rest of their SaaS architecture into the simulation as well. So the same simulation is not just testing their Kafka and their operations with S3 and so on, it’s also testing user signups and you know, people using their admin APIs to create clusters. And it’s totally crazy. It is actually far beyond what I’ve ever done with DST, but it works.

Sri Panyam 00:27:43 How big was it?

Will Wilson 00:27:43 I would have to go look it up. I do not know the answer off the top of my head, but I think there were many, many different services involved.

Sri Panyam 00:27:50 Okay, fair enough. In terms of their own testing phase, typically how long do customers run this for? Is it a parallel thing? Is it an ongoing thing? What is the workflow for them?

Will Wilson 00:28:01 Yeah, that’s a great question. So I think there’s two main ways that people do this and one of our long-term product goals, actually not even long-term, one of our relatively short-term now product goals is to bridge these two ways of using. So the first way people use this is sort of offline as a very intensive check phase, maybe overnight. Or maybe they do tons and tons of testing before release, before rolling out a release. So the idea here is you spend all day writing code and then whatever the last version of your code is the default branch. You send that to us when people leave work and then we test it all night and then in the morning you come in and you see what your results are and that can be cool. It does mean that we can find pretty rare, pretty hard to find bugs because we’re testing for 12 hours or whatever and, but they’re still fresh in your head and it’s probably code that you changed yesterday.

Will Wilson 00:28:57 So it’s still high productivity to work on those problems. The other way of using it, which is much more similar to the way that we used it at FoundationDB, well I guess we actually did both there, but the other way of using it is in the loop. While you’re coding, so you write some code, you make some change, you kick off a test and hopefully if you want to do it this way, you need to make it so that your tests can run very fast or can give you good information very quickly so that hopefully within 15 minutes or maybe an hour at most, you get some answer that’s, was there anything obviously broken with this? ? And that can be a very, very powerful iteration loop. I think the ideal is to have both and moreover to have them be working together.

Will Wilson 00:29:43 It would be cool if for example, you are submitting builds throughout the day and you’re running tests on them and anything which at the end of the day you’re still iterating on, but which was green, which we didn’t find a bug yet overnight we do hundreds more hours of simulation CPU hours and look much more deeply for bugs in any of those versions. Then give you a report when you get in in the morning and then you can sort of pick up where you left off. I think that would be really cool and that’s sort of what we’re aiming to get towards.

Sri Panyam 00:30:12 Hmm, interesting.

Sri Panyam 00:30:47 You brought up correlation with the code base earlier. Tell me more. At what level is the correlation is it at the compiled binary level? Is it an actual source code level based on coverage? How does that work?

Will Wilson 00:30:59 Right, so basically we have very good integration with a large number of different language tool chains. So it works a little differently depending on the language. Some of the languages we actually just come in and transform your binary afterwards and you don’t have to do anything at all. We do that for Java, other languages, we give you some flags that you pass to your compiler. For instance, we do this for C++ and Rust and the result in either case is the same binary that you would run in production except it tells us when each line of code gets reached. So far this is very similar for with instrumenting for code coverage or something like that. But the thing is, the key is we’re not just going use this for reporting coverage to you, we’re also going to use this for this kind of correlation stuff for saying, oh, we’re in like code block A, we’ve never tried pausing a thread while we’re in code block A before.

Will Wilson 00:32:00 Let’s do that now. Or, this microservice is running function X and simultaneously this other one over here is also running function X. Let’s interrupt the network connection between them now and see what happens. And we’re going to do that very much in parallel. And then the fact that the hypervisor lets us cheaply snapshot and restore stuff means that we can go back to situations that we found interesting and try exploring more from that point. And that basically lets us amplify the probability of very, very rare events to find really, really hard to discover bugs. I’ll give you a concrete example because that was a little bit vague. Suppose you have a bug and in order for this bug to happen, you need to be in the middle of a transaction rollback and there needs to be some network problem. And at that moment something has to trigger a leader election in your cluster or something.

Will Wilson 00:33:00 If each of those events has a one in 1000 chance of happening at any given moment, that’s a one in a billion bug. That’s very, very hard for a test to find normal. But if you’re running it in production and you’re running on thousands of nodes and whatever, your customers will totally hit that bug. So this is a classic example of a bug that normal testing would never ever find and that would cause you to get a page. And so, but what we can do, because we can see what code is running, is when we hit that one in 1000 transaction rollback, we’re going to say, aha, we made a transaction rollback, we’ve never seen that before. Let’s do more testing from this point onwards. And each of the tries we do from that point, if it doesn’t hit the next event of interest, we can just reset to the transaction rollback and try more stuff from there. It’s like if you’re playing a computer game and you can save your game anytime you want, it’s easier to make progress because you can always just reload your game and go to the point where things were going well. And so basically what this means is that for these bugs that require multiple different rare things to happen in sequence, we can find them at a vastly accelerated rate compared to traditional techniques.

Sri Panyam 00:34:14 You know, I love the example of saving the game and going back again. I spent my nineties doing that in all those old games.

Will Wilson 00:34:20 Yeah. Well, so we actually use computer games a lot for testing our DST. So we believe that a good testing system should be able to beat a computer game in the interest of testing it. And indeed, if you go to our website, we have many videos of antithesis playing classic computer games and beating them.

Sri Panyam 00:34:37 I saw the Mario example at Gamescom last year. So back to the example, so what I said was when you hit that one in a thousand transaction rollback scenario, you still need that scenario to be hit. Is there a way you can kind of say, look, this block of code actually through some sequence of code parts could get hit. And can you force that to happen proactively instead of waiting for it?

Will Wilson 00:34:59 So that’s a great question. We try very hard to make it so that everything gets hit and there’s a number of techniques that we use to do that. One is a very, I think very under used technique called swarm testing, which is applicable to people who are not even doing DST. Just everybody needs to know about this technique, okay? And so the basic idea of swarm testing is you sometimes should run all of your tests with certain features of your program turned off. That sounds counterintuitive, right? Why would you want to do that? Well, imagine a situation where let’s say you have some data structure and you have two randomized operations that you can do on the data structure. One of them adds an element to it and one takes an element out. If both of these things happen with 50% probability, the contents of the data structure will never get very large.

Will Wilson 00:35:52 You would require, let’s say you want to get a hundred things in the data structure and then some garbage collection routine or some compaction happens that would require a one in two to the 100 chance to happen if you’re randomly adding and removing things. But if we just temporarily disable the remove function, then you’re guaranteed to get a whole bunch of ads in a row and then you’re going to trigger that compaction routine. And so this is a very motivating example and it’s like, you want to do this with everything including with your fault injection. One of the most common mistakes that in the early days before we did this automatically that customers would make is they would say, oh, I want my code to be maximum powerful, so I want to turn on every single fault. And the problem is, there are some faults that will hide bugs.

Will Wilson 00:36:44 Imagine there’s a bug where your program gets into some really bad state, but if you restart the program, the bug goes away. Well now if we are doing a fault that occasionally kills nodes that can hide that bug. And so what you actually want to do is sometimes run with killing nodes completely turned off because that might save you from some bug. And so it’s exactly the same with everything else and with what kinds of requests you generate and every other kind of behavior that you want to provoke in the system. Now it’s still possible to get this wrong. It’s still possible to overlook things. And so that’s why we come back to the, sometimes assertions, the logging, that kind of stuff. If you know that transaction rollbacks are an important situation, you should be adding, you know, a sometimes assertion or a log to your rollback function. And then we should be looking at every test run and saying, did we make rollbacks happen? And if not, that should be a red test failure. It could mean that your program has changed, maybe you accidentally disabled the rollback function and that’s, you know, now we found a real bug, but it could also mean that the tests are weak. It could mean that for whatever reason we’re not making transactions roll back enough and that could be very, very important information.

Sri Panyam 00:37:59 So in the example of the transaction rollback, the other triggering event was a leader election. So again, how do you draw the correlation of what’s happening at the lowest bits and best level to at the more application level? How do you say that, hey, that actually is a leader election.

Will Wilson 00:38:17 Yeah, yeah, yeah. So the easiest way to do this actually, so the way that requires the least investment from a user is you just look at lines of code. Every line of code in your program has a purpose. Some of the lines of code are very boring. They run all the time. The code that like logs a string that probably is running all the time in your program. And so, we’re going to see it running all the time and we are quickly going to get bored of it.

Sri Panyam 00:38:42 Across all the programs — because it could be an external event or external service.

Will Wilson 00:38:47 Totally, totally true. But anything that runs rarely, that runs only sometimes or that runs in very specific conditions or that runs, you could also have, for example, a line of code that very commonly runs together with some other line of code, but very, very rarely runs by itself. That could be something very interesting. So you’re not just looking at which lines of code, run or don’t run at all, but you’re actually also looking at the correlations between them and when they run together with each other or not and what orders they run and so on. And this is a kind of analysis that human beings are very bad at doing, but it’s a kind of analysis that AI is actually very good at doing, machine learning and so on. And so you can sort of look at the history of what the software has done and then ask, is this novel, is this interesting? Is this something, is this very similar to what I’ve seen before? Or is it very different from what I’ve seen before? And then you can use that as an input into where you want to continue testing.

Sri Panyam 00:39:49 So are you using AI here to do the correlation?

Will Wilson 00:39:53 Yeah, yeah. We actually find it’s funny for many, many application domains and in many, many situations it’s overkill. You actually find almost all the bugs with much simpler methods once you’ve got some decent randomness in there and some decent fault injection. But we are currently doing some very cool research with this that we think is going to drastically improve our results, which I cannot quite talk about yet, but we’ll probably be announcing it, probably announcing it soon.

Sri Panyam 00:40:30 Could you guys give a hint?

Will Wilson 00:40:32 I’ve said too much.

Sri Panyam 00:40:34 Sure. I can’t wait to see it. What about limitations? I mean cost sounds one, but when would this be a completely inapplicable or ill wise case?

Will Wilson 00:40:46 Yeah, great question. I think the costs, people often focus on costs, and I think that this is very silly of course I would say that, but no, I think it is very silly because basically what people are currently doing is costing them so much. Like you go and you talk to engineering leaders, engineering managers, and you’re like, how much money do you spend on testing? And they’re, oh, we don’t spend any money at all. We don’t buy any tools. And then you’re, okay, how much time do your engineers spend answering pages, triaging production issues, writing tests, maintaining tests that are broken, blah, blah, blah. And they’re, oh my God, they spend all their time doing those things. I’m, okay great. How many engineers do you have? How much are you paying them? You’re spending a huge amount on this stuff?

Will Wilson 00:41:31 But people don’t always think about it that way, but they are. And I think that’s one of the challenges, is to make people realize that they’re already spending millions and millions of dollars on this in terms of what is a bad fit. I think the main things now are, number one, if your program is basically just like a straight line in some sense, if it’s very easy to test in a conventional way because there’s not a lot of concurrency, there’s not a lot of network chatter, right? It’s just a small stateless function that does something and you can write some unit tests for it pretty easily. That probably doesn’t justify the pain and hassle of trying to test it this way. If you’re running on exotic hardware, we certainly cannot support you today. And if you’re using a vast number of third-party external dependencies, that’s something that we’re going to have a very hard time for now dealing with because we haven’t written a mock for every single one of them yet. I would say those are the main categories of limitations.

Sri Panyam 00:42:35 I want to contrast it with one, I don’t want to call it a technique, but one different idea. If let’s say programmers or libraries or developers could annotate or inject annotations in their code base, which actually explicitly was able to assign error probabilities, how compatible that would with what you’re doing and how much would that alleviate that initial pain?

Will Wilson 00:42:58 I think that’s a very interesting idea. So one thing that we actually do encourage our customers to do, if you’re thinking about annotating, one thing that I think is a very, very powerful technique is something called bugification. And the idea of bugification is suppose you have some code which usually does one thing, but very rarely could do something else, right? For example, it could be writing to a disc. Usually it writes all the data to the disc. But you know, as every systems programmer has learned the hard way, sometimes it writes some of the data to the disc and gives you a special return code that says, hey, I only wrote some of the data. You know, by the way, networks work the same way.

Sri Panyam 00:43:40 Well usually flush and f syner actually not on by default, soÖ

Will Wilson 00:43:43 Well that’s true too. Yeah. So, anytime you have something like this, that usually does one thing, but very rarely does something else, if it’s actually great, if you are the one maintaining that code to basically add a little macro around it or add a little, add a little thing annotation, which says, hey, if I’m running in test, I’m going to do the rare thing on purpose. Sometimes I’m going to roll a random number and maybe 10% of the time I’m just not going to write all the data. Or 10% of the time this thing is going to throw an error, even though it doesn’t have to just to test what happens on the other side. is somebody catching that error correctly? Is somebody handling that return value correctly? That can drastically accelerate bug discovery as well.

Sri Panyam 00:44:29 Okay, cool. Switching back to real world examples, I heard about Ethereum’s, you know, the merger two and, you also mentioned wall stream. What were some of the most trickiest, kind of scariest bugs that you had them uncover?

Will Wilson 00:44:43 Specifically for Ethereum or for anybody?

Sri Panyam 00:44:45 Oh, I also wanted to bring up Wall stream because you mentioned earlier.

Will Wilson 00:44:48 Yeah, yeah. For Ethereum, I think the scariest bug we found in Ethereum, yeah, I can talk about now because it’s long ago, been fixed. We found a very, very scary denial of service bug, which I believe was actually in their protocol and they had to, they had to change their protocol to address it. Their former head of security was the one who’s telling us about this. He basically said that it would’ve qualified for the very highest tier of their bug bounties. It was a bug that would allow you to essentially DDOS the Ethereum blockchain very easily by sending certain maliciously crafted messages to nodes. You could basically cause them to shut down or start using a 100% CPU and not doing anything else or something like that. It was pretty, pretty alarming. One of the very first really scary bugs we found was for MongoDB, this was a huge win for us and for them and for our relationship together.

Will Wilson 00:45:42 We had been collaborating with them, but in a very, our thing was brand new and it was experimental and they were trying it out and, you know, it was sort of, hey, let’s just see how this goes. Not serious or anything. And then right before they were going to ship a major release, we found a very rare but very severe data corruption bug in their wired Tiger storage engine, which could basically corrupt completely arbitrary data in any Mongo instance. And, you know, it was something, it was very rare and it was very hard to make happen, but with the perfect determinism of DST, we could make it happen as much as they wanted and enable them to debug it really, really efficiently. And so that was a very cool win for them and for us. And that bug has been written up in extensive detail by one of their engineers, both on our blog and also on their Jira. I can provide a link for that if you want.

Sri Panyam 00:46:44 Yes please.

Will Wilson 00:46:44 Cool. It’s a really good one. It’s wired Tiger bug number 9500. I still remember the number.

Sri Panyam 00:46:51 What was the trigger for this bug? I mean, what was the rare trigger that whatever bug happened?

Will Wilson 00:46:56 Yeah, I’m trying to remember the exact details. This was a few years ago, but basically it involved, it involved their storage engine and their storage engine is a highly, highly concurrent thing. Storage engines have to be highly concurrent because disks are very slow. And so you need a lot of throughput and a lot of latency hiding in order to have a good performance storage engine. But you also don’t want to take a ton of locks everywhere because if you do that, then operations targeting different parts of your database could block each other and that would be bad for performance too. And so much of the difficulty of writing a storage engine is you’re trying to hide all the weird performance characteristics and weird abstractions of a disc and make it just, appear a very uniform. Even when you have operations that could logically depend on each other or not.

Will Wilson 00:47:48 And I think we found a bug that involved like, there was some lock that should have been taken that wasn’t, but it only happened if there was a compaction occurring at the same time that something else happened and you needed a particular shape of query and it was , I’ll send you the writeup. It’s a pretty gnarly one. I used to be a database guy and I looked at this bug and I was, oh my god, I would never, I would not know how to fix this. It was pretty bad. I’m trying to think of other good, really scary bugs that we found for people. So we’ve done a bunch of testing for NATS, which is, it’s one of the Linux Foundation projects or CNCF projects.

Will Wilson 00:48:32 They’re sort of a distributed sort of software defined network. And they also have a bunch of cool storage primitives and consensus primitive and stuff like that. We found a bug in their Raft implementation, which could, which I think they also wrote up for our blog. It’s a really cool bug. It requires several things to go wrong. A node has to get excluded from the cluster when something is happening, and then it gets restarted and it comes back and it thinks it’s one of these classic distributed systems bugs. Yeah, I mean there’s no shortage of crazy bug stories around here. A lot of our customers have fun writing them up for us. So there’s actually a decent number of these on our blog.

Sri Panyam 00:49:11 Oh, thank you. Can you share some metrics around how you quantify, I guess I don’t want to use the ROI word, but you know, once they see it, before they see, how do you kind of quantify this?

Will Wilson 00:49:24 Yeah, so the customer who’s done the most quantification for us was actually Mongo. So they did their own internal study and analysis of their use of us, and they basically found, I think the results actually surprised them a little bit. They actually surprised us too. They had in one period of time a hundred really very significant release blocking, could have done something quite bad in the real world. Bugs introduced in their development process. And I should be really clear here. These are bugs that came nowhere close to getting to customers. These were all fixed and Mongo is really good at this stuff, which, and I think the fact that they’re studying this and quantifying this really shows you how seriously they take it. But basically of those 100 bugs, 77 were only found by us, which is pretty cool. And then here’s the next part. Of those bugs, they found that the average time to fix them was massively lower. If we found the bug, they were fixed in half as much time and they were able to be fixed by much more junior engineers.

Sri Panyam 00:50:35 Is that because your discovery would identify all the scenarios that would need to be trigger in that bug as opposed to guessing what’s going on?

Will Wilson 00:50:42 Yeah, that’s a huge part of it is, we have the determinism. And so we can just replay that bug. So much of the hard part of debugging is you never have all the information you need, and you may not be able to get that bug to happen again. An analogy I really would like to use is itís a police investigation. When a server crashes, you have to put up the yellow tape. So, nobody’s going to step on that crime scene and ruin the evidence or something. Anything that you do in production could potentially make the bug go away and then you don’t. And so really senior engineers get good at guessing, and they get good at, they know these systems like the back of their hand and they’re, oh, I’ve seen this.

Will Wilson 00:51:21 I saw this one time three years ago; I know what this. But as soon as you have determinism, it’s like you’re doing that police investigation, but you have a time machine and you can just go back in time and watch the accident happen, like watch the murder happen and now you know who did it. That’s so different. It’s such a huge change. It actually implies really big changes in how all parts of the SDLC work. If you think about it, people log way too much. People log so much data. This is why observability vendors make so much money. It’s all log storage. But why do people log so much data? People log so much data because they know that some issue could happen and if it’s a one in a million issue, but it’s really important, they may not be able to make it happen again.

Will Wilson 00:52:10 And so you better hope that you had that log statement in there so that you can put together the clues and figure it out. But again, if you have a time machine, you don’t need to log all that data. You just go back to before the bug and you look at it happen, or you take a cord dump at the time, or you go dynamically enable your logs before the bug hits and then you turn them off again afterwards. There’s such a different attitude and a different approach that becomes possible once you have this capability.

Sri Panyam 00:52:40 Damn. Yeah. That’s awesome. What’s getting in the way of an option?

Will Wilson 00:52:44 Oh, I think just that it’s brand new. Like I said, the very first people that we know of who were doing this started doing it in 2010 and it was all hidden inside this company until 2015. And then word got out, but it was only possible for extremely fanatical crazy people to do it because the barrier to adoption was so high. And still a lot of people wanted to do it because it was very effective. But we came out of stealth and launched this company one year ago. So we are really, really brand new. So most of the people who we talk to, they just, I mean, I think we sound crazy people to them. They’re like you know, we’re trying to explain to them there’s this, completely different way to have confidence in your software and to get way more confidence in it and to save yourself tons of time. And I think as with anything like the computer industry, the tech industry loves to think that we are very, very forward looking, but actually people are creatures of habit. And that’s especially true about developer tools. And so I think a lot of this is just, we need a little bit of time to move the industry in this new direction.

Sri Panyam 00:53:56 I think it’s only now that time traveling debuggers are kind of taking on.

Will Wilson 00:54:01 That’s right, that right. And that technology has been around for 10, 20 years, 30 years. GDB was able to do this decades ago.

Sri Panyam 00:54:09 Yeah. Looking back at integration, now obviously I’m guessing if I was a frontend developer, I may not be as motivated to get on DST at the same time, on the other extreme, if I was a kernel developer or a database vendor or a storage developer, this seems extremely, extremely valuable. In that spectrum, when would be the sweet spot to kind of start considering this?

Will Wilson 00:54:33 So we’ve actually started seeing really a lot of traction with application developers, but so far it’s specifically in the sector of FinTech and finance and blockchain stuff. They’re definitely doing something more abstract, more high level than a database developer or a kernel or a systems developer. But the thing is that what they’re doing is so important and it’s so important that it not make mistakes that justifies this, much higher level of scrutiny. So we recently began working with ramp, which the corporate card, and they have a service that decides whether authorized charges on the card, and it’s really bad if that fails to respond to a request or it authorizes a twice or that would be not good. And so they really want to test it very, very effectively. And they’ve been doing that with us. We recently also started working with a large hedge fund. I can’t say their name, but it’s a big one and similar kind of situation. They’re moving hundreds of millions of dollars around. They really care that that works correctly, and they really care that every operation is running in the right way.

Sri Panyam 00:55:40 Looking forward. I think the big buzz word now, AI with AI writing so much code, and if you’ve seen the last few days, it’s just, it’s quoting for hours and generating, millions and so much on of code. How do you see, all that impacting security bugs, vulnerabilities and sounds like an opportunity.

Will Wilson 00:56:01 Yeah, I think so I think basically, AI can write a lot of code now, and the thing is that people are very tempted to just sort of ship it. And to not look too carefully at it, which is not a very good idea if you’re in a certain line of work. And so I think it is a huge opportunity for people try and develop more sophisticated approaches to software verification. It really highlights writing code is no longer the bottleneck. Actually writing code was never the bottleneck, but people thought it was. But now writing code is really, really not the bottleneck. The bottleneck is making sure it does the thing and that’s much harder.

Sri Panyam 00:56:40 Yeah I mean, writing is not reading it is.

Will Wilson 00:56:42 Yeah, exactly, exactly. Well, that’s the other problem with AI of course, is it generates a whole bunch of code for you and then as soon as you want to modify anything about it, you’re basically starting from scratch.

Sri Panyam 00:56:52 How is seeing opt into this? You know, do you have to change your approach and how you approach companies? What’s your strategy now?

Will Wilson 00:57:00 It hasn’t really affected our approach that much, to be totally honest. We have a lot of customers who are doing a lot of AI code generation, and they’re using us to test the results. And I probably, that segment of our customers will continue to grow over time. Nothing about what we’re doing has to change that fundamentally. To accommodate this use case. You know, it just means there’s more bugs for us to find.

Sri Panyam 00:57:24 Fair enough. Talking about bugs, you mentioned, was it Mongo had a high bug bounty, I’m sorry, for Ethereum?

Will Wilson 00:57:30 That was Ethereum. Yeah, yeah,

Sri Panyam 00:57:31 Yeah. Did you guys qualify for that Bug Bounty or?

Will Wilson 00:57:33 No, since we were a vendor for them, we didn’t qualify for. I wish it would’ve been a lot of money. No, in the early days of the company, we actually thought about using bug bounties as the primary source of revenue. Because some of them get pretty high and this thing is actually quite good at finding security bugs too that human beings sometimes miss. And so that was certainly a thing we thought about. In the end, we thought that was a little bit adversarial and that we would rather be embedded with our customers engineering team and making them successful and improving their lives, making it easier for them to do their job. We thought that in the long term, that would lead to better customer relationships than this. Like coming in from the outside and claim bounties model, but I assume somebody’s going to be using either DST or AI or something else to claim bug bounties very soon.

Sri Panyam 00:58:32 Hmm. Fair enough. No, it’s an exciting kind of feature out there. As we wrap up in closing, thoughts, what advice would you give to our listen about improving their own testing practices or evolving their platforms to get ready for DST?

Will Wilson 00:58:47 It’s really, the thing I said towards the start of this is I think the single most important piece of advice, any time you see a bug in production any single time, it is two bugs. It is a bug in your code, but it’s also a bug in your tests. And you should demand more from your tests. You should cultivate the attitude that it is unacceptable for things to go wrong in production. And that if a thing does go wrong in production, you should be asking yourself, what could we have done differently? That would mean that we found this ahead of time.

Sri Panyam 00:59:23 Awesome. For our listeners who want to learn more, how can they do so and where?

Will Wilson 00:59:27 Ah, yeah. There’s a lot of people who have written good stuff about this. We have some material on our website and antithesis.com. The other people who have talked the loudest and the most about DST, the FoundationDB team has written a bunch of stuff about this. They’re now at Apple and a lot of them are at Snowflake, but they’ve written a ton of good stuff. The FoundationDB paper that came out a few years ago and was in Sigmod is very good. I can send you a link to that. It’s got a whole section on how we did DST. The people at Tiger Beetle have written some really good stuff about this. They’ve done a very good job at it. So I think those are some pretty good resources. And then I give a talk at Strange Loop back in 2015 about the FoundationDB DST approach. So that was almost 10 years ago now, but it still has some good stuff.

Sri Panyam 01:00:12 Nice, thank you. Before we wrap up, is there anything that we haven’t covered that you’d like to share for our listeners?

Will Wilson 01:00:17 Nope. I think this was a ton of fun. Thank you so much.

Sri Panyam 01:00:20 Thank you. Thank you Will, thanks for being on the show. Looking forward to seeing you live.

Will Wilson 01:00:25 Cool.

Sri Panyam 01:00:26 Cheers. [End of Audio]

---

[Original source](https://se-radio.net/2025/09/se-radio-685-will-wilson-on-deterministic-simulation-testing/)

Reply