SE Radio 697: Philip Kiely on Multi-Model AI
Philip Kiely, software developer relations lead at Baseten, speaks with host Jeff Doolittle about multi-agent AI, emphasizing how to build AI-native software beyond simple ChatGPT wrappers. Kiely advocates for composing multiple models and agents that take action to achieve complex user goals, rather than just producing information. He explains the transition from off-the-shelf models to custom solutions, driven by needs for domain-specific quality, latency improvements, and economic sustainability, which introduces the engineering challenge of inference engineering. Kiely stresses that AI engineering is primarily software engineering with new challenges, requiring robust observability and careful consideration of trust and safety through evals and alignment. He recommends an approach of iterative experimentation to get started with multi-agent AI systems. Brought to you by IEEE Computer Society and IEEE Software magazine.
---
Philip Kiely, software developer relations lead at Baseten, speaks with host Jeff Doolittle about multi-agent AI, emphasizing how to build AI-native software beyond simple ChatGPT wrappers. Kiely advocates for composing multiple models and agents that take action to achieve complex user goals, rather than just producing information. He explains the transition from off-the-shelf models to custom solutions, driven by needs for domain-specific quality, latency improvements, and economic sustainability, which introduces the engineering challenge of inference engineering. Kiely stresses that AI engineering is primarily software engineering with new challenges, requiring robust observability and careful consideration of trust and safety through evals and alignment. He recommends an approach of iterative experimentation to get started with multi-agent AI systems.
Brought to you by IEEE Computer Society and IEEE Software magazine.
---
---
Show Notes
#### Related Episodes
- SE Radio 633: Itamar Friedman on automated testing with generative AI
- SE Radio 603: Rishi Singh on using Gen AI for test code generation
- SE Radio 626: Ipek Oskaya on Gen AI for software architecture
- SE Radio 680: Luke Hinds on Privacy and Security of AI Coding Assistants
#### From IEEE Computer Society
- https://www.computer.org/csdl/proceedings-article/cai/2025/240000b617/289J2jTxdIs
- https://www.computer.org/csdl/magazine/it/2025/01/10893880/24sGq0TJUzu
- https://www.computer.org/csdl/magazine/it/2025/04/11125703/29aJ2GUXcPe
- https://www.computer.org/csdl/magazine/mi/2025/01/10916416/24RVwgmja6s
- https://www.computer.org/csdl/proceedings-article/cai/2025/240000b603/289J1ZAVOYo
---
#### Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Jeff Doolittle 00:00:18 Welcome to Software Engineering Radio. I’m your host, Jeff Doolittle. I’m excited to invite Philip Kiely as our guest on the show today for a conversation about multi-agent AI. Topics around machine learning and artificial intelligence have come up in previous episodes. Some examples are Episode 633 when Itamar Friedman appeared on the show to discuss Automated Testing with Generative AI. Episode 603, featured Rishi Singh on using Gen AI for test code generation and Episode 626, featured Ipek Oskaya on GenAI for software architecture. Philip Kylie leads software developer relations at Baseten. Prior to joining Baseten in 2022, he worked across software engineering and go-to-market for a variety of startups outside of work. You’ll find Philip practicing martial arts reading a new book or cheering for his adopted bay area sports teams. Philip appeared previously on Software Engineering Radio in Episode 426 for a discussion about writing for software developers. Philip, welcome back to the show.
Philip Kiely 00:01:20 Hey Jeff, thank you so much for having me.
Jeff Doolittle 00:01:22 Glad you’re here. Multi-agent AI. What are we going to talk about in this episode, Philip?
Philip Kiely 00:01:29 Today we’re going to talk about how to build things that aren’t ChatGPT wrappers when you want to build what we call at Baseten AI native software. So that’s companies that are started to build AI native products as well as existing companies who are adopting AI and trying to become AI native. It’s about more than just producing a wrapper around a single model call. It’s about composing multiple models together into a unique and differentiated system that can actually achieve your user’s goals.
Jeff Doolittle 00:02:03 So give us more details on what you mean. It sounds like you’re saying most teams start by doing what you’re calling a wrapper around something like ChatGPT or maybe some other models. What exactly does that entail?
Philip Kiely 00:02:15 Yeah, so when you look at the sort of evolution of an AI product, you might have something that starts out as an add-on. Let’s say you add a Chat window or a search window to existing product.
Jeff Doolittle 00:02:28 Everybodyís doing this right?
Philip Kiely 00:02:30 Not exactly, but then there are a lot of companies now that are building really exciting and new products that weren’t possible before we added all of these AI capabilities. an example of that is something like DS Script, which is a tool that I use all the time as a content creator where I upload the video that I made, I get a transcript, and then I can type in the transcript to actually edit the video. This is a brand-new capability and that requires more than one model. It requires transcription, it requires language models, it requires the ability to actually have a model that takes action on an object like the video sequence. Or if you think about a company like, I don’t know, source Graph or Zed who are building code editors, it’s more than just ask a question, get an answer, paste it into your code base. You’re actually integrating the context of the code base into an agentic framework and then powering that with multiple different models, not just one. all of these products, they’re using a bunch of models, they’re coordinating and orchestrating them together, and they’re also figuring out how to do all of that in a way that’s fast for the user from a latency perspective and creates like feasible unit economics from a business perspective.
Jeff Doolittle 00:03:49 Okay. for people that are just getting started, are there benefits sometimes to just using maybe off the shelf AI tools and just wrapping them?
Philip Kiely 00:03:58 Yeah, a hundred percent. When you’re getting started and you just need to introduce some kind of intelligence to your product, you should absolutely go grab an off the shelf model. Something like Gemini, Claude, GPT-5, all of these are great models. They’re very capable, they have a wide range of capabilities and they’re going to show you immediately what’s possible at the frontier of intelligence. But as you advance in the product development cycle, as you roll out to production to hundreds of thousands, millions of users, you start to understand the specific needs of your product and your audience and need to build a tailored AI strategy to accommodate those.
Jeff Doolittle 00:04:40 Okay. Now, some of this stuff is still new. Things change so fast that I don’t want to assume that every listener knows what the word agentic means, which you just use that word. So could you just describe briefly what is agentic AI as opposed to just what we’ve thought of as like Chat AI, this sort of stuff that’s been common and is now very well known?
Philip Kiely 00:05:00 Absolutely. We’ve got a lot of fun buzzwords here and that’s what helps us feel smart in the AI industry. But I’m actually wearing a show, this is not a video podcast. I’m wearing a shirt right now that says artificially intelligent. And I use all these buzzwords to feel naturally intelligent. age, agentic AI basically just means going from an AI that produces information to an AI that takes action at a technical level. This is implemented through something called function calling or tool use, where you add the capability of your model to take a list of tools, pick which one wants to use, and return that selection as well as inputs to the tool. Then your agent framework or your product environment is going to be able to take that suggestion from the model, like, hey, do this action and actually execute that function call. Use that tool to take whatever action the agent has suggested.
Jeff Doolittle 00:05:58 Okay, so I might have an AI experience that just gives me information like ChatGPT, but no action is being taken the next step once we go to an agent, now it can do something with a tool, a function call, as you said. And now in this episode we’re also going to talk about multi-agent AI. That sounds like even the next level of composition. So, what happens now at that level?
Philip Kiely 00:06:21 Yeah, so ChatGPT itself has become increasingly agentic. For example, if you think about ChatGPT Pro, which does all of the searching ahead of the reasoning chain, doing a web search is an agentic action. Your model is not capable of simply just doing that. You need to build a scaffold around it, you need to tell it that it has access to this tool, tell it that it can request this tool and then wait for a response from it. And when we think about composing multiple agents together, that really comes from having more than one model in many cases. So everyone, when we’re talking about generative AI, we love to talk about large language models, LLMs, stuff like that, GPT-5, that Gemini 2.5, and the open source ones, Kim and Quinn, all those guys, but there are a lot of other modalities out there in the generative AI industry. We can create images and videos from text. We can use embeddings to encode the semantic meaning of text and use that to do retrieval, use that to do memory context, recommendation, all that kind of stuff. We can interface with the user through new modalities like voice. There are all these other capabilities outside of the core language model intelligence that we need to compose together in order to build one of these more advanced systems.
Jeff Doolittle 00:07:43 Speaking of a more advanced system, there’s an example we talked about before the show that it might be helpful to dive into, which is the d and d AI assistant. So let’s talk through maybe if a team is just getting started, as you mentioned before, maybe they want to use some off the shelf model. Let’s start there and then walk us through a story of how this team might grow. Like let’s, like maybe eventually generates video and all this crazy stuff. Let’s start simple, right? Maybe it’s all text based and it’s just driving stuff. And then how would that begin to grow more into this agentic flow? And then how might that grow into multi-model flow? And then when would you maybe want to transition from using a GPT-5 or a quad to using a different model, maybe one of your own?
Philip Kiely 00:08:27 This is a great question. So this is a problem that product teams face all the time, is they have this vision, right? Which is in the case, Jeff and I, we were kind of joking around before the show, like what would be a fun AI product that I would enjoy building? And I play a lot of Dungeons and Dragons and Dungeon, or at least I used to before I was working in AI all the time. So as a D&D game, you have a bunch of your friends together all playing different characters, and then one person who’s the game manager or the Dungeon Master who’s in charge of actually orchestrating the game, you can think of the job as the Dungeon Master as almost being a video game engine for all your friends, making the actual actions happen.
Philip Kiely 00:09:09 So weíre thinking well, what if you could just do AI instead so that everyone could play instead of one person having to kind of orchestrate the game? If I was to build a V zero of this kind of system, I would think first from a product perspective, okay, I want to build an assistant for a Dungeon Master. I want to make it easier and more fun to do this role rather than trying to replace it right away with an entire end-to-end system. And so, what that would probably look like is a text-based interface. Like you said, I can just spin up a large language model, I can use something off the shelf, and then the first component that I’m going to need to introduce is some kind of context and some kind of memory. So, I’m going to want to be able to feed my model information, like what are the stats of different monsters that they might run across, and what are the equipment that they might need? What is the terrain that they’re in? All that kind of stuff. So, for that we start to think about something called Retrieval, Augmented Generation, a RAG, which has been a really hot topic in the industry for the last couple years. With RAG, you can introduce new context into your model dynamically at one time. And what that lets you do is ground the model in facts and reality, or in our case, a specific fictional world, and you achieve that using embedding models.
Jeff Doolittle 00:10:35 Okay, so now we’ve got stage one, we’ve got our assistant that helps the human Dungeon Master, I guess, work less, right? They’re getting help. Now what’s the next phase of evolution here? You know, where real quick, what’s the product vision? Where are we heading? And then we can go back to the steps we’re taking to get there. It’s like, where does this thing end?
Philip Kiely 00:10:53 Yeah. the product vision that we’re building toward here, and you know again, this is just completely hypothetical, is what if you could have a system where my friends and I show up and we tone on the system and we can all just play D&D together. We come up with characters, we come up with scenarios, and no matter what we do, the game is able to update itself, it’s able to interact with us, it’s able to create characters for us to interact with basically an entire generative gaming experience delivered exactly the same way that an experienced human GM would, but in a way that all of us can play together instead of one person having to do that role.
Jeff Doolittle 00:11:36 And maybe it’s creatively generating the story, but maybe eventually it generates video snippets or…
Philip Kiely 00:11:42 Exactly, yeah. Okay. On the modality side where when you build systems like this that you’re trying to deploy to a large number of people and have be a very engaging AI experience, you really quickly have to move beyond text. No one gets together to play D&D by email. You get together in a room and you’re talking and you’re rolling dice. And so, you want to introduce new capabilities to the system. First off, you need to be able to transcribe the conversation. You need to be able to dialyze the conversation, which means attribute who’s saying what lines, so you know what’s happening with each character. And then you also need to give voice to the AI. You need it to be able to speak back to the players, not just produce text on screen. So, these are all great examples of the kind of work that has to happen in a multimodal agent system where you’re bringing in inputs from different sources.
Philip Kiely 00:12:41 And then we can also see a little bit more concretely what it means to be an agent. For example, rolling dice, that’s a huge part of D&D. It determines whether or not your attacks hit whether or not you are able to take certain actions. And it’s very easy for a computer to quote unquote roll a dice. All they need to do is run a little script to generate a random number. But an LLM is not capable of generating a random number on its own. If you ask it for a random number, the distribution is going to be heavily weighted because it’s predicting a likely next token, not a random next token, right? So, we can introduce a tool, let’s say we have a python function called Roll the Dice, and you tell the function how many sides the dice has. So you would, as the game goes on, the agent system would very frequently say like, oh, I as the GM am simulating a monster attacking the player.
Philip Kiely 00:13:37 And whenever an attack happens, I need to roll a 20-sided dice to see whether or not it hits. So, I would call this function Roll Dice and I need to pass it the argument 20. And that’ll be the actual response from the model. The model won’t output the monster hit the player or the monster missed. The model will output, hey, please roll a 20-sided dice. The agent framework is going to pick that up, do that action, give it back to the model. And then you have a new prompt, you have a new generation cycle that’s going to say, okay, it was a 19, the monster hit the player, and then it’s going to take that information and speak it out loud. you can see how we’re very quickly moving to a very complex system where all of these models have to work together in sequence very quickly, and they are passing information back and forth in a sort of contextual memory layer, rather than relying on a single prompt to run this entire system end to end.
Jeff Doolittle 00:14:35 Yeah. Composing together different tools and different agents that are specifically capable in particular ways, rather than just throwing it at ChatGPT and expecting it to, as you said, generate random numbers for example, which it’s not going to be going to be good at, right?
Philip Kiely 00:14:52 Exactly. It’s always very surprising what capabilities AI models have and don’t have. Sometimes I ask a model to one shot an entire page of code that I’ve been thinking about, and it just gets it done perfectly on the first try. But there’s other very simple actions that models can’t do unless you give it the tools to do so.
Jeff Doolittle 00:15:12 Absolutely. I’ve attempted “generate system” just to see what would happen and you wouldn’t be surprised that it’s just a big mess by giving it that much all at once. And then of course there’s the opposite end of the spectrum, how many B’s are in Blueberry? But you know, when you consider token size and things of that nature, people get been at a shape. Oh, the AI can’t tell you that. Well, it’s a tool, it may not be able to do that, but it’s really good at a bunch of other things.
Philip Kiely 00:15:33 Exactly. And you can always use tool calls to address the weakness of a foundation model. If you give it a Python script that says, hey, do this substring matching problem, that’s a very simple Python script to implement. It can write that script for you and then run that script for you and then tell you how many B’s are in Blueberry. You just need to give it the access to the tools and the onetime environment to do that.
Jeff Doolittle 00:16:00 Yeah, interesting. So, there’s a few directions we can go, but I want to start with, let’s say we’ve gotten this Dungeons and Dragons AI agent to a certain point, but everything we’ve done so far, we’re using off the shelf models. When would you start to make that transition and what would you do?
Philip Kiely 00:16:18 Yeah, so there’s a few places where I see AI teams start to run into some of the limitations of off the shelf models. A big one is domain specific quality. So, when we think about our system, we are trying to, for example, have a fun role-playing voice where it’s going to be adventurous and exciting and engaging in the same way that say, a book or movie dialogue would be your average off. The self-model does not have very much personality like that. You can get a long way with prompting, but you can always tell when some text is written by a certain model. So, this is even more important of course, in in real world systems. I work with a lot of companies in the healthcare space where they’re investing a ton of money in training and fine-tuning models that do better and do better enough that they can actually be used in the real world for both front and back-office healthcare things.
Philip Kiely 00:17:20 It’s important in finance, it’s important in education. There are all these places where you want to train a, a fine tune, a model so that it performs better in your domain. That’s the first places is domain specific quality makes people start looking. The second reason people start looking is latency. We’re now talking about four or five models running and sometimes running multiple calls each in between every single user action. I ask the system, you know, hey, my guy swung his soloed, do you hit the Hydra? And if it’s waiting four or five seconds to tell me the answer to that, it’s not a very good player experience. Latency is important in literally every single part of AI. It’s especially important in voice AI when people are speaking to a system in real time. And then the third thing is economics. Like if I’m making this app and it is blowing up and going viral, and I am paying linearly more in token costs for every single user who signs up and start using it, it’s going to get really expensive really quick. Of course, there’s a ton of money in the AI space. I can always ways around and just continue to gain market share, but at some point I’m going to have to start thinking about the unit economics of my product and making sure that it makes sense and that I’m operating a sustainable business in the long term.
Jeff Doolittle 00:18:45 What’s interesting is those three factors you just mentioned were true before everybody started adopting AI, it sounds like it’s just software engineering.
Philip Kiely 00:18:55 That’s true. That’s maybe if you take nothing else away from this episode is that AI engineering is just software engineering with some new tricks to it. I think a perfect example of this is we see so many times where we have a model that we’ve optimized, we’ve shaved every single millisecond off of the pre-fill and the decode and all these infant steps. We’ve put it up on some auto-scaling, GPU clusters, co-located it next to the customer. You know, we’ve gotten everything perfect down to the millisecond and we’re like, hey, the benchmark for this is 200 milliseconds, and they come back to us and say, dude, I’m seeing 25300 in production, like what’s going on? It’s the client code, it’s just session reuse, or it’s some pass through some redirect proxy that’s adding a few dozen milliseconds here or there. There’s so many times when the problems that we run into in production around these AI systems are less related to the model itself or the influence service or any of the cool stuff that I like to talk about. And just pure ordinary software engineering that still needs to be done to an incredibly high degree of quality and craft in this new industry.
Jeff Doolittle 00:20:11 It makes sense. And you talked about multiple steps, for example. Well, if I have an agent that’s orchestrating a five-step process and those steps can’t be paralleled, well, any one step can slow the entire, it doesn’t matter if four out of five steps or 20 milliseconds if one of the steps is 2000, for example.
Philip Kiely 00:20:28 Exactly.
Jeff Doolittle 00:20:29 Yeah. So, we’re right back to that. With these systems, let’s switch gears a little bit. There’s obviously a lot of information passing back and forth now between the human and these systems. If we’re talking about Dungeons and Dragons and, and a game, maybe the risk isn’t super high of nefarious things happening with our data, but the more we hand off capabilities to an agent, it seems like trust becomes more and more important. Speak some to how do we handle issues around sensitive data or around agents that maybe have power that is a little dangerous to be handing them? You know, I think of the agent I saw the other day that did a RM space dash RF slash and fortunately it had the right path afterwards, but I sort of had that moment of, oh my gosh, what if it just was slash, I would’ve been in big trouble.
Philip Kiely 00:21:17 Absolutely. There’s a few different things here. I think number one is just making sure that you understand the tools you’re giving the agent access to. You can choose what tools it has if it doesn’t have access to a given tool that it can’t take that action. The second thing is human in the loop for more sensitive actions. Let’s say I was making a stock trading agent that was going to manage my portfolio on my behalf. Maybe I want to have it text me every time it’s going to make a trade for more than a thousand dollars so that I can just give it a quick thumbs up and make sure it’s not doing something dumb. The final thing is, you know, more on the evals side, like making sure that we’re vigorously testing our models, making sure that we’re evaluating our models on the actual domain that we’re going to be deploying them in, rather than just looking at the launch blog post and being like, wow, this has a higher aversion number than the last model, so it must be better and just cowboying it straight into production.
Philip Kiely 00:22:21 So it’s a multi-step process. It’s having a clear understanding of what good looks like, having some safeguards in place around what systems you’re giving it access to at all, and then having that sort of yellow light system of the human in the loop to prevent any kind of nefarious actions. You know, in in general, I find myself pretty trusting of AI agents, the folks working on alignment are doing a pretty good job these days. I think of it more around the idea of preventing mistakes rather than preventing nefarious behavior for nefarious behavior that that’s still coming from humans. You know, we have all the same problems around, you know, with Baseten, we got a lot of GPUs, people sign up and try and like crypto mine and stuff. There’s all of the same human nefarious problems that you have to solve in this agent world, and then you have to add in the safeguards against mistakes from the AI system as well.
Jeff Doolittle 00:23:19 So two terms you mentioned, I want to make sure we define for listeners, evals and alignment. Can you describe what those terms mean in case they’re unfamiliar to listeners?
Philip Kiely 00:23:28 Absolutely. So, an eval is a quality benchmark. Generally, it’s used in the industry to mean one that you’ve created yourself or one that you’ve created with your own product in mind. An example of an eval for a d and d system would be, let’s say you have a model that is deciding what level of monster to give you characters and you want to make sure it’s evenly matched. You might make a set of good party setups and a set of good monsters and make sure that the model is able to match them appropriately. That’s an example of a domain specific eval for alignment. That’s more of a philosophical question in the AI industry of making sure that our models are prompted and are constrained and fine-tuned to be helpful, to be homeless, to be useful. So that’s making sure that the models aren’t swearing, aren’t creating negative images, all that kind of stuff. And the related problems you can have evals for alignment. But yeah, generally the models that get released to the public these days are quite well aligned. They are very vigorously tested, and we’ve seen them be, you know, very, very useful in general. So that’s why I said, you know, alignment is pretty good these days.
Jeff Doolittle 00:24:58 Yeah. To what extent is alignment being done by retraining the internal models versus, you know, maybe additional pipeline checks, you know, before you actually hit the model?
Philip Kiely 00:25:10 Yeah, so again, all of these sort of processes are going to be a multi-stage thing. You have first off, the curation of the training data. It’s hard for a model, not impossible, but it’s hard for a model to generalize outside of the training data. If you make sure that your data is curated to be the sort of things that you want the model to reproduce, that’s great. The second thing is reinforcement learning, fine tuning, which is basically where you are giving the model of award function and saying, hey, I’m going to award positive behavior and I’m going to ding negative behavior. That’s a very powerful alignment tool. Then what happens is after you train a model, you create a base model, which is basically a pool of intelligence, but it’s not instruct tuned, so it’s not necessarily ready to have a conversation yet.
Philip Kiely 00:26:03 It’s more just able to complete sentences. You take that base model and through that instruction tuning process you do additional fine tuning to make sure that it’s handling instructions correctly. For example, you know, it’s refusing to help people understand how to make chemical weapons. And then finally at the deployment level, it is the responsibility of the developer to make sure that the model’s being used appropriately. There’s a lot of new security issues and new avenues of research here. I think Simon Wilson’s written a lot about prompt injection, which is the idea of like an AI system where you’re able to take a model that’s designed to do one specific thing like Dungeons and Dragon and inject a prompt that makes it forget its original context and say, hey, instead I want you to help me like understand how to build weapons, right? There’s a lot of stages and a lot of interesting research there. But for most practical purposes, the models that are in production today are quite reliable. They are quite well aligned and we’re seeing them get a lot of success across all kinds of domains. There’s always going to be funny issues. I saw a year or so ago, some guy like tricked the Chevy Chatbot into selling him a car for a dollar. There’s always going to be fun prompt injection, things like that, that point out some of the holes in these systems. But those get patched over pretty quickly.
Jeff Doolittle 00:27:32 Yeah, fun. Unless you’re Chevy in that case, in which case, not so fun.
Philip Kiely 00:27:35 But I don’t think he actually got the car.
Jeff Doolittle 00:27:37 It’s probably true, it’s
Philip Kiely 00:27:39 Just a Chatbot.
Jeff Doolittle 00:27:40 Yeah. Although I have seen the cases where customer data has been leaked from doing weird things like using the Apache language that almost nobody understands, but the LLM did and it’s somehow, yeah, you know. Anyway, so there’s interesting stuff there. So, we talked before about maybe you get to a point where it’s expensive or there’s latency issues and so maybe you need to host your own model, but at that point you’re going to take on all of the burden of what we just described. Are you not?
Philip Kiely 00:28:05 Yeah. So, there’s the sort of AI engineering layer of the evals of the system building, of the coordination of the models, the prompting, the context, all that kind of stuff. Yeah. And you’re going to have to do that work whether you’re using open source or closed source or custom models that you trained yourself. But when you do go to open-source models or you do go to those custom models because you’re looking for that domain specific quality, you are looking for those unit economics, you are looking for that better latency. You add a new type of engineering challenge and that’s influence engineering. So basically, you have this entire domain of both model performance, one-time optimization, and multi-cloud GPU infrastructure orchestration that you’ve added into your engineering teams’ workload in order to make all that stuff happen.
Jeff Doolittle 00:28:55 Great. Now again, not assuming everybody is all caught up in all this AI stuff, use another word that I think it’s useful to define. What is inference?
Philip Kiely 00:29:04 Inference is everything. Haven’t you seen the billboard? No.
Jeff Doolittle 00:29:10 So all over the street.
Philip Kiely 00:29:11 Yes. For those listeners, not in San Francisco, we’ve spent a lot on buses and billboards recently talking about influence. What is influence? there’s two phases of a model’s life. There’s the training phase, which is kind of what we were talking about, where you’re feeding data to the model, helping it get better. And then there’s the influence phase with influence the model weights or the actual underlying values and architecture of the model. Those are frozen, they’re done, the model’s not going to change. What you need to do now is actually use the model to generate responses to user queries and that’s influence. two phases, training is creating the model influence is actually using the model in production. One misconception I’ll address really quick about machine learning systems, and I’ve heard this from the highest levels of big enterprises even in confluences and stuff, is the idea that somehow a model loans while it’s live in production.
Philip Kiely 00:30:06 That’s I think, a very easy misconception to have because it’s literally called machine learning. Isn’t it loaning more? No. Creating a system that actually adopts to what’s happening in production takes a ton of engineering work to gather that data and continuously retrain and redeploy these models. There are companies that have been able to build that kind of sophisticated system. Obviously, all the frontier labs, OpenAI, Claude, all those are able to use those free user chats and continue to make their model faster. Sorry, not FAO, but better. But yeah, the thing to understand about influence in its purest form is that the model is fixed, it’s done, it’s fully baked, and now you’re just using it over and over in production to actually generate business value. And so users,
Jeff Doolittle 00:30:54 Right now, let’s take that a little further. Let’s say I want to build my own model and then, you know, I do the training and now it’s locked. Right now, it’s now we’re doing inference, the model’s not changing, but I want to use multi-agent AI to build the next model. So maybe I have a judging AI that knows how to, you know, do my evals right and make sure there’s alignment according to some of the quality characteristics that I care about. How would I create a pipeline that would now allow me to say, hey, start maybe auto training the next model and seeing how it performs according to this automated. Just can you carry that analogy forward and are people doing that and how might people do that for themselves if they want to take this? You know, that far.
Philip Kiely 00:31:39 Yeah. That’s one of the very most sophisticated setups that a company might have. Generally, you see these setups at companies where creating new foundation models is core to their business, right? Like they are research labs. And in those cases, absolutely what they’re doing is they’re taking each iteration of the model, deploying it into production, seeing how it works, gathering user feedback and using that to create fine tuning data sets. Using that to create training data sets and putting those into the next version of the model. They’re also doing a lot of other stuff. They’re doing architecture research, they’re trying to expand things to larger scale, trying to get more and higher quality training data from other sources. it’s one small part of a larger system of making better and better models. But the way you set up that system is, like you said, you have the model one in production, you have a data gathering mechanism, you have that evaluation mechanism. You’re generally not going to be like retraining your model overnight. That’s more of a setup that was common in the ML world where you were sort of continuously retraining these very small couple hundred million parameter type models and redeploying them for increased predictive accuracy. The iteration cycles are slower in the AI world where, you know, you might be releasing new models on a monthly cadence or a multi-month cadence is what we generally see from the best labs.
Jeff Doolittle 00:33:06 Right? But in the meantime, I imagine there are multiple potential models that don’t fulfill whatever the criteria might be. They don’t meet the evals, they’re not aligned properly, they’re too easy to jailbreak, whatever it might be, and then they select from the ones that fulfill the requirements and that’s the one that gets released.
Philip Kiely 00:33:23 Absolutely. There is unfortunately a lot of work that has to be thrown away in a non-deterministic system, but ultimately you are able to get to the best and not just get stuck at like some local maximum.
Jeff Doolittle 00:33:36 And as soon as you’re going to create your own models though, you’re going to have to carry that burden whether you automate it or not.
Philip Kiely 00:33:42 Yeah, absolutely. So that’s why a lot of companies choose to, instead of creating models from scratch, do lightweight, fine tunes on top of existing models. I’ve talked a few times about this idea of a foundation model that’s going to be one of the big models from a major lab in the open-source world. This might be something like Llama, MISRA, Quinn, Kimi, you’re taking one of these foundation models and then you are applying a much smaller data set. These models are trained on trillions of tokens. I think, you know, every book ever written, everything ever said on the internet, that kind of scale of data, yeah, fine tuning data can be much smaller in scale. You might have a few tens or hundreds of thousands of questions and answer pairs that are relevant to the behavior that you want to generate. And you are able to create a lightweight modification to these models. You know, AI influence is just matrix math. It’s just a bunch of matrix multiplications. Your fine tune is just one more matrix in that sequence that everything’s getting filtered through to slightly adjust the behavior of the model. And creating one of those, while still a lot of engineering work and still something that should only be invested in after a lot of other things in the system are figured out is hundreds of thousands of times faster, cheaper, and easier than training and tire models from scratch.
Jeff Doolittle 00:35:06 Yeah, that makes sense. So let’s say our Dungeons and Dragons game is really taking off now and we’re hitting up on some of those factors you mentioned before that might cause you to say, you know, whether it’s the economic factors or it’s the latency factors, we’re starting to hit on those and now we need to transition to what you’ve described before as a dedicated deployment. So can you unpack for our listeners what exactly those next steps might be using that example to move from, you know, say, and maybe you still use a foundation model for some aspects of it, but there’s some piece that you say, we’re going to move this now to a dedicated deployment. What does that mean and what does that entail?
Philip Kiely 00:35:45 Absolutely. to start with, we’re just doing a little bit more software engineering. We’ve always had multi-tenant services and single-tenant services as an AI model, API off the shelf. What I’m getting from someone like OpenAI is I’m getting one endpoint, everyone else has the same endpoint, we’re all hitting it and our traffic gets mixed together. And so, but when I want to start doing my own thing, having really stable P99 latencies and buying my tokens in bulk, like going to Costco rather than one at a time, like going to the gas station, I’m going to need to set up my own GPUs to run these models. And this is why we’ve switched to open models of course here, because I can’t just take the GPT-5 weights and run it. They won’t let me. It’s their secret sauce. Yes. But if I take one of these open-source models like Quinn or something and I put it up on the GPU, now it’s only my users who are using it.
Philip Kiely 00:36:41 So if someone else, if my friend startup gets a spike of traffic, their traffic isn’t affecting mine. It’s the same noisy naval problem we’ve had in in past APIs. It’s just magnified by the fact that now APIs are backed by these huge, inexpensive GPUs doing these very intensive calculations. So yeah, I’m going to have to secure these GPUs, I have to like actually to go get them, go lease them or event them, and then I’m going to need to install a one time on these GPUs so that I can actually serve the model. And then I’m going to need to scale up and down replicating this system over and over again in order to meet the demands where maybe more people are playing on Friday and Saturday night, so I need more GPUs than I need fewer GPUs at 10 o’clock on a Tuesday when everyone should be at work.
Jeff Doolittle 00:37:31 Right. Another term you just used, again, in the interest of clarification, you said weights and you said, you know, GPT-5, we don’t get to use their weights, they won’t let us, that’s their secret sauce. Can you unpack that? What are weights?
Philip Kiely 00:37:45 Absolutely. a large language model or most generative AI models, like I said, it’s all just a bunch of matrix math, right? the weights are a giant matrix. If you think about a model having say a trillion parameters, I’m sure we’ve heard about model size being talked about in terms of parameters, A parameter is just one number. If you think about a simple logistical regression model where it’s like C equals A, X plus B, YA and B are parameters, that is a two-parameter model. And you can scale up that example to, okay, now I have a trillion parameters. Each one is just a number and these numbers are in this neural network. They weigh or they adjust the weighted probability, probability of a certain token being produced. So that’s why they’re called the model weights. The parameters are, are influencing or weighing on the output of the model. And so, they’re the weights. The weight of a model is what determines whether or not a model’s open source or closed source. Closed source models, you don’t know what the weights are. Open-source models. You do the wait. on hugging face, you can download them, you can use them, you can run them in your own company, you can run them on your own hardware.
Jeff Doolittle 00:39:01 Great. what other pieces are needed? If you want to build and host, let’s start with your own dedicated inference platform, and then maybe we want to grow beyond that into a dedicated multi-agent platform, or at least pieces of it.
Philip Kiely 00:39:15 Yeah, so the two pieces are the infrastructure and the runtime, the infrastructure, again, it’s like getting the GPUs, orchestrating them. The runtime is really exciting because with these model weights that you have, you have various options for things like influence engines. some popular tools are VLLM, sg, Lang, tensor, RTLLM. There’s also transformers and diffusers, which are the sort of underlying technologies based on PyTorch that run these generative models. Transformers, for most things, diffusers are going to be for your image models, right? That turn noise into an image. So, what’s VLM? What’s SG lang? What do all these things do? Well, they’re influence engines. you load the weights into the engine, you put that system in a container, just a docker container. Again, we’re just doing real software engineering, right? Put that on A GPU and the influence engine is actually responsible for running those matrix multiplications to produce influence on the GPU and your choice of influence engine.
Philip Kiely 00:40:22 The way you configure the influence engine has a massive impact on the actual speed and throughput of your eventual production system. Because these are all very complex calculations. The bottlenecked in various places, for example, your time to first token or the time it takes to generate that first little bit of output is going to be mostly based on compute. Like how many flops you have on your GPU and then you decode, which is what’s going to generate every single one of those subsequent tokens is going to be bound on your memory bandwidth on the GPU. we have two different bottlenecks right next to each other in these two phases of influence. You have to figure out, you know, systems and algorithms and optimizations to get around those bottlenecks to be able to serve all this traffic really fast and to be able to serve it in a way that utilizes the GPU to the maximum capacity so that you’re not wasting money on underutilized hardware that’s influence engineering, right. Though it’s both getting the GPU, but also using it effectively and it’s a really exciting thing to work on.
Jeff Doolittle 00:41:25 May seem like a really basic question, but why do we need GPUs? Why aren’t CPU sufficient in this space?
Philip Kiely 00:41:31 So in the ML world, and when I joined Baseten, we were a very influenced focused company four years ago, but we still were working a lot with ML models. Like I ran all kinds of XG boost models and stuff on CPUs and they ran fast, but those things were tiny. So the models today, first off, they’re huge. So you know these billions, trillions parameters, you need a ton of memory to support them. So that’s where the GPUs come in with their hundreds of gigabytes in the case of BC hundreds of high bandwidth memory. And then more importantly, they have the tensor cores. As we’ve talked about, model influence is a matrix math, linear algebra type of system. And so, these GPUs are optimized to do all of these things in parallel. GPUs are throughput machines, they’re great at parallelizing, this kind of math. You actually can run models on CPUs, it’s just really slow. If you have like a local CPU and some local RAM, I mean, if you have a Raspberry Pi, you can run a really tiny model. It’s just going to be slow. And for our production systems, we’re trying to make things real time.
Jeff Doolittle 00:42:41 Yeah. You talk about a matrix and you know, for those who took calculus BC in the United States, what we call it, or calculus C, you know what matrices are, but those are typically two by two. How big are these matrixes or matrices that we’re talking about in this space?
Philip Kiely 00:42:55 Billions.
Jeff Doolittle 00:42:57 Well, but just in the dimensionality, right?
Philip Kiely 00:42:59 Yeah, yeah. Dimensionality, you know, hundreds of thousands of dimensions in each direction.
Jeff Doolittle 00:43:05 So we can’t even conceive of four dimensions in our head.
Philip Kiely 00:43:08 Yeah. So the thing is like you don’t actually, I mean there’s a lot of math here. Yeah, don’t get me wrong. And I certainly have loaned a lot but have a ton more to learn about the underlying mechanics of all of this at the very lowest level. But I got a B in Linear Algebra, I got a B in Statistics in college. Like I am good at math, but not great at it. You don’t have to be some Einstein level mathematician to grasp the mechanics of influence and to understand how to make it faster. It’s just like we talked about software engineering, introducing constraints to the system and iterative experimentation and evaluation. The math’s important, it’s really fun and interesting to learn, but as someone working in the system every day, I’m living proof right here on a podcast talking about it, that you don’t have to be a PhD in math to understand how some of this stuff works.
Jeff Doolittle 00:44:05 And as you said before, traditional software engineering challenges still apply in this context. Correct?
Philip Kiely 00:44:11 Absolutely. Like most of AI engineering is just new versions of software engineering with some fun non-deterministic behavior sprinkled in.
Jeff Doolittle 00:44:20 Are there any specific aspects of software engineering that really should be emphasized if you are working on self-hosting or building multi-agent AI systems?
Philip Kiely 00:44:31 I think a big one is the idea of configuration. When we think about a model influence engine, there are hundreds of different parameters that you can adjust. You can adjust the quantization, you can adjust the speculation algorithm, you can adjust the batch sizes, the sequence lengths. There are all of these differentÖ
Jeff Doolittle 00:44:56 Temperature.
Philip Kiely 00:44:56 Yes. The temperature. Exactly. There are all of these different configurations that you can mess with. What’s the best one? Well, it depends. One thing that I’ve seen happen over the last couple years and is a trend that is continuing is the idea of a lot of these configurations used to be set at deploy time and now they’re set dynamically at one time. As we think about software infrastructure, a big trend over the industry for a long time has been how do we take something that used to be static and make it dynamic and make it adjust to the real-world conditions it’s operating under. The exact same thing is happening in influence engineering. Where we’re going from, okay, we’re going to run one benchmark for performance and we’re going to do as well as we can on that benchmark. And then we’re just going to roll out this system to having a system that automatically adjusts to scale to the nature of the traffic and to the performance it’s seeing in order to just continually improve.
Jeff Doolittle 00:45:58 Well that speaks to the importance of observability. Can you see absolutely some to how are companies doing observability in this space to ensure that they’re getting good results out of these systems?
Philip Kiely 00:46:08 We use all the standard observability tools. We have our Grafana dashboards and our Century alerts and stuff. Again, if these models are going to be core mission critical parts of your software stack, they need to be treated like core mission critical parts of your software stack. You should get paged when latencies elevate. You should get paged when GPU node cycle. A big part of our observability is around the performance, just making sure that we’re keeping an eye on traffic, keeping an eye on latencies, keeping those steady, keeping an eye on errors, making sure that like there aren’t too many requests, timing out, that kind of stuff. And there’s also the AI observability layer in terms of the model outputs and making sure that those are continuously of high quality. I have a lot less experience there being that I work more on the infrastructure side than on the AI engineering side, but from my understanding, it’s basically keeping an eye on those eval numbers, making sure that they’re trending in the right direction. And also, just listening to your users, like just having open lines of communication, making sure that users are continuously happy with the service that they’re getting from the models.
Jeff Doolittle 00:47:18 What other responsibilities do you have to take on when you either host your own models or you’re building multi-agent systems?
Philip Kiely 00:47:25 So GPUs are great technology. Like I would not have this career in, in influence engineering without the existence of GPUs. I’m very thankful for them. But they’re also kind of a pain to work with sometimes. If anyone’s ever run a long running task on GPUs in production, in a data center, they know that GPUs do not stay up 99.99% of the time. But if, nothing close to that, like you’re lucky to have a couple nines in there. But if you are building, say an assistant for doctors that every single doctor in America is going to use to make sure that they’re delivering the best possible care, which is something that one of our customers has built, then you need to have that extremely high reliability. Because if you are interfering with a medical process, that can literally be life or death.
Philip Kiely 00:48:18 And so the big responsibility that you have on the infant engineering and infrastructure side that you’re taking on when you switch to models that you’re running yourself is to go from these very flaky and unreliable hardware systems into a software layer that orchestrates all of this has active reliability, goes in multiple clouds just in case one of your cloud providers goes down, goes in multiple regions. you’re close to every user can fail over from cluster to cluster, can cycle nodes automatically when they have issues. This software orchestration layer becomes your responsibility as well and is very, very critical because reliability is one of the differentiators today in the AI industry. Everyone complains when ChatGPT goes down, everyone’s always asking on Twitter like, hey, does this model seem dumb today? Or, you know, what’s going on under the hood in this system? When you take that in-house, you do gain control over all of that. But again, you also gain all of that responsibility and depending what you’re working on, that can be a very, very heavy responsibility.
Jeff Doolittle 00:49:28 Everything you just said about orchestration across multiple regions and failover and high availability, that still just sounds like software engineering.
Philip Kiely 00:49:37 It’s, we
Jeff Doolittle 00:49:38 We canít just sprinkle AI on it, Philip, and just have it work?
Philip Kiely 00:49:41 There’s a reason they let me on Software Engineering Radio, instead of having to create AI Engineering Radio. It’s a field that I think should be very welcoming to everyone with a background in any kind of software engineering work because it’s all needed again. We got to do the same stuff that we’ve done for all of our previous systems. We have to adapt it. Of course, the systems are bigger, they’re faster, they’re less deterministic. But we have to do a lot of the same engineering work as an industry to make these things grow and scale that we did in every single one of the last generations of technology.
Jeff Doolittle 00:50:16 Yeah. Let’s say someone wants to start transitioning. They’ve been adding Chat bots on top of their existing systems and now they’re listening to this show, or they’ve been reading other things and they’re saying, I want to get into this agentic world and I want to start looking even at multi-agent AI. What should they do to get started? Or what approaches might they take?
Philip Kiely 00:50:39 I would really recommend an approach of iterative experimentation. I think the easiest way for this to go wrong is to decide that you want to build the single greatest AI system to ever exist. And you’re going to do it today. Obviously, you should have a lot of ambition working in this space. This space is growing fast, people are building unbelievable things. But it’s important to stay grounded in the day-to-day work of how I can make something a little bit better today? That might be introducing a new modality, saying like, okay, I’m going to add voice to my chat. It might be introducing new contexts, saying I’m going to give my agent access to a new type of data that it hasn’t had before. And as we’ve been talking about, it could be something entirely unrelated to AI. You could be making better tools for agents to use. You could be cleaning up your docs so that it’s easier to find out what information to pull. You could be optimizing the latency of your networking stack so that you’re not having users hang out for 50 milliseconds every time they want to make a call. There’s so much engineering work to do around making AI systems better and more reliable, both within and outside of AI engineering, that there’s someplace for everyone to contribute.
Jeff Doolittle 00:51:54 So Philip, what’s been your journey? How did you get into this space?
Philip Kiely 00:51:58 Like I said, everyone has a role to play in the AI engineering industry. And mine was joining pretty early, actually. I joined this company called Baseten. Back then, it was a pre-revenue team, and we were working on ML AI influence. I wish that I could tell you that I had some like great market insight where I knew this was going to be huge. You know, a year before ChatGPT came out, I did not, I was looking for a job, I wanted a stable paycheck so I could pay rent. And my only thesis if I had one was like, hey, these people are really kind and really smart and I’d like to work with them on whatever it is they’re doing. So I joined this little company and I was very fortunate to have the opportunity to learn on the job over the last four years to transition into more of this developer relations and influence engineering space where I’m able to work directly with some of the fastest growing AI companies in the world where I’m able to go speak at like Nvidia, GTC, and we invent an AI engineer awards fail and some of these great conferences and talk about the things that I’m learning from our engineers about how to make these systems faster and more reliable.
Philip Kiely 00:53:08 So yeah, it’s been a great journey growing with the industry, from getting in even ahead of ChatGPT blowing up and having the opportunity to learn along the way.
Jeff Doolittle 00:53:19 That’s great. And tell us a little bit about what Baseten does and how that relates to multi-agent AI.
Philip Kiely 00:53:25 Absolutely. So Baseten is an influence platform, and we’ve come a long way since that tiny company I joined four years ago. Actually, as we’re recording today, I am super happy that we’re announcing our Series D financing. We’ve raised at over a $2 billion valuation, which is really exciting. And that’s just a validation of this industry that we’re in and the importance of influence in this market. What Baseten does is think of any fast-growing AI startup you can think of though probably either using or should be using Baseten to run the influence that powers their products. So they’re taking their open source and their custom models, deploying them with us, we’re helping them make them faster, lower latency, more scalability, more throughput, all of that kind of stuff. And it’s been a really cool place to work because I’ve learned about every different modality of model, every different use case. I now know way too much about where you can find GPUs from Finland to Australia. We’ve got this cool map where we just kind of show where all the GP.
Jeff Doolittle 00:54:30 Everybody’s hunting for GPUs.
Philip Kiely 00:54:32 Yeah, exactly. And what we’ve been able to do with this multi-cloud capacity management software that we’ve built is, we’re able to kind of pool the GPUs from 10+ different cloud providers, have a single unified layer of compute that we’re able to draw from and use that to run mission critical influence for all these super-fast growing AI companies, and now also for larger companies, enterprises, which has been just the most interesting learning experience I’ve been a part of in my entire career.
Jeff Doolittle 00:55:02 Great. Well, if people want to find out more about what you’re up to, where can they go?
Philip Kiely 00:55:06 So I’ve done a lot of work. I’ve been working on it for 10 years. I’ve got SEO on my name. So just Google, Philip Kylie, and you’ll find me. I finally beat out that guy who is a doctor in Vermont who has my same name. But the best place to connect is going to be number one on Twitter, Philip Kylie Deo, number two, LinkedIn, and you can always just hit me up, send me a DM, send me an email. I love talking about AI engineering. I love talking about influence, and I’d be thrilled to hear what you’re working on.
Jeff Doolittle 00:55:37 Philip, thanks so much for joining me on the show,
Philip Kiely 00:55:39 Jeff. Thanks so much for having me. It’s been a pleasure.
Jeff Doolittle 00:55:41 This is Jeff Doolittle for Software Engineering Radio. Thanks for listening.
[End of Audio]
---
[Original source](https://se-radio.net/2025/12/se-radio-697-philip-kiely-on-multi-model-ai/)