Let's build a tool-using agentgiving hands to the brain in a vat Posted 2026-03-06 by Eric Rescorla At this point, if you haven't heard about "agentic AI", you haven't just been living under a rock but under a huge pile of rocks. However, even you have heard of agentic AI, you may also have only some idea of what it actually means. If so, you've come to the right place. In this post we're going to build a simple tool-using AI agent and try to get some sense of what it's actually doing. Here's a typical definition of agentic AI, from IBM:
In other words, agentic AI doesn't just talk to you but can have side effects in the real world. For instance, you might ask it to book travel for you, do some web searching, send emails, etc. The interesting question here is how. The first thing to understand is that a large language model (LLM) is what it says on the tin: a language model, which means that it operates at the level of text. At a high level, an LLM takes in a string of text (the prompt) and then emits some other text (the response). It's common to talk about this as an autocomplete or predictive system where the LLM emits the most likely text to come after the prompt, but for our purposes, it doesn't matter: the important thing is that the LLM just manipulates text. Everything we're going to do in this post is downstream of that fact. Preliminaries #AI models can either be local (running on your machine or a machine you control) or hosted (running on infrastructure operated by the model provider). In general, the hosted models are a lot more capable and require a lot of compute power, but you can still get fairly far with a local model if you just want to do simple stuff. In a local model, you can provide input to the model directly, but for a hosted model you need to use some interface provided by the model provider. For interactive use, this is often some kind of chat interface, such as ChatGPT, but for programmatic use model providers give you some kind of HTTP API. These APIs are conceptually similar but subtly different, so you need to write your app slightly differently for each platform. Although you can access local models directly as a practical matter what's convenient is to use something like Ollama, which is an engine that allows you to run a large number of models—and actually knows how to automatically download[1] them—and also provides a common HTTP API which you can use just as you would a model provider's API. We'll be writing our examples using Ollama so that they can work with local models—thus allowing us to do some internal instrumentation—but Ollama can also bridge to the APIs for big model providers, so we can use the same code with Gemini, Claude, etc, allowing us to demonstrate things with better models. In practice, you would usually not talk to the HTTP API directly, but instead download some local library (e.g., ollama-js) which takes care of the HTTP API mechanics. In this case, however, I want to be able to show what's actually happening, so we're going to be writing to the API directly, using the built-in nodejs fetch API. To do this, we're going to have a trivial JS API client, shown below.
A Simple Chatbot #We're going to warm up by build a simple chatbot, which is comparatively trivial [2] given this kind of model. We just need to connecting it up some interface that reads text from the user and sends it to the model, as shown in the diagram below.
We can then write a trivial chatbot like so.
Note that this code makes use of a chat framework[3] that just loops around input and passes the results to the handler function shown here. This lets us handle stuff like reading from the terminal and/or opening the input all in one place so you can focus on the main code. Anyway, here's an example interaction.
So far so good, but now try to have a conversation. For example:
WTF? I just told you the color of my shirt. What's going on is that the model itself
is stateless: it just takes in a string of input and produces
output, so as far as the model is concerned when I asked about my
shirt color, this is the first thing I said.
If we want to have a conversation,
I actually need to play back the entire conversation
with each request to the API. We do this by keeping a
You'll notice that each entry in the context contains
a
Congratulations, we now have a primitive but functional chatbot. Tool calling #What we've built so far is just a brain in a vat: we can feed it text and it responds with other text, but it can't do anything that has side effects. That's all great, but the examples we gave above (booking travel, etc.) require the ability to do things out in the world—or at least on the Internet—so we need to enable that somehow. The way this is done is by giving the LLM something called a "tool". LLM tools are kind of like API calls in traditional programming languages: they are functions that let the LLM do something. Using tools with an LLM is conceptually simple:
Tool Definitions #The tool definition is basically just a JSON expression, like so:
This should be reasonably self-explanatory, but just in case:
Think about the tool definition as API documentation for the
LLM: it tells it about the tool, what it does, and how to
call it. The Model Calls Tool #In order to call the tool, the model provides a response that has the information about the tool to call text. For instance:
This says exactly what you think it says, namely "I would like to call the
tool Tool Execution #The way that the tool actually gets called is that the wrapper code detects that the model's output is actually a tool call and calls the tool rather than printing the output (or whatever it would ordinarily do with it). In other words, you need to update the agent wrapper code to be something like this:
This code is comparatively simple, but let's work through it in pieces. Whenever we send a request to the model API, we can get one of two responses:
In the first case, we just display the response to the user and then read the user's next input, just as with our original chatbot code. What's new here is the tool call request. In this case, we don't want to display the result to the user, but instead intercept the response and call the appropriate tool. Handily, the tools are named, so we can just look up the appropriate implementation by name and call it. Once we have the response, we can add it to the context and call the completion API again. Here's a simple exchange:
If you look closely, you'll notice something interesting. I never told the model which room I wanted it to get the temperature for; it just hallucinated "living room". This behavior is actually nondeterministic and model dependent (I'm using (mistral-small.) Some fraction of the time, the model actually refuses to give me an answer and instead asks what room I want:
Providing the response works exactly as you'd expect, with our agent providing the following context:
As you can see, the context here includes everything that has happened so far, namely:
Just as before, the model is stateless, so if we don't remind it that it called a tool, it doesn't have any context for what the answer "25" is. The key point here is that all the tool action happens in the wrapper
code. The LLM has no idea how the tool works; it just knows whatever
the wrapper told it about what each tool does and then whatever the
wrapper says the tool did. And in fact, my implementation of Notice also that the agent wrapper doesn't
really know anything about the tools either, it's just importing
the list of tools from Multi-Round Tool Execution #Our wrapper code will keep looping until the model returns some response, so this means we can have multiple rounds of tool execution. So, for instance, we can have a simple thermostat which turns on the heat if we are below some target temperature. To do this, we first need to give the model a new
This is implemented as:
Then with the right instructions...
You might notice that my instructions here are pretty verbose, and in particular that I'm telling it to use the tool to turn on the heat. What's going on here is that I'm fighting with the model: mistral-small only has about 20B parameters and so it's not really smart enough to figure things out if you're not super explicit. Here's what happened if I didn't tell it that it had a tool to turn on the heat:
This is the same reason I'm having the
These results also aren't totally reliable (LLMs usually do not have deterministic behavior) so if you try this yourself you may need to run the program a couple of times to get the desired result. You'd probably get a better result if you were using a smarter model, but I picked something that would run well on low-end machines, because the next thing I want to do is go a level deeper into what's actually going on, and that requires a modal I can run locally. Internals #As I said above we're using the Ollama API rather than a local library so that you can see the actual data we're sending to the API, but that's just the first layer of the onion because each LLM has its own idiosyncratic syntax which Ollama translates to and from. For example, here are is what our initial shirt prompt turns into when we send it to Mistral and Gemma (one of Google's open weight models) respectively:[4] Mistral #
Gemma #
This is, as they say, a "rich text". The first thing to notice is that neither of these prompts is JSON. Instead, Ollama has taken our JSON API input and translated it into this stuff, which we'll generously call "structured". However, each model has made its own idiosyncratic choices:
All this is just hidden by Ollama, which has a pretty fancy templating engine that lets each downloadable model specify how to translate to and from the Ollama API to the model-specific stuff. I think the coolest thing here, though, is what's at the end of the Gemma prompt, which is basically an incomplete response from the model's response, with just the framing ready for the model to fill it in. What's going on here? Well, recall that an LLM is basically a completion machine, and it's trying to continue the conversation, so basically we're telling the model "the next thing that's going to happen in this conversation is that the model is going to say something". OpenAI's open models do the same thing. Here's gpt-oss-20b:
The responses from Mistral and Gemma are about what you would expect: Mistral #
Gemma #
There don't seem to be any delimiters here, so this could be a bug in my instrumentation, but I think that's actually what's going on.[6] Now take a look at what gpt-oss looks like:
This is really cool, because we're actually now getting two kinds of output:
This is an important clue to what's actually going on under the hood. Reformatting it to make it clearer:[7]
What we've got here is a "reasoning" model, and what that means in practice is that it produces its "thinking" process out loud as part of the model output and after that thinking is done ("Let's produce" in the text above) it actually produces the output that's intended for the user. What this output shows, though, is that it's still all text production—albeit with a lot of tuning—basically what's happening the model just produces the reasoning text first and then produces the output that follows—in a literal sense!—from that reasoning. Tool-Calling #Now let's ask see when we call a tool. Here's Mistral and gpt-oss after I removed the system prompts (the version of Gemma I'm using didn't want to do tool calling, so I didn't show it, but you can use functiongemma): Mistral #
GPT #
Holy mixture of formats, batman. In both cases we have quasi-XML with embedded
JSON. With Mistral the the JSON is just inlined into the And finally, here's the actual tool calls, which are about what you would expect: Mistral #
GPT #
Don't ask me why the GPT tool calls are in the commentary channel. That's just how things are. Model Context Protocol #Tools aren't the only way that an AI model can interact with the outside world. For example, Anthropic has developed something called the Model Context Protocol (MCP), which is a way for models to interact with external resources (tools, data, etc.).
MCP Architecture: from modelcontextprotocol.io[9] The arrows connecting the host process and the servers are MCP, which is a fairly simple JSON-RPC protocol. Like tool calling, MCP is generic in that it specifies how to talk to external resources but doesn't specify any details about the resources themselves. Instead, the servers are responsible for providing descriptions of the resources, which can currently be any of:
The client can interrogate the server to learn about each of these resource, which come packaged in convenient descriptions just like we saw with tools. For instance, here is an example tool description from the MCP spec:
This should look incredibly familiar, because it's basically
the same thing as you would feed in for a tool description
with Ollama. This is actually the tool description
format
that Claude uses, where things are named a little differently
(e.g., It's important to realize that using server-side resources via MCP is isomorphic to tool calling. Recall that the model doesn't know how the tools are implemented, it just knows that they exist. This means that if you have an LLM which knows how to call tools, you can make it do MCP just by creating a translation layer that exposes the MCP-provided tools as if they were regular tools, as shown below:
In this diagram, we actually have three sources of tools, namely the two MCP servers and then a local tool. The agent wrapper just collects all the tools and provides them to the LLM without distinguishing where they live, and then is responsible for dispatching the tool requests to wherever they need to go. You can handle resource requests the same way, with the resource just being a specialized kind of tool that reads static data. Prompts are a little different, and I'm not going to handle them here. Server to Client #There is a bit more to MCP than this. In particular, MCP includes functions to let the server ask the client to access the LLM on its behalf (e.g., ask for completion). However, these too don't require anything new from the model, but are just implemented in the wrapper code, which accesses the model on the user's behalf. The important thing to realize here is that the LLM doesn't need to know anything about MCP at all, because this part of MCP is just tool calling wearing a different hat. As long as it's set up for tool calling, which has a simple request/response model, we can do all the translation to MCP in the deterministic agent wrapper code (which doesn't require any model tooling). If someone invented a new version of MCP with totally different syntax, we wouldn't need to change the LLM at all, just update the wrapper. The Bigger Picture #The amazing thing is really how much we are doing with how little. Our final agent program is less than 350 lines, and though we'd obviously need proper error handling, etc. the functionality here is actually the core of a real agentic tool. We get that power by composing a bunch of simple components:
That's all there is. Using it for real work would mostly consist of (1) adding a full suite of tools and (2) using it with a good model rather than the local ones we're using here. (2) is actually quite straightforward with Ollama actually supports cloud models, translating to the cloud APIs instead of to the local model interfaces. This leaves us with the tools, but the tools aren't about AI, they're just the same kinds of APIs that you'd write for any programming task. What makes all this possible is that while the models are trained to use tools generically, they aren't trained to use any specific tools. That means that all you have to do to add new capabilities is to write new tools and tell the model about them. The model can than work forward from your instructions and the tools it knows about in order to figure out what tools it needs to call and in what order so that it can accomplish whatever it is you asked it to do.[11] This setup gives us two avenues for increasing the power of the system. First, we can make the model smarter so it's better at figuring what to do with the tools it has. We saw that already above, where we had to remind the model that it had tools available, but with a better model that wouldn't be necessary. Second, we can give the model model more tools to work with. These avenues are independent but work together: you can make your existing AI-based system better by adding more tools, but then if you replace your model with a smarter one, it will instantly get better with the tools it has.
|