Ask most people what an AI agent is, and they describe a chatbot. You type a question, it types back an answer, the answer sounds confident, and that's the product.
On a construction project, that mental model breaks fast. A confident answer about RFI 204 that's wrong, or three weeks out of date, or quietly invented, isn't a minor bug. It's the kind of thing that ends up quoted in a meeting, then in an email, then in a claim.
The language model that writes the sentence is the easiest part of the system to build. It's also the least important part when it comes to whether you can trust what it says.
The shift worth understanding is this: the intelligence in an AI agent isn't in how it talks. It's in everything that happens before it's allowed to.
A language model is a mouth, not a brain
A large language model is a remarkably good pattern matcher. Given a prompt, it produces text that statistically resembles the kind of text that should follow. It does this whether or not it has ever seen your project, your RFIs, or your schedule.
This is easy to forget because the output reads so naturally. Ask a model directly "what's the status of RFI 204 on the Mill Street job," and it may answer fluently, with a plausible-sounding status, a plausible-sounding date, and a plausible-sounding name attached. None of it has to be true. The model isn't lying in the way a person lies. It's doing exactly what it was built to do: continue the pattern.
- Mouth: generates language, fluently and on demand, regardless of what it actually knows.
- Brain: decides what the mouth is allowed to talk about, based on what's actually in the project record.
An AI agent is the system built around the model to supply that second part. The model never sees a question cold. By the time it generates anything, the brain has already decided what the question is really asking, what evidence is relevant, and whether there's enough of it to answer responsibly.
The agent's first job is figuring out what it's being asked
Two questions can look similar and require completely different work.
"What's the status of RFI 204" is a lookup. There's a record, it has a status field, the answer is a fact retrieval problem.
"Why are we behind on level 3 framing" is not a lookup. There's no single document with that answer. It requires pulling together RFIs, change directives, daily reports, and schedule activities, then reasoning across all of them to find a pattern.
An agent's first move is classifying which kind of question it's facing, before it tries to answer either one. This is intent detection: routing the question to the right kind of retrieval, rather than handing every question the same generic search and hoping for the best.
It gets more complicated in practice. A single message often contains more than one question ("what's the status of RFI 204, and has it affected the framing schedule"), and the agent has to recognise both, handle each on its own terms, and bring the results back together in one answer. Get this step wrong, and everything downstream is built on the wrong foundation, no matter how good the retrieval or the model is.
| Chatbot | Agent | |
|---|---|---|
| First step | Sends the question straight to the model | Classifies what kind of question it is |
| Source of answer | The model's training data and the prompt | The project record, retrieved on demand |
| Multiple questions in one message | Often answers only one, or blends them | Detects and handles each separately |
| When it doesn't have an answer | Generates something plausible anyway | Can decide not to answer |
This is also, quietly, what a lot of "AI co-pilots" bolted onto existing construction software are. A chat window appears in the corner of a familiar tool, and it looks like the agent column. Underneath, it's often just a model with a search box wired in: no real classification of what's being asked, no following the links between records, no way to say it came up short. It looks like an agent. It behaves like a chatbot.
Then it goes and gets the evidence, the same way a person would
Once the agent knows what's being asked, it has to go get the evidence, and that's a series of decisions, not a single search box. Does this question need a document lookup, a follow-the-links search through related records, a schedule query, or some combination of all three?
This is where the project's structure does the heavy lifting. An RFI links to the directive that responded to it. A directive links to the schedule activity it affected. A change request links to the change order it became. The agent doesn't need to guess these connections from text similarity alone. It can follow them, the same way a person would pull a thread from one document to the next.
What 'tool calling' actually means here
When people say an agent "calls tools," they mean the model itself decides, mid-reasoning, which of these retrieval steps it needs. It might check the schedule first, realise the delay traces back to an RFI, then go fetch that RFI and the directive that followed it, before it has enough to respond.
This is also where the earlier classification step pays off. A status lookup might need one targeted search. A "why" question might need to follow three or four links across related records before there's enough to say anything useful. The agent decides how many links are enough, and when to stop.
The brain's real job is knowing when to say "I don't know"
Here's the uncomfortable truth about earlier generations of these assistants: when they couldn't find a relevant document, they often answered anyway. We saw this with a question as simple as "what's the current revision of the level 4 mechanical specs." An earlier version of the assistant found nothing current in its search results, and answered anyway, citing a revision that had been superseded months earlier. The model wasn't malfunctioning. Nothing in the system was checking whether "we found nothing current" should produce an answer at all.
A wrong answer that arrives instantly and sounds certain is more dangerous than a slow one, because nobody double-checks the thing they already believe.
The fix isn't a smarter model. It's a system that checks its own evidence before it speaks. If the retrieval step comes back thin, contradictory, or simply doesn't cover the question, the agent's job is to say so, point to what it did find, and suggest where a person should look next. That's a less satisfying answer than a confident paragraph. It's also the only kind of answer worth building a workflow around.
This is the difference between an assistant that's occasionally impressive and one that's reliably useful. Impressive is a demo. Reliable is something a project manager can act on without re-checking it against the source files first, because the source files are right there in the answer.
Where this is going next
The next step for these agents isn't a bigger model. It's longer chains of reasoning: an agent that follows a question from an RFI to the directive that answered it to the schedule activity it touched to the change order it produced, all in one pass, without losing the thread or the sources along the way. That's less "ask a question, get an answer" and more "ask a question, watch it investigate."
That's the direction Storia's agents are heading: not a faster mouth, but a better-informed brain behind every one of them, one that treats your project's documents and records as something to reason over and connect, not just look up one at a time.
Amin Bayatpour AI Engineer at Storia, working on GenAI, RAG, and agentic systems. Reach out at info@storiatechnologies.com if you want to see how this kind of reasoning shows up across Storia's agents.



