TL;DR — Key Takeaways
- Start with one narrow, measurable task — not a general-purpose assistant
- Use a model router (we use OpenRouter) instead of locking into one vendor — different models for different roles save 40-70% on cost with no quality loss
- Skip heavyweight frameworks (LangGraph, CrewAI) for most use cases — a 200-line tool-calling loop with explicit state is easier to debug, cheaper to maintain, and gives you full control
- Build retrieval from day one: chunk your knowledge base by H2 headings, embed with
text-embedding-3-small, query with vector search - Follow the 10-20-70 rule: 10% algorithms, 20% infrastructure, 70% people and process
Introduction
When I speak with CTOs across fintech and proptech organizations, the question has shifted from "should we build agents?" to "which of the 12 frameworks should we commit to?" My answer usually disappoints them: none of them, for most use cases.
At Mobile Reality we've shipped 75+ production deployments across our AI automation practice — from our internal CMS agent that autonomously edits SEO articles, to client-facing underwriting systems processing thousands of loan applications. After that many deployments, a pattern emerges: the teams that ship fastest and maintain cheapest are the ones that reject the "pick a framework" framing entirely.
This guide walks through how we actually build agents — the model router, the hand-rolled tool-calling loop, the vector retrieval layer, and the tradeoffs behind each choice. You'll see concrete code from our production CMS agent (the one that wrote the first draft of many sections on themobilereality.com/blog) — not generic pseudocode.
This year, Gartner projects 40% of enterprise applications will embed AI agents — up from less than 5% in 2025. With the market heading to $47.1B by 2030, enterprise leaders are no longer asking "if." They're asking: how do we build something maintainable?
What Is an AI Agent? The Five Types You'll Actually Encounter
Before writing code, map your use case to one of five agent archetypes. Picking the wrong type is the most expensive mistake I see — teams overbuild "autonomous learning" systems when a 50-line reflex agent would do the job.
Simple Reflex and Model-Based Reflex Agents
Simple reflex agents are deterministic rule-runners. If-then logic, no memory, no planning. A sales-tax calculator that applies rates by zip code is a simple reflex agent. Our initial Google Ads alerting service — which pinged Slack when CPC crossed a threshold — was one, too. These agents are cheap to build, impossible to hallucinate, and break the moment input deviates from the expected shape.
Model-based reflex agents extend this by maintaining an observable world state. A commercial HVAC controller tracking occupancy, outdoor temperature, and open windows is model-based — it infers hidden variables from observable ones. These agents need more compute but handle unfamiliar scenarios that would otherwise require human triage.
Goal-Based, Utility-Based, and Learning Agents
Goal-based agents plan. They decompose a high-level objective into steps and pick the best sequence. Our internal SEO-fix agent is goal-based — given "make this article pass 47 SEO rules," it calls getarticle, then findtext, then replace_text in whatever order the reasoning model decides. No fixed pipeline; the model picks the next tool based on what the previous tool returned.
Utility-based agents add numerical optimization on top. Instead of binary "goal met / not met," they weigh tradeoffs — cost vs. quality, latency vs. accuracy, user satisfaction vs. compliance risk. A benefit-planning system balancing premium costs against employee satisfaction across hundreds of plan variations is utility-based.
Learning agents refine their strategy through experience. These are the rarest in production because they need high-quality feedback loops, which most businesses don't have. When they work (property valuation, ad bid optimization), they outperform static models on drifting distributions. When they don't, they amplify bad signal into worse behavior.
The honest truth: 80% of production agents we ship are goal-based with a tool-calling loop. Learning agents are cool at conferences; goal-based agents pay rent.
How We Build Agents at Mobile Reality: The Architecture
Here's the architecture we've converged on after 75+ deployments. It's unfashionably simple, which is exactly why it scales.
The Stack
- Model provider: OpenRouter, single API, 300+ models, per-request model selection, transparent pricing
- Orchestration: Hand-rolled tool-calling loop in TypeScript/JavaScript, no LangGraph, no CrewAI, no n8n
- Memory: Conversation history (in-memory during the run) + vector retrieval (for persistent knowledge)
- Vector store: Convex native
vectorSearch, no separate Pinecone/Weaviate/pgvector - Embeddings:
openai/text-embedding-3-smallvia OpenRouter - Runtime: Node.js in a React app for client-facing agents, Convex actions for server-side
Why This Stack (And Why Not LangGraph)
Every production system we inherited from other agencies was built on LangGraph or LangChain. Every one of them was a rewrite candidate within 18 months. The failure mode is always the same: a framework that abstracts the prompt, the state, and the control flow into "nodes" and "edges" — until a production bug requires you to trace why the model picked tool X over tool Y, and you realize the abstraction has hidden the only thing that matters.
A hand-rolled loop is ~200 lines. You can read it start to finish. When something goes wrong — and with agents, something always goes wrong — you console.log the messages array and see exactly what the model saw.
Frameworks make sense when you have dozens of engineers who need a common vocabulary. For a 5-person AI team shipping one agent per quarter, they're a tax.
Step-by-Step: Building a Production Agent
Step 1: Define the Mission in One Sentence
Every agent we ship starts with a single sentence on a whiteboard:
"This agent will [specific action] when [trigger] to achieve [measurable outcome]."
For our SEO-fix agent: "This agent will apply targeted edits to an article when the SEO verification tool reports failed rules, to achieve a passing score without human review."
Narrowness is the feature. A well-defined agent beats a general-purpose "AI assistant" on every metric — cost, latency, reliability, maintainability.
Step 2: Choose Your Models by Role, Not by Vendor
This is where most teams leave 50%+ of their budget on the table. There is no single "best model." Different roles in the agent need different things: the orchestrator needs strong tool-calling, the content writer needs fluent prose, the auditor needs reasoning, the lightweight utility calls just need speed.
Here's our actual role-to-model mapping from production (model-config.js):
export const AGENT_ROLES = {
orchestrator: {
default: 'z-ai/glm-5', // strong tool-calling, cheap
requiresTools: true,
},
contentWriter: {
default: 'moonshotai/kimi-k2', // excellent long-form prose
},
seoAuditor: {
default: 'openai/gpt-5.2', // deep reasoning for analysis
},
lightweight: {
default: 'openai/gpt-4.1-mini', // meta tags, quick tasks
},
webSearch: {
default: 'z-ai/glm-5', // native web_search plugin
requiresWebPlugin: true,
},
vision: {
default: 'openai/gpt-4.1-mini', // alt text generation
requiresVision: true,
},
};
The orchestrator is cheap (GLM-5) because tool-calling is mostly structured output. Expensive reasoning (GPT-5.2) is reserved for qualitative analysis. Writing is delegated to a model trained on long-form prose (Kimi K2). The total cost per article edit dropped ~60% when we moved from "use GPT-4 for everything" to role-based routing.
OpenRouter makes this trivial. Every call hits https://openrouter.ai/api/v1/chat/completions — you just swap the model field. One API key, one billing relationship, zero vendor lock-in.
Step 3: Build the Tool-Calling Loop
Here's the shape of every agent loop we write. It's deliberately boring:
const messages = [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userRequest },
];
let iteration = 0;
const MAX_ITERATIONS = 400;
while (iteration < MAX_ITERATIONS) {
iteration++;
// One model call
const result = await streamChatCompletionWithTools({
model: getModel('orchestrator'),
messages,
tools: AGENT_TOOLS,
signal: abortController.signal,
});
// Append assistant response to conversation
messages.push({
role: 'assistant',
content: result.fullText || null,
tool_calls: result.toolCalls,
});
// No tool calls → model is done
if (!result.toolCalls?.length) break;
// Execute each tool call and append result
for (const tc of result.toolCalls) {
const toolResult = await executeToolCall(tc.name, tc.arguments, articleRef);
messages.push({
role: 'tool',
tool_call_id: tc.id,
content: JSON.stringify(toolResult),
});
}
}
That's the core. Everything else is production hygiene:
- Abort controller: every request is abortable mid-stream — users can stop a runaway agent
- Duplicate detection: hash each tool call; if the same call repeats back-to-back, break out (the model is looping)
- Tool call limit: cap at 400 calls per run to bound cost
- Streaming: surface partial output to the UI so users see progress, not a spinner
- Progressive state: tools mutate
articleRef.currentdirectly — if the run aborts, all work so far is preserved
This is the entire pattern. No graph compiler, no state machine DSL, no "supervisor agent." Just a while loop and a messages array.
Step 4: Define Tools as Data
Tools are JSON schemas. The model sees them, decides which to call, and your code executes them. OpenAI's function-calling format is the de facto standard — every model on OpenRouter that supports tools accepts it.
Here's one of our ten production tools:
{
type: 'function',
function: {
name: 'replace_text',
description:
'Replaces the first occurrence of old_text with new_text. ' +
'old_text must match exactly (whitespace and formatting).',
parameters: {
type: 'object',
properties: {
old_text: { type: 'string', description: 'Exact text to find.' },
new_text: { type: 'string', description: 'Replacement text.' },
},
required: ['old_text', 'new_text'],
},
},
}
The hard part isn't the schema — it's what happens when the model gets the text wrong. Models hallucinate whitespace, convert straight quotes to smart quotes, paraphrase instead of quoting verbatim. If your replace_text tool only does strict string matching, it will fail 30% of the time and the agent will spiral into retry loops.
Our production replace_text uses six fallback matching strategies before giving up:
- Exact string match — the happy path
- Trimmed match — strip leading/trailing whitespace
- Unicode normalization — smart quotes → straight, em-dash → hyphen
- Whitespace collapse — run of spaces/newlines → single space
- Line-trimmed match — trim each line independently
- First-last-line anchor — find the first and last line of the target, match everything between
This fuzzy-matching layer is worth more than the model choice. It turned our tool-success rate from 68% to 97% with no change to the prompt.
Step 5: Delegate Generation to a Specialized Model
A subtle but high-leverage pattern: the orchestrator shouldn't generate prose. It should delegate.
When our SEO-fix agent needs to rewrite a paragraph, it calls generate_text — a tool that makes a separate OpenRouter call to Kimi K2:
async function callGenerationModel(apiKey, instruction, context) {
const response = await fetch(
'https://openrouter.ai/api/v1/chat/completions',
{
method: 'POST',
headers: { Authorization: `Bearer ${apiKey}` },
body: JSON.stringify({
model: getModel('contentWriter'), // 'moonshotai/kimi-k2'
messages: [
{ role: 'system', content: 'You are an expert SEO content writer...' },
{ role: 'user', content: `## Instruction\n${instruction}\n\n## Context\n${context}` },
],
}),
},
);
// ...
}
This decouples orchestration from generation. The orchestrator can be a cheaper, faster model focused on "what should happen next." The writer can be a different model entirely — one chosen for prose quality. You can change either independently.
It also gives you observability: when output quality drops, you know which call to inspect.
Memory: What Actually Works in Production
Every "build an AI agent" guide talks about "short-term, episodic, and long-term memory" with diagrams of Redis clusters and vector databases. In production, here's what we actually run:
Conversation Memory: Just an Array
Our short-term memory is a React useRef holding an array of messages. It lives for the duration of the agent run, gets passed in full on every model call, and is discarded when the run ends.
const messagesRef = useRef([]);
// ... after each tool call:
messagesRef.current = [...messagesRef.current, toolResultMessage];
That's it. No Redis, no Mem0, no LangChain ConversationBufferMemory. A JavaScript array.
The "but what about token limits?" objection rarely bites in practice. Agent runs are bounded (our max is 400 iterations). Modern models have 200k+ context windows. If you hit the limit, you have a different problem — your agent's goal is too broad.
Persistent Knowledge: Vector Retrieval
For knowledge that persists across runs — past client projects, Clutch reviews, published blog posts — we use Convex's native vector search. Here's the real ingestion flow:
- Chunk the source by H2 headings (
chunkByH2) - Embed each chunk with
openai/text-embedding-3-smallvia OpenRouter - Insert into the
knowledgeChunkstable with the embedding indexed
Querying is three lines:
const queryEmbedding = await generateEmbedding(args.query);
const searchResults = await ctx.vectorSearch("knowledgeChunks", "by_embedding", {
vector: queryEmbedding,
limit: 5,
filter: args.source ? (q) => q.eq("source", args.source) : undefined,
});
No separate vector DB to operate. No embedding sync job. Convex handles indexing automatically. If you're already on a BaaS with built-in vector support (Convex, Supabase with pgvector), use it. The marginal gain from a dedicated vector DB is not worth the operational cost until you're at tens of millions of vectors.
Verified Facts: Just a JSON File
For canonical company facts — services we offer, team bios, portfolio highlights — we don't use retrieval at all. We use a JSON file that a tool returns verbatim:
case 'get_core_knowledge':
return { success: true, result: JSON.stringify(companyKnowledge, null, 2) };
Why? Because "verified company data" doesn't need similarity search. It needs to be exactly right, 100% of the time. When an agent writes about Mobile Reality, it calls getcoreknowledge() and gets the authoritative JSON — no embedding drift, no retrieval ranking, no nearest-neighbor weirdness.
The lesson: not every knowledge store should be a vector database. Static, small, must-be-exact data belongs in a file. Large, fuzzy, similarity-searchable data belongs in a vector store. Use both.
Essential Tools for Building AI Agents
Here's the toolkit we actually reach for:
| Need | Tool | Why |
|---|---|---|
| Model access | OpenRouter | Single API for 300+ models, per-request routing, one bill |
| Orchestration | Hand-rolled loop in TypeScript | No lock-in, debuggable, ~200 LOC |
| Streaming SDK | Native fetch with SSE parsing | No wrapper needed; the API is straightforward |
| Vector store | Convex or Supabase + pgvector | Built-in, no separate ops |
| Embeddings | openai/text-embedding-3-small via OpenRouter | Cheap, 1536 dims, good enough for 95% of cases |
| Prompt versioning | Langfuse | Move prompts out of code; version, A/B test, edit without deploys |
| LLM observability | Langfuse + PostHog AI | Trace every call: prompt version, input, output, tokens, latency |
| Evals | PromptFoo or Braintrust | You cannot ship agents without evals — PromptFoo runs in CI |
Notable omissions: LangChain, LangGraph, CrewAI, LlamaIndex, Mem0, Pinecone, Weaviate. We've tried all of them. We don't run any in production today.
This isn't dogma — if you're doing something genuinely novel (reinforcement learning loops, complex multi-agent choreography with distinct processes), some of these earn their keep. For the "agent that uses tools to do a job" pattern that covers 90% of enterprise use cases, they're weight.
The 10-20-70 Rule for AI Success
After shipping the technical stack, the question becomes: why do some agents get adopted and others die on the shelf?
The 10-20-70 rule, originally described by BCG, explains it better than any architecture diagram:
- 10% → Algorithms: model selection, prompt engineering, fine-tuning
- 20% → Technology infrastructure: vector stores, embeddings, webhooks, retry logic, observability
- 70% → People and process redesign: workflow mapping, training, change management
We routinely see teams invert this — 80% of budget on algorithm perfection, 5% on rollout. Those projects produce beautiful demos and zero business value.
The 10% algorithm slice is where this guide has lived so far. It's where engineers want to spend time. It's also the slice with the lowest marginal return once you hit "good enough" — foundation models are commoditized, and the gap between GPT-4.1-mini and a fine-tuned custom model for most tasks is smaller than teams expect.
The 20% infrastructure slice is where agents go from demo to production. Retry logic on rate-limited APIs, PII scrubbing before embedding, audit trails for every tool call, cost tracking per agent run, graceful degradation when a model provider goes down. The boring stuff that determines whether you can leave the system unattended.
The 70% people slice determines whether the agent gets used. One of our clients built a perfect maintenance-scheduling agent and watched adoption stall at 12%. We didn't change a line of code. We ran two workshops with floor supervisors, redesigned the escalation flow so the agent's suggestions routed through their existing ticketing system, and adoption hit 80% within a month. The agent didn't change. The process around it did.
How Much Does It Cost to Build an AI Agent?
Prices below are what we actually quote for client engagements. They bake in the architecture choices above — teams building on heavyweight frameworks will see 30-60% higher numbers for equivalent scope, mostly in ongoing maintenance.
| Agent Type | Cost Range | Timeline | Example |
|---|---|---|---|
| Basic reactive | $2,500–$5,000 | 1-2 weeks | Webhook + Slack notifications |
| Intermediate contextual | $5,000–$15,000 | 4-10 weeks | Tool-calling loop + vector retrieval |
| Advanced autonomous | $15,000–$50,000 | 16-24 weeks | Multi-step orchestration with evals and audit trails |
| Enterprise multi-agent | $50,000–$150,000+ | 6-12 months | Multiple specialized agents with shared knowledge layer |
The biggest hidden cost is observability. Every production agent we run has structured logging on every tool call, every model response, every token count. Bolting this on after launch costs 2-3x what building it in costs. If you take one piece of advice from this guide beyond the architecture: log every tool call to a queryable store from day one.
Other cost drivers to plan for:
- Annual maintenance: 15-30% of initial build cost — models get deprecated, APIs change, prompts drift
- Inference cost: for high-volume agents, this can exceed engineering cost — role-based model routing (Step 2) is the single biggest lever
- Evals infrastructure: don't skip. An agent without evals is an agent you cannot improve
Conclusion
The agents we ship that deliver business value share a few traits: narrow goals, role-based model routing via OpenRouter, hand-rolled tool-calling loops instead of frameworks, vector retrieval only where it earns its keep, and relentless focus on the 70% of the work that isn't code.
If you're starting this week: pick one narrow task, build it as a tool-calling loop against OpenRouter, and ship it behind a feature flag to 5% of users. Iterate from there. The goal is production, not architecture.
If you're building complex, compliance-sensitive systems and want a team that's shipped this 75+ times — my team at Mobile Reality designs, builds, and operates these systems end-to-end. We'll bring the architecture; you bring the domain knowledge.
Related reading:
- Generative UI: How AI Creates Dynamic User Interfaces
- Business Automation with AI Agents
- Structured LLM Output Without JSON Schemas
Frequently Asked Questions
Should I use LangGraph, CrewAI, or another framework to build AI agents?
Skip heavyweight frameworks for most use cases. After 75+ production deployments, we consistently see that a 200-line hand-rolled tool-calling loop in TypeScript is easier to debug, cheaper to maintain, and gives you full control. Frameworks like LangGraph hide the state and control flow in abstractions that make production bugs nearly impossible to trace. They only make sense if you have dozens of engineers who need a common vocabulary.
How do I choose which LLM to use for my AI agent?
Don't lock into one vendor. Use a model router like OpenRouter and assign different models to different roles. Our production setup uses GLM-5 for the orchestrator (cheap, strong tool-calling), Kimi K2 for content writing (excellent prose), GPT-5.2 for SEO auditing (deep reasoning), and GPT-4.1-mini for lightweight tasks. This role-based routing cut our costs by 60% without any quality loss.
What kind of memory do I actually need for production AI agents?
Far less than you think. For short-term memory, use a simple array of messages that lives for the duration of the agent run—no Redis, no Mem0, no LangChain memory abstractions. Modern models have 200k+ context windows, and agent runs should be bounded anyway (we cap at 400 iterations). For persistent knowledge, use vector search with embeddings, but keep verified facts in a JSON file that tools return verbatim—similarity search is wrong for data that must be exactly right.
How do I prevent my AI agent from making errors or getting stuck in loops?
Build explicit guardrails into your hand-rolled loop, not framework-level abstractions. Our production pattern includes: an abort controller so every request is cancellable mid-stream, duplicate detection to break out when the same tool call repeats back-to-back, a hard cap of 400 tool calls per run, and six fallback matching strategies for text replacement tools (exact match, trimmed, unicode normalization, whitespace collapse, line-trimmed, and first-last-line anchor). These predictable failures beat hoping the model reasons its way out.
What is the 10-20-70 rule and why does it matter for AI agent projects?
The 10-20-70 rule, from BCG's digital transformation research, states that success depends on 10% algorithms, 20% technology infrastructure, and 70% people and process redesign. Teams routinely invert this—spending 80% on model perfection and 5% on rollout. The result is beautiful demos with zero business value. We saw one client with a perfect maintenance-scheduling agent stall at 12% adoption until we redesigned the escalation flow to route through their existing ticketing system—no code changes, just process. Adoption hit 80% within a month.
Discover more on AI-based applications and genAI enhancements
Artificial intelligence is revolutionizing how applications are built, enhancing user experiences, and driving business innovation. At Mobile Reality, we explore the latest advancements in AI-based applications and generative AI enhancements to keep you informed. Check out our in-depth articles covering key trends, development strategies, and real-world use cases:
- AI Development Costs 2026: Cut Budgets 3x With AI Tools
- The Role of AI in the Future of Software Engineering
- Unleash the Power of LLM AI Agents in Your Business
- Generative AI in software development
- AI Arbitrage Agency 2026: Scale Business Decisions 5X Faster
- Generate AI Social Media Posts for Free!
- Mastering Automated Lead Generation for Business Success
- Generative UI: AI-Driven User Interfaces Transforming Design
- Generative vs Agentic AI: Key Differences for Business 2026
Our insights are designed to help you navigate the complexities of AI-driven development, whether integrating AI into existing applications or building cutting-edge AI-powered solutions from scratch. Stay ahead of the curve with our expert analysis and practical guidance. If you need personalized advice on leveraging AI for your business, reach out to our team — we’re here to support your journey into the future of AI-driven innovation.
