Build AI Agents with 75+ Deployments Cutting Costs 60% in 2026

TL;DR — Key Takeaways

Start with one narrow, measurable task — not a general-purpose assistant
Use a model router (we use OpenRouter) instead of locking into one vendor — different models for different roles save 40-70% on cost with no quality loss
Skip heavyweight frameworks (LangGraph, CrewAI) for most use cases — a 200-line tool-calling loop with explicit state is easier to debug, cheaper to maintain, and gives you full control
Build retrieval from day one: chunk your knowledge base by H2 headings, embed with text-embedding-3-small, query with vector search
Follow the 10-20-70 rule: 10% algorithms, 20% infrastructure, 70% people and process

Introduction

When I speak with CTOs across fintech and proptech organizations, the question has shifted from "should we build agents?" to "which of the 12 frameworks should we commit to?" My answer usually disappoints them: none of them, for most use cases.

At Mobile Reality we've shipped 75+ production deployments across our AI automation practice — from our internal CMS agent that autonomously edits SEO articles, to client-facing underwriting systems processing thousands of loan applications. After that many deployments, a pattern emerges: the teams that ship fastest and maintain cheapest are the ones that reject the "pick a framework" framing entirely.

This guide walks through how we actually build agents — the model router, the hand-rolled tool-calling loop, the vector retrieval layer, and the tradeoffs behind each choice. You'll see concrete code from our production CMS agent (the one that wrote the first draft of many sections on themobilereality.com/blog) — not generic pseudocode.

This year, Gartner projects 40% of enterprise applications will embed AI agents — up from less than 5% in 2025. With the market heading to $47.1B by 2030, enterprise leaders are no longer asking "if." They're asking: how do we build something maintainable?

What Is an AI Agent? The Five Types You'll Actually Encounter

Before writing code, map your use case to one of five agent archetypes. Picking the wrong type is the most expensive mistake I see — teams overbuild "autonomous learning" systems when a 50-line reflex agent would do the job.

Simple Reflex and Model-Based Reflex Agents

Simple reflex agents are deterministic rule-runners. If-then logic, no memory, no planning. A sales-tax calculator that applies rates by zip code is a simple reflex agent. Our initial Google Ads alerting service — which pinged Slack when CPC crossed a threshold — was one, too. These agents are cheap to build, impossible to hallucinate, and break the moment input deviates from the expected shape.

Model-based reflex agents extend this by maintaining an observable world state. A commercial HVAC controller tracking occupancy, outdoor temperature, and open windows is model-based — it infers hidden variables from observable ones. These agents need more compute but handle unfamiliar scenarios that would otherwise require human triage.

Goal-Based, Utility-Based, and Learning Agents

Goal-based agents plan. They decompose a high-level objective into steps and pick the best sequence. Our internal SEO-fix agent is goal-based — given "make this article pass 47 SEO rules," it calls getarticle, then findtext, then replace_text in whatever order the reasoning model decides. No fixed pipeline; the model picks the next tool based on what the previous tool returned.

Utility-based agents add numerical optimization on top. Instead of binary "goal met / not met," they weigh tradeoffs — cost vs. quality, latency vs. accuracy, user satisfaction vs. compliance risk. A benefit-planning system balancing premium costs against employee satisfaction across hundreds of plan variations is utility-based.

Learning agents refine their strategy through experience. These are the rarest in production because they need high-quality feedback loops, which most businesses don't have. When they work (property valuation, ad bid optimization), they outperform static models on drifting distributions. When they don't, they amplify bad signal into worse behavior.

The honest truth: 80% of production agents we ship are goal-based with a tool-calling loop. Learning agents are cool at conferences; goal-based agents pay rent.

How We Build Agents at Mobile Reality: The Architecture

Here's the architecture we've converged on after 75+ deployments. It's unfashionably simple, which is exactly why it scales.

The Stack

Model provider: OpenRouter, single API, 300+ models, per-request model selection, transparent pricing
Orchestration: Hand-rolled tool-calling loop in TypeScript/JavaScript, no LangGraph, no CrewAI, no n8n
Memory: Conversation history (in-memory during the run) + vector retrieval (for persistent knowledge)
Vector store: Convex native vectorSearch , no separate Pinecone/Weaviate/pgvector
Embeddings: openai/text-embedding-3-small via OpenRouter
Runtime: Node.js in a React app for client-facing agents, Convex actions for server-side

Why This Stack (And Why Not LangGraph)

Every production system we inherited from other agencies was built on LangGraph or LangChain. Every one of them was a rewrite candidate within 18 months. The failure mode is always the same: a framework that abstracts the prompt, the state, and the control flow into "nodes" and "edges" — until a production bug requires you to trace why the model picked tool X over tool Y, and you realize the abstraction has hidden the only thing that matters.

A hand-rolled loop is ~200 lines. You can read it start to finish. When something goes wrong — and with agents, something always goes wrong — you console.log the messages array and see exactly what the model saw.

Frameworks make sense when you have dozens of engineers who need a common vocabulary. For a 5-person AI team shipping one agent per quarter, they're a tax.

Step-by-Step: Building a Production Agent

Step 1: Define the Mission in One Sentence

Every agent we ship starts with a single sentence on a whiteboard:

"This agent will [specific action] when [trigger] to achieve [measurable outcome]."

For our SEO-fix agent: "This agent will apply targeted edits to an article when the SEO verification tool reports failed rules, to achieve a passing score without human review."

Narrowness is the feature. A well-defined agent beats a general-purpose "AI assistant" on every metric — cost, latency, reliability, maintainability.

Step 2: Choose Your Models by Role, Not by Vendor

This is where most teams leave 50%+ of their budget on the table. There is no single "best model." Different roles in the agent need different things: the orchestrator needs strong tool-calling, the content writer needs fluent prose, the auditor needs reasoning, the lightweight utility calls just need speed.

Here's our actual role-to-model mapping from production (model-config.js):

export const AGENT_ROLES = {
    orchestrator: {
        default: 'z-ai/glm-5',         // strong tool-calling, cheap
        requiresTools: true,
    },
    contentWriter: {
        default: 'moonshotai/kimi-k2', // excellent long-form prose
    },
    seoAuditor: {
        default: 'openai/gpt-5.2',     // deep reasoning for analysis
    },
    lightweight: {
        default: 'openai/gpt-4.1-mini', // meta tags, quick tasks
    },
    webSearch: {
        default: 'z-ai/glm-5',          // native web_search plugin
        requiresWebPlugin: true,
    },
    vision: {
        default: 'openai/gpt-4.1-mini', // alt text generation
        requiresVision: true,
    },
};

The orchestrator is cheap (GLM-5) because tool-calling is mostly structured output. Expensive reasoning (GPT-5.2) is reserved for qualitative analysis. Writing is delegated to a model trained on long-form prose (Kimi K2). The total cost per article edit dropped ~60% when we moved from "use GPT-4 for everything" to role-based routing.

OpenRouter makes this trivial. Every call hits https://openrouter.ai/api/v1/chat/completions — you just swap the model field. One API key, one billing relationship, zero vendor lock-in.

Step 3: Build the Tool-Calling Loop

Here's the shape of every agent loop we write. It's deliberately boring:

const messages = [
    { role: 'system', content: systemPrompt },
    { role: 'user', content: userRequest },
];

let iteration = 0;
const MAX_ITERATIONS = 400;

while (iteration < MAX_ITERATIONS) {
    iteration++;

    // One model call
    const result = await streamChatCompletionWithTools({
        model: getModel('orchestrator'),
        messages,
        tools: AGENT_TOOLS,
        signal: abortController.signal,
    });

    // Append assistant response to conversation
    messages.push({
        role: 'assistant',
        content: result.fullText || null,
        tool_calls: result.toolCalls,
    });

    // No tool calls → model is done
    if (!result.toolCalls?.length) break;

    // Execute each tool call and append result
    for (const tc of result.toolCalls) {
        const toolResult = await executeToolCall(tc.name, tc.arguments, articleRef);
        messages.push({
            role: 'tool',
            tool_call_id: tc.id,
            content: JSON.stringify(toolResult),
        });
    }
}

That's the core. Everything else is production hygiene:

Abort controller: every request is abortable mid-stream — users can stop a runaway agent
Duplicate detection: hash each tool call; if the same call repeats back-to-back, break out (the model is looping)
Tool call limit: cap at 400 calls per run to bound cost
Streaming: surface partial output to the UI so users see progress, not a spinner
Progressive state: tools mutate articleRef.current directly — if the run aborts, all work so far is preserved

This is the entire pattern. No graph compiler, no state machine DSL, no "supervisor agent." Just a while loop and a messages array.

Step 4: Define Tools as Data

Tools are JSON schemas. The model sees them, decides which to call, and your code executes them. OpenAI's function-calling format is the de facto standard — every model on OpenRouter that supports tools accepts it.

Here's one of our ten production tools:

{
    type: 'function',
    function: {
        name: 'replace_text',
        description:
            'Replaces the first occurrence of old_text with new_text. ' +
            'old_text must match exactly (whitespace and formatting).',
        parameters: {
            type: 'object',
            properties: {
                old_text: { type: 'string', description: 'Exact text to find.' },
                new_text: { type: 'string', description: 'Replacement text.' },
            },
            required: ['old_text', 'new_text'],
        },
    },
}

The hard part isn't the schema — it's what happens when the model gets the text wrong. Models hallucinate whitespace, convert straight quotes to smart quotes, paraphrase instead of quoting verbatim. If your replace_text tool only does strict string matching, it will fail 30% of the time and the agent will spiral into retry loops.

Our production replace_text uses six fallback matching strategies before giving up:

Exact string match — the happy path
Trimmed match — strip leading/trailing whitespace
Unicode normalization — smart quotes → straight, em-dash → hyphen
Whitespace collapse — run of spaces/newlines → single space
Line-trimmed match — trim each line independently
First-last-line anchor — find the first and last line of the target, match everything between

This fuzzy-matching layer is worth more than the model choice. It turned our tool-success rate from _{68% to}97% with no change to the prompt.

Step 5: Delegate Generation to a Specialized Model

A subtle but high-leverage pattern: the orchestrator shouldn't generate prose. It should delegate.

When our SEO-fix agent needs to rewrite a paragraph, it calls generate_text — a tool that makes a separate OpenRouter call to Kimi K2:

async function callGenerationModel(apiKey, instruction, context) {
    const response = await fetch(
        'https://openrouter.ai/api/v1/chat/completions',
        {
            method: 'POST',
            headers: { Authorization: `Bearer ${apiKey}` },
            body: JSON.stringify({
                model: getModel('contentWriter'), // 'moonshotai/kimi-k2'
                messages: [
                    { role: 'system', content: 'You are an expert SEO content writer...' },
                    { role: 'user', content: `## Instruction\n${instruction}\n\n## Context\n${context}` },
                ],
            }),
        },
    );
    // ...
}

This decouples orchestration from generation. The orchestrator can be a cheaper, faster model focused on "what should happen next." The writer can be a different model entirely — one chosen for prose quality. You can change either independently.

It also gives you observability: when output quality drops, you know which call to inspect.

Memory: What Actually Works in Production

Every "build an AI agent" guide talks about "short-term, episodic, and long-term memory" with diagrams of Redis clusters and vector databases. In production, here's what we actually run:

Conversation Memory: Just an Array

Our short-term memory is a React useRef holding an array of messages. It lives for the duration of the agent run, gets passed in full on every model call, and is discarded when the run ends.

const messagesRef = useRef([]);
// ... after each tool call:
messagesRef.current = [...messagesRef.current, toolResultMessage];

That's it. No Redis, no Mem0, no LangChain ConversationBufferMemory. A JavaScript array.

The "but what about token limits?" objection rarely bites in practice. Agent runs are bounded (our max is 400 iterations). Modern models have 200k+ context windows. If you hit the limit, you have a different problem — your agent's goal is too broad.

Persistent Knowledge: Vector Retrieval

For knowledge that persists across runs — past client projects, Clutch reviews, published blog posts — we use Convex's native vector search. Here's the real ingestion flow:

Chunk the source by H2 headings (chunkByH2)
Embed each chunk with openai/text-embedding-3-small via OpenRouter
Insert into the knowledgeChunks table with the embedding indexed

Querying is three lines:

const queryEmbedding = await generateEmbedding(args.query);
const searchResults = await ctx.vectorSearch("knowledgeChunks", "by_embedding", {
    vector: queryEmbedding,
    limit: 5,
    filter: args.source ? (q) => q.eq("source", args.source) : undefined,
});

No separate vector DB to operate. No embedding sync job. Convex handles indexing automatically. If you're already on a BaaS with built-in vector support (Convex, Supabase with pgvector), use it. The marginal gain from a dedicated vector DB is not worth the operational cost until you're at tens of millions of vectors.

Verified Facts: Just a JSON File

For canonical company facts — services we offer, team bios, portfolio highlights — we don't use retrieval at all. We use a JSON file that a tool returns verbatim:

case 'get_core_knowledge':
    return { success: true, result: JSON.stringify(companyKnowledge, null, 2) };

Why? Because "verified company data" doesn't need similarity search. It needs to be exactly right, 100% of the time. When an agent writes about Mobile Reality, it calls getcoreknowledge() and gets the authoritative JSON — no embedding drift, no retrieval ranking, no nearest-neighbor weirdness.

The lesson: not every knowledge store should be a vector database. Static, small, must-be-exact data belongs in a file. Large, fuzzy, similarity-searchable data belongs in a vector store. Use both.

Essential Tools for Building AI Agents

Here's the toolkit we actually reach for:

Need / Tool / Why
Need	Tool	Why
Model access	OpenRouter	Single API for 300+ models, per-request routing, one bill
Orchestration	Hand-rolled loop in TypeScript	No lock-in, debuggable, ~200 LOC
Streaming SDK	Native `fetch` with SSE parsing	No wrapper needed; the API is straightforward
Vector store	Convex or Supabase + pgvector	Built-in, no separate ops
Embeddings	`openai/text-embedding-3-small` via OpenRouter	Cheap, 1536 dims, good enough for 95% of cases
Prompt versioning	Langfuse	Move prompts out of code; version, A/B test, edit without deploys
LLM observability	Langfuse + PostHog AI	Trace every call: prompt version, input, output, tokens, latency
Evals	PromptFoo or Braintrust	You cannot ship agents without evals — PromptFoo runs in CI

Notable omissions: LangChain, LangGraph, CrewAI, LlamaIndex, Mem0, Pinecone, Weaviate. We've tried all of them. We don't run any in production today.

This isn't dogma — if you're doing something genuinely novel (reinforcement learning loops, complex multi-agent choreography with distinct processes), some of these earn their keep. For the "agent that uses tools to do a job" pattern that covers 90% of enterprise use cases, they're weight.

The 10-20-70 Rule for AI Success

After shipping the technical stack, the question becomes: why do some agents get adopted and others die on the shelf?

The 10-20-70 rule, originally described by BCG, explains it better than any architecture diagram:

10% → Algorithms: model selection, prompt engineering, fine-tuning
20% → Technology infrastructure: vector stores, embeddings, webhooks, retry logic, observability
70% → People and process redesign: workflow mapping, training, change management

We routinely see teams invert this — 80% of budget on algorithm perfection, 5% on rollout. Those projects produce beautiful demos and zero business value.

The 10% algorithm slice is where this guide has lived so far. It's where engineers want to spend time. It's also the slice with the lowest marginal return once you hit "good enough" — foundation models are commoditized, and the gap between GPT-4.1-mini and a fine-tuned custom model for most tasks is smaller than teams expect.

The 20% infrastructure slice is where agents go from demo to production. Retry logic on rate-limited APIs, PII scrubbing before embedding, audit trails for every tool call, cost tracking per agent run, graceful degradation when a model provider goes down. The boring stuff that determines whether you can leave the system unattended.

The 70% people slice determines whether the agent gets used. One of our clients built a perfect maintenance-scheduling agent and watched adoption stall at 12%. We didn't change a line of code. We ran two workshops with floor supervisors, redesigned the escalation flow so the agent's suggestions routed through their existing ticketing system, and adoption hit 80% within a month. The agent didn't change. The process around it did.

How Much Does It Cost to Build an AI Agent?

Prices below are what we actually quote for client engagements. They bake in the architecture choices above — teams building on heavyweight frameworks will see 30-60% higher numbers for equivalent scope, mostly in ongoing maintenance.

Agent Type / Cost Range / Timeline / Example
Agent Type	Cost Range	Timeline	Example
Basic reactive	$2,500–$5,000	1-2 weeks	Webhook + Slack notifications
Intermediate contextual	$5,000–$15,000	4-10 weeks	Tool-calling loop + vector retrieval
Advanced autonomous	$15,000–$50,000	16-24 weeks	Multi-step orchestration with evals and audit trails
Enterprise multi-agent	$50,000–$150,000+	6-12 months	Multiple specialized agents with shared knowledge layer

The biggest hidden cost is observability. Every production agent we run has structured logging on every tool call, every model response, every token count. Bolting this on after launch costs 2-3x what building it in costs. If you take one piece of advice from this guide beyond the architecture: log every tool call to a queryable store from day one.

Other cost drivers to plan for:

Annual maintenance: 15-30% of initial build cost — models get deprecated, APIs change, prompts drift
Inference cost: for high-volume agents, this can exceed engineering cost — role-based model routing (Step 2) is the single biggest lever
Evals infrastructure: don't skip. An agent without evals is an agent you cannot improve

Conclusion

The agents we ship that deliver business value share a few traits: narrow goals, role-based model routing via OpenRouter, hand-rolled tool-calling loops instead of frameworks, vector retrieval only where it earns its keep, and relentless focus on the 70% of the work that isn't code.

If you're starting this week: pick one narrow task, build it as a tool-calling loop against OpenRouter, and ship it behind a feature flag to 5% of users. Iterate from there. The goal is production, not architecture.

If you're building complex, compliance-sensitive systems and want a team that's shipped this 75+ times — my team at Mobile Reality designs, builds, and operates these systems end-to-end. We'll bring the architecture; you bring the domain knowledge.

Related reading:

Frequently Asked Questions

Should I use LangGraph, CrewAI, or another framework to build AI agents?

Skip heavyweight frameworks for most use cases. After 75+ production deployments, we consistently see that a 200-line hand-rolled tool-calling loop in TypeScript is easier to debug, cheaper to maintain, and gives you full control. Frameworks like LangGraph hide the state and control flow in abstractions that make production bugs nearly impossible to trace. They only make sense if you have dozens of engineers who need a common vocabulary.

How do I choose which LLM to use for my AI agent?

Don't lock into one vendor. Use a model router like OpenRouter and assign different models to different roles. Our production setup uses GLM-5 for the orchestrator (cheap, strong tool-calling), Kimi K2 for content writing (excellent prose), GPT-5.2 for SEO auditing (deep reasoning), and GPT-4.1-mini for lightweight tasks. This role-based routing cut our costs by 60% without any quality loss.

What kind of memory do I actually need for production AI agents?

Far less than you think. For short-term memory, use a simple array of messages that lives for the duration of the agent run—no Redis, no Mem0, no LangChain memory abstractions. Modern models have 200k+ context windows, and agent runs should be bounded anyway (we cap at 400 iterations). For persistent knowledge, use vector search with embeddings, but keep verified facts in a JSON file that tools return verbatim—similarity search is wrong for data that must be exactly right.

How do I prevent my AI agent from making errors or getting stuck in loops?

Build explicit guardrails into your hand-rolled loop, not framework-level abstractions. Our production pattern includes: an abort controller so every request is cancellable mid-stream, duplicate detection to break out when the same tool call repeats back-to-back, a hard cap of 400 tool calls per run, and six fallback matching strategies for text replacement tools (exact match, trimmed, unicode normalization, whitespace collapse, line-trimmed, and first-last-line anchor). These predictable failures beat hoping the model reasons its way out.

What is the 10-20-70 rule and why does it matter for AI agent projects?

The 10-20-70 rule, from BCG's digital transformation research, states that success depends on 10% algorithms, 20% technology infrastructure, and 70% people and process redesign. Teams routinely invert this—spending 80% on model perfection and 5% on rollout. The result is beautiful demos with zero business value. We saw one client with a perfect maintenance-scheduling agent stall at 12% adoption until we redesigned the escalation flow to route through their existing ticketing system—no code changes, just process. Adoption hit 80% within a month.

Discover more on AI-based applications and genAI enhancements

Artificial intelligence is revolutionizing how applications are built, enhancing user experiences, and driving business innovation. At Mobile Reality, we explore the latest advancements in AI-based applications and generative AI enhancements to keep you informed. Check out our in-depth articles covering key trends, development strategies, and real-world use cases:

Our insights are designed to help you navigate the complexities of AI-driven development, whether integrating AI into existing applications or building cutting-edge AI-powered solutions from scratch. Stay ahead of the curve with our expert analysis and practical guidance. If you need personalized advice on leveraging AI for your business, reach out to our team — we’re here to support your journey into the future of AI-driven innovation.

How to Build AI Agents: A Practitioner's Guide from 75+ Production Deployments