Building an AI Arbitrage Stack: Our Production Engineering Blueprint

Q: Why separate signal generation from execution layers?

Signal generation identifies opportunities using LLMs and statistical analysis that can tolerate occasional errors and run slowly (seconds to minutes), whereas execution acts on signals and must be deterministic, idempotent, and fast (milliseconds) without ever calling an LLM in the hot path. Teams that collapse these layers create impressive demos but production systems that lose money because inference latency and unpredictability corrupt time-sensitive decision-making.

Q: Why don't you use Airflow, Kubeflow, or Kafka like other guides recommend?

For 95% of AI arbitrage workloads, Cloudflare Queues combined with event-driven Workers deliver identical outcomes with a fraction of the operational burden, eliminating cold starts (sub-5ms versus 100-500ms on AWS Lambda) and scaling to zero between bursts without autoscaling configuration. Most arbitrage signal models fit in a Python script plus a cron job, making heavyweight pipeline tools like Airflow or Kubernetes cargo cult infrastructure unless you are retraining foundation models.

Q: Should I use a separate vector database like Pinecone or Weaviate?

No—PostgreSQL with the pgvector extension handles millions of vectors without the operational overhead of a separate service, co-locating embeddings with your existing transaction data and eliminating network hops during inference. Both HyperIntelligence and HyperFund rely on pgvector for embeddings, migrating to a dedicated vector store only when you actually hit tens of millions of vectors and bottlenecks become real rather than speculative.

Q: What risk management components are essential for an AI arbitrage system?

Every production system requires formula-driven position sizing (such as Kelly fraction capped at 25%), hard stop-loss limits enforced at the execution layer regardless of signal, exposure caps per asset and venue, circuit breakers that flip a kill-switch Durable Object when daily drawdown thresholds are breached, and an append-only audit trail for every decision. Additional safeguards include model staleness detection to reject signals built on outdated features and automatic inventory hedging when positions skew beyond configured bands.

Q: How do I know if my business needs this full architecture?

Build this stack only if your decision frequency is seconds to minutes, you have clean high-frequency data supporting statistical significance, your margins can absorb three to six months of infrastructure investment before revenue offsets cost, and someone internally owns AI/ML ops literacy. Skip it for decisions taking days or weeks, regulatory environments requiring human approval for every action, or low volumes where a Google Sheet with periodic review wins.

Businessman running on winding road with location markers illustrating AI arbitrage stack building and latency reduction progress

TL;DR — Key Takeaways

AI arbitrage stacks = signal generation (slow, LLM-driven) + execution (fast, deterministic). Don't put LLMs in the hot path.
Our real stack: Next.js + Cloudflare Workers + Durable Objects + PostgreSQL with pgvector + OpenRouter (with direct Anthropic/OpenAI fallback) + Langfuse for prompt versioning
Skip the Airflow/Kubeflow mythology — for 95% of AI arbitrage workloads, Cloudflare Queues + event-driven workers deliver the same outcome with a fraction of the operational burden
Prompts belong in Langfuse (versioned, A/B testable), not in code
State machines beat free-form agent loops when correctness matters more than flexibility

Introduction

In AI Arbitrage Agency: How Mobile Reality Delivers Scalable Intelligence, we laid out what AI arbitrage is and why it works: using machine intelligence to make high-frequency decisions no human team can match. In How to Build AI Agents, we went one layer deeper — the agent pattern we use inside those systems, built on OpenRouter and hand-rolled tool-calling loops instead of heavyweight frameworks.

This article goes one layer deeper still: the engineering stack that powers the most ambitious AI applications we've shipped. Two systems in particular — HyperIntelligence (collaborative AI workspace) and HyperFund (AI fundraising platform) — represent the architectural patterns we've battle-tested for high-throughput, low-latency AI operations. The same blueprint underpins arbitrage systems we've built for fintech and proptech clients.

If you've read other "AI arbitrage stack" guides, you've seen the standard shopping list: Airflow, Kubeflow, TFX, SageMaker, Vertex AI, Kafka, Spark Streaming. We run none of those in production. Here's what we actually use — and why.

The Two-Layer Principle: Signal vs. Execution

The single biggest architectural decision in any arbitrage stack is the separation of signal generation from execution.

Signal layer — identifies opportunities. Can be slow (seconds to minutes). Uses ML models, LLMs, statistical analysis. Tolerates occasional errors because downstream has checks.
Execution layer — acts on signals. Must be fast (milliseconds). Deterministic. Idempotent. Never calls an LLM in the hot path.

Teams that collapse these layers into a single "AI decides and acts" pipeline produce demos that look impressive and systems that lose money. We keep them strictly separate.

This mirrors the agent pattern from our agent-building guide: the orchestrator (signal) delegates writing to Kimi K2 (generation), while the actual text mutation happens in deterministic code. Same principle, different domain.

The Core Stack

Here's what we actually run across HyperIntelligence, HyperFund, and the arbitrage-style systems we build for clients. No Airflow. No Kubeflow. No TFX.

Runtime: Next.js + Cloudflare Workers + Durable Objects

Frontend is Next.js 15/16 on Vercel. Backend is split: Next.js API routes for simple request/response, and Cloudflare Workers for everything streaming, long-running, or stateful. For persistent connections (multi-turn chat state, continuous inference streams), Cloudflare Durable Objects hold the session context between calls.

Why not AWS Lambda? Cold starts. Cloudflare Workers V8 isolates boot in <5ms; Lambda on Node is 100-500ms. For signal generation that runs hundreds of times per minute, that gap compounds.

Why not a monolithic Node backend on ECS? Because arbitrage workloads are bursty. Cloudflare Workers scale to zero between bursts and to thousands of concurrent isolates during them — no autoscaling config, no idle cost.

Data: PostgreSQL + pgvector, Not a Separate Vector DB

Both HyperIntelligence and HyperFund run on PostgreSQL (Supabase-hosted in HyperFund, self-managed in HyperIntel) with the pgvector extension for embeddings. No Pinecone. No Weaviate. No Qdrant.

This is the same conclusion we reached for the knowledge layer in our CMS agent — a built-in vector store inside your existing DB beats a separate service until you're operating at tens of millions of vectors. For arbitrage signal stores (historical spreads, past trade outcomes, market regime embeddings), Postgres + pgvector is the right floor.

Embeddings are generated via OpenRouter (openai/text-embedding-3-small, 1536 dims) and written asynchronously through a Cloudflare Queue — EMBEDDING_QUEUE in HyperIntelligence. Ingestion never blocks inference.

For object storage (uploaded documents, generated artifacts, market data snapshots), we use Cloudflare R2. S3-compatible API, zero egress fees, co-located with Workers.

LLM Access: OpenRouter First, Direct SDKs for Specialized Features

Our default is OpenRouter — one API, 300+ models, per-request model selection. Role-based routing lives in a presets.ts config that maps roles to model IDs with pricing metadata baked in:

Fast reasoning / tool calling → z-ai/glm-5 or openai/gpt-4.1-mini
Long-form generation → moonshotai/kimi-k2 or anthropic/claude-sonnet-4-6
Deep analysis → openai/gpt-5.2 or anthropic/claude-opus-4-7
Embeddings → openai/text-embedding-3-small

But OpenRouter isn't parity for every feature. When we need Anthropic's extended thinking or prompt caching (10× read discount, 1.25× write premium baked into our cost estimates), we call the Anthropic SDK directly. When we need Cerebras for sub-second inference on specific reasoning paths (HyperFund uses this), we call Cerebras directly.

Pattern: OpenRouter for breadth and cost flexibility, direct SDKs when a provider's unique feature justifies the integration cost.

Signal Layer: Agent Loops or Finite State Machines (Pick Carefully)

Two patterns, two different use cases:

Tool-calling agent loop — HyperIntelligence's runAgentStream() runs the same while-loop we described in the agent-building guide: model call → tool call → tool result → model call, repeat until done. Tools include webscrape (via Firecrawl), documentread, knowledgesearch (pgvector lookup), phasetransition. Great when the flow is open-ended and the model should decide next steps.

Finite state machine (FSM) — HyperFund uses typescript-fsm to drive multi-phase workflows: Welcome → Analyzing → Generating → Generated, with explicit transitions and fallback routes. Each state calls a different model via a different prompt loaded from Langfuse. Better when the flow is known, correctness matters, and you need test coverage per transition.

For arbitrage signal generation, we lean FSM. "Detect regime → compute features → score opportunity → verify → emit" is a known sequence. Agent loops shine for exploratory analysis (why did the spread blow out last Tuesday?), not for hot signal paths.

Execution Layer: Deterministic, Idempotent, Boring

The execution layer does not call LLMs. It takes signals as input and applies rules as code:

Position sizing — formula-driven (Kelly fraction / volatility-adjusted). No model inference.
Slippage budget — reject fills worse than N bps of target price. Hard-coded.
Exposure limits — sum of open positions < max per asset / per venue / per strategy. Checked on every order.
Idempotency — every order carries a client-generated ID; the execution worker dedupes on retry.
Kill switch — a Durable Object holds a single enabled: boolean that every execution worker reads before firing. Flip it to halt everything in <1 second.

This is where Cloudflare Workers + Durable Objects shine. The kill switch state lives in one Durable Object; every Worker instance reads it on every decision. No distributed consensus, no Redis, no Zookeeper — just a single-instance DO that's globally addressable.

Observability: Langfuse + PostHog, Not Datadog Everywhere

Langfuse — every LLM call goes through it. Full trace: prompt version, input, output, token cost, latency. Prompts themselves live in Langfuse (versioned, A/B testable, editable without deploys). If a signal quality drops, we compare trace distributions across prompt versions to find the regression.
PostHog — product analytics + @posthog/ai for inference tracing with user context.
Pino + OpenTelemetry — structured logs from Workers, exported to whatever OTel collector fits the client's existing observability stack.

The prompt-versioning point is non-negotiable. Hard-coded prompts are where AI projects go to die. After six months you have fifteen engineers tweaking strings in fifteen files with no versioning, no rollback, no A/B. Put them in Langfuse on day one.

Testing: PromptFoo for LLM, Vitest for Code

HyperFund ships PromptFoo test suites for its LLM routing layer — explicit test cases for router decisions, guidance generation, and output verification. Not heuristics, not vibes — a test harness that fails CI when a prompt change regresses behavior.

For an arbitrage system, the equivalent is a backtest harness with frozen market data: deterministic replay of past scenarios against the current signal model. If the new model/prompt version underperforms the production version on historical data, the PR doesn't merge.

What We Deliberately Don't Use (And Why)

Tool / Why we skip it /
Tool	Why we skip it
Apache Airflow	Python-centric DAG runner with heavyweight infra. Cloudflare Queues + scheduled Workers do 95% of the job with 5% of the operational burden.
Kubeflow / Kubernetes for ML	Unless you're retraining foundation models, a managed vector DB + managed LLM APIs remove the entire "run our own Kubernetes" requirement.
TFX / full ML pipelines	Most arbitrage signal models fit in a Python script + a cron job. Full TFX pipelines are cargo cult for teams that aren't doing full-lifecycle ML.
LangChain / LangGraph	We explained this in the agent-building guide — a 200-line tool-calling loop is easier to debug than a framework that abstracts the prompt, state, and control flow.
Kafka	Cloudflare Queues cover 99% of event-driven ingestion for signal generation. Kafka makes sense at petabyte scale, not at arbitrage-signal scale.
Separate vector DB (Pinecone, Weaviate, Qdrant)	pgvector in your existing Postgres is free and handles millions of vectors. Migrate only when you hit actual bottlenecks.
TensorFlow.js / ONNX.js in the browser	Arbitrage inference doesn't belong on the client. Run models server-side and stream results.

Risk Management: The Unsexy Part That Actually Matters

Risk layer components we implement in every production arbitrage system:

Position sizing — Kelly fraction capped at 25% of full-Kelly, volatility-adjusted. Formula-driven, no ML.
Stop-loss / trailing stop — hard limit per position, enforced at execution layer regardless of signal.
Exposure caps — per asset, per venue, per strategy, per correlated-asset-cluster. Pre-trade check.
Circuit breakers — if realized P&L crosses a daily drawdown threshold, the kill-switch DO flips and execution halts. Human must re-enable.
Model staleness detection — every signal model has a max age. If features haven't been recomputed recently enough, signals are rejected.
Inventory hedging — for market-making style strategies, hedge positions trigger automatically when inventory skews beyond configured bands.
Audit trail — every decision (signal generated, order placed, order rejected, fill received) written to an append-only log. Regulators will ask.

Most of these have direct analogues in non-trading AI systems: output safety checks, rate limits, drift detection, explainable denials. The discipline transfers.

Decision Framework: When to Build This Stack

Not every business needs the full stack above. Build it when:

Your decision frequency is seconds to minutes (not hours or days)
You have clean, high-frequency data — arbitrage needs statistical significance
Your margins can absorb 3-6 months of infrastructure build before revenue offsets cost
Someone internally owns AI/ML ops literacy — this stack requires it

Skip it when:

Decisions take days/weeks and analysts do fine
Regulatory rules require human approval for every decision
Volume is low enough that a Google Sheet + periodic review wins
You're in the "we should use AI somehow" phase — build something narrower first (start with a focused AI agent)

As we argued in the AI arbitrage agency post, we'll tell you in the discovery call if your use case doesn't warrant this architecture. Overbuilding is the most common failure mode we see.

Cost and Timeline

Scope / Cost Range / Timeline
Scope	Cost Range	Timeline
Signal-only MVP — pgvector + single-model pipeline + basic dashboard	$5,000–$15,000	2-3 weeks
Signal + manual execution — above + audit trail + approval UI	$15,000–$30,000	4-6 weeks
Full autonomous — signal + execution + risk layer + observability + backtest harness	$30,000–$60,000+	2-4 months

The biggest cost multipliers: observability built in from day one (retrofitting is 2-3× cost) and prompt versioning via Langfuse (hard-coded prompts are the #1 source of post-launch chaos).

Conclusion

The arbitrage stack we've converged on — Next.js + Cloudflare Workers + Durable Objects + Postgres/pgvector + OpenRouter + Langfuse — is the product of ~75 production deployments across fintech, proptech, and our own AI-native products (HyperIntelligence, HyperFund, Flaree). It's deliberately boring, deliberately small, and deliberately free of the heavyweight ML infrastructure most guides recommend.

If you're thinking about building an AI arbitrage system, start with our agency-level overview to decide whether arbitrage fits your business. Then read our agent-building guide to understand the core loop. Use this article to design the surrounding infrastructure.

If you want us to build it — my team at Mobile Reality designs, builds, and operates these systems end-to-end. The discovery call is free; the honest assessment is faster than the sales pitch.

Frequently Asked Questions

Why separate signal generation from execution layers?

Signal generation identifies opportunities using LLMs and statistical analysis that can tolerate occasional errors and run slowly (seconds to minutes), whereas execution acts on signals and must be deterministic, idempotent, and fast (milliseconds) without ever calling an LLM in the hot path. Teams that collapse these layers create impressive demos but production systems that lose money because inference latency and unpredictability corrupt time-sensitive decision-making.

For 95% of AI arbitrage workloads, Cloudflare Queues combined with event-driven Workers deliver identical outcomes with a fraction of the operational burden, eliminating cold starts (sub-5ms versus 100-500ms on AWS Lambda) and scaling to zero between bursts without autoscaling configuration. Most arbitrage signal models fit in a Python script plus a cron job, making heavyweight pipeline tools like Airflow or Kubernetes cargo cult infrastructure unless you are retraining foundation models.

Should I use a separate vector database like Pinecone or Weaviate?

No—PostgreSQL with the pgvector extension handles millions of vectors without the operational overhead of a separate service, co-locating embeddings with your existing transaction data and eliminating network hops during inference. Both HyperIntelligence and HyperFund rely on pgvector for embeddings, migrating to a dedicated vector store only when you actually hit tens of millions of vectors and bottlenecks become real rather than speculative.

What risk management components are essential for an AI arbitrage system?

Every production system requires formula-driven position sizing (such as Kelly fraction capped at 25%), hard stop-loss limits enforced at the execution layer regardless of signal, exposure caps per asset and venue, circuit breakers that flip a kill-switch Durable Object when daily drawdown thresholds are breached, and an append-only audit trail for every decision. Additional safeguards include model staleness detection to reject signals built on outdated features and automatic inventory hedging when positions skew beyond configured bands.

How do I know if my business needs this full architecture?

Build this stack only if your decision frequency is seconds to minutes, you have clean high-frequency data supporting statistical significance, your margins can absorb three to six months of infrastructure investment before revenue offsets cost, and someone internally owns AI/ML ops literacy. Skip it for decisions taking days or weeks, regulatory environments requiring human approval for every action, or low volumes where a Google Sheet with periodic review wins.

Discover more on AI-based applications and genAI enhancements

Artificial intelligence is revolutionizing how applications are built, enhancing user experiences, and driving business innovation. At Mobile Reality, we explore the latest advancements in AI-based applications and generative AI enhancements to keep you informed. Check out our in-depth articles covering key trends, development strategies, and real-world use cases:

Our insights are designed to help you navigate the complexities of AI-driven development, whether integrating AI into existing applications or building cutting-edge AI-powered solutions from scratch. Stay ahead of the curve with our expert analysis and practical guidance. If you need personalized advice on leveraging AI for your business, reach out to our team — we’re here to support your journey into the future of AI-driven innovation.

Did you like the article?Find out how we can help you.

Contact Us Intro call

Matt Sadowski

CEO of Mobile Reality