Question 1

What is a custom LLM, and when is it worth building one instead of just calling an API?

Accepted Answer

A custom LLM is a small, open-source model fine-tuned on your data for one specific task, then self-hosted in your own infrastructure. It pays off when you run the same kind of request thousands of times (so a giant general model is overkill and the per-token bill hurts), when regulated or proprietary data can't leave your walls, when you'd rather pay a fixed monthly infrastructure cost than an unbounded usage meter, or when you need a specific output schema returned reliably. For low or moderate volume, a frontier API is usually cheaper and we'll tell you so. The economics flip in your favor at high, repetitive volume.

Question 2

Which open-source base models do you fine-tune?

Accepted Answer

We pick the right open-source base model for the task and budget, mostly from the Llama 3, Gemma, and Mistral families. We favor the smallest model that can meet the quality bar, because a smaller model is cheaper to host and faster to serve. We don't chase one giant model for everything: we start from the narrowest business case and choose the smallest base model that can do the job.

Question 3

How much does it actually cost to run a self-hosted LLM?

Accepted Answer

A self-hosted model is a fixed infrastructure cost, not a per-token tax. From our own work: GPU containers ran about $2.76/hour, so keeping a model live 24/7 is roughly $1,800/month, or about $900/month if you only run it during business hours. The catch is cold start, around 8 minutes, which we mitigate by keeping the function warm or using a managed runtime. Even so, for a high-volume use case that's dramatically cheaper than paying around $1,000/month per seat for a frontier model. Open-source doesn't mean free, it means a predictable cost you control.

Question 4

How do you fine-tune on our data without exposing PII?

Accepted Answer

Fine-tuning runs on your data, so preparing it safely is the first step. We collect and structure task examples, then anonymize personally identifiable information (PII) before any data touches a training run. Deployment is on-premise in your own cloud (AWS, Kubernetes, or a managed runtime like Bedrock), inside your network boundary and behind your existing auth, so every prompt and response stays inside your infrastructure. Nothing is sent to a third-party API, which is what makes regulated and proprietary workloads possible at all.

Question 5

What if we don't have a training dataset yet?

Accepted Answer

That's common, and it doesn't block the project. We deploy agents into your process to collect and structure the training data first, then use that data to build the model. Examples of the task done correctly are the fuel for fine-tuning, so when you don't already have them, the agent-driven data pipeline becomes the first phase of the engagement rather than a prerequisite you have to solve on your own.

Question 6

How accurate can a small fine-tuned model really get?

Accepted Answer

Every model goes through held-out evaluation before it ships. A concrete data point from our internal MDMA model: a first pass on the smallest possible model reached around 94% on the training eval and around 60% held-out as a proof-of-concept baseline, and adding three worked examples to a tiny system prompt lifted the held-out score from 40% to 60%, with a clear path toward around 95% on a larger model. A small model fine-tuned on one task can match or beat a giant general model on that task: faster, cheaper, and more reliably formatted.

Question 7

Where do you deploy the model, and can it be air-gapped?

Accepted Answer

We deploy your private LLM where your data already lives: your own cloud (AWS, Kubernetes), a managed runtime like Bedrock, or an air-gapped environment, then integrate it with your internal systems. You get the model behind your own auth, network, and audit controls, with data residency and cost handled. The model connects to your internal tools and data sources so it works inside your workflows instead of being a siloed chat box. You own the model weights and the deployment, with no dependency on one provider's pricing, rate limits, or roadmap.

Question 8

What is the DSL input format, and why does it matter?

Accepted Answer

Tokens cost money and latency, so the input format matters as much as the model. We tested full JSON, compact JSON, and a custom compact DSL (Domain Short Language) for the same task, and the shortest DSL format gave the best results on the fewest tokens. Users never write the DSL by hand: an agent or CLI generates it from a plain request. For repeated workflows with a stable schema, this is a big part of what makes self-hosting a small model economical, because fewer tokens per call means lower cost and faster responses at scale.

Custom LLM Development Services

Why choose open-source LLMs?

Full data control

No vendor lock-in

Lower cost at scale

Security & compliance

See It In ActionFree PoC On Your Own Data

Our LLM fine-tuning process

1. Data preparation & PII anonymization

2. Base model selection

3. Supervised fine-tuning (SFT)

4. Input format engineering (DSL)

5. Reinforcement & preference tuning

6. Evaluation & deployment

Internal case study: the MDMA model

What this unlocks

Model ownership

Predictable cost

Data privacy

Task specialization

Agent-driven data pipeline

Low latency at scale

Own Your AI: Train a Model Built for Your Business

From business case to owned model

Define the business case

Prepare data & anonymize PII

Select base model

Fine-tune & evaluate

Deploy on-premise & optimize cost

Security & on-premise deployment

Your cloud, your control

Integrated with your systems

MDMA, our Generative UI engine

Why build a custom LLM with Mobile Reality?

When a custom LLM is the right call

What you get

Our recommendation

Frequently Asked Questions