Self-hosting LLMs without setting your money on fire

← Journal

Infrastructure

Feb 14, 2026

15 min read

The pitch for self-hosted inference is usually framed as a privacy story. Your data stays on your infra, you control the model behavior, you don't have vendor lock-in. All true. But the more compelling case for most startups is the economics.

At low volume, hosted APIs are obviously the right choice. The economics flip somewhere between 100K and 10M tokens per day depending on your model size, traffic patterns, and GPU procurement strategy. Most AI-first products hit that inflection point earlier than they expect.

The real cost of hosted APIs

The headline cost per million tokens is not what you actually pay. The real cost includes: context window inflation (long system prompts, conversation history, retrieved context all add up), retry costs when you hit rate limits, the premium you pay for availability SLAs you probably don't need, and the tax of not being able to optimize the model for your specific use case.

When self-hosting makes sense

The threshold is lower than most people think. If you're making more than 1M API calls per day and your use case tolerates 7B-13B model quality, you should be running the numbers on self-hosted inference. If your use case requires behavioral control (consistent persona, specific refusals, domain specialization), the quality argument often outweighs the cost argument.

The practical setup

The stack we've settled on for most engagements:

vLLM for serving — best throughput/latency tradeoff for batched inference in production
RunPod Serverless for compute — GPU pricing that scales with traffic, cold starts under 15 seconds with the right configuration
LoRA adapters for customization — fine-tune without merging, swap without redeploying the base model
FastAPI in front for auth, routing, and the business logic layer

The cold start problem is real but solvable. The key is keeping at least one warm worker running at all times (the cost is small), and designing your application to handle latency variance gracefully.

What you don't get

Self-hosting gives you cost control and behavioral control. It doesn't give you the latest model weights, safety research, or the continuous improvement that hosted API providers invest in. For most production AI products, that's an acceptable tradeoff. For products where model quality is the primary differentiator, it might not be.

← Back to journal

Start a project