AI / Self-Hosted LLM / Mental Health Tech
Guide
An AI companion for first responders.
Client
Guide
Role
Fractional CTO, AI Architecture
Timeline
2024 – Ongoing
The problem
First responders die by suicide at rates well above the general population, and most won't talk to a therapist. Guide's bet is that a peer community, anchored by an always-on AI companion, can reach people that traditional mental health tools can't.
The product needed an AI that could do something most chatbots can't: hold a persona, participate in small group conversations like a member rather than a tool, and stay available at 3am when a paramedic is alone in the truck after a bad call.
The constraints were real:
The persona had to be consistent across thousands of conversations and never break character into "as an AI assistant" mode.
The unit economics had to work for a mental health startup, not a Series C SaaS company.
The data was sensitive. Off-the-shelf hosted APIs were a non-starter for both privacy and behavioral control.
The approach
We made three architectural calls early that defined the build.
Self-hosted inference, not API calls.
Hosted LLMs were ruled out for privacy, persona drift, and cost at scale. We stood up a vLLM-based serving stack on RunPod Serverless, which gave us GPU pricing that flexed with traffic and full control over the model behavior.
LoRA adapters, not merged weights.
Instead of fine-tuning and merging a custom model, we serve the base model with a hot-swappable LoRA adapter. This gave us tighter control over the persona scaling factor at inference time, and the ability to iterate on the persona without retraining and redeploying a 13B+ parameter model every cycle.
Persona as architecture, not prompt.
The persona isn't a system prompt bolted onto a generic model. It's a fine-tuned adapter trained on curated dialogue, with conversation memory and group dynamics handled in a layer above the model.
The persona isn't a system prompt. It's a fine-tuned adapter — the identity is baked in, not bolted on.
The work
The serving stack
vLLM for high-throughput batched inference
LoRA adapter loaded at runtime with configurable scaling
RunPod Serverless for cold-start-tolerant GPU allocation
FastAPI gateway in front for auth, rate limits, and routing
Postgres for conversation state and group context
The application stack
React Native client (iOS and Android)
Backend on Render with deep-link infrastructure on links.theguideapp.com using apple-app-site-association and assetlinks.json
PostgreSQL migrated from AWS RDS to Render via logical replication, zero downtime, no dump-and-restore
The unglamorous work
Eval harness for persona consistency, refusal behavior, and crisis-language detection
Observability on token costs, latency p95/p99, and cold start frequency
A cost model that lets the team decide which conversations get the larger model
The outcome
A self-hosted, persona-driven AI that participates in real groups of first responders, at a unit cost that lets Guide actually grow.
More importantly: the team has the levers. They can tune the persona without a vendor in the loop, control where data lives, and adjust GPU spend as a dial rather than a contract negotiation.
The team has the levers. GPU spend is a dial, not a contract negotiation.
Tech stack
vLLM
LoRA
RunPod Serverless
Python
FastAPI
PostgreSQL
React Native
Render
Logical Replication
Want this kind of work on your team?
We read every message. Real reply within two business days.