Guide — AI Companion for First Responders

← Work

Guide

AI / Self-Hosted LLM / Mental Health Tech

Guide

An AI companion for first responders.

Client

Guide

Role

Fractional CTO, AI Architecture

Timeline

2024 – Ongoing

The problem

First responders die by suicide at rates well above the general population, and most won't talk to a therapist. Guide's bet is that a peer community, anchored by an always-on AI companion, can reach people that traditional mental health tools can't.

The product needed an AI that could do something most chatbots can't: hold a persona, participate in small group conversations like a member rather than a tool, and stay available at 3am when a paramedic is alone in the truck after a bad call.

The constraints were real:

The persona had to be consistent across thousands of conversations and never break character into "as an AI assistant" mode.

The unit economics had to work for a mental health startup, not a Series C SaaS company.

The data was sensitive. Off-the-shelf hosted APIs were a non-starter for both privacy and behavioral control.

The approach

We made three architectural calls early that defined the build.

Self-hosted inference, not API calls.

Hosted LLMs were ruled out for privacy, persona drift, and cost at scale. We stood up a vLLM-based serving stack on RunPod Serverless, which gave us GPU pricing that flexed with traffic and full control over the model behavior.

LoRA adapters, not merged weights.

Instead of fine-tuning and merging a custom model, we serve the base model with a hot-swappable LoRA adapter. This gave us tighter control over the persona scaling factor at inference time, and the ability to iterate on the persona without retraining and redeploying a 13B+ parameter model every cycle.

Persona as architecture, not prompt.

The persona isn't a system prompt bolted onto a generic model. It's a fine-tuned adapter trained on curated dialogue, with conversation memory and group dynamics handled in a layer above the model.

The persona isn't a system prompt. It's a fine-tuned adapter — the identity is baked in, not bolted on.

The work

The serving stack

vLLM for high-throughput batched inference

LoRA adapter loaded at runtime with configurable scaling

RunPod Serverless for cold-start-tolerant GPU allocation

FastAPI gateway in front for auth, rate limits, and routing

Postgres for conversation state and group context

The application stack

React Native client (iOS and Android)

Backend on Render with deep-link infrastructure on links.theguideapp.com using apple-app-site-association and assetlinks.json

PostgreSQL migrated from AWS RDS to Render via logical replication, zero downtime, no dump-and-restore

The unglamorous work

Eval harness for persona consistency, refusal behavior, and crisis-language detection

Observability on token costs, latency p95/p99, and cold start frequency

A cost model that lets the team decide which conversations get the larger model

The outcome

A self-hosted, persona-driven AI that participates in real groups of first responders, at a unit cost that lets Guide actually grow.

More importantly: the team has the levers. They can tune the persona without a vendor in the loop, control where data lives, and adjust GPU spend as a dial rather than a contract negotiation.

The team has the levers. GPU spend is a dial, not a contract negotiation.

Tech stack

vLLM

LoRA

RunPod Serverless

Python

FastAPI

PostgreSQL

React Native

Render

Logical Replication

Want this kind of work on your team?

We read every message. Real reply within two business days.

Start a project