How to ship voice AI inside a regulated perimeter (without lying to your compliance officer).

The problem nobody puts on the demo

Every voice AI vendor demo opens with a frictionless customer call: clean audio, a happy customer, a tidy transcript. Then you sign the contract and try to put that same agent in front of a regulated insurance flow, and the wheels come off. PII spills into transcripts. Consent isn't captured. Audit logs are JSON blobs nobody can query. The compliance officer asks one question — "prove to me this call was lawful" — and the whole stack falls silent.

This is the part of voice AI that nobody publishes a tutorial about. So here's ours.

The reference architecture we ship

We build every regulated voice deployment on three layers, in this order:

Telephony + media — usually Twilio or a regional SIP trunk. The job here is just to land the audio safely inside our perimeter and tag every call with a call_id that lives forever.
Inference + orchestration — Ultravox for speech-to-speech, with a thin orchestrator (TypeScript service, no framework) that holds the state machine. Why our own state machine and not LangGraph? Because regulated flows have hard rules — you cannot quote a premium before consent is captured — and we want those rules in plain code that an auditor can read.
Storage + audit — Postgres for the operational data, S3 (with object lock + KMS) for the raw audio, and a separate append-only audit_events table that no service can UPDATE or DELETE from.

The thing that makes this work isn't any one piece. It's that consent, retention, and audit are first-class concerns from the schema up — not a logging middleware bolted on at the end.

Consent, the way the regulator actually wants it

Consent in voice AI has three moments, and you must capture all three:

Recording consent — captured before the agent says anything substantive. The opening prompt is a script your legal team signs off on, and the customer's response is timestamped against call_id and against the prompt version (because prompts change weekly and an auditor will ask "what exactly did we read them in March?").
Data processing consent — separate from recording. The customer is consenting to your AI processing their data, not just to being recorded.
Outcome consent — the customer agreeing to whatever the call produced (a quote, a policy change, a callback). This one is the most often skipped and the most painful when it's missing.

Each of these lands in audit_events as a row with event_type, consent_version, prompt_version, agent_version, transcript_offset_ms. When the regulator asks, you can reconstruct exactly what was said, when, by whom, and what the customer agreed to.

Audit logging that survives a real audit

The trap with audit logging is writing too much and querying too little. We log five things, religiously:

State transitions in the conversation state machine (with the trigger and the next state)
Tool calls the agent made (with arguments and result hash — not the raw result, because the raw result might contain PII you don't want in audit forever)
Consent events (see above)
Escalations to a human agent (and the reason code)
Policy guardrail hits — when a guardrail blocked the agent from doing something

That's it. No request/response dumps. No "info" logs. Audit is for proving compliance, not for debugging. Debugging logs live somewhere else with a 30-day retention.

What we'd do differently next time

After Hayah we changed two things in the template. First, we now version the consent script itself as a content-collection entry, so the script and the prompt that reads it are tied together at deploy time. Second, we built a replay tool that takes a call_id and walks the auditor through the call as a timeline — transcripts, consent events, tool calls, all in one view. The auditor stopped asking for screenshots. That alone was worth a week of engineering.

Takeaway

Voice AI in a regulated perimeter isn't a model problem. It's a data discipline problem. Get consent versioning, audit immutability, and replay tooling right on day one, and the rest of the build is just product. Get them wrong and you'll be rebuilding from the schema up six months in.