Why we chose Ultravox over the obvious vendors.

The shortlist nobody tells you about

When you go shopping for voice AI infrastructure in 2026, the obvious vendors come up first: OpenAI's Realtime API, ElevenLabs Conversational, Retell, Vapi. They demo well. They have logos on their homepage. They're what your CTO heard about at a conference. We evaluated all of them for a regulated MENA deployment and ended up shipping on Ultravox. Here's the actual decision matrix — not the marketing version.

Latency: the only number that matters

Conversational latency is the difference between an AI that feels like a human and one that feels like a 2002 IVR. The bar we hold ourselves to is < 800ms from end-of-speech to start-of-response, end-to-end, on a 4G connection in Dubai. That number includes: VAD endpointing, network egress, model inference, TTS first audio packet, and network return.

Here's what we measured, p50, on the same hardware and the same test corpus:

OpenAI Realtime — 1.1s. Good when it works, but variance is brutal (p95 around 2.3s).
ElevenLabs Conversational — 950ms. Better, but their endpointing was aggressive and cut off slow speakers.
Retell / Vapi — both wrap underlying providers, so latency was provider-dependent and added 100–200ms of orchestration overhead.
Ultravox — 620ms p50, 880ms p95. The architecture is end-to-end speech-to-speech, so they're not paying the STT → LLM → TTS tax.

For a customer who's used to talking to a human, 600ms is the magic number where they stop noticing they're talking to a machine. That's the win.

Language support: it's never just "does it speak the language"

Our deployment had to handle Gulf Arabic, MSA, and English, often code-switched mid-sentence. Every vendor on the shortlist claimed Arabic support. The reality:

Some vendors trained on MSA only and produced unusable output for dialectal speech.
Some had Arabic STT but English-only TTS, which is useless for a voice agent.
Two vendors handled code-switching by detecting language per utterance — so a customer saying "my deductible يعني the amount I pay" would get half a response in English and half in Arabic, both confused.

Ultravox handled code-switched Gulf Arabic + English in a single utterance without us writing custom orchestration. That alone collapsed two weeks of planned prompt engineering into zero.

Data residency: the deal-breaker

Our customer needed inference to run inside a specific jurisdiction. That's a compliance requirement, not a preference. Most vendors offer "data residency" as a marketing line that means "we'll store your transcripts in your region" — but the model inference itself happens in us-east-1. That's not residency. That's a transcript backup.

Ultravox let us run inference inside our own AWS account, in the right region, with our own KMS keys. That made the entire compliance story tractable in one architecture review instead of a six-month vendor negotiation.

What we gave up

This isn't a free win. Ultravox is younger than the obvious vendors. Their dashboard is thinner. Their SDK has rough edges. We file a support ticket roughly once a month and the team responds, but we're not in a tier where we get an account manager calling us back.

For a marketing demo, the obvious vendors are still easier. For a regulated production deployment with hard latency and residency requirements, the tradeoff was clear.

Takeaway

Vendor selection for voice AI in 2026 isn't about feature lists. It's about three numbers: p50 latency under your real network conditions, language coverage measured on your customers' actual speech, and where the inference physically runs. Get those three answered honestly and the choice usually makes itself.