March 14, 20266 min read

Building Voice AI Systems That Don't Embarrass You in Production

What I've learned from putting AI voice agents on real customer calls across India, the Middle East, and Latin America. Latency budgets, language switching, and why a single hallucination is a customer-impact event, not a chat-window inconvenience.

By Vikas Goel

A demo voice agent and a production voice agent share roughly nothing. They use similar models, similar pipelines, similar APIs — and behave completely differently the moment a real customer is on the line.

I'm the CTO at Nexiva, our AI voice agent platform built at blackNgreen. We are now live across India, the Middle East, and Latin America, handling inbound service queries, outbound sales, and collections for telecom and BFSI customers. We launched at MWC Barcelona in 2025, and the gap between the demo we showed there and the system that runs production traffic is, honestly, where most of the engineering happened.

This post is the operating manual I wish I'd had two years ago.

Latency is not a number — it is a budget

The single biggest difference between toy voice agents and production ones is how they think about latency. A toy agent thinks of it as a single number: "the model responds in X milliseconds." A production agent thinks of it as a budget that gets spent across a pipeline.

Here's a typical Nexiva turn:

Endpointing (deciding the customer is done speaking): ~150ms
ASR (speech-to-text): ~250ms
Context assembly + tool routing decision: ~50ms
LLM inference (first token): ~400ms
LLM streaming completion: ~600ms (overlapped with the next stage)
TTS (text-to-speech, streaming): ~300ms
Network jitter and audio buffering: ~150ms

If you add those up naively you get a number that would be conversationally unacceptable. The trick is that most of the stages are overlapped and streamed: ASR streams partials, the LLM streams tokens, TTS starts speaking before the full response is generated. The number that actually matters is time-to-first-audible-syllable after the customer stops talking. Below ~700ms feels natural. Around ~1.2 seconds, customers start saying "hello?" Above ~2 seconds, they hang up.

Engineering for this means treating latency like a financial budget. Every component has an allocation. When something goes over budget, something else has to come in under. We've ripped out individually better components because they pushed the system over its conversational latency floor.

Language switching is the test that exposes everything

In India, customers code-switch mid-sentence — Hindi, English, sometimes a third language thrown in. In the Middle East, you have Arabic with English technical terms. In Latin America, Spanish with Portuguese borrowings on the borders. A voice agent that handles any one language well but breaks on switches is not deployable in any of these markets.

The mistake most teams make is to treat language detection as a step at the start of the call. That doesn't work — language switches mid-conversation, sometimes mid-sentence. The right model is continuous bilingual perception: ASR running in a configuration that tolerates switching, an LLM that handles multilingual generation natively, and TTS that can swap voice characteristics without the change being jarring to the customer.

The other mistake is treating non-English languages as a translation problem. They are not. The phonetics, the social register, the way numbers and names are pronounced — none of this survives a translation. Production-quality voice agents in Hindi or Arabic require Hindi-quality and Arabic-quality systems end to end, not English systems with a translation layer bolted on. We learned this the hard way.

Hallucination at the bottom of the pyramid

In a chat interface, a hallucination is recoverable. The user reads the response, doesn't believe it, asks again. The cost is annoyance.

In a voice interface, a hallucination is an action. The customer hears something authoritative-sounding, asks a follow-up assuming it was true, the agent doubles down, and now you have a customer who has been told their bill is paid when it isn't, or that their plan includes international roaming when it doesn't. The cost is real-world consequences — and a regulatory complaint if the customer is in a market with strong telecom oversight.

The architectural answer to this in Nexiva is grounded answering with explicit refusal paths. Anything that touches account state, money, or product specifics has to come from a tool call that returns structured data, not from the model's parametric memory. The model's job in those moments is to phrase the data, not produce it. And when a tool call fails or returns ambiguous data, the agent has to be willing to say "let me transfer you to someone who can verify this." Most of the work in building production voice AI is teaching the system when not to answer.

What I would tell my past self

Three things I now know that I didn't believe enough two years ago:

Treat the evaluation regime as the product. Whatever your eval looks like in month one, it will be wrong by month three. Customer phrasings change, product changes, edge cases emerge. Build the eval system as if it is a first-class component — versioned, instrumented, regression-tested. The team that owns evaluations is doing more important work than the team that owns the model. This is the same insight that pushed me toward the ThinkerWave research programme — the question of how an AI system defines what success looks like is, on reflection, the central question of production AI.

Recovery is more important than capability. A voice agent that handles 95% of cases beautifully and 5% catastrophically is unshippable. A voice agent that handles 80% well, 15% acceptably, and 5% by gracefully transferring to a human is shippable. Engineer the recovery paths first.

The infrastructure underneath matters more than the model on top. Most of the wins in Nexiva over the last year came from things that were not the model: better endpointing, smarter retry logic, more responsive transfer flows, more accurate intent classification. The model is one component in a larger system. Treat it that way.

The unromantic part

Building voice AI in 2026 is profoundly unromantic. The fun parts — the model, the demos, the impressive moments — are maybe 10% of the work. The other 90% is observability, eval discipline, edge case handling, recovery paths, telco integration, regulatory compliance, audio quality, and on-call rotations. If you're building in this space, that 90% is where the moat is. The model is a commodity. The system around it isn't.

If you're working on production voice AI and want to swap notes, reach out. And if you want to read more about the agent architecture stuff, see my piece on AI agents reshaping business.

Voice AI
AI agents
engineering
Nexiva

← Previous

The Future of Human–AI Collaboration in Customer Service

How AI Agents Will Reshape Businesses (And What Most People Get Wrong)