I have enough from the main article. Let me write the blog post now.

How to Build a Sub-500ms Voice Agent From Scratch: A Deep Dive

TL;DR

A developer built a fully functional voice agent with under 500ms end-to-end latency in roughly one day using ~$100 in API credits. The key breakthroughs: switching from OpenAI to Groq for inference (cutting first-token latency from 300–500ms down to ~80ms), pre-warming WebSocket connections to ElevenLabs, and deploying in the EU instead of running locally. The full code is open source. This is one of the most practical, honest breakdowns of real-time voice AI architecture published so far.


What the Sources Say

The original writeup at ntik.me is a refreshingly candid engineering post-mortem. The author (Nick Tikhonov) doesn’t just show you the finished product — he walks through the failures, the latency numbers at each stage, and the specific decisions that made the difference.

The Core Architecture

At its heart, the voice agent is a turn-taking loop managing two states:

  • The user is speaking (agent listens)
  • The agent is speaking (user listens)

That sounds simple. In practice, it means you’re orchestrating four separate services in real time:

  1. Twilio — handles telephony and ingests 8kHz μ-law audio packets
  2. Deepgram Flux — does turn detection (figuring out when the user stopped talking) plus transcription
  3. An LLM — generates the response (Groq llama-3.3-70b or gpt-4o-mini)
  4. ElevenLabs — converts text to speech

The author makes a critical architectural point early on: this is not a sequential pipeline. Voice agents require continuous real-time orchestration. If you think of it as “transcribe → generate → speak” in discrete steps, you’ll be stuck at 1.5+ seconds. The entire thing has to stream.

The Latency Journey (This Is the Good Part)

The post includes an honest progression of what the numbers actually looked like:

ConfigurationEnd-to-End Latency
Local dev machine (Turkey)~1.7 seconds
Deployed to EU~790ms
EU + switched to Groq~400ms

Two things jump out here. First, geography destroyed almost half the latency before any code optimization touched anything. Running the orchestration server in Turkey while hitting APIs hosted in the US/EU added ~900ms just in network round-trips. Deploying to Railway.app in the EU halved that instantly.

Second — and this is the headline insight — model inference dominates total latency. The author is explicit: “The TTFT (time to first token) accounts for more than half of the total latency.” Groq’s infrastructure achieved ~80ms TTFT vs. OpenAI’s 300–500ms. That single swap took the system from “almost usable” to genuinely fast.

The Three Optimizations That Actually Moved the Needle

1. Connection pooling to ElevenLabs

Every new WebSocket handshake to ElevenLabs costs ~300ms. Pre-warming a pool of connections and keeping them open eliminated that overhead entirely. This is the kind of thing that’s obvious in retrospect and invisible until you measure it.

2. Token-level streaming into TTS

Don’t wait for the full LLM response before starting TTS. As soon as the first tokens arrive from the LLM, start feeding them to ElevenLabs. The author pipes LLM tokens directly into the TTS stream, and audio packets forward to Twilio immediately. You’re playing the beginning of the sentence while the model is still generating the end of it.

3. Barge-in handling

When the user interrupts mid-response, the system has to immediately cancel in-flight generation, tear down TTS, and flush any buffered audio. Get this wrong and the agent keeps talking over the user, which feels broken. The author built interrupt detection on top of Deepgram’s turn detection signals.

The VAD Journey

The initial implementation used Silero VAD (voice activity detection) for figuring out when the user finished speaking. This worked locally but introduced its own latency and complexity. Switching to Deepgram Flux for combined turn detection + transcription simplified the architecture and improved reliability. Having one service handle both “is the user still talking?” and “what did they say?” turned out to be cleaner than stitching two systems together.

What Didn’t Work (And Why That’s Valuable)

The 1.7-second local latency isn’t presented as embarrassing — it’s the baseline that makes the optimization story legible. The author is clear that the biggest gains came from infrastructure decisions, not clever code:

  • Wrong region → +900ms
  • Cold WebSocket connections → +300ms
  • Slow inference provider → +200–400ms

That’s over a second of latency that had nothing to do with algorithm design. For anyone building voice AI, this framing is genuinely useful: measure your infrastructure before you optimize your code.


Pricing & Alternatives

The author used ~$100 in API credits to build and test the whole thing. Here’s a rough breakdown of the services involved and their tradeoffs:

ComponentWhat Was UsedNotable AlternativeTradeoff
TelephonyTwilioVonage, TelnyxTwilio has best docs/ecosystem
STT + Turn DetectionDeepgram FluxWhisper (OpenAI), AssemblyAIDeepgram faster for streaming
LLM InferenceGroq (llama-3.3-70b)OpenAI gpt-4o-miniGroq ~4x faster TTFT, less flexible
TTSElevenLabsCartesia, PlayHTElevenLabs most natural, pricier
HostingRailway.app (EU)Fly.io, RenderProximity to API endpoints matters

The choice of Groq is the most opinionated one here. The author originally tried gpt-4o-mini (300–500ms TTFT) and Groq’s llama-3.3-70b (~80ms TTFT). For a voice agent where humans expect sub-300ms response times to feel natural, that difference is the line between “impressive demo” and “broken product.”


The Bottom Line: Who Should Care?

Developers building voice AI products should read this post top to bottom. It’s the rare engineering writeup that includes actual latency numbers at each stage — not just the final result.

Indie hackers and AI startup founders will appreciate that this was built in ~1 day for ~$100. The barrier to building a working voice agent is lower than most people think. The barrier to building one that feels good is where the real work is.

LLM infrastructure teams should take note of the Groq vs. OpenAI comparison. When TTFT is literally more than half your total latency, inference speed becomes a product feature, not an implementation detail.

Anyone surprised by their own voice agent latency: if you haven’t measured the network distance between your orchestration server and your API providers, do that before touching your code. The biggest gains here came from geography, not algorithms.

The code is open source at github.com/NickTikhonov/shuo. Whether you use it directly or just as a reference architecture, it’s one of the better practical examples of streaming voice AI that’s publicly available right now.


Sources