Neha's Substack

I Built a Coding Agent That Fixes GitHub Issues - in just a few lines of code

Neha Deodhar — Wed, 25 Feb 2026 00:23:52 GMT

I built a coding agent that picks up GitHub issues, writes the fix in a sandbox, and pings me on Slack when the PR is ready for approval - without writing a single line of Docker, Slack or state management code. Here’s how.

The Workflow

A new GitHub issue triggers the workflow. The agent comments on the issue, clones the repo into a sandboxed Docker container.

Two agents take over - planner and coder. Both get built-in sandbox tools automatically: shell execution, file read/write, edit, glob, grep, web search. I just defined the agent goals. Polos handled the sandbox lifecycle, tool wiring, and coordination.

The coder finishes. The workflow pauses for human review. I get a Slack notification, review the diff, approve, PR is live.

const sandbox = sandboxTools({
  env: 'docker',
  scope: 'session',
  docker: { image: 'node:20-slim', memory: '2g' },
});

const planner = defineAgent({
  id: 'planner',
  model: anthropic('claude-sonnet-4-5'),
  systemPrompt: 'Analyze the issue and create an execution plan.',
  tools: [...sandbox],
});

const coder = defineAgent({
  id: 'coder',
  model: anthropic('claude-sonnet-4-5'),
  systemPrompt: 'Implement the plan. Read, write, and test code.',
  tools: [...sandbox],
});

I didn’t have to figure out how to create the Docker container, execute commands inside it, manage file system access, or keep the same sandbox alive across multiple tool calls within a session. Polos manages the full sandbox lifecycle - creation, tool execution, persistence across calls, and cleanup.

Full working example: Typescript | Python

Demo

3-minute video:

For this demo, I used a fork of the Zod repo and gave the agent an existing issue. Within seconds it commented on the issue, started working in the sandbox. A few minutes later, Slack notification - coder finished, ready for review. Approved from my phone. PR was live.

What I Used

I built this with Polos, an open-source runtime for AI agents. What I got out of the box:

Sandboxed execution - agents run inside managed Docker containers with built-in tools for shell, files, and web search
Slack integration - @mention agents, get responses in thread, receive notifications when agents need input
Durable workflows - agent fails at step 47 of 50, resumes from 47
Observability - OpenTelemetry tracing for every tool call and decision
LLM agnostic - any provider via Vercel AI SDK and LiteLLM

curl -fsSL https://install.polos.dev/install.sh | bash
npx create-polos
cd my-project && polos dev

Github repo: https://github.com/polos-dev/polos

100% open source. Python and TypeScript.

If you’re building agents that do real work, run commands, touch real systems - give it a try. I’d love to hear what you build.

Why I Built Polos: Durable Execution for AI Agents

Neha Deodhar — Mon, 09 Feb 2026 16:00:57 GMT

When I started building AI agents, getting a demo working was easy. But once we put them in production, things got complicated fast.

We needed a bunch of infra:

Kafka to pass events between agents
Retry logic - but if an agent fails halfway through, you can’t just restart it. It may have already charged the customer or sent an email.
Concurrency control so we didn’t blow through our OpenAI quota
Observability to actually see what the agents were doing

We ended up bolting together Kafka, durable execution frameworks, a bunch of heavyweight infrastructure - just to run agents reliably. And then we were stuck operating all of it. Time we should’ve spent building the actual product!

I realized every team building agents hits the same wall. We’re missing an AI-native platform that handles this out of the box.

That’s why I built Polos.

What is Polos?

Polos is a durable execution platform for AI agents. It gives you stateful infrastructure to run long-running, autonomous agents reliably at scale - with a built-in event system, so you don’t need to bolt on Kafka or RabbitMQ.

You write plain Python or TypeScript. No DAGs, no graph syntax. Polos handles the durability, the retries, the coordination.

This workflow survives crashes, resumes mid-execution, and pauses for approval - with zero manual checkpointing.

What Polos Gives You

Durable state. If your agent crashes on step 18 of 20, it resumes from step 18. Not step 1. Every side effect like LLM calls, tool executions, API requests is checkpointed. If your agent already charged Stripe via a tool call before the crash, that charge isn’t repeated on resume. Polos replays the result from its log. No wasted LLM calls, no duplicate charges, no double-sends.

Global concurrency. System-wide rate limiting so one rogue agent can’t exhaust your entire API quota. Queues and concurrency keys give you fine-grained control.

Human-in-the-loop. Pause execution for hours or days, wait for a user signal or approval, and resume with full context. Paused agents consume zero compute.

Exactly-once execution. Charging Stripe, sending an email - all actions happen once, even if you retry the workflow. Polos checkpoints every side effect.

Built-in observability. Trace every tool call, every decision. See why your agent chose Tool B over Tool A.

See It In Action

Here’s a short demo showing crash recovery in practice:

We start an order workflow. The agent charges the customer via Stripe - the charge succeeds, and Polos checkpoints the result.
Since the order amount is flagged as unusual, the workflow suspends for a fraud review.
While waiting for the fraud team, the worker crashes.
In most frameworks, this workflow is dead. You’d need to handle the failure manually and risk charging the customer again.
With Polos, we simply start a new worker. When the fraud team approves, Worker 2 picks up the workflow exactly where it left off.
Stripe is not called again - Polos replays the result from its log. The confirmation email is sent, and the workflow completes.

How It Works

Polos has three components:

Orchestrator: manages workflow state, persists every side effect to a durable log, handles event routing, scheduling, and concurrency control. If a worker dies, the orchestrator knows exactly where execution left off and schedules the workflow on a different worker.

Workers: run your code. They connect to the orchestrator, pick up workflow steps, execute them (including LLM calls and tool invocations), and report results back. Workers are stateless and horizontally scalable.

SDK: what you import in your code. Provides the @workflow decorator, Agent class, and the WorkflowContext that gives you durable steps, suspend/resume, events, and concurrency primitives.

Under the hood, Polos captures the result of every side effect - tool calls, API responses, time delays - as a durable log. If your process dies, Polos replays the workflow from the log, returning previously-recorded results instead of re-executing them. Your agent’s exact local variables and call stack are restored in milliseconds.

Completed steps are never re-executed - so you never pay for an LLM call twice.

Get Involved

Polos is open source: github.com/polos-dev/polos

Star us on GitHub, join the Discord, and give it a spin. We’re building this in the open and would love your feedback and contributions.

How Prompt Caching Cut My API Costs by 60%: A Real-World Experiment

Neha Deodhar — Sun, 08 Feb 2026 18:01:27 GMT

If you’re building multi-turn conversations or agentic workflows, you’re probably re-sending the same tokens over and over again - tool definitions, system prompts, and conversation history. Every single request. That’s expensive.

Prompt caching lets you cache static content and pay just 10% of the normal input token price on subsequent requests. I ran an experiment using Anthropic Claude Sonnet 4.5 to see how much this actually saves in practice.

But first, a bit of background on how prompt caching works...

How Prompt Caching Works Under the Hood

When you send a request with caching enabled, Anthropic computes a hash of your prompt content up to each cache breakpoint (see the section below to understand what a cache breakpoint is). If that exact hash exists in the cache from a recent request, the system skips reprocessing those tokens entirely - it just loads the cached computation state and continues from there. This is why cache reads are so cheap (10% of base price): you’re not paying for the model to process those tokens again, just to retrieve the precomputed state.

The default cache lifetime is 5 minutes, but here’s the key detail: the cache refreshes for no additional cost each time it’s used. So if you’re actively conversing with the model and hitting the cache every minute or two, that 5-minute window keeps resetting. The cache only expires after 5 minutes of inactivity. For active conversations or agentic loops, this means your cache essentially stays warm indefinitely. Anthropic also offers a 1-hour TTL at a higher write cost (2x base price instead of 1.25x) for workflows where requests are more spread out.

Cache Breakpoints and the Hierarchy

A cache breakpoint is where you place cache_control: {”type”: “ephemeral”} in your request. It tells Anthropic: “cache everything from the start of the request up to this point.” You can have up to 4 breakpoints per request.

The cache follows a strict hierarchy: tools → system → messages. This ordering matters because caches are cumulative - each level builds on the previous ones. Here’s how invalidation works:

Change your tools? The entire cache invalidates (tools, system, and messages).
Change your system prompt? The tools cache survives, but system and messages caches invalidate.
Change a message? Tools and system caches survive, but the messages cache from that point forward invalidates.

This is why placing breakpoints strategically matters. If your tools rarely change but your system prompt updates daily, put separate breakpoints on each. That way, a system prompt change doesn’t force you to re-cache the tools.

One key difference from OpenAI: OpenAI caches prompts automatically with no configuration needed. Anthropic requires explicit cache_control breakpoints, giving you more control over what gets cached and when, but requiring more upfront thought about your caching strategy.

The Experiment

I set up a 7-turn conversation with Claude Sonnet 4.5 that included:

Tool definitions (~2K tokens of function schemas)
A detailed system prompt (~6K tokens of instructions and context)
Growing conversation history (accumulating with each turn)

I ran the same conversation four different ways

The Results

Here’s what each request cost across the four strategies:

Cost Breakdown

To see why caching wins, look at the economics:

Without caching, you pay full price to process every token on every request. With caching, the first request pays a 25% premium to populate the cache. But every subsequent request reads those tokens at just 10% of the base price. Over a 7-turn conversation, the math overwhelmingly favors caching.

What’s Actually Happening Under the Hood

Let’s look at the cache behavior (tokens) for the fully cached strategy:

Request 1 is the “cold start” - nothing is cached yet, so we write 8,093 tokens to the cache. This actually costs more than no caching (cache writes are 1.25x the base input price).

Request 2 onward is where caching pays off. We’re reading thousands of tokens from cache at 0.1x the normal price, and only writing the new conversation turns.

By Request 7, we’re reading 13,464 tokens from cache and only writing 362 new tokens. That’s why the cost dropped to $0.01.

The Counterintuitive First Request

Notice that the first request with caching enabled ($0.06) actually costs more than without caching ($0.05). This is the cache write penalty - you pay 25% extra to populate the cache.

But this pays for itself immediately. By Request 2, the cached version is already cheaper ($0.02 vs $0.04), and the gap only widens from there.

The breakeven point is just 2 requests.

Why “Tools Only” and “Partial” Performed the Same

In my experiment, caching just tools vs. caching tools + system prompt showed identical costs. This is because both approaches left the conversation history uncached, and that’s what dominated the cost.

By Request 7, the fully cached experiment was reading 13,464 tokens from cache. The tools and system prompt together account for maybe 8K of those tokens. The remaining 5K+ is conversation history that accumulated over the 7 turns.

When you only cache the static prefix (tools, or tools + system), you’re still reprocessing that growing conversation history on every single request. The marginal savings from caching an extra few thousand tokens of system prompt gets swamped by the cost of reprocessing 10K+ tokens of conversation.

The real gains came from caching the conversation itself - that’s where the “fully cached” strategy pulled ahead.

Implementation Tips

Here’s the pattern that worked:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=4096,
    tools=[
        # ... your tools here ...
        {
            "name": "final_tool",
            "description": "...",
            "input_schema": {...},
            "cache_control": {"type": "ephemeral"}  # Cache breakpoint 1
        }
    ],
    system=[
        {
            "type": "text",
            "text": "Your detailed system prompt...",
            "cache_control": {"type": "ephemeral"}  # Cache breakpoint 2
        }
    ],
    messages=[
        # ... previous conversation turns ...
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Latest user message",
                    "cache_control": {"type": "ephemeral"} # Cache breakpoint 3
                }
            ]
        }
    ]
)

Key points:

Put cache_control on the last item in each section you want cached
The cache is hierarchical: tools → system → messages
Place your cache breakpoint at the end of each turn to incrementally cache the conversation

When to Use Prompt Caching

Caching makes sense when you have:

Long system prompts (instructions, examples, documentation)
Large context windows (RAG documents, code files)
Multi-turn conversations (chatbots, agents)
Repetitive tool definitions (same tools across many requests)

It’s especially powerful for agentic workflows where you might make 10-20 API calls in a single task, each building on the previous context.

The Bottom Line

For a 7-turn conversation:

No caching: $0.32
Full caching: $0.13
Savings: 60%

The cache writes cost extra on the first request, but you break even by Request 2 and save significantly from there. For any multi-turn application, prompt caching is essentially free money.

Experiment run with Claude Sonnet 4.5. Actual savings will vary based on your prompt structure and conversation length.

The Distributed Systems Problem: Why AI Agents Break in Production

Neha Deodhar — Sun, 08 Feb 2026 01:32:35 GMT

If you’ve shipped an AI agent to production, you know the Day 2 problem. Over 90% of AI agents fail before production. The models work. The infrastructure doesn’t. We’re running agents - long-running, stateful, autonomous workflows - on systems designed for stateless request-response.

When a 45-minute agent workflow dies at step 38 because your server restarted, you’ve lost more than an error log. You’ve burned tokens, API quota, and user trust. And unlike a failed HTTP request, you can’t just “retry” - the agent’s reasoning state is gone.

I’ve spent the last year building AI infrastructure. The tooling is shockingly immature. Every team hits the same failure modes and rebuilds the same broken solutions: brittle retry logic, state management hacks, rate limiting Band-Aids.

The problem isn’t the agents. It’s the infrastructure gap.

Agents Break Every Infrastructure Assumption

Assumption #1: Fast & Stateless

Traditional apps: Handle request → return response → forget everything. Average response time: 50ms.

Agents: Multi-step reasoning loops that run for minutes or hours. Call 20 different APIs. Make stochastic decisions at each step.

The break: A single network blip or worker restart loses the entire workflow state. You can’t “just replay” because the agent’s call stack, local variables, and reasoning context are gone. No amount of retries will help you.

Assumption #2: Concurrency is Request-Scoped

Traditional apps: Each request is isolated. One user’s failed request doesn’t affect anyone else.

Agents: Multiple agents running in parallel, all sharing the same API quotas, all racing to read and modify shared state.

The break:

The rate limit cascade: Agent A is debugging a loop and burns through your OpenAI quota in 30 seconds. Agents B through Z all start failing with 429s. Your entire system is down because one agent misbehaved.

The state race: Agent 1 reads user context at 10:00:00. Agent 2 modifies it at 10:00:01. Agent 1 writes based on stale data at 10:00:02. The user’s context is now corrupted, and you have no idea which agent’s view was “correct.”

The resource starvation: One runaway agent spawns 50 parallel reasoning branches and starves everything else of memory and compute.

You need system-wide rate limiting (not per-agent), distributed locks, and transactional state management. None of this exists in standard application frameworks. You’re on your own.

Assumption #3: Deterministic Execution

Traditional software: Same input → same output. Bugs are reproducible. You read the stack trace, fix the code, deploy.

Agents: Stochastic decision-making at every step. The same prompt can produce different tool calls, different reasoning paths, different outcomes.

The break: Your agent hallucinates a tool call. Or it enters an infinite reasoning loop. Or it chooses Tool B when Tool A was obviously correct.

Your logs tell you what happened (”Agent called send_email with invalid parameters”). They don’t tell you why (was the context corrupted? did the model misinterpret the schema? did a previous tool call return bad data?).

Traditional monitoring - CPU graphs, error rates, p99 latency - is useless here. You need decision-level observability: why did the agent make that choice?

Assumption #4: User-Scoped Auth

Traditional apps: User clicks button → auth token attached to request → action performed → token discarded.

Agents: Act on your behalf while you’re offline. Call 15 different APIs autonomously. Make decisions that require different permission levels depending on context.

The break: Traditional OAuth wasn’t built for this. It assumes a human is present to click “Authorize” and handle browser redirects. Agents are headless - they act while the user is offline, asleep, or unreachable.

You can’t give the agent your master API key - that’s a security disaster. You can’t pre-scope every possible action - you don’t know what the agent will need until runtime.

You need dynamic, just-in-time credential delegation. Short-lived tokens scoped to exactly what the agent needs right now. And when the agent wants to do something sensitive - delete a database, charge a credit card - you need human-in-the-loop approval gates that pause execution, wait for a signature (sometimes for hours), and resume without losing state.

Your auth infrastructure doesn’t do this. You’re going to build it yourself, and you’re going to get it wrong the first three times.

Assumption #5: Services Talk Via Contracts

Traditional microservices: Rigid REST or gRPC interfaces. Typed schemas. You call GET /api/v2/items?sort=price&limit=10 and you get exactly what you asked for.

Agents: Communicate in high-level intents. “Find me the best option.” “Summarize the user’s preferences.” “Coordinate with the scheduling agent to find a time.”

The break: Agent-to-agent communication isn’t just JSON over HTTP. It requires:

Context propagation: Agent B needs to know what Agent A was thinking, not just what data it returned.
Shared working memory: A persistent space where agents can read and write state across different servers and lifetimes.
Reasoning handoff: Agent A partially solves a problem, hands off to Agent B with full context, Agent B picks up where A left off.

None of this exists in your service mesh. You’re going to build a custom “agent communication layer” and spend six months debugging context drift.

What Reliable Agents Actually Require

If you want agents to work in production - not in a demo, not in a Jupyter notebook, but under real load with real users - you need infrastructure that doesn’t exist yet.

1. Durable Execution

Not checkpoints. Not retries. Guaranteed resumption.

If a worker node dies mid-workflow, the agent must resume on another node with its call stack, local variables, and reasoning state intact. Progress isn’t “saved” - it’s guaranteed. The agent picks up at the exact line of code where it stopped, as if nothing happened.

This is how Temporal works for workflows. Agents need the same semantics. However, unlike a standard Temporal workflow, an agent’s path isn’t hardcoded in a DSL - it’s generated live by an LLM. This makes the durability requirement even more extreme: you aren’t just persisting data; you’re persisting a dynamic, evolving reasoning chain.

2. Global Concurrency Control

System-wide rate limiting: All agents share a single budget for OpenAI calls. If Agent A is burning tokens, Agents B-Z slow down proportionally. No cascading failures.

Distributed coordination: Agents competing for the same resource (user state, external API, database row) use distributed locks or optimistic concurrency control. State corruption is impossible.

Resource quotas: Runaway agents are killed before they starve the system. No single agent can take down production.

3. Decision-Level Observability

Traditional logs: “Agent called Tool B at 10:00:03.”

What you need: “Agent chose Tool B over Tool A because the user’s context indicated preference X, and Tool A’s output schema didn’t match the downstream agent’s expectations.”

You need to trace the reasoning, not just the execution. Every decision point, every branch, every piece of context that influenced the outcome.

And when an agent fails, you need to replay its thought process, not just its API calls.

4. Delegated Identity & Dynamic Scoping

Just-in-time credentials: When an agent needs to call Stripe, the infrastructure issues a short-lived token scoped to exactly the Stripe API and the specific user context. The token expires in 60 seconds. The agent never sees your master key.

Approval gates: When the agent wants to execute a sensitive action, execution pauses. A notification goes to the user. The user approves or rejects (sometimes after several hours). Execution resumes with full state intact. The agent doesn’t restart from scratch.

Audit trails: Every action the agent takes is logged with full attribution. Who authorized it? What context led to the decision? What credentials were used?

5. Agent Communication Primitives

Shared memory: A durable, transactional store where agents can read and write state. Agents running on different servers, at different times, can access the same working memory.

Context propagation: When Agent A hands off to Agent B, B receives not just data but the history of reasoning that produced that data. B doesn’t start from zero.

Handoff semantics: Agent A pauses. Agent B takes over. Agent A resumes later. The infrastructure manages the transition without dropping state.

The Infrastructure Maturity Gap

Here’s what most teams do today:

Build agents with Python scripts, cron jobs, and manual restarts
Add retry logic in application code (and get it wrong)
Store state in Redis or Postgres with custom serialization (and lose data during crashes)
Rate-limit by hoping agents don’t call OpenAI too fast
Debug failures by reading logs and guessing what the agent was thinking

Every team rebuilds the same broken infrastructure. The solutions are fragile, incomplete, and impossible to test.

Your team spends more time on infrastructure duct tape than on the agent itself. The work that actually differentiates your product - better reasoning, smoother user experience - barely gets attention.

The Shift We Need

From: Application-layer duct tape

To: Infrastructure that handles state, concurrency, auth, and observability so you don’t have to

Agents are powerful. But right now, they’re production nightmares.

The teams that win won’t be the ones with the best prompts or the biggest models. They’ll be the ones who solved the infrastructure problem.

It’s time to stop building agents like scripts and start building them like distributed systems.

Because that’s what they are.

Modern Voice Agent Architectures: A Deep Dive

Neha Deodhar — Wed, 05 Nov 2025 19:55:13 GMT

Subscribe now

Voice agents have become increasingly sophisticated, enabling natural human-computer interactions across various applications from virtual assistants to customer service agents. When designing these systems, developers typically choose between two architectural approaches, each with distinct advantages and trade-offs.

The Modular Pipeline: Speech-to-Text (STT) → LLM → Text-to-Speech (TTS): A decomposed system where specialized components handle discrete functions in sequence.
The Unified Approach: Speech-to-Speech Models: An end-to-end approach where a single integrated system processes audio input and generates audio output with minimal intermediate transformations.

Let’s examine each approach in detail.

1. Modular Pipeline (STT → LLM → TTS)

In this architecture, three components operate as a pipeline:

Speech-to-Text (STT): This is the first stage of the pipeline. It captures and converts the user’s audio input into text transcription, often with specialized acoustic and language models.

Large Language Model (LLM): This is the brain of the agent. It processes the transcribed text, performs reasoning, calls tools and external APIs, manages context or memory, and generates appropriate responses.

Text-to-Speech (TTS): This is the final stage that synthesizes the LLM’s text response into spoken audio output.

Most modern implementations stream these components to reduce latency, allowing the agent to begin formulating responses even before the user finishes speaking.

Critical Auxiliary Components: VAD and Turn Detection

Beyond these three core components, effective voice agents require two additional systems:

Voice Activity Detection (VAD): This component identifies when a user is speaking versus when there is silence or background noise. VAD is essential for determining when to start and stop processing audio, conserving computational resources and reducing latency. High-quality VAD systems can distinguish between human speech and other sounds, preventing false activations.

Turn Detection: This component determines when a user has completed their thought or utterance, signaling to the agent that it’s time to respond. Effective turn detection is crucial for natural conversation flow and prevents the agent from interrupting users mid-sentence. Turn detection may use a combination of silence duration, prosodic features (intonation patterns), and semantic completeness to identify appropriate response moments.

Consider this scenario for turn detection:

“I’m looking for tickets to the concert … [brief pause] on Friday.”

A naive turn detector might interpret the pause after “concert” as the end of the user’s turn and trigger a response. However, a sophisticated turn detector would recognize that the semantic content may be incomplete or that the intonation suggests more information is coming, and would wait for the complete utterance before prompting the agent to respond.

Newer STT models are increasingly incorporating these capabilities directly. For example, AssemblyAI’s Universal and OpenAI’s realtime models now offer built-in VAD and preliminary turn detection features, simplifying the architecture while potentially improving responsiveness.

Example Agent

Below is an example agent in LiveKit using modular pipeline architecture. It uses:

Assembly AI for Speech-to-Text
OpenAI for LLM
Cartesia for Text-to-Speech

This example uses an edited version of LiveKit’s example from https://github.com/livekit-examples/agent-starter-python

import logging

from dotenv import load_dotenv
from livekit.agents import (
    Agent,
    AgentSession,
    JobContext,
    JobProcess,
    MetricsCollectedEvent,
    RoomInputOptions,
    WorkerOptions,
    cli,
    inference,
    metrics,
)
from livekit.plugins import noise_cancellation, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

logger = logging.getLogger(”agent”)

load_dotenv(”.env”)


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=”“”You are a helpful voice AI assistant. The user is interacting with you via voice, even if you perceive the conversation as text.
            You eagerly assist users with their questions by providing information from your extensive knowledge.
            Your responses are concise, to the point, and without any complex formatting or punctuation including emojis, asterisks, or other symbols.
            You are curious, friendly, and have a sense of humor.”“”,
        )

def prewarm(proc: JobProcess):
    proc.userdata[”vad”] = silero.VAD.load()


async def entrypoint(ctx: JobContext):
    ctx.log_context_fields = {
        “room”: ctx.room.name,
    }

    # Set up a voice AI pipeline using OpenAI, Cartesia, AssemblyAI, and the LiveKit turn detector
    session = AgentSession(
        stt=inference.STT(model=”assemblyai/universal-streaming”, language=”en”),
        llm=inference.LLM(model=”openai/gpt-4.1-mini”),
        tts=inference.TTS(
            model=”cartesia/sonic-3”, voice=”9626c31c-bec5-4cca-baa8-f8ba9e84c8bc”
        ),
        turn_detection=MultilingualModel(),
        vad=ctx.proc.userdata[”vad”],
        preemptive_generation=True,
    )

    # Metrics collection, to measure pipeline performance
    usage_collector = metrics.UsageCollector()

    @session.on(”metrics_collected”)
    def _on_metrics_collected(ev: MetricsCollectedEvent):
        metrics.log_metrics(ev.metrics)
        usage_collector.collect(ev.metrics)

    async def log_usage():
        summary = usage_collector.get_summary()
        logger.info(f”Usage: {summary}”)

    ctx.add_shutdown_callback(log_usage)

    # Start the session, which initializes the voice pipeline and warms up the models
    await session.start(
        agent=Assistant(),
        room=ctx.room,
        room_input_options=RoomInputOptions(
            # For telephony applications, use `BVCTelephony` for best results
            noise_cancellation=noise_cancellation.BVC(),
        ),
    )

    # Join the room and connect to the user
    await ctx.connect()


if __name__ == “__main__”:
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))

Advantages

Independent Scaling & Optimization: Each component is a separate service and can be scaled or optimized independently. For example, you can scale the STT workers based on user load and the LLM workers based on computational intensity independently.
Flexibility & Vendor Lock-in Mitigation: This is the ultimate “plug-and-play” architecture. You can easily swap an AWS Transcribe STT with a fine-tuned Whisper model or switch from GPT-4 to Claude for the LLM without major system rewrites.
Tool and RAG Integration: LLMs can easily integrate with external tools, databases, and APIs between the STT and TTS stages.
Explainability and Audit: The architecture inherently produces clean, traceable text transcripts at the STT output, which is critical for logging, compliance, fine-tuning, and downstream analytics.

Limitations

Latency: Each transition between components in the pipeline introduces some latency, potentially affecting conversation flow and making the interaction feel less spontaneous.
Error Propagation: Errors in earlier components (e.g., STT misrecognition) cascade through the pipeline.
Context Loss: Prosodic information like tone or emphasis may be lost when converting speech to text, losing the user’s intent and limiting the ability to respond with genuine emotional context.

2. Unified Approach: Speech-to-Speech Models

This newer architecture uses end-to-end models that process audio directly and generate audio responses with minimal intermediate steps. While text may still be used internally as a latent representation, the primary pathway is audio-to-audio, eliminating explicit modality conversions.

These models are trained to maintain conversational context and generate responses in real-time, often producing more natural-sounding interactions including non-verbal backchannels (like “mm-hmm” or “uh-huh”).

Below is an example agent in LiveKit using OpenAI’s realtime model.

import logging

from dotenv import load_dotenv
from livekit.agents import (
    Agent,
    AgentSession,
    JobContext,
    JobProcess,
    MetricsCollectedEvent,
    RoomInputOptions,
    WorkerOptions,
    cli,
    inference,
    metrics,
)
from livekit.plugins import openai, noise_cancellation, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel


logger = logging.getLogger(”agent”)

load_dotenv(”.env”)


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=”“”You are a helpful voice AI assistant. The user is interacting with you via voice, even if you perceive the conversation as text.
            You eagerly assist users with their questions by providing information from your extensive knowledge.
            Your responses are concise, to the point, and without any complex formatting or punctuation including emojis, asterisks, or other symbols.
            You are curious, friendly, and have a sense of humor.”“”,
        )

def prewarm(proc: JobProcess):
    proc.userdata[”vad”] = silero.VAD.load()


async def entrypoint(ctx: JobContext):
    ctx.log_context_fields = {
        “room”: ctx.room.name,
    }

    session = AgentSession(
        llm=openai.realtime.RealtimeModel(voice=”marin”),
        turn_detection=MultilingualModel(),
        vad=ctx.proc.userdata[”vad”],
    )

    session = AgentSession(
         llm=openai.realtime.RealtimeModel(voice=”marin”)
    )

    # Metrics collection, to measure pipeline performance
    usage_collector = metrics.UsageCollector()

    @session.on(”metrics_collected”)
    def _on_metrics_collected(ev: MetricsCollectedEvent):
        metrics.log_metrics(ev.metrics)
        usage_collector.collect(ev.metrics)

    async def log_usage():
        summary = usage_collector.get_summary()
        logger.info(f”Usage: {summary}”)

    ctx.add_shutdown_callback(log_usage)

    # Start the session, which initializes the voice pipeline and warms up the models
    await session.start(
        agent=Assistant(),
        room=ctx.room,
        room_input_options=RoomInputOptions(
            # For telephony applications, use `BVCTelephony` for best results
            noise_cancellation=noise_cancellation.BVC(),
        ),
    )

    # Join the room and connect to the user
    await ctx.connect()


if __name__ == “__main__”:
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))

Advantages

Lower Latency: Direct audio processing enables faster turn-taking and more natural conversation flow.
Preserved Prosody: Maintains acoustic features like tone, emphasis, and rhythm throughout processing. This results in the model:
- Responding with appropriate tone, e.g., a sympathetic voice to a frustrated user.
- Generating natural backchanneling (”mm-hmm,” “uh-huh”) precisely timed during the user’s speech.
Simplified Deployment: Less operational complexity than coordinating three distinct, streaming components.

Limitations

Black-Box Constraint: This is a major hurdle for enterprise deployments. The lack of an explicit, auditable text transcript makes debugging, explainability, and compliance more challenging. If the agent fails, diagnosing whether it was a “speech understanding” or “reasoning” failure is difficult.
Reasoning capabilities: The reasoning capabilities of these models are lesser than their similar sized LLM counterparts.
Reduced Flexibility: Hard to swap or fine-tune one aspect (e.g., improving speech recognition).

As comparison, here are two voice recordings from agent interactions built using the two architectures. The first recording uses an agent built with the modular pipeline approach (STT → LLM → TTS), while the second uses OpenAI’s realtime speech model showcasing the unified approach.

Modular STT -> LLM -> TTS agent

Speech-to-speech agent using OpenAI

Conclusion

The choice between these architectures depends on your specific requirements, resources, and the nature of the voice interactions you’re designing. The modular pipeline approach is more widely used currently and offers flexibility and explainability at the cost of some latency and potential error propagation, while unified speech models provide more natural interactions but with less visibility and flexibility.

As the field evolves, we’ll see the realtime speech models continue to get better at handling complex interactions, supporting multiple languages, and integrating with external systems. Their capabilities will expand while maintaining the conversational fluidity that makes them compelling, potentially narrowing the flexibility gap with modular systems.

For technical teams building voice agents today, understanding these architectural trade-offs is essential for delivering experiences that meet user expectations for both functionality and naturalness.