Everyone wants to build agents. Fewer people talk about keeping them running.
The demo works. The prototype is impressive. Then you try to run it in production for real users, and things start breaking in ways the tutorial never mentioned. Not because the agent is wrong - because the infrastructure around it isn't ready.
After building agentic systems across multiple products, here's what I've learned the hard way.
The Agent Isn't the Hard Part
The LLM call is the easy part. What breaks production agents is everything surrounding it:
- Retry logic: what happens when a tool call fails halfway through a multi-step task?
- State management: if the agent crashes on step 7 of 12, where does it resume?
- Context loss: agents working across sessions have no memory of previous runs unless you build scaffolding that makes prior work discoverable
- Cost runaway: without explicit budgets and stop conditions, a stuck agent loops until your bill does
Google's research on agentic infrastructure puts it plainly: entangled workflows complicate debugging, and unpredictability from shifting models or data sources compounds the problem. The transition from informal testing to rigorous engineering discipline is where most teams struggle.
Context Loss Is the Silent Killer
This one catches everyone eventually. Agents working on long-horizon tasks lose context between sessions. The next run starts cold, unaware of what the previous run completed, failed, or partially finished.
Anthropic's engineering team ran into this building their own internal agents. Their solution: explicit progress files, feature lists, and strict single-feature-per-session boundaries. The goal is to make previous work discoverable - not just for the agent, but for the next agent instance that picks up where the last one left off.
It sounds simple. It requires real engineering discipline to get right.
Failures Need to Be First-Class Citizens
Most agent code handles the happy path. Production demands you handle every failure mode explicitly:
- Should the agent retry? With exponential backoff and jitter?
- Should it fall back to cached data?
- Should it escalate to a human?
- What's the max number of retries before it stops?
AWS's architecture guidance on resilient agents recommends bounded retry limits with explicit stop conditions - maximum tool calls, timeouts, confidence thresholds. Without these, autonomy becomes unpredictability.
Anthropic's multi-agent engineering team learned this directly: full production tracing lets you diagnose why an agent failed, not just that it failed. They also built systems that resume from exactly where errors occurred - so a failure mid-task doesn't mean starting over.
Reduce Abstraction, Build With Basics
Frameworks help you prototype fast. They can hide failure modes at scale.
Anthropic's own experience is telling: when they moved agents to production, they found that frameworks which helped during prototyping became liabilities at scale. The framework's retry logic may not match your system's needs. Its state management may not survive your infrastructure failures. Build with basic components where it counts.
This doesn't mean avoid frameworks entirely. It means understand what they're doing, and be ready to replace the parts that don't hold up.
What This Looks Like in Practice
The infrastructure that actually works in production isn't glamorous:
- Progress checkpoints written to disk or database after each meaningful step
- Explicit budgets: max tool calls, max tokens, max wall-clock time
- Structured logs at every decision point, not just errors
- A human escalation path that's actually reachable
- Tests that run the full agent end-to-end, not just unit tests of individual tools
The compound nature of errors in agentic systems means a small issue that's annoying in traditional software can derail an agent entirely. One bad tool call sends the agent down a completely different trajectory.
Building for that reality is the actual job.
The Takeaway
Agent capabilities are moving fast. The infrastructure discipline required to run them reliably is catching up slower. The teams winning in production aren't the ones with the most sophisticated agents - they're the ones who built boring, reliable scaffolding around them.
What's the hardest production infrastructure problem you've hit with agents? I'd like to know.