Everyone wants to build agents. Fewer people talk about keeping them running.

The demo works. The prototype is impressive. Then you try to run it in production for real users, and things start breaking in ways the tutorial never mentioned. Not because the agent is wrong - because the infrastructure around it isn't ready.

After building agentic systems across multiple products, here's what I've learned the hard way.


The Agent Isn't the Hard Part

The LLM call is the easy part. What breaks production agents is everything surrounding it:

Google's research on agentic infrastructure puts it plainly: entangled workflows complicate debugging, and unpredictability from shifting models or data sources compounds the problem. The transition from informal testing to rigorous engineering discipline is where most teams struggle.

Context Loss Is the Silent Killer

This one catches everyone eventually. Agents working on long-horizon tasks lose context between sessions. The next run starts cold, unaware of what the previous run completed, failed, or partially finished.

Anthropic's engineering team ran into this building their own internal agents. Their solution: explicit progress files, feature lists, and strict single-feature-per-session boundaries. The goal is to make previous work discoverable - not just for the agent, but for the next agent instance that picks up where the last one left off.

It sounds simple. It requires real engineering discipline to get right.

Failures Need to Be First-Class Citizens

Most agent code handles the happy path. Production demands you handle every failure mode explicitly:

AWS's architecture guidance on resilient agents recommends bounded retry limits with explicit stop conditions - maximum tool calls, timeouts, confidence thresholds. Without these, autonomy becomes unpredictability.

Anthropic's multi-agent engineering team learned this directly: full production tracing lets you diagnose why an agent failed, not just that it failed. They also built systems that resume from exactly where errors occurred - so a failure mid-task doesn't mean starting over.

Reduce Abstraction, Build With Basics

Frameworks help you prototype fast. They can hide failure modes at scale.

Anthropic's own experience is telling: when they moved agents to production, they found that frameworks which helped during prototyping became liabilities at scale. The framework's retry logic may not match your system's needs. Its state management may not survive your infrastructure failures. Build with basic components where it counts.

This doesn't mean avoid frameworks entirely. It means understand what they're doing, and be ready to replace the parts that don't hold up.

What This Looks Like in Practice

The infrastructure that actually works in production isn't glamorous:

The compound nature of errors in agentic systems means a small issue that's annoying in traditional software can derail an agent entirely. One bad tool call sends the agent down a completely different trajectory.

Building for that reality is the actual job.

The Takeaway

Agent capabilities are moving fast. The infrastructure discipline required to run them reliably is catching up slower. The teams winning in production aren't the ones with the most sophisticated agents - they're the ones who built boring, reliable scaffolding around them.

What's the hardest production infrastructure problem you've hit with agents? I'd like to know.

References

  1. Effective harnesses for long-running agents - Anthropic Engineering
  2. How we built our multi-agent research system - Anthropic Engineering
  3. Build resilient generative AI agents - AWS Architecture Blog
  4. Agentic AI Infrastructure in Practice - Google Research