Welcome to AIEdTalks’ Newsletter!

In today's edition:

  • Your Multi-Agent system needs a Circuit Breaker

Let’s dive in.

Today’s Edition

AI TOPIC
Your Multi-Agent system needs a Circuit Breaker

On April 29, 2026, an engineer published a post-mortem that's become a rite of passage. Their nightly pipeline — an agent that summarized and categorized documents — hit a transient error around 11 PM and started retrying. It never stopped. By 7 AM it had made thousands of identical, failing tool calls, each one billed. The bill was $437. The fix took twenty minutes. The loop had run for eight hours, and nothing — no alert, no threshold, no guard — had fired.

The instinct after a story like this is "we'll add a kill switch." That instinct is right but incomplete. The kill switch is a button a human presses after noticing. What you actually need is the thing that notices and acts on its own, with no human in the loop at 3 AM. That thing is a circuit breaker, and it's not a new idea — it's one of the oldest patterns in distributed systems, borrowed from the electrical panel in your house. The interesting part is why agents need it far more than the microservices it was invented for.

What a circuit breaker actually is

The pattern comes from Michael Nygard's Release It! and was hardened at scale by Netflix's Hystrix a decade ago. It wraps a call you don't fully trust, and it has three states:

  • Closed — normal operation. Calls pass through. The breaker counts failures.

  • Open — too many failures, too fast. The breaker trips and stops calling the failing thing entirely. Requests fail instantly instead of piling onto a service that's already drowning. This is the whole point: fail fast, stop the bleeding, don't make it worse.

  • Half-open — after a cooldown, the breaker lets a single test request through. Succeeds? Close the breaker, resume normally. Fails? Back to open, wait again.

That's it. The genius is the open state — refusing to call something that's broken, so a downstream failure doesn't become an upstream catastrophe. In microservices, "broken" means a dead database. In agentic systems, "broken" means a loop, a retry storm, a runaway cost, or an agent that's confidently wrong. Same machinery, new failure modes.

You'll hear "circuit breaker agent" used two ways, and both are valid. The narrow sense is the pattern wrapping your tool and model calls. The broader sense is a dedicated supervisor — an agent or governor whose only job is to watch the fleet's health and trip the system when a worker misbehaves. Start with the pattern; graduate to the supervisor when you have more than a couple of agents in play.

Why multi-agent systems need this more than microservices do

A REST endpoint behaves predictably: same input, same output. An LLM agent doesn't — same prompt, different reasoning path, different cost. That non-determinism is exactly why you can't trust an agent to stop itself. Four reasons the need is sharper here:

Each agent is blind to global health. Agent A has no idea Agent B just failed calling the same endpoint a second ago. There's no shared nervous system unless you build one. Every agent re-discovers the outage independently, in parallel.

Retry logic multiplies instead of adds. This is the one that bites teams on LangGraph and similar frameworks. A per-node retry policy looks sensible — stop_after_attempt=10 on a flaky tool. But put that tool behind ten parallel agents and a dead service now takes 100 requests to the face, all at once. Per-node backoff doesn't cascade a "this service is down" signal across the graph. Each node is polite in isolation and catastrophic in aggregate.

The LLM treats errors as a reason to retry, not stop. Hand a model a tool result that says "DB connection failed" and it does the helpful thing: it tries again. It can't tell a transient blip from a total outage — retrying is what a good assistant does. Returning the raw error string to the model is one of the most common ways to cause a retry storm.

Failures cascade through reasoning. One agent's wrong output becomes the next agent's confident input. There's no stack trace for "the reasoning was bad." A breaker that trips on the first sign of pathology is often the only thing standing between one bad step and a fully corrupted run.

What should trip the breaker — the use cases

A breaker is only as good as its trip conditions. These are the signatures worth watching, roughly in order of how often they're the culprit:

  • Identical repeated calls. Same tool, same arguments, N times in a row. This is the loop. Trip at 3.

  • Consecutive failures or high error rate. A common, battle-tested threshold is 5 failures in 2 minutes, or >50% of calls failing in the last 60 seconds. Open the breaker, redirect away from the failing worker.

  • Consecutive rate-limit (429) responses. If the agent keeps getting throttled and keeps hammering, it isn't backing off correctly. Trip for ~60 seconds with exponential reopen.

  • Per-run cost or token ceiling crossed. Budget the task. Cross the ceiling, trip, log why. (This is the direct fix for the $437 night.)

  • Permission-boundary violation — fail closed. The agent reaches for a data source or API outside its granted scope. This isn't a loop, but the breaker model applies perfectly: the instant a boundary is crossed, halt and log with full context. When in doubt, fail closed, not open.Your RAG returns 10 chunks. 8 are garbage.

How to implement it well

A minimal three-state breaker is about forty lines. Here's the spine:

import time

class CircuitBreaker:
    def __init__(self, fail_max=5, reset_timeout=60):
        self.fail_max = fail_max          # failures before tripping
        self.reset_timeout = reset_timeout # seconds before half-open test
        self.failures = 0
        self.state = "closed"
        self.opened_at = None

    def call(self, fn, *args, **kwargs):
        if self.state == "open":
            if time.time() - self.opened_at >= self.reset_timeout:
                self.state = "half-open"        # time to test recovery
            else:
                raise RuntimeError("Circuit open — failing fast.")

        try:
            result = fn(*args, **kwargs)
        except Exception:
            self._record_failure()
            raise

        self._record_success()
        return result

    def _record_failure(self):
        self.failures += 1
        if self.failures >= self.fail_max or self.state == "half-open":
            self.state, self.opened_at = "open", time.time()

    def _record_success(self):
        self.failures = 0
        self.state = "closed"

Wrap your tool and model calls in breaker.call(...). Pair it with the repetition check from the runaway-loops playbook (trip on 3 identical calls) and you've covered the two most common failure shapes.

Sensible starting defaults (tune to your traffic):

Guard

Starting value

Why

Failures before open

5 in 2 min

Below this is transient noise

Reset timeout (half-open)

30–60 s

Long enough for a blip to clear

Identical-call trip

3

The loop signature

Consecutive 429s

3

Backoff is broken, stop hammering

Per-run cost ceiling

10–20× median run

Kills runaways, spares healthy runs

Three implementation rules that matter more than the code:

  1. Put it outside the agent, in shared state. The agent is the thing malfunctioning — it can't police itself. Track breaker state at the graph/orchestrator level and have the router enforce it, so a tripped breaker actually stops the system instead of being a suggestion the LLM ignores. A gateway or proxy in front of the API is the most robust place of all.

  2. Don't feed raw errors back to the model. Translate them. The model doesn't need "ConnectionError"; it needs to be routed away from the dead path entirely.

  3. Always build the half-open recovery. A breaker that trips and never tests recovery is just an outage you caused yourself.

When the breaker opens, you want a graceful fallback hierarchy, not a dead end: try an alternate specialist agent, then a simpler rule-based path, then a cheaper model, and finally a human escalation queue. Each rung trades capability for reliability.

Build or buy

You don't have to write this from scratch. The pattern has a forty-year toolchain behind it (resilience4j, Hystrix's descendants) and a fast-growing agent-native one: open-source proxies like llm-circuit sit transparently between your agent and the API, and gateways like Portkey and TrueFoundry bundle circuit breaking with rate limiting and provider failover. The build-vs-buy call is the usual one — a proxy is the cheapest insurance if you're framework-agnostic; if you're on LangGraph or similar, check what its state and routing layer already let you enforce before adding a dependency.

The takeaway

The $437 night wasn't a freak event — it's the default behavior of any agent that can retry and can't reason about global health. A circuit breaker is the smallest piece of infrastructure that turns "ran all night, no alert" into "tripped in seconds, logged, recovered." It's not novel and it's not glamorous. It's the pattern your distributed-systems colleagues have leaned on for decades, and your agents — blind, parallel, non-deterministic, and eager to retry — need it even more than their services did.

Wrap the call. Trip on the loop, the failures, and the budget. Enforce it outside the agent. Build the half-open path. Then go to bed knowing the thing that notices at 3 AM isn't you.

👋 That’s All Folks!

Before you go, just a few public service announcements:

  • Do you have a topic in mind you'd like us to cover? DM me 

  • Looking to sponsor AIEdTalks’ Newsletter? DM me, and we’ll get back to you asap.

See you soon,

AIEdTalks’ Newsletter Team

Recommended for you