Mentatcurated
▸ Concept

Multi-agent safety

The unsolved problem of keeping networks of AI agents — each acting on instructions from another — from compounding errors, deferring accountability, or being manipulated through the messages they exchange.

In a nutshell

When one AI agent orchestrates another, each step adds a new place for things to go wrong: a subtask is misread, a tool call returns unexpected output, or a malicious instruction injected into the environment gets passed along as if it were legitimate. The hard part is that individual agents can behave correctly in isolation while the system fails: trust, blame, and oversight become distributed across a chain that no single layer controls. No established discipline yet covers all of this — threat models, evaluation methods, and governance norms are still being worked out.

How this connects

Tap a node to open it