Building Multi-Agent Systems

Scaling LLM-based agents to handle complex problems reliably.

Dec 26, 2024

As Large Language Models (LLMs) have gotten more powerful, we’ve started thinking of them not just as text-in, text-out models, but as “agents”1 that can take problems, perform actions, and arrive at solutions. Despite the significant advancements in LLM agentic capabilities in the last year (OpenAI o3, Anthropic Computer Use), it’s still a non-trivial challenge to plug agents effectively into existing institutions and enterprise products.

While LLM-based agents are deceptively capable of low-complexity automations, anyone building real agentic products is likely running into a common set of challenges:

While 90% accuracy might work for something like ChatGPT, that doesn’t cut it for products that aim to approach (or possibly replace) human-level capabilities.
Their efficacy rapidly degrades as you introduce enterprise-specific complexity (e.g., every piece of product-specific context or constraint you prompt the agent with).
Enterprise data is messy, and while human employees can be trained over months to cope with this, an agent will struggle to handle large amounts of nuance and gotchas.
The larger and more capable the agent, the harder it is to evaluate, make low-risk changes, and parallelize improvements across an engineering team.

While you may initially try using human-in-the-loop, parameter-based fine-tuning, or reducing agent-facing complexity — these will eventually come to limit your scale, margin, and product capabilities. Many of these problems also don’t necessarily go away when using GPT-{N+1}, as model “reasoning” and “intelligence” can be orthogonal to an AI developer’s own ability to accurately provide the right structure, context, and assumptions.

Multi-Agent Systems

My proposal is that the primary way to solve these issues long term will be through decomposing agentic systems into an organization of subdomain-specific subagents. I think of this as akin to human-based organizational design where individual human employees with specialized roles are organized to solve complex problems (e.g., running a SaaS company).

By breaking down the “agent”, we can say subagents:

Own and abstract away the complexity of their subdomain (~ a software engineer owns the codebase complexity, an account executive owns the complexity of a specific account)
Will communicate with other subagents in semi-structured natural language (~ tickets, structured meetings/channels)
Can be evaluated and improved independently without risking a degradation to the whole system (~ performance reviews, mentorship, termination)

These properties allow you to greatly mitigate those common issues with enterprise-grade agentic systems:

Complexity is managed by keeping per-subagent complexity low (e.g. many subagents with short prompts rather than a single agent with a large prompt) and a team of AI developers can work on these in parallel.
Reliability is improved through modular evaluation and fault isolation (e.g., a poor-performing subagent is unlikely to cause the entire system to fail, and if part of the system does fail, it should be easy to isolate which subagent was responsible).

Subagents also fall into two primary types:

Frontend Subagents who interact directly with users outside the organization. They must handle translation from external to internal terminology (i.e. what do they actually want?) and external-facing tone/outputs. They often own customer interaction and conversational state. (~ sales, support, marketing, etc)
Backend Subagents who interact only internally with other subagents to solve various subproblems. They own data nuances and proprietary internal workflows. Often they are stateless. (~ engineering, product, managers, etc)

While I typically try to avoid anthropomorphizing LLMs, drawing tight parallels with human-centered organizational design makes multi-agent systems significantly more intuitive to design and manage. For those into systems thinking, it would be interesting to see how these architectures align with how you already see human-based organizations.

Multi-Agent Architectures

While “decompose a big problem into smaller problems” is a trivial answer to many kinds of engineering problems, it can be unclear what this means for LLM-based agents specifically. Based on agents I’ve built and seen in the wild, I’ve defined the three main multi-agent architectures and their trade-offs.

Assembly Line Agents

Subagents acting in an assembly line to produce a response. The inputs and outputs are handled by frontend subagents (green) and the intermediate steps are handled by backend ones (blue).

The “assembly line” (aka vertical) architecture puts the subagents in a linear sequence starting with a frontend subagent, then several backend subagents, and a final frontend subagent that produces the answer. It’s best for problems that have a shared sequence of steps for all inputs.

Features are implemented by adding more intermediate backend subagents.
Failures occur when handling out-of-domain questions that don’t fit the predetermined sequence of steps, requiring one of the alternatives below.

Examples

A basic prompt-to-website builder. The system works in stages, first writing a PRD, then building the site one by one. The final subagents must ensure quality and the right user presentation.
- [user prompt] → Build Site Requirements → Build Frontend Components → Build Frontend → Build Backend Schemas → Build Backend → Perform QA → Documentation → [website]
Anthropic’s Prompt Chaining, Parallelization
MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework
CrewAI Sequential

Variations

Early stopping — an intermediate subagent can decide to abort or prevent further processing
Parallelism — intermediate subagents can run in parallel (i.e. as a DAG) depending on their dependencies
Self-consistency — run the full flow or part of the flow multiple times and pick (using a heuristic or another LLM) the best output

Call Center Agents

Subagents acting similar to a call center where inputs are routed to a frontend subagent (green) that best fits the subdomain.

The “call center” (aka horizontal) architecture stratifies requests over subdomain-specific frontend subagents. It’s best for handling very diverse sets of inputs and outputs and when functionality is fairly correlated with specific subdomains. Each subagent is expected to produce an appropriate customer-facing response.

Features can be added by simply adding more subdomain frontend subagents.
Failures occur when answers need to join information from several different subdomains, requiring a manager-worker architecture.

Examples

A basic travel assistant. The user prompt is routed using a keyword heuristic to a subagent dedicated to that question. The user speaks exclusively with that subdomain expert unless the subagent decides to transfer to another one.
- [user prompt] →
  - Weather Assistant → [forecast, weather advice]
  - Flight Booking Assistant → [flight recommendations, tickets]
  - Hotel Booking Assistant → [hotel recommendations, tickets]
  - Car Booking Assistant → [car recommendations, tickets]
Anthropic’s Routing
AWS Multi-Agent Orchestrator
OpenAI’s Swarm

Variations

Advanced routing — there are several mechanisms for initial routing: basic heuristics, the user themselves via a UI, or another LLM
Transfers — For cross subdomain questions or if a subagent fails, it can transfer to another subagent

Manager-Worker Agents

A frontend (green) subagent calls several internal backend (blue) subagents to solve and compile a response.

The “manager-worker” architecture uses an orchestrator frontend subagent to task internal backend subagents with different pieces of the problem. The backend worker subagent outputs are then used by the orchestrator to form the final output. It’s best for problems that require complex joins from several subdomains and when the output format is fairly standard among all types of inputs. Unlike the call center architecture, the manager is solely responsible for compiling a user-facing response.

Features are implemented by adding more worker subagents.
Failures occur when the manager becomes too complex, requiring breaking the manager itself into either an assembly line or call center-style agent.

Examples

An advanced travel assistant. The user input is passed into a manager who asks experts (via tool-use) subdomain-specific questions. The expert responses are then compiled by the manager into the final answer.
- [user prompt] →
  - Travel Manager
    - Flights Expert
    - Hotels Expert
    - Car Rental Expert
    - Weather Expert
  - → [recommendations, bookings]
Anthropic’s Orchestrator-workers
Microsoft’s Magentic-One
Microsoft’s AutoGen
Langroid's Multi-Agent Framework
Langgraph Supervisor, Network, Hierarchical
CrewAI Hierarchical
phidata
AgentVerse

Variations

Sync/Async - Tasks for backend subagents can either block (tool-call returns worker response) the orchestrator or happen asynchronously (tool-call returns a promise)
Worker Recursion - Backend subagents can request responses from other backend subagents

Open-Questions

As far as I can tell, these patterns (or some variant) will become increasingly part of modern LLM-agent system design over the next few years.

There are, however, still some open questions:

How much will this cost? It’s implementation-dependent whether moving towards this structure will save money. On one hand, subagents reduce “unused” prompt instructions and enable better semantic caching, but on the other hand, they require some amount of per-subagent instruction overhead.
What are the actual tools and frameworks for building these? I use custom frameworks for agent management, but CrewAI and LangGraph look promising. As for good third-party tools for multi-agent evaluation — I haven’t seen one.
How important is building a GenAI engineering team modeled around a multi-agent architecture? One useful property of this organization is that it’s intuitive how to split the AI development work across human AI developers. This may matter in 1- to 3-year timespan, but eventually agent-iteration itself might be abstracted away by more powerful AI dev tools.
How much will LLM-agent system design change when we get increasingly intelligent models? I suspect some level of subagent organization will be required for at least the next 10 years. The biggest change may be increased complexity-per-subagent and a reduced effort to “prompt engineer” vs just throwing large amounts of data into the model’s context.

There’s also a large disconnect right now between the full capabilities of frontier models and the abilities of agentic products. It’s easy to see why “AGI is almost here!!” is seen as hype (and to some extent it is) when the actual AI-branded tools and copilots we see as consumers can be fairly underwhelming. I think this is because foundation model improvements (the hype) are far outpacing enterprise agent development (what we see) and that as the industry figures this out (e.g. by adapting LLM-agent system design and multi-agent architectures) we’ll start to see more “this-is-so-good-it’s-scary” AI products.

The definition of “agents” has become a bit controversial. When I use it, I’m referring to all Anthropic-defined “agentic systems”. However, these multi-agent paradigms are only really useful for “Agents…where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.”

Shrivu’s Substack

Discussion about this post