We are rapidly entering a world with lots and lots of AI agents, built on lots and lots of different frameworks. There have been previous efforts at defining a common protocol for interacting with agents, but now that we HAVE lots of good agents, the need is more urgent.
The goals of this proposal are to enable interoperability between agents built using different frameworks like LangGraph, Smol Agents, Atomic Agents, etc... Notably, our emphasis is on allowing two (or more) agents to collaborate together, rather than providing a common User->Agent interface (although that is a partial side-effect of this proposal).
You can read some background on our motivations for this project.
Our goal is to let multiple AI agents, built on different software stacks, to collaborate on a task. To make a concrete example, assume I have built my "Personal Assistant" agent which helps me in my daily tasks (it has access to my email, calendar, etc...). I want my agent to be able to use the Browser Use agent for browser automation tasks, AND the GPT Researcher agent to perfom long research tasks. I can code my Personal Agent by hand to accomplish this. The intention of this proposal is to define a protocol where such integration would be easy and extensible to other agents.
The current standard for "teams of agents" is to support agents as tools - re-using the function calling protocol to allow Agent A to invoke Agent B. This approach assumes that agents look like synchronous functions. You invoke the agent with a set of parameters, and then wait for it to return a unitary result.
This is the wrong model for agents. Agent operations may run for a long time, take different paths, and generate lots of intermediate results while they run. They may need to stop and ask a human for input. None of these characteristics fit well into a synchronous function call model (this is the same reason we build large concurrent systems using event driven architectures rather than RPC).
The correct model for AI Agents is actually the actor model, created back in 1973! Actors are independently operating entities, which only access their own private state, and communicate asynchronously by passing messages between them. This model naturally allows us to fit our long running, asynchronous, interupttable agents into a unified framework.
We propose that the correct model is not to "define down" agents as tools, but rather to generalize "everything is an agent", including tools. All coordination happens via asynchronous message passing. If we adopt this model then agent cooperation is very natural, and tools and agents are interchangable. Today I can use the hand-coded "web browser tool", but tomorrow I can swap it out for a true agent (like BrowserUse) which performs the job better.
(One caveat is that the LLM tool calling protocol only has a single LLM completion pass to 'observe' the results of a tool call. So if our 'tool' is an agent generating an output stream, what is the input to the 'observe' phase? This is still an open design question. 'Cache the events' and provide them all as the result is the easiest model. One could imagine progessively feeding the sub-agent results to the caller, like "Here are prelim results from that tool call:... Keep waiting for more output.")
An agent is defined as a named software entity which advertises a set of supported operations. A client can run the agent to perform an operation by sending it a run request message. Subsequently the agent will publish a stream of events relevant to the operation until eventually it publishes an run completed event.
All events between run started and run completed are considered a single run. The client can send another request to the same agent, and this is considered the next run. Agent memory is preserved across sequential runs and together those runs consistute a thread (analogous to a web session). Threads are started automatically, but clients can also elect to start a new thread with any operation request.
Thus this model:
Agent
--> Thread
--> Run
--> Events
Note that this spec is aimed at interoperability amongst agents in a trusted environment. We do specify any user authentication nor specify any authz/authn between agents.
Agents must implement the following logical operations:
describe Requests the agent to return its description (name and operations)
configure Send a ConfigureRequest
to configure some aspect of the agent's environment.
run - Send the agent a request to process. A request could start a new thread or continue one already in progress.
get events - returns available events, or waits for the more events from an active run
Agent can advertise one or more supported operations via the describe protocol. For convenience our protocol assumes that every agent supports a generic "ChatRequest" operation type which contains a single text request (like a ChatGPT user prompt). Agents should implement this request by publishing intermediate TextOutput events (string messages) and publishing a final RunCompleted event which contains a single string result. This "lowest-common denominator" operation allows us to integrate almost any agent that supports a basic conversational interface.
run(requestObject, thread_id, run_context)
Requests an agent to start an operation.
The requestObject specifies the details of the request and references an _operation_ defined by the agent.
If 'thread_id' is null, then a new Thread is started (agent short-term memory is initialized).
"run_context" can pass additional metadata into the operation. A notable example is the "user_context".
which could identify the requesting user.
If thread_id is not null, then this operation continues an existing Thread.
<-- returns a RunStarted object
get_events(run_id, stream=True)
Streams output events from the agent until _RunCompleted_ which should be the final event.
as you can see from this pseudo-code, much of our protocol lies in the definitions of the input and output events to the agent.
Below are casual descriptions of the main types/events in the system. These will be formalized via JSON Schemas.
# == result of the "describe" API
type AgentDescriptor:
name: string
purpose: string
endpoints: list[string] - list of supported API endpoints
operations: list[AgentOperation]
tools: list[string] - for information purposes
# agent operations
type AgentOperation:
name: string
description: string
input_schema: Optional formal schema
output_schema: Optional formal schema
type DefaultChatOperation(AgentOperation):
name: chat
description: send a chat request
input_schema: [input: string]
output_schema: [output: string]
# == Event base type
type Event:
id: int # incrementing event index, only unique within a Run
run_id: <uuid> # the Run that generated this event
thread_id: <uuid> # the Thread that this event is part of
agent: string # Identifier for the agent, defaults to the name
type: string # event type identifier
role: string # generally one of: system, assistant, user, tool
depth: int # indicates the caller-chain depth where this event originated
# == Request types
type ConfigureRequest: # pass configuration to the agent
args: dict
type Request:
logging_level: string # request additional logging detail from the agent
request_metadata: dict # opaque additional data to the request. Useful for things like:
# user_id, current_time, ...
type ChatRequest(Request):
input: string
type CancelRequest(Request): # cancel a request in progress
type ResumeWithInput(Request): # tell an agent to resume from WaitForInput
request_keys: dict # key, value pairs
# Implementations can implement new Request types. An example might be 'ChatWithFileUpload' which
# would include a file attachment with the user input.
# == Response events
type RunStarted: # the agent has started processing a request
run_id
type WaitForInput: # the agent is waiting on caller input
request_keys: dict # Requested key value, description pairs
type TextOutput(Event): # the agent generated some text output
content: string
type ToolCall(Event): # agent is calling a tool
function_name: string
args: dict
type ToolResult(Event): # a tool call returned a result
function_name: string
text_result: string # text representation of the tool result
type ArtifactGenerated(Event): # the agent generated some artifact
name: string
id: string
url: string
mime_type: string
type ToolTextOutput(Event): # tool call generated some text output
content: string
type ToolError(Event):
content: string # a tool encountered an error
type CompletionCall(Event): # agent is requesting a completion from the LLM
type CompletionResult(Event): # the result of an LLM completion call
type RunCompleted(Event): # the agent turn is completed
finish_reason: string [success, error, canceled]
An agent must support these events at minimum:
ChatRequest, RunStarted, RunCompleted
To make the operation of an agent visible, it should support these events:
TextOutput, ToolCall, ToolResult, ToolError, CompletionCall, CompletionResult
All other events are optional.
The most analagous API is the OpenAI Assistants API. We use similar but not identical nouns:
OpenAI | Agent Protocol |
---|---|
Assistant | Agent |
Thread | Thread |
Run | Run |
Steps | Events |
Messages | Events |
The Agent Protocol is similar, some terminology is different:
Many apps and libraries have been built around the streaming completion
API defined by OpenAI. To
suppor broader compatibility, we provide a stream_request
endpoint which takes a Request input
object and streams result events via SSE immediately back to the client. This endpoint operates conceptually
in a similar manner as the standard completion
endpoint.
The protocol can be implemented on multiple transport types. For reference purposes we define a REST API that all agents should support. Other transports are optional (websocket, etc...).
Basic discovery endpoint
# List agents available at this endpoint
/ -> list[name, path] pairs
All other endpoints are relative to the agent's path:
# Get the agent's descriptor
/describe -> AgentDescriptor
# Send the agent a request to process
/run (Request) -> Event|None
params:
wait: bool # wait for the agent response. Agent will return an Event response, otherwise
# the agent returns only the HTTP status code.
# Get events from a request. If stream=False then the agent will return any events queued since
# the last `get_events` call (basic polling mechanism). If stream=True then the endpoint will
# publish events via SSE
/get_events (run_id)
params:
stream: bool
since: event_id # pass the last event_id and any later events will be returned
# Convenience route that starts a new Run and streams back the results in one call
/stream_request (Request)
<-- events via SSE
**Optional endpoints**
GET /runs/{run_id} -> Returns the status of a request
GET /threads -> Returns a list of persisted Runs
GET /get_events/{thread_id} -> Returns all events for a Run in chronological order
Example event flows:
# retrieve agent operations
GET /describe
# configure an agent
POST /configure (ConfigureRequest)
-> RunCompleted
# Run the agent, passing a chat prompt to the agent
POST /run (ChatRequest(input), wait=True)
-> RunStarted (contains 'run_id' and 'thread_id')
# Stream output events from the agent
GET /get_events/{run_id}?stream=True
# Continue a thread
POST /run (ChatRequest(thread_id=?))
Human in the Loop
POST /run (ChatRequest(input), wait=True)
-> RunStarted (contains 'run_id' and 'thread_id')
# Stream output events from the agent
GET /get_events/{run_id}?stream=True
<- WaitForInput event received (the run is paused)
..caller prompts for input...
POST /run (ResumeWithInput(run_id=))
GET /get_events/{run_id}?stream=True
Canceling a Request
You can interrupt a long-running agent run:
POST /run (ChatRequest(input), wait=True)
GET /get_events/{run_id}?stream=True
POST /run (CancelRequest(run_id=?))
GET /get_events/{run_id}?stream=True
<-- RunCompleted (finish_reason=canceled)
Artifact example
An agent uses a PDFWriter tool to create a PDF file that the caller can download:
POST /run (ChatRequest(input), wait=True)
GET /get_events/{run_id}?stream=True
<-- ArtifactGenerated
(caller displays the artifact to the user)
Persisted Threads
Caller lists available Threads, then requests the event history from a Thread:
GET /threads
GET /get_events/{thread_id=?}