Skip to content

supercog-ai/agent-protocol

Repository files navigation

Agent Protocol (draft - 2025 edition)

We are rapidly entering a world with lots and lots of AI agents, built on lots and lots of different frameworks. There have been previous efforts at defining a common protocol for interacting with agents, but now that we HAVE lots of good agents, the need is more urgent.

The goals of this proposal are to enable interoperability between agents built using different frameworks like LangGraph, Smol Agents, Atomic Agents, etc... Notably, our emphasis is on allowing two (or more) agents to collaborate together, rather than providing a common User->Agent interface (although that is a partial side-effect of this proposal).

You can read some background on our motivations for this project.

Goal by example

Our goal is to let multiple AI agents, built on different software stacks, to collaborate on a task. To make a concrete example, assume I have built my "Personal Assistant" agent which helps me in my daily tasks (it has access to my email, calendar, etc...). I want my agent to be able to use the Browser Use agent for browser automation tasks, AND the GPT Researcher agent to perfom long research tasks. I can code my Personal Agent by hand to accomplish this. The intention of this proposal is to define a protocol where such integration would be easy and extensible to other agents.

Why tool calling is the wrong paradigm

The current standard for "teams of agents" is to support agents as tools - re-using the function calling protocol to allow Agent A to invoke Agent B. This approach assumes that agents look like synchronous functions. You invoke the agent with a set of parameters, and then wait for it to return a unitary result.

This is the wrong model for agents. Agent operations may run for a long time, take different paths, and generate lots of intermediate results while they run. They may need to stop and ask a human for input. None of these characteristics fit well into a synchronous function call model (this is the same reason we build large concurrent systems using event driven architectures rather than RPC).

The correct model for AI Agents is actually the actor model, created back in 1973! Actors are independently operating entities, which only access their own private state, and communicate asynchronously by passing messages between them. This model naturally allows us to fit our long running, asynchronous, interupttable agents into a unified framework.

Tools as Agents

We propose that the correct model is not to "define down" agents as tools, but rather to generalize "everything is an agent", including tools. All coordination happens via asynchronous message passing. If we adopt this model then agent cooperation is very natural, and tools and agents are interchangable. Today I can use the hand-coded "web browser tool", but tomorrow I can swap it out for a true agent (like BrowserUse) which performs the job better.

(One caveat is that the LLM tool calling protocol only has a single LLM completion pass to 'observe' the results of a tool call. So if our 'tool' is an agent generating an output stream, what is the input to the 'observe' phase? This is still an open design question. 'Cache the events' and provide them all as the result is the easiest model. One could imagine progessively feeding the sub-agent results to the caller, like "Here are prelim results from that tool call:... Keep waiting for more output.")

Agent definition

An agent is defined as a named software entity which advertises a set of supported operations. A client can run the agent to perform an operation by sending it a run request message. Subsequently the agent will publish a stream of events relevant to the operation until eventually it publishes an run completed event.

All events between run started and run completed are considered a single run. The client can send another request to the same agent, and this is considered the next run. Agent memory is preserved across sequential runs and together those runs consistute a thread (analogous to a web session). Threads are started automatically, but clients can also elect to start a new thread with any operation request.

Thus this model:

Agent
    --> Thread
        --> Run
            --> Events

Exclusions

Note that this spec is aimed at interoperability amongst agents in a trusted environment. We do specify any user authentication nor specify any authz/authn between agents.

Base elements of the protocol

Agents must implement the following logical operations:

describe Requests the agent to return its description (name and operations)

configure Send a ConfigureRequest to configure some aspect of the agent's environment.

run - Send the agent a request to process. A request could start a new thread or continue one already in progress.

get events - returns available events, or waits for the more events from an active run

Agent operations

Agent can advertise one or more supported operations via the describe protocol. For convenience our protocol assumes that every agent supports a generic "ChatRequest" operation type which contains a single text request (like a ChatGPT user prompt). Agents should implement this request by publishing intermediate TextOutput events (string messages) and publishing a final RunCompleted event which contains a single string result. This "lowest-common denominator" operation allows us to integrate almost any agent that supports a basic conversational interface.

Pseudo-code example

run(requestObject, thread_id, run_context)
    Requests an agent to start an operation. 
    The requestObject specifies the details of the request and references an _operation_ defined by the agent.
    If 'thread_id' is null, then a new Thread is started (agent short-term memory is initialized).
    "run_context" can pass additional metadata into the operation. A notable example is the "user_context".
    which could identify the requesting user.
    If thread_id is not null, then this operation continues an existing Thread.

  <-- returns a RunStarted object

get_events(run_id, stream=True)
    Streams output events from the agent until _RunCompleted_ which should be the final event.

as you can see from this pseudo-code, much of our protocol lies in the definitions of the input and output events to the agent.

Schema definitions

Below are casual descriptions of the main types/events in the system. These will be formalized via JSON Schemas.

# == result of the "describe" API

type AgentDescriptor:
    name: string
    purpose: string
    endpoints: list[string] - list of supported API endpoints
    operations: list[AgentOperation]
    tools: list[string] - for information purposes

# agent operations 

type AgentOperation:
    name: string
    description: string
    input_schema: Optional formal schema
    output_schema: Optional formal schema

type DefaultChatOperation(AgentOperation):
    name: chat
    description: send a chat request
    input_schema: [input: string]
    output_schema: [output: string]

# == Event base type

type Event:
    id: int             # incrementing event index, only unique within a Run
    run_id: <uuid>      # the Run that generated this event
    thread_id: <uuid>   # the Thread that this event is part of
    agent: string       # Identifier for the agent, defaults to the name
    type: string        # event type identifier
    role: string        # generally one of: system, assistant, user, tool
    depth: int          # indicates the caller-chain depth where this event originated

# == Request types

type ConfigureRequest:         # pass configuration to the agent
    args: dict

type Request:
    logging_level:  string # request additional logging detail from the agent
    request_metadata: dict   # opaque additional data to the request. Useful for things like:
                             # user_id, current_time, ...

type ChatRequest(Request):
    input: string

type CancelRequest(Request): # cancel a request in progress

type ResumeWithInput(Request): # tell an agent to resume from WaitForInput
    request_keys: dict  # key, value pairs

# Implementations can implement new Request types. An example might be 'ChatWithFileUpload' which
# would include a file attachment with the user input. 

# == Response events

type RunStarted: # the agent has started processing a request
    run_id

type WaitForInput:   # the agent is waiting on caller input
    request_keys: dict      # Requested key value, description pairs

type TextOutput(Event): # the agent generated some text output
    content: string 

type ToolCall(Event):   # agent is calling a tool
    function_name: string
    args: dict

type ToolResult(Event): # a tool call returned a result
    function_name: string
    text_result: string     # text representation of the tool result

type ArtifactGenerated(Event): # the agent generated some artifact
    name: string
    id: string
    url: string
    mime_type: string

type ToolTextOutput(Event): # tool call generated some text output
    content: string

type ToolError(Event):
    content: string         # a tool encountered an error

type CompletionCall(Event): # agent is requesting a completion from the LLM

type CompletionResult(Event): # the result of an LLM completion call

type RunCompleted(Event): # the agent turn is completed
    finish_reason: string   [success, error, canceled]
    

The minimum Event set

An agent must support these events at minimum:

ChatRequest, RunStarted, RunCompleted

To make the operation of an agent visible, it should support these events:

TextOutput, ToolCall, ToolResult, ToolError, CompletionCall, CompletionResult

All other events are optional.

Relation to OpenAI APIs

The most analagous API is the OpenAI Assistants API. We use similar but not identical nouns:

OpenAI Agent Protocol
Assistant Agent
Thread Thread
Run Run
Steps Events
Messages Events

The Agent Protocol is similar, some terminology is different:

Many apps and libraries have been built around the streaming completion API defined by OpenAI. To suppor broader compatibility, we provide a stream_request endpoint which takes a Request input object and streams result events via SSE immediately back to the client. This endpoint operates conceptually in a similar manner as the standard completion endpoint.

Protocol as REST Endpoints

The protocol can be implemented on multiple transport types. For reference purposes we define a REST API that all agents should support. Other transports are optional (websocket, etc...).

Basic discovery endpoint

    # List agents available at this endpoint
    /   -> list[name, path] pairs


All other endpoints are relative to the agent's path:

    # Get the agent's descriptor
    /describe -> AgentDescriptor

    # Send the agent a request to process
    /run (Request) -> Event|None
        params: 
            wait: bool  # wait for the agent response. Agent will return an Event response, otherwise
                        # the agent returns only the HTTP status code.

    # Get events from a request. If stream=False then the agent will return any events queued since
    # the last `get_events` call (basic polling mechanism). If stream=True then the endpoint will
    # publish events via SSE
    /get_events (run_id)
        params:
            stream: bool
            since: event_id     # pass the last event_id and any later events will be returned

    # Convenience route that starts a new Run and streams back the results in one call
    /stream_request (Request)
        <-- events via SSE

**Optional endpoints**

    GET /runs/{run_id}              -> Returns the status of a request
    GET /threads                    -> Returns a list of persisted Runs
    GET /get_events/{thread_id}      -> Returns all events for a Run in chronological order

Example event flows:

# retrieve agent operations
GET /describe

# configure an agent
POST /configure (ConfigureRequest)
    -> RunCompleted

# Run the agent, passing a chat prompt to the agent
POST /run (ChatRequest(input), wait=True)
    -> RunStarted (contains 'run_id' and 'thread_id')

# Stream output events from the agent
GET /get_events/{run_id}?stream=True

# Continue a thread
POST /run (ChatRequest(thread_id=?))

Human in the Loop

POST /run (ChatRequest(input), wait=True)
    -> RunStarted (contains 'run_id' and 'thread_id')

# Stream output events from the agent
GET /get_events/{run_id}?stream=True

<- WaitForInput event received (the run is paused)
..caller prompts for input...

POST /run (ResumeWithInput(run_id=))
GET /get_events/{run_id}?stream=True

Canceling a Request

You can interrupt a long-running agent run:

POST /run (ChatRequest(input), wait=True)

GET /get_events/{run_id}?stream=True

POST /run (CancelRequest(run_id=?))

GET /get_events/{run_id}?stream=True
<-- RunCompleted (finish_reason=canceled)

Artifact example

An agent uses a PDFWriter tool to create a PDF file that the caller can download:

POST /run (ChatRequest(input), wait=True)

GET /get_events/{run_id}?stream=True
<-- ArtifactGenerated
(caller displays the artifact to the user)

Persisted Threads

Caller lists available Threads, then requests the event history from a Thread:

GET /threads
GET /get_events/{thread_id=?}

About

A protocol for cooperation between agents built on different tech stacks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages