Skip to content

Support for realistic multi-step rollouts via async vLLM API #3284

Open
@BjarniHaukur

Description

@BjarniHaukur

Feature request

I propose adding a new OpenAI-compatible vLLM API server for use with the GRPOTrainer.

The implementation mirrors the weight syncing logic from trl/scripts/vllm_serve.py, but offloads most complexity to the existing vllm.entrypoints.openai.api_server infrastructure.

This enables training on significantly more complex rollouts than the standard synchronous .generate() endpoint can support. By supporting the OpenAI API interface, it also allows seamless integration with a wide range of existing agent frameworks and products.

This direction is a step toward reproducing pipelines like OpenHands LM 32B. I strongly suspect that Claude 3.7 Sonnet was trained in a similar fashion, iteratively reinforced using rollouts generated through its own Claude Code scaffolding.

Motivation

Currently, TRL only supports synchronous, batched .generate() calls for inference. This restricts the types of rollouts that can be created, especially in domains that benefit from having multi-step approaches, tool use, or environment interaction.

I’ve been using TRL for my Master’s thesis on reinforcement learning for language models in the program repair domain. In several GRPO experiments, I repeatedly encountered the same limitation: with .generate(), all context construction, planning, and feedback extraction must happen within a single call. For example, in tasks from SWE-Gym, the model needs to generate code edits for real repositories. To do this in one .generate() call, the user must manually construct the relevant repo context and later parse outputs like diffs to extract useful reward signals. This makes experimentation slow and always feels like “reinventing the wheel.”

Rather than building ad-hoc scaffolding from scratch, I began exploring how to integrate existing coding agents like Aider directly into the training loop. These agents already support rich workflows such as repo mapping, diff parsing, and iterative interaction—and they use the OpenAI API interface. Enabling TRL to train models through this interface would allow us to run them in situ, inside the same environment they’re meant to be deployed in.

This proposal aims to bridge that gap and enable more realistic, multi-step training workflows as a first-class feature in TRL.

Your contribution

I have already developed an initial working implementation in this PR: #3285, which introduces vllm_serve_openai_compatible.py.

I intend to wrap up remaining loose ends and properly test this approach, both for functional correctness and throughput benchmarking.

The draft PR also includes a few project-specific utilities (WIP) to illustrate how this can be used in practice. For example, it shows how to parallelize existing Aider instances that interact with this server to generate training data.

One open issue is how to reliably access full conversation histories for each rollout. Since API calls happen internally within the agent, we cannot assume access to .get_conversation_history() or similar. A possible approach is to record all requests and responses server-side and map them back to the original prompt to reconstruct complete rollouts to train on.

I’d be happy to align the implementation with TRL’s design goals and iterate toward something mergeable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions