Support for realistic multi-step rollouts via async vLLM API

### Feature request

I propose adding a new OpenAI-compatible vLLM API server for use with the `GRPOTrainer`.

The implementation mirrors the weight syncing logic from `trl/scripts/vllm_serve.py`, but offloads most complexity to the existing `vllm.entrypoints.openai.api_server` infrastructure.

This enables training on significantly more complex rollouts than the standard synchronous .generate() endpoint can support. By supporting the OpenAI API interface, it also allows seamless integration with a wide range of existing agent frameworks and products.

This direction is a step toward reproducing pipelines like [OpenHands LM 32B](https://www.all-hands.dev/blog/introducing-openhands-lm-32b----a-strong-open-coding-agent-model). I strongly suspect that Claude 3.7 Sonnet was trained in a similar fashion, iteratively reinforced using rollouts generated through its own [Claude Code](https://docs.anthropic.com/en/docs/welcome) scaffolding.

### Motivation

Currently, TRL only supports synchronous, batched `.generate()` calls for inference. This restricts the types of rollouts that can be created,  especially in domains that benefit from having multi-step approaches, tool use, or environment interaction.

I’ve been using TRL for my Master’s thesis on reinforcement learning for language models in the program repair domain. In several GRPO experiments, I repeatedly encountered the same limitation: with `.generate()`, all context construction, planning, and feedback extraction must happen within a single call. For example, in tasks from [SWE-Gym](https://huggingface.co/SWE-Gym), the model needs to generate code edits for real repositories. To do this in one `.generate()` call, the user must manually construct the relevant repo context and later parse outputs like diffs to extract useful reward signals. This makes experimentation slow and always feels like “reinventing the wheel.”

Rather than building ad-hoc scaffolding from scratch, I began exploring how to integrate existing coding agents like [Aider](https://github.com/Aider-AI/aider) directly into the training loop. These agents already support rich workflows such as repo mapping, diff parsing, and iterative interaction—and they use the OpenAI API interface. Enabling TRL to train models through this interface would allow us to run them *in situ*, inside the same environment they’re meant to be deployed in.

This proposal aims to bridge that gap and enable more realistic, multi-step training workflows as a first-class feature in TRL.

### Your contribution

I have already developed an initial working implementation in this PR: #3285, which introduces `vllm_serve_openai_compatible.py`.

I intend to wrap up remaining loose ends and properly test this approach, both for functional correctness and throughput benchmarking.

The draft PR also includes a few project-specific utilities (WIP) to illustrate how this can be used in practice. For example, it shows how to parallelize existing Aider instances that interact with this server to generate training data.

One open issue is how to reliably access full conversation histories for each rollout. Since API calls happen internally within the agent, we cannot assume access to `.get_conversation_history()` or similar. A possible approach is to record all requests and responses server-side and map them back to the original prompt to reconstruct complete rollouts to train on.

I’d be happy to align the implementation with TRL’s design goals and iterate toward something mergeable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for realistic multi-step rollouts via async vLLM API #3284

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for realistic multi-step rollouts via async vLLM API #3284

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions