Skip to content

Is it possible to make prompts dynamic (or iterable datasets) in GRPO training? #3474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
onlyjokers opened this issue May 21, 2025 · 1 comment
Labels
✨ enhancement New feature or request 🏋 GRPO Related to GRPO

Comments

@onlyjokers
Copy link

Feature request

I'd like to be able to dynamically adjust the next prompt to be fed to the policy model based on the completion it generates.

Motivation

I would like to enhance the model's ability to generate GLSL code through GRPO.

I do this by adding a new reward function that tries to execute the GLSL code generated by the model, and if it runs correctly, whether or not the image displayed by the GLSL is the same as that requested in the prompt.

I observed that at first the model did perform better. However, as the difficulty of the prompt increased, almost all the results generated by the model were wrong, which resulted in the policy model not being able to gain a relative advantage. So I would like to be able to dynamically determine how much longer the model needs to stay at that stage.

Your contribution

I'm currently trying to build a version that can support this feature

@github-actions github-actions bot added 🏋 GRPO Related to GRPO ✨ enhancement New feature or request labels May 21, 2025
@qgallouedec
Copy link
Member

This could be relevant #3226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement New feature or request 🏋 GRPO Related to GRPO
Projects
None yet
Development

No branches or pull requests

2 participants