You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to be able to dynamically adjust the next prompt to be fed to the policy model based on the completion it generates.
Motivation
I would like to enhance the model's ability to generate GLSL code through GRPO.
I do this by adding a new reward function that tries to execute the GLSL code generated by the model, and if it runs correctly, whether or not the image displayed by the GLSL is the same as that requested in the prompt.
I observed that at first the model did perform better. However, as the difficulty of the prompt increased, almost all the results generated by the model were wrong, which resulted in the policy model not being able to gain a relative advantage. So I would like to be able to dynamically determine how much longer the model needs to stay at that stage.
Your contribution
I'm currently trying to build a version that can support this feature
The text was updated successfully, but these errors were encountered:
Feature request
I'd like to be able to dynamically adjust the next prompt to be fed to the policy model based on the completion it generates.
Motivation
I would like to enhance the model's ability to generate GLSL code through GRPO.
I do this by adding a new reward function that tries to execute the GLSL code generated by the model, and if it runs correctly, whether or not the image displayed by the GLSL is the same as that requested in the prompt.
I observed that at first the model did perform better. However, as the difficulty of the prompt increased, almost all the results generated by the model were wrong, which resulted in the policy model not being able to gain a relative advantage. So I would like to be able to dynamically determine how much longer the model needs to stay at that stage.
Your contribution
I'm currently trying to build a version that can support this feature
The text was updated successfully, but these errors were encountered: