dyyyyyyyy/FAPO-GenRM-4B
Text Generation • 4B • Updated
• 21 • 1
FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning. Project Page: https://fapo-rl.github.io/
Note 4B Generative Reward Model for FAPO Reinforcement Learning.
Note 32B FAPO Reasoning Model Trained with Generative Reward.
Note Training and Evaluation Dataset for FAPO-GenRM-4B (Generative Reward Model).
Note Training and Evaluation Dataset for FAPO-32B (Reasoning Model).