Our reward design combines correctness, preference, and efficiency. Preference only counts when the...

- 奖励设计结合正确性、偏好和效率。
- 偏好仅在答案正确时计入评分。
- 避免模型优化为“听起来更好但错误”的答案。
Preference only counts when the answer is correct.
This keeps the model from optimizing for better-sounding wrong answers. https://t.co/VbJ1M4o26w" / X
Post
Conversation

Our reward design combines correctness, preference, and efficiency. Preference only counts when the answer is correct. This keeps the model from optimizing for better-sounding wrong answers.

New to X?
Sign up now to get your own personalized timeline!
By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.
Trending now
What’s happening
Sports · Trending
#BURMCI
Trending in United States
#MichaelMovie!Image 3
Trending in United States
Grapefruit
Politics · Trending
Hung Cao
Trending with Phelan, Secretary of the Navy
|
|
|
|
|
© 2026 X Corp.