Back|Direct Preference Optimization: Your Language Model Is Secretly a Reward Model
100%
Loading PDF…