RLHF (Reinforcement Learning from Human Feedback)
RLHF is the training recipe that turned raw language models (good at predicting text) into aligned assistants (good at following instructions helpfully and safely). It was popularized by InstructGPT (…