5.2 Reinforcement Learning Fine-tuning

Although the industry has almost declared that reinforcement learning is not suitable for language models, many institutions and researchers are still exploring the feasibility of reinforcement learning to fine-tune all or some language model parameters. OpenAI is the most famous one. ChatGPT uses the mature SOTA reinforcement learning model PPO proposed by OpenAI itself to fine-tune the language model. So far, PPO is the only RL algorithm successfully applied to language models. Let's see how to describe this fine-tuning process from a graphical perspective using RL algorithms.

Last updated