Q-Learning Challenges: Off-Policy RL Algorithm Struggles to Scale for Complex, Long-Horizon Problems

Q-learning is not yet scalable

Does RL scale? Over the past few years, we've seen that next-token prediction scales, denoising diffusion scales, contrastive learning scales, and so on, all the way to the point where we can train models with billions of parameters with a scalable objective that can eat up as much data as we can throw at it. Then, what about reinforcement learning (RL)? Does RL also scale like all the other objectives? Apparently, it does. In 2016, RL achieved superhuman-level performance in games like Go and C...