Diffusion Beats Autoregressive in Data-Constrained Settings
TLDR:
If you are compute-constrained, use autoregressive models; if you are data-constrained, use diffusion models.
Motivation
Progress in AI over the past decade has largely been driven by scaling compute and data. The recipe from GPT-1 to GPT-5 has appeared straightforward: train a larger model on more data, and the result is a more capable system.
Scaling plot from Chinchilla paper
Yet a central question remains: will this recipe continue to hold from GPT-6 to GPT-N?
Many analysts and researc...
Read more at blog.ml.cmu.edu