From Multi-Head to Latent Attention: The Evolution of Attention Mechanisms
Press enter or click to view image in full sizeWhat is attention?In any autoregressive model, the prediction of the future tokens is based on some preceding context. However, not all the tokens within this context equally contribute to the prediction, because some tokens might be more relevant than others. The attention mechanism addresses this by allowing the model to concentrate on the important context words selectively, while generating each output word or token. Consider the popular example...
Read more at vinithavn.medium.com