News Score: Score the News, Sort the News, Rewrite the Headlines

Forcing Flash Attention onto a TPU and Learning the Hard Way · Archer Zhang

This is the fifth post in a series on LLM internals. Part 1 covered attention, Part 2 covered generation, Part 3 covered the Flash Attention algorithm, Part 4 put it on a GPU with Triton. This post takes the Triton kernel from Part 4 and ports it to a TPU. There was a lot of good learning that came out of Part 4. But while working in Colab, I couldn’t help but notice that TPU was offered for free in the free tier. I figured — what if I just take Part 4’s flash attention and port it to TPU? I kno...

Read more at archerzhang.me

© News Score  score the news, sort the news, rewrite the headlines