Developer ports Flash Attention algorithm from GPU Triton kernel to TPU using JAX; discovers compiler limitations, systolic array behavior, and immutability constraints through hands-on implementation and emulator testing.

Forcing Flash Attention onto a TPU and Learning the Hard Way · Archer Zhang

This is the fifth post in a series on LLM internals. Part 1 covered attention, Part 2 covered generation, Part 3 covered the Flash Attention algorithm, Part 4 put it on a GPU with Triton. This post takes the Triton kernel from Part 4 and ports it to a TPU. There was a lot of good learning that came out of Part 4. But while working in Colab, I couldn’t help but notice that TPU was offered for free in the free tier. I figured — what if I just take Part 4’s flash attention and port it to TPU? I kno...