Connor Davis explores transformer circuits and mechanistic interpretability to understand AI language models mathematically; aims to prevent harmful AI behavior through reverse-engineering neural networks.

Intuitions for Transformer Circuits

In a previous post on language modeling, I implemented a GPT-style transformer. Lately I’ve been learning mechanistic interpretability to go deeper and understand why the transformer works on a mathematical level.This post is a brain dump of what I’ve learned so far after reading A Mathematical Framework for Transformer Circuits (herein: “Framework”) and working through the Intro to Mech Interp section on ARENA. My goal is to describe my current intuition for the paper, especially parts I was co...