Emergent Introspective Awareness in Large Language Models
Affiliations
Published
October 29th, 2025
Contents
We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in ce...
Read more at transformer-circuits.pub