Mechanical Interpretability

We introduce OthelloGPT, a GPT model trained on Othello, to understand how LLMs learn internal representations. Despite simple training, it develops structured gameplay understanding—early layers detect board patterns, while deeper layers track dynamic moves. Using Sparse Autoencoders, we decode strategic features like tile stability, offering insights into LLM learning. This framework helps analyze representations in transformers and LLMs.

Jan 13, 2025