« Back to homepage · « Back to blogs

How I got started in mechanistic interpretability

by Qinyuan Ye, May 25, 2026.


TL;DR: The reason was quite random but the journey turned out to be rewarding.


Part I: Stories

How it started (Apr-Sep 2024)

In April 2024, I was told that my thesis proposal didn’t quite meet the bar, and I was required to write a new work that had sufficient intellectual merit, technical depth, and was strongly connected to my thesis topic, “cross-task generalization abilities of language models.” At the end of my fifth year, I was terrified.

What would that be? On top of the requirements above, I also wanted to work on something I was genuinely curious about, since I believe my best work was produced when I was curious, and I was worried that I wouldn’t have the luxury to do it anymore after the PhD program. It was a blessing and a curse at the same time—curiosity and inspiration don’t come with a fixed schedule.

I kept searching hard. One day I came across the topic of mechanistic interpretability in a seminar course, where two papers were featured. I read the IOI circuit paper (1), which I found intriguing—reading it was like reading a detective novel, where I was guided to put pieces together. The second paper (2), discussing how the IOI circuit was reused in other tasks, also encouraged me to work in this direction because it seemed like a good alignment with my thesis topic. It led me to the idea of applying interpretability techniques to understand cross-task generalization in language models.

I had a hidden agenda as well. I decided not to go for academia after my thesis proposal. That meant I had a lot of technical job interviews ahead of me, and I guessed many of them would be about transformers. Preparing myself for those interviews via memorization would be boring, so I decided to try something different—I wanted to learn transformers by heart (and to nail those interviews) by interpreting them.

All of these were good reasons, but there was also one important downside—I knew little about mechanistic interpretability. I was scared of the math and the level of detail in interpretability papers, and I had no idea what this project would lead to. With all of the uncertainties, this winding journey began.

Finding the right problem (Sep-Dec 2024)

I was the only one in my lab interested in mechanistic interpretability, so I started by doing some solo random explorations. I made several pivots before landing on the final project, and here’s what happened.

First inspiration: Probing. One question I’ve been wanting to investigate for a long time was “how does post-training change model internals?” The first piece in my thesis (3) was on fine-tuning models for task generalization (now referred to as SFT in post-training), so this might create a nice consistency with my thesis.

One hypothesis I have is that since LMs can develop linguistic structures during pre-training (4, 5), perhaps post-training enables task generalization because the models develop (or strengthen) task structures during post-training.

To study this, I used the probing techniques in Tenney et al., 2019 (4). I was able to reproduce the findings (a staged pipeline of POS tagging → parsing → NER → semantic roles → coreference) on BERT and RoBERTa, but I wasn’t able to find similar trends in newer model families like OPT and Llama-3. Additionally, there were no clear patterns on how these structures change when comparing the base and the post-trained model. After two months or so, I gave up on this thread.

Second inspiration: Compositionality. While the probing direction didn’t work out, it led me to think about finding “task structures” in language models, and my intuition suggested the structures may be compositional. The IOI circuit paper also introduced the nice idea to interpret models with pairs of contrast tasks (IOI vs ABC), so creating contrast task pairs to isolate such structures seemed to be a promising direction. Some structures that I thought about at that time include:

I played with these tasks with an interactive tool called Information Flow Route (7), where I was able to get some intuitions. For example, when using Llama-3.1-8B, the important edges in the computation graph for the base prompt “5+4=9, 2+3=5, 8+3=11” were always before layer 16. For the counterfactual prompt “5+4=19, 2+3=15, 8+3=21” I saw some important edges going from “=” to “19” at layer 24, and from “19” to the next “=” at layer 25. This suggests that the model is indeed processing the task with two sequential steps.

Final project: Off-by-k addition. It was probably too ambitious to cover all the structures I was thinking about in one paper. The finding above on modified arithmetic looked interesting. Additionally, this arithmetic task offered a lot of controllability—in many models, numbers are always single-token, and I could control the number of shots and the offset value. So, I decided to take a closer look.

I used offsets k=1,2,5,10 and three ~7B-size language models. Using the interactive tool mentioned earlier and its attribution method, I found out that a few attention heads in the late layers of the LM were responsible for the off-by-k addition behavior. More importantly, for the same model and different offsets, it was usually the same set of heads being responsible.

Learning the tools (Dec 2024)

After reading the IOI paper, I told myself I wanted to write papers like that—interesting and with high scientific rigor. To do that for off-by-k addition, I needed to first familiarize myself with the tools in the paper.

I first spent one or two days refreshing my knowledge about transformers (using Andrej Karpathy’s NanoGPT tutorial). Then I spent a few weeks finishing two chapters of in the Arena Tutorial (Intro to Mech Interp and Indirect Object Identification) while juggling my other responsibilities.

These tutorials are very carefully designed! I really liked that they were interactive: I got to have hands-on experience, do a lot of exercises to check my understanding, and have a lot of fun. All at my own pace. I felt prepared after doing all the exercises—I adapted what I learned to my problem, off-by-one addition, and indeed I found a circuit.

Turning it into a full paper (Jan-May 2025)

When the class read the IOI circuit paper together in September 2024, we were discussing what the broader impact of a paper like this was, since its main contributions is explaining the mechanism of a very toy task. “It’s hard to justify for publication if all you do is identify a circuit, unless you’re the first paper to do that,” said someone from the class.

This set a very high bar for upcoming circuit-style interpretability papers. Having the observations on off-by-one addition and identifying the circuit was a huge relief for me—I knew this could become a paper (and thus I could graduate by the end of my sixth year!) But conceiving a paper was one thing; actually writing it and making it a good one was another challenge entirely.

While much more work was needed, at this point I had some confidence and knew that I wasn’t shooting in the dark. And this gave me the most fun part of the project, where I got to be the “detective.” Sometimes I solved mysteries, sometimes I validated or invalidated my guesses. In all cases, I was happy.

Generalization. Many interpretability works start from a small, concrete problem but later describe it as a case of something more general (e.g., modular addition circuit → investigating “grokking” (8), entity tracking circuit → investigating the effect of fine-tuning (9)). I tried to follow these examples and enrich my work from a broader perspective, and I tried to link the findings back to my thesis topic of cross-task generalization.

My hypothesis was that the +1 circuit I found was reused in many different tasks, and it enabled task-level generalization—language models being able to perform unseen tasks on the fly. To justify this, I needed to find task pairs that used this circuit. This became one of the most enjoyable parts of the project: I made several hypotheses, and my success rate was high—the language model was mostly working as I expected it to. The circuit was reused in weird tasks that I created like shifted MMLU and distantly related tasks like Caesar cipher. Moreover, I got inspiration from some of my earlier work on prompt engineering (10), and found that the circuit was reused in base-8 addition, though in an unintended way.

Parallelization. In the beginning of the project, my naive guess was that there would be one single attention head that wrote out the +1 function to the residual stream (and that would be all). I found nine heads instead, and it made me wonder why their effects were not cumulative, i.e., if nine heads said “+1”, why was the effect not “+9”? This became something that kept me awake at night.

This was finally answered with the function vector style analysis (11) that I did, where I found that the 9 heads sort of corresponded to 9 subspaces or bases, and the 9 heads were tuning the 9 knobs in parallel. This was the kind of aha moment where I made sense of everything, and I was amazed that transformers had developed such beautiful structures from their training. In this very toy task, I got to see the serial, parallel, and compositional structures in these artificial minds. Perhaps these basic structures are just the fundamental ingredients for the endless things that these models can do.

The paper resulting from the exploration was accepted to ICLR 2026. I had some long but good fights with the reviewers amidst the OpenReview incident. I was quite excited at the news of its acceptance. It was a celebration of the work itself, and also a celebration of a new research mindset that I picked up. I re-learned many things throughout: how to get started on a new direction, how to manage uncertainty and pivot, and most importantly, how to do research at my own pace and enjoy it. I hope this time I learned these things right.


Part II: Reflections

How do I feel about doing mech interp?

My reason for starting to work on mechanistic interpretability was quite uncommon—I stumbled into this field. I’m not part of established interpretability groups or fellowship programs. My research direction is not safety-focused. The work was mostly carried out on my own. I want to summarize my experience, in case someone like me is thinking about entering the field.

What I like

What I find hard


Ideas for Future Work

My project left me with a bunch of new ideas. In my current job, I have to pursue some other research directions, but I think it might be helpful to write down these ideas and perhaps I will revisit them in my free time. I have grouped these ideas into two categories: further investigation of function induction as a short-term, direct follow-up, and automated interpretability as a broader, longer-term line of research.

Further investigation on function induction

Reuses of the circuit. In my paper, I found 4 task pairs where the +1 circuit is reused. There could be more, and it would be interesting to develop a scalable method for discovering them. There are two possible strategies to find them: (a) Task-level hypothesis testing: We can manually propose more task pairs, and examine whether the circuit was reused there. The proposal process could be augmented with an AI. (b) Sentence-level brute-force: We can enumerate sentences in a corpus and check whether the circuit is being useful. We can then summarize patterns from the highlighted sentences.

Expanding to more functions. In our work, the scope was limited to the function of +1. We found that the circuit generalizes to +k and letters. Could there be other functions? Do they reuse the same set of heads and circuit? In particular, can we link our findings to any real-world behaviors of the model that are concerning, e.g., models being sycophantic or repeating its mistakes from the past context?

Pre-training dynamics. How does the circuit emerge from pre-training? Can we reproduce our results on open models (e.g., OLMo 3), study the emergence of the circuit, and trace it back to certain pre-training datapoints? I’m seeing a lot of work on pre-training dynamics and data attribution these days and thought this could be interesting to study. In particular, I’d love to see whether function induction heads evolve from standard induction heads during pre-training.

Automated/AI-assisted interpretability

One thing that I found hard in my project is to discuss my findings along with those of prior work, which was built on older models. It’s hard to have an apples-to-apples comparison. For example, I would love to know which heads are FV heads in Gemma-2 (9B), but the FV head paper (11) was originally done with Llama-2 (7B). I could have reproduced FV head experiments with Gemma-2 (9B) myself, but that would introduce a significant amount of work.

Companies will keep releasing new models. Their new capabilities will inspire new interpretability efforts, but then it would be hard to consolidate new findings with old findings if they are not based on the same model. Based on this, I have a few proposals:

A Collaborative Wikipedia for Interpretability Findings. If someone else finds the FV heads in Gemma-2 (9B), I hope there is a structured and organized place for them to share the results, so that I don’t have to repeat it. LMs are like a huge puzzle, and all researchers are trying to put pieces together. What if we bring this puzzle online and make it a collaborative effort? Envisioning a future where agents conduct interpretability research autonomously, this Wikipedia can be seen as a shared “agent memory”.

Note: Existing cool efforts along this line—NeuronPedia; Attention Motif.

Automated Circuit Reproduction. Finding FV heads in Gemma-2 (9B) could be very suitable for coding agents. I wrote a research proposal about two years ago on building AI agents to reproduce AI research, because I think a prerequisite to getting AI agents to make new scientific discoveries is to reproduce old findings; plus reproduction could be more easily verified by humans. While working on this interp paper, I realized that reproducing older interpretability findings on newer models could be a well-suited application. This is because the circuit discovery procedure is very structured and repetitive, we have the ground truth on the old models for reference, and coding agents as of 2026 could be sufficiently powerful for this.

The biggest benefit of this approach is its massive scalability. For now, our knowledge of the (behavior, circuit, model) grid is limited to isolated points, but reproducibility agents would allow us to systematically explore and fill up that entire space.

Interpretability-as-a-service. Sometimes we are annoyed by unexpected behaviors of language models, e.g., hallucinating, preferring specific words (”goblin”). Sometimes we are simply curious about what models are thinking under the hood, e.g., indirect object identification, off-by-one addition. Interpretability is a powerful tool to study these behaviors by case and on demand. For now, this is manually done by researchers; however, if we document our research well and accumulate sufficient “human demonstrations”, we can train agents to do it, and turn it into a “service” for non-interpretability researchers.

This could happen sooner than we think. The field already has high-quality educational materials and highly reproducible code. Plus, researchers sometimes live-stream their coding and paper-reading sessions—all of which would serve as excellent training data for these agents.

One day, with a single click, we could kick off an agent research team to study a neuron, a head, a circuit, or a specific phenomenon, and perhaps add a new puzzle piece to the canvas (the wikipedia mentioned above).


If you’ve read this far, thank you! This research project was a really special experience for me, so I wrote a long blog post—to document some of the good times I had during my PhD and to remind myself why I wanted to do research in the first place.

The blog post jumps around a bit and covers a lot of ground—much like the research journey itself. If any of the ideas above sound interesting, I’d love to connect. If you’re thinking about getting started in mechanistic interpretability, I’d be happy to share what I know. I can be reached at qinyuany@usc.edu.