How I got started in mechanistic interpretability

by Qinyuan Ye, May 25, 2026.

Part I: Stories
Part II: Reflections
- How do I feel about doing mech interp?
  - What I like
  - What I find hard
- Ideas for future work
  - Further investigation on function induction
  - Automated/AI-assisted interpretability

TL;DR: The reason was quite random but the journey turned out to be rewarding.

Part I: Stories

How it started (Apr-Sep 2024)

In April 2024, I was told that my thesis proposal didn’t quite meet the bar, and I was required to write a new work that had sufficient intellectual merit, technical depth, and was strongly connected to my thesis topic, “cross-task generalization abilities of language models.” At the end of my fifth year, I was terrified.

What would that be? As if meeting the requirements above weren’t hard enough, I also wanted to work on something I was genuinely curious about—I believe my best work was produced when I was curious, and I was worried that I wouldn’t have the luxury to do it anymore after my PhD. It was a blessing and a curse at the same time—curiosity and inspiration don’t come with a fixed schedule.

I kept searching hard. One day, I came across the topic of mechanistic interpretability in a seminar course, where two papers were featured. I read the IOI circuit paper (1), which I found intriguing. Reading it was like reading a detective novel, where I was guided to put pieces together. The second paper (2), discussing how the IOI circuit was reused in another task, further encouraged me to pursue this direction, as it appeared to align well with my thesis topic. It led me to the idea of applying interpretability techniques to achieve a deeper understanding of cross-task generalization in language models.

I had a hidden agenda as well. I decided not to go for academia after my thesis proposal. That meant I had a lot of technical job interviews ahead of me, and many of them would be about transformers. Preparing myself for those interviews via memorization would be boring, so I decided to try something different—I wanted to learn transformers by heart (and to nail those interviews) by interpreting them.

All of these were good reasons, but there was also one important downside—I knew little about mechanistic interpretability. I was intimidated by the mathematics and level of detail when I first read interpretability papers, and I had no idea where this project would lead. With so many uncertainties ahead, I began this winding journey, and it ultimately turned out to be a special and rewarding one.

Finding the right problem (Sep-Dec 2024)

I was the only one in my lab interested in mechanistic interpretability, so I started by doing some solo random explorations. I made several pivots before landing on the final project, and here’s what happened.

First inspiration: Probing. One question I’ve been wanting to investigate for a long time was “how does post-training change model internals?” The first piece in my thesis (3) was on fine-tuning models for task generalization (sometimes referred to as SFT in post-training), so studying this problem might create a nice consistency with my thesis.

One hypothesis I have is that since LMs can develop linguistic structures during pre-training (4, 5), perhaps post-training enables task generalization because the models develop (or strengthen) task structures during post-training.

To study this, I used the probing techniques in Tenney et al., 2019 (4). I was able to reproduce the findings (a staged pipeline of POS tagging → parsing → NER → semantic roles → coreference) on BERT and RoBERTa, but I wasn’t able to find similar trends in newer (decoder-only) model families like OPT and Llama-3. Additionally, there were no clear patterns on how these structures change when comparing the base and the post-trained model. After two months or so, I gave up on this thread.

Second inspiration: Compositionality. While the probing direction didn’t work out, it led me to think about finding “task structures” in language models, and my intuition suggested the structures may be compositional. The IOI circuit paper also introduced the nice idea to interpret models with pairs of contrast tasks (IOI vs ABC), so creating contrast task pairs to isolate such structures seemed to be a promising direction. Some structures that I thought about at that time include:

Opposite tasks, e.g., “Which answer is correct?” vs. “is incorrect?”
Parallel tasks, e.g., “Name three cities in the United States.” (6)
Sequential tasks
1. Multi-hop reasoning, e.g., “The president of South Korea was born in the city of”
2. Modified arithmetic, e.g., “Add 10 to the final output. 5+4=19, 2+3=15, 8+3=21”

I played with these tasks with an interactive tool called Information Flow Route (7), where I was able to get some intuitions. For example, when using Llama-3.1-8B, the important edges in the computation graph for the base prompt “5+4=9, 2+3=5, 8+3=11” were always before layer 16. For the counterfactual prompt “5+4=19, 2+3=15, 8+3=21” I saw some important edges going from “=” to “19” at layer 24, and from “19” to the next “=” at layer 25. This hinted that the model was processing the task with two sequential steps.

Final project: Off-by-k addition. It was probably too ambitious to cover all the structures I was thinking about in one paper. The finding above on modified arithmetic looked interesting. Additionally, this arithmetic task offered a lot of controllability—in many models, numbers are always single-token, and I could control the number of shots and the offset value. So, I decided to take this one step further.

I used offsets k=1,2,5,10 and three ~7B-size language models. Using the interactive tool mentioned earlier (7) and its attribution method, I found that a few attention heads in the late layers of the LM were responsible for the off-by-k addition behavior. More importantly, for the same model and different offsets, it was usually the same set of heads being responsible.

Learning the tools (Dec 2024)

After reading the IOI paper, I told myself I wanted to write papers like that—interesting and with high scientific rigor. To do that for off-by-k addition, I needed to first familiarize myself with the tools in the paper.

I first spent one or two days refreshing my knowledge about transformers (using Andrej Karpathy’s NanoGPT tutorial). Then I spent a few weeks finishing two chapters of in the Arena Tutorial (Intro to Mech Interp and Indirect Object Identification) while juggling my other responsibilities.

These tutorials are very carefully designed! I really liked that they were interactive: I got to have hands-on experience, do a lot of exercises to check my understanding, and have a lot of fun. All at my own pace. I felt prepared after doing all the exercises—I adapted what I learned to my problem, off-by-one addition, and indeed I found a circuit.

Turning it into a full paper (Jan-May 2025)

When the class read the IOI circuit paper together in September 2024, we were discussing what the broader impact of a paper like this was, since its main contributions is explaining the mechanism of a very toy task. “It’s hard to justify for publication if all you do is identify a circuit, unless you’re the first paper to do that,” said someone from the class.

This set a very high bar for upcoming circuit-style interpretability papers. Having the observations on off-by-one addition and identifying the circuit was a huge relief for me—I knew this could become a paper, and that I might be able to graduate by my sixth year. But conceiving a paper was one thing; actually writing it and making it a good one was another challenge entirely.

While much more work was needed, at this point I had some confidence and knew that I wasn’t shooting in the dark. And this gave me the most fun part of the project, where I got to be the “detective.” Sometimes I solved mysteries, sometimes I validated or invalidated my guesses. In all cases, I was happy.

Generalization. Many interpretability works start from a small, concrete problem but later describe it as a case of something more general (e.g., modular addition circuit → investigating “grokking” (8), entity tracking circuit → investigating the effect of fine-tuning (9)). I tried to follow these examples and enrich my work from a broader perspective, and I tried to link the findings back to my thesis topic of cross-task generalization.

My hypothesis was that the +1 circuit I found was reused in many different tasks, and thus it enabled task-level generalization—language models being able to perform unseen tasks on the fly. To justify this, I needed to find task pairs that used this circuit. This became one of the most enjoyable parts of the project: I made several hypotheses, and my success rate was high—the language model was mostly working as I expected it to. The circuit was reused in weird tasks that I created like shifted MMLU and distantly related tasks like Caesar cipher. Moreover, I got inspiration from some of my earlier work on prompt engineering (10), and found that the circuit was reused in base-8 addition, though in an unintended way.

Parallelization. In the early phase of the project, my naive guess was that there would be one single attention head that wrote out the +1 function to the residual stream (and that would be all). I found nine heads instead, and it made me wonder why their effects were not cumulative, i.e., if nine heads said “+1”, why was the effect not “+9”? This became something that kept me awake at night.

This was finally answered with the function vector style analysis (11) that I did, where I found that the 9 heads sort of corresponded to 9 subspaces or bases, and they were effectively tuning 9 knobs in parallel. This was the kind of aha moment where I made sense of everything, and I was amazed that transformers had developed such beautiful structures from their training. In this very toy task, I got to see the serial, parallel, and compositional structures in these artificial minds. Perhaps these basic structures are the fundamental ingredients behind the amazing things these models can do.

The paper resulting from the exploration was accepted to ICLR 2026. I had some long but good fights with the reviewers amidst the OpenReview incident. I was quite excited at the news of its acceptance. It was a celebration of the work itself, and also a celebration of a new research mindset that I picked up. I re-learned many things throughout: how to get started on a new direction, how to manage uncertainty and pivot, and most importantly, how to do research at my own pace and enjoy it. I hope this time I learned these things right.

Part II: Reflections

How do I feel about doing mech interp?

My path into mechanistic interpretability was somewhat unconventional—I stumbled into the field rather than being trained within established groups or fellowship programs. I came in already committed to research on generalization and have mostly viewed interpretability as a tool for pursuing that agenda, rather than as an end in itself. While I follow research in AI safety from time to time, my own work is not safety-focused. As a result, I might have entered the field with a different set of motivations than many interpretability researchers, and I want to share a few reflections from that perspective.

What I like

Beginner friendly. There are high-quality educational materials and open-source tools for mechanistic interpretability. With some initial investment and by working through the key parts of the tutorials, you can get to a point where you’re ready to start your first interpretability project.
Academic budget friendly. All experiments in my paper were possible to run on one single A6000 GPU. With the compute situation at my PhD lab, it’s not quite possible to post-train 7B models decently, but I can still interpret 7B-sized models with a reasonable timeline.
Intellectually satisfying. My project really kept awake at night! I couldn’t sleep when my circuit was still incomplete. I kept making hypotheses. I kept wondering. I kept thinking. And when I found the answers to these questions, I was extremely satisfied. I got to experience the best part of research.
Many unknowns and opportunities. I sometimes think of mechanistic interpretability as a kind of powerful, general-purpose tool. It can often be used to better explain phenomena in one’s prior research direction, leading to a deeper understanding. If you’re missing a final piece in your thesis and struggling to graduate (as I was), it might be a good option to explore.
Reviewers. Several reviewers of our submission were very thoughtful. They held the work to a high standard and pushed me to use more precise language and think more deeply about the results. They also helped move the study forward by suggesting ways to identify a more complete circuit and better understand the underlying mechanism. I’m grateful for the academic training I received through this process.
Community. I presented the work at the Mech Interp Workshop at NeurIPS 2025. I liked it because the workshop is not just about bringing people together and sharing their work, it’s also about discussing how the field moves forward and what we should do as a community. Regarding this, the workshop features lightning talks from different perspectives, and also a talk on how to get your research funded. As a past workshop organizer myself, I’m in awe.

What I find hard

Hard to get a job. Full-time interpretability roles are rare in general, as compared to other roles that will create economic value more directly. Only a few companies can afford to start an interpretability team, and the bar is very high. I did my full-time job search around February to May in 2025, when the paper was still in the making, so I had little luck in getting an interpretability position. If securing a full-time job after your PhD is a concern, you may need to consider the opportunity cost of choosing one research topic over another. This may sound overly utilitarian, but I found job searching as a new grad to be far more challenging than I had anticipated—I really struggled to find a position.
Hard to write an interpretability paper for general audience. My work was submitted multiple times (NeurIPS, Workshop, ICLR), and I find that reviewers can have very diverging opinions. Reviewers from the interp community are more familiar with the language and terminology. Their concerns tend to be more technical, e.g., the cleanness of your circuit. They tend to be critical and often seriously challenge the rigor of the work (in a good way). Convincing reviewers with a general ML background feels different. To begin with, it’s hard to explain interpretability techniques within one page to these reviewers. Moreover, a general audience will likely expect practical takeaways that can immediately improve language models, i.e., “how is this relevant to me?” Personally, I think the field of mechanistic interpretability is still young and growing, and we should allow time and space for work without immediate impact. However, I also understand that the acceptance bar is high, so general ML reviewers may view this differently.

Ideas for Future Work

My project left me with a number of new ideas. In my current job, I have to prioritize other research directions, but I think it would be helpful to write down these ideas as an exercise and possibly revisit them in my free time. I have grouped these ideas into two categories: further investigation of function induction as a short-term, direct follow-up, and automated interpretability as a broader, longer-term line of research.

Further investigation on function induction

Reuses of the circuit. In my paper, I found 4 task pairs where the +1 circuit is reused. There could be more, and it would be interesting to develop a scalable method for discovering them. There are two possible strategies to find them: (a) Task-level hypothesis testing: We can manually propose more task pairs, and examine whether the circuit was reused there. The proposal process could be augmented with an AI. (b) Sentence-level brute-force: We can enumerate sentences in a corpus and check whether the circuit is being useful. We can then summarize patterns from the highlighted sentences.

Expanding to more functions. In our work, the scope was limited to the function of +1. We found that the circuit generalizes to +k and letters. Could there be other functions? Do they reuse the same set of heads and circuit? In particular, can we link our findings to any real-world model behaviors that are concerning, such as sycophancy or the tendency to repeat mistakes from past context?

Pre-training dynamics. How does the circuit emerge from pre-training? Can we reproduce our results on open models (e.g., OLMo 3), study the emergence of the circuit, and trace it back to certain pre-training datapoints? I’m seeing a lot of work on pre-training dynamics and data attribution these days and thought this could be interesting to study. In particular, I’d love to see whether function induction heads evolve from standard (or backup) induction heads during pre-training.

Automated/AI-assisted interpretability

One thing that I found hard in my project is to discuss my findings along with those of prior work, which was built on older models. It’s hard to have an apples-to-apples comparison. For example, I would love to know which heads are FV heads in Gemma-2 (9B), but the FV head paper (11) was originally done with Llama-2 (7B). I could have reproduced FV head experiments with Gemma-2 (9B) myself, but that would introduce a significant amount of work.

Companies will keep releasing new models. Their new capabilities will inspire new interpretability efforts, but then it would be hard to consolidate new findings with old findings if they are not based on the same model. Given this, I have a few proposals:

A Collaborative Wikipedia for Interpretability Findings. If someone else finds the FV heads in Gemma-2 (9B), I hope there is a organized and accessible place for them to share the results, so that I don’t have to repeat it. LMs are like a huge puzzle, and all researchers are trying to put pieces together. What if we bring this puzzle online and make it a collaborative effort? Envisioning a future where agents conduct interpretability research autonomously, this Wikipedia can be seen as a shared “agent memory”.

Note: Existing cool efforts along this line—NeuronPedia; Attention Motif.

Automated Circuit Reproduction. Finding FV heads in Gemma-2 (9B) could be very suitable for coding agents. I wrote a research proposal about two years ago on building AI agents to reproduce AI research, because I think a prerequisite to getting AI agents to make new scientific discoveries is to reproduce old findings; plus reproduction could be more easily verified by humans. While working on this interp paper, I realized that reproducing older interpretability findings on newer models could be a well-suited application. This is because the circuit discovery procedure is very structured and repetitive, we have the ground truth on the old models for reference, and coding agents as of 2026 could be sufficiently powerful for this.

The biggest benefit of this approach is its massive scalability. For now, our knowledge of the (behavior, circuit, model) grid is limited to isolated points, but reproducibility agents would allow us to systematically explore and fill up that entire space.

Interpretability-as-a-service. Sometimes we are annoyed by unexpected behaviors of language models, e.g., hallucinating, emergent misalignment, preferring specific words (”goblin”). Sometimes we are simply curious about what models are thinking under the hood, e.g., indirect object identification, off-by-one addition. Interpretability is a powerful tool to study these behaviors by case and on demand. For now, this is manually done by researchers; however, if we document our research well and accumulate sufficient “human demonstrations”, we can train agents to do it, and turn it into a “service” for non-interpretability researchers.

This could happen sooner than we think. The field already has high-quality educational materials and highly reproducible code. Plus, researchers sometimes live-stream their coding and paper-reading sessions—all of which would serve as excellent training data for these agents.

One day, with a single click, we could kick off an agent research team to study a neuron, a head, a circuit, or a specific phenomenon, and perhaps add a new puzzle piece to the canvas (the wikipedia mentioned above).

If you’ve read this far, thank you! This research project was a really special experience for me, so I wrote a long blog post—to document some of the good times I had during my PhD and to remind myself why I signed up to do research in the first place.

The blog post jumps around a bit and covers a lot of ground—much like the research journey itself. If any of the ideas above sound interesting, I’d love to connect. If there is anything that I could have done differently, I’d love to hear your feedback. If you’re thinking about getting started in mechanistic interpretability, I’d be happy to share what I know. I can be reached at qinyuany@usc.edu.