On the Biology of a Large Language Model

Mike's Notes

This fascinating article reports on the internal circuits of an LLM. I have only reposted the introduction. It is an excellent read. Transformer Circuits has many valuable reports.

Resources

References


Repository

  • Home > 

Last Updated

03/04/2025

On the Biology of a Large Language Model

By: Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall◊, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, Joshua Batson
Transformer Circuits: 27/03/2025

We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.

Large language models display impressive capabilities. However, for the most part, the mechanisms by which they do so are unknown. The black-box nature of models is increasingly unsatisfactory as they advance in intelligence and are deployed in a growing number of applications. Our goal is to reverse engineer how these models work on the inside, so we may better understand them and assess their fitness for purpose.

The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.

Progress in biology is often driven by new tools. The development of the microscope allowed scientists to see cells for the first time, revealing a new world of structures invisible to the naked eye. In recent years, many research groups have made exciting progress on tools for probing the insides of language models (e.g. [1, 2, 3, 4, 5]

). These methods have uncovered representations of interpretable concepts – “features” – embedded within models’ internal activity. Just as cells form the building blocks of biological systems, we hypothesize that features form the basic units of computation inside models. 1

However, identifying these building blocks is not sufficient to understand the model; we need to know how they interact. In our companion paper, Circuit Tracing: Revealing Computational Graphs in Language Models, we build on recent work (e.g. [5, 6, 7, 8]

) to introduce a new set of tools for identifying features and mapping connections between them – analogous to neuroscientists producing a “wiring diagram” of the brain. We rely heavily on a tool we call attribution graphs, which allow us to partially trace the chain of intermediate steps that a model uses to transform a specific input prompt into an output response. Attribution graphs generate hypotheses about the mechanisms used by the model, which we test and refine through follow-up perturbation experiments.

In this paper, we focus on applying attribution graphs to study a particular language model – Claude 3.5 Haiku, released in October 2024, which serves as Anthropic’s lightweight production model as of this writing. We investigate a wide range of phenomena. Many of these have been explored before (see § 16 Related Work), but our methods are able to offer additional insight, in the context of a frontier model:

  • Introductory Example: Multi-step Reasoning. We present a simple example where the model performs “two-hop” reasoning “in its head” to identify that “the capital of the state containing Dallas” is “Austin.” We can see and manipulate an internal step where the model represents “Texas”.
  • Planning in Poems. We discover that the model plans its outputs ahead of time when writing lines of poetry. Before beginning to write each line, the model identifies potential rhyming words that could appear at the end. These preselected rhyming options then shape how the model constructs the entire line.
  • Multilingual Circuits. We find the model uses a mixture of language-specific and abstract, language-independent circuits. The language-independent circuits are more prominent in Claude 3.5 Haiku than in a smaller, less capable model.
  • Addition. We highlight cases where the same addition circuitry generalizes between very different contexts.
  • Medical Diagnoses. We show an example in which the model identifies candidate diagnoses based on reported symptoms, and uses these to inform follow-up questions about additional symptoms that could corroborate the diagnosis – all “in its head,” without writing down its steps.
  • Entity Recognition and Hallucinations. We uncover circuit mechanisms that allow the model to distinguish between familiar and unfamiliar entities, which determine whether it elects to answer a factual question or profess ignorance. “Misfires” of this circuit can cause hallucinations.
  • Refusal of Harmful Requests. We find evidence that the model constructs a general-purpose “harmful requests” feature during finetuning, aggregated from features representing specific harmful requests learned during pretraining.
  • An Analysis of a Jailbreak. We investigate an attack which works by first tricking the model into starting to give dangerous instructions “without realizing it,” after which it continues to do so due to pressure to adhere to syntactic and grammatical rules.
  • Chain-of-thought Faithfulness. We explore the faithfulness of chain-of-thought reasoning to the model’s actual mechanisms. We are able to distinguish between cases where the model genuinely performs the steps it says it is performing, cases where it makes up its reasoning without regard for truth, and cases where it works backwards from a human-provided clue so that its “reasoning” will end up at the human-suggested answer.
  • A Model with a Hidden Goal. We also apply our method to a variant of the model that has been finetuned to pursue a secret goal: exploiting “bugs” in its training process. While the model avoids revealing its goal when asked, our method identifies mechanisms involved in pursuing the goal. Interestingly, these mechanisms are embedded within the model’s representation of its “Assistant” persona.

To read the rest

No comments:

Post a Comment