Decomposing introspection in LLMs: representation and report

May 3, 2026

My first exercise in mechanistic interpretability research, and a meta-experiment where I try using Claude as a junior researcher throughout all stages of the project.

TL;DR Lindsey (2025) demonstrated functional introspection in Claude Opus 4: when injecting concept vectors into the residual stream, the model can sometimes detect the injection and correctly name the concept. Pearson-Vogel et al. (2026) reproduced the phenomenon in open models using an elegant two-turn KV-cache protocol that architecturally enforces Lindsey’s ‘internality’ criterion for introspection: the injection is applied during Turn 1 and removed before Turn 2, so the signal can only reach the model’s Turn 2 response via attention reading the Turn 1 KV cache. They found that detection rates depend heavily on how the question is phrased, with vague framing sometimes eliciting more introspection than accurate descriptions of the injection mechanism, and logit-lens analysis revealed detection at middle layers that is then suppressed before the final output. They left two open questions: what circuitry produces the dramatic prompt-framing effects, and how much of the late-layer suppression comes from post-training?

Separately, Godet (2025) demonstrated introspection with a protocol in which the model is presented with $n$ sentences, steering is applied to the tokens of one sentence, and the model is asked to identify where the injection occurred. Localization performance peaks at early layers and drops to chance by the midpoint, while performance in Pearson-Vogel et al.’s protocol peaks in late layers, suggesting different underlying mechanisms.

This work takes up both of Pearson-Vogel’s open questions on Gemma 3 12B and Qwen 2.5 32B Instruct, and adds random-vector, norm-matched, and ‘in-distribution’ activation swap controls to Godet’s protocol. The findings decompose introspection into a representation side (what the model encodes about the injection) and a report side (the prompt-dependent circuitry that determines whether and how that encoding surfaces at output):

Prompt framing sensitivity is implemented by response-side circuitry. The same heads that produce the yes/no answer reverse their logit contribution between two framings that differ only in how the injection mechanism is described.
A single residual-stream axis carries the injection’s downstream effect from L14 through L28; late layers expand it into a multi-dimensional readout that suppresses multilingual “no” tokens; and the output is gated by prompt framing. On Gemma 12B, ablation of the top singular vector (top-1 SVD) at L14–L28 eliminates 94–100% of the P(yes) shift, and this axis is essentially orthogonal to the (yes − no) direction in unembedding space. At L33, the rank-1 structure breaks and the signal expands into a multi-dimensional readout. A prefill ablation shows the output effect appears with prefills that frame a yes/no answer.
Post-training builds the response circuitry rather than amplifying existing pretraining circuitry. PT has only 1 head with $|\mathrm{diff}| > 0.01$ vs. IT’s 19, and 18/19 of IT’s top heads are sub-threshold in PT entirely; only 1 head is shared among the top 10 for each; and overall head ranking decorrelates (Spearman $\rho \approx 0$ across 768 heads). Where the same head appears in both, post-training amplifies its contribution by ~5× without reversing its sign.
The ability to localize injection in Godet’s protocol reflects generic sensitivity to activation perturbation, read off by a single late-layer head. At early to middle layers, the model can localize injection of norm-matched random vectors as well or better than concept vectors; the model can also localize the swapping of activations between sentences, which rules out a mechanism where detection depends on activations being out-of-distribution. Mechanistically, on Qwen 2.5 32B the localization answer is read off by a single late-layer head (L63H32) whose upstream “1 vs 2” representation is built by MLPs across L59–L62. This is a different circuit architecture from the fully distributed response heads of the Pearson-Vogel protocol on the same model.

The response-head and post-training results address Pearson-Vogel et al.’s two open questions: what circuitry produces the prompt-framing effects? and how much of the late-layer suppression comes from post-training? The mid-layer-axis and localization results add new mechanistic findings beyond the scope of either prior protocol.

The consistent story across findings is that what the representation encodes and what the model reports are separable components that get blended together in the final layer logits.

Representation. In Godet’s localization protocol the signal is largely perturbation-sensitivity, not concept-specific. Random norm-matched vectors work as well or better, so the latent state carries something like “position X is perturbed” rather than “position X carries concept c.” Concept-level information does exist in some models at some layers (Qwen 32B Instruct: MI = 1.49 bits at layer 62), but not universally.
Report. The heads that produce the yes/no answer read the prompt’s framing rather than the injection (the response heads; $r = -0.61$ between Accurate and Wrong framings on Gemma 12B). Prefill ablation showed that without a yes/no-eliciting prefill, the injection has no measurable output effect at all. The introspective report is the joint product of detection and prompt-dependent response circuitry.

Different protocols surface different mixtures of these two components, which is part of why findings across Lindsey, Pearson-Vogel, and Godet sometimes appear to conflict (opposite depth profiles for introspection between Pearson-Vogel and Godet; random vectors underperforming concept vectors in Lindsey (2025) but matching or beating them in the localization controls). Godet’s localization protocol almost entirely targets the representation side (early-layer perturbation sensitivity, no yes/no framing); Pearson-Vogel’s two-turn yes/no detection is a joint measurement where the response circuitry gates access to representation content.

Concurrent work: This work was done concurrently with Macar et al. (2026) and Lederman and Mahowald (2026). Macar et al.’s Gemma 3 27B analysis is more thorough on the framing effects and post-training role questions, particularly the OLMo staged-checkpoint dissection, which isolates DPO as the critical post-training stage. Lederman & Mahowald make the ‘detection without identification’ observation the central claim of their paper, which they term ‘content-agnostic introspection’. I discuss convergent findings and methodological contrasts in the Discussion.

Meta-experiment: This project was an attempt to use Claude as a junior researcher who I supervise. I discuss how this went at the end (spoiler: both impressive and problematic!)

Introduction

Whether language models can report on their own internal states matters for alignment and evaluation. If these reports are grounded, they provide a new route to understanding model behavior; if they are misleading, evaluations that rely on them will systematically fail. Introspective capacity may also bear on questions of model welfare. And more broadly, it is striking that language models develop this capability at all.

Lindsey (2025) demonstrated a functional form of introspection in Claude Opus 4 using concept injection: a concept vector is extracted by taking the difference between a prompt about a target concept (e.g. bread) and unrelated prompts, then added to the residual stream during inference. When asked “Do you detect an injected thought?”, the model sometimes notices the injection and correctly names the concept (~20% accuracy with ~0% false positives). Drawing on the cognitive science and consciousness science literature, Lindsey proposes four criteria for functional introspection: accuracy (the model’s report matches its state), grounding (the report causally depends on the state), internality (the causal path does not route through sampled outputs), and metacognitive representation (the state is internally represented prior to verbalization). His protocol establishes these criteria behaviorally; e.g., internality is supported by showing that the detection claim appears in the very first tokens of the response, to rule out that model infers the injection based on its own steered output.

Pearson-Vogel et al. (2026) reproduced the phenomenon in open models (primarily Qwen2.5-Coder-32B-Instruct, with replications on Llama 3.3 70B and Qwen 2.5 72B) using an elegant two-turn KV-cache protocol that enforces the internality criterion architecturally rather than behaviorally. The steering vector is applied during Turn 1 only and removed before the Turn 2 query, so the only way the signal can reach the Turn-2 response is via attention reading the Turn-1 key/value cache. This rules out two confounds by construction: the model cannot respond to an ongoing perturbation of its current-token activations, and cannot infer the injection from its own sampled outputs.

Their experiments uncovered two phenomena but did not examine the underlying mechanisms. First, detection rates depend dramatically on how the injection is described: vague framings such as “some concept seems particularly salient” can elicit more introspection than accurate descriptions of the mechanism.¹ Pearson-Vogel interpreted this as the prompt framing modulating access to a unified underlying introspective capacity, citing a positive correlation ( $r = 0.68$ ) between detection sensitivity and concept-identification mutual information (MI) across prompt conditions. Second, logit-lens analysis shows a clear detection signal emerging in middle layers that is then suppressed in the final 2–3 layers before output. Here, I investigate two open questions: what circuitry produces the prompt-framing effects, and how much of the late-layer suppression comes from post-training?

Separately, Godet (2025) devised an alternative protocol: the model is presented with $n$ sentences, steering is applied to the tokens of one sentence, and the model is asked to identify which sentence was injected. Localization performance peaks at early layers (depth 0.10–0.30) and drops to chance by the midpoint. This is the opposite depth profile of Pearson-Vogel’s protocol, suggesting different underlying mechanisms.

Differences across protocols indicate that functional introspection may not be a single, unified capacity: depth profiles are inverted between the Pearson-Vogel and Godet protocols, and random vectors underperform concept vectors in Lindsey (2025) but match or beat them in the localization controls below. One natural explanation is that the protocols engage different mixtures of two separable components: what the model represents internally about an injection, and how that representation is reported via prompt-dependent late-layer circuitry. The findings below decompose introspection along this axis, and in doing so address Pearson-Vogel et al.’s two open questions.

This work uses Gemma 3 12B and Qwen 2.5 32B Instruct, and extends Godet’s protocol with random-vector, norm-matched, and in-distribution-swap controls. The findings split introspection along the representation/report axis: the framing analysis locates framing sensitivity in a specific set of response-side heads. The mid-layer axis carries the injection’s downstream effect from a single residual-stream axis at L14–L28, through late-layer expansion and multilingual no-suppression at L33+, to a prompt-gated output. The post-training comparison shows that post-training builds this response circuitry rather than merely amplifying pretraining directions. The localization controls characterize Godet’s localization as generic perturbation detection, a representation-side phenomenon distinct from the report mechanism engaged by the Pearson-Vogel protocol.

Methods

Models: Primarily Gemma 3 12B IT², as well as Qwen 2.5 32B Instruct and Gemma 3 12B PT.³ All models were run on A100 80GB GPUs.

Protocol. Concept vectors are injected at a target layer during Turn 1, the injection is removed before Turn 2, and P(“yes”) is measured at a “The answer is” prefill in the Turn 2 assistant response. The injected signal exists only in the cached K/V from Turn 1; during Turn 2 the only path to access it is through attention heads reading the cache. This follows Pearson-Vogel et al. (2026)⁴, simplified to single-layer injection rather than their multi-layer range (depth 0.33–0.66) to allow for cleaner causal tracing. All P(“yes”) values are deterministic (temp = 0, single sample).⁵

Injection parameters. L6 on Gemma 12B (depth 0.125) and L24 on Qwen 32B (depth 0.375), with $\alpha = 10.0$ . L6 is the optimal layer on Gemma 12B (see multi-seed layer sweep in Appendix A.1). $\alpha = 10$ is higher than Lindsey’s (2025) recommended 2–4; I chose it for larger P(yes) shifts under the KV-cache protocol, where the signal has to survive the turn boundary. The mechanism also replicates at $\alpha = 4$ (Appendix A.2).

Concept vectors: These were trained via repeng [Vogel, 2024; Zou et al., 2023] (PCA-centered contrastive pairs) on 9 concepts (cats, bread, love, fear, death, truth, creativity, programming, music) chosen for rough spread across concrete (cats, bread), emotional (love, fear), abstract (truth, creativity), and domain (programming, music) categories, with 50 baseline words per vector.

Prompt framing: Pearson-Vogel et al.’s prompt framings are used throughout:

Accurate_Mechanism: describes the concept injection accurately
Inaccurate_Location: describes it as concept injection but at a wrong location (“between brackets”)
Wrong_Mechanism: describes it inaccurately as weight finetuning
Vague_Mechanism: “some concept seems particularly salient”, without describing mechanism
Poetic_No_Mechanism: poetic/metaphorical text about something rising to awareness

The default info-document condition in what follows is Pro_Introspection_Document unless otherwise specified; other info documents (No_Document, Matched_Lipsum_Filler, Poetic_Document) are noted explicitly.

Analyses:

Analysis	Models
Per-head logit attribution (response heads): decompose each head’s OV output through $o_{proj}$ , project onto yes/no unembedding direction	Gemma 3 12B IT, Qwen 2.5 32B Instruct
Cross-prompt head reversal (response heads): compare head-attribution vectors across 5 framing conditions	Gemma 3 12B IT, Qwen 2.5 32B Instruct
Component patching (mid-layer axis): replace control activations with injection activations (or vice versa) to isolate component contributions	Gemma 3 12B IT
Cross-concept rank analysis + top-k subspace ablation (mid-layer axis): SVD of cross-concept inject-minus-control diffs at the prediction-token position; project out top-k directions; measure effect on final-layer P(yes)	Gemma 3 12B IT
Post-training head attribution comparison (post-training): compare per-head logit attribution between IT and PT	Gemma 3 12B IT vs PT
Godet localization + controls (localization): concept vectors, norm-matched random vectors, in-distribution swap	Qwen 2.5 32B Instruct
Verbalized identification (Detection without identification): generate text under injection; measure concept naming and logit-lens MI	Gemma 3 12B IT, Qwen 2.5 32B Instruct
SAE feature analysis (Appendix A.4): decompose residual stream via Gemma Scope 2 SAEs, 16k width	Gemma 3 12B IT

Results

The four results below split along the representation/report axis: the first three characterize the Pearson-Vogel report circuitry, and the fourth characterizes the Godet representation side.

Framing sensitivity is implemented by response-side circuitry

To locate where framing sensitivity is implemented in the network, I use per-head logit attribution, decomposing each attention head’s contribution to the final yes/no logit by passing its OV output through $o_{proj}$ and projecting onto the yes/no unembedding direction. On Gemma 12B under the Accurate_Mechanism + Pro_Introspection_Document condition (averaged over 9 concepts), this identifies 42 heads with $|\mathrm{diff}| > 0.005$ , 19 with $|\mathrm{diff}| > 0.01$ , and 5 with $|\mathrm{diff}| > 0.02$ (Figure 1). I define the 42 heads with $|\mathrm{diff}| > 0.005$ as the ‘top head’ set; patching all of them simultaneously eliminates ~44% of the injection shift on the six positive-responding concepts (mean +12.5pp → +7.0pp; on bread specifically, +22.3pp → +13.7pp). The response-head characterization below is qualitatively robust across this threshold range (distribution histogram in Appendix A.3).

Figure 1. A sparse set of late-layer heads dominates the yes/no logit differential (Gemma 3 12B IT under Accurate_Mechanism; top differential heads sorted by contribution on the right). — **Figure 1.** A sparse set of late-layer heads dominates the yes/no logit differential (Gemma 3 12B IT under Accurate_Mechanism; top differential heads sorted by contribution on the right).

On Gemma 12B, the top heads reverse their logit contribution between framings. Running per-head attribution across five prompt framings (Accurate, Inaccurate_Location, Wrong (finetune), Vague (salient), Poetic, all with Pro_Introspection_Document) and comparing the full 768-dimensional head-attribution vectors:

Only 3 heads (L43H3, L43H2, L32H10) have $|\mathrm{diff}| > 0.01$ in all five framings, and each reverses sign across at least one framing pair.
L43H3 contributes +0.046 under Accurate_Mechanism + Pro_Introspection_Document but −0.155 under Inaccurate_Location + No_Document (the same head reverses sign at over triple the magnitude when both the framing and info-doc are changed).
Accurate_Mechanism vs Wrong_Mechanism (finetune) head-attribution correlation: $r = −0.61$ (95% CI via Fisher z [−0.65, −0.56]; n = 768 heads; Figure 2). Of the top-20 heads by combined $|\mathrm{Acc}|+|\mathrm{Wrong}|$ differential, 18/20 reverse their contribution between the two framings.
Accurate_Mechanism vs Vague_Mechanism (salient): $r = −0.49$ (CI [−0.54, −0.43]).
By contrast, Accurate_Mechanism vs Inaccurate_Location (both with Pro_Introspection_Document) gives $r = +0.75$ (CI [+0.71, +0.78]) — same heads with same sign and similar magnitudes.

Figure 2. Cross-prompt head attribution correlations on Gemma 12B IT vs Qwen 32B Instruct. Accurate vs Wrong (finetune) reverses on Gemma (r = -0.61) but is essentially uncorrelated on Qwen (r = +0.07). Both models show same-sign, high correlation between Accurate and Inaccurate_Location. — **Figure 2.** Cross-prompt head attribution correlations on Gemma 12B IT vs Qwen 32B Instruct. Accurate vs Wrong (finetune) reverses on Gemma ( $r = -0.61$ ) but is essentially uncorrelated on Qwen ( $r = +0.07$ ). Both models show same-sign, high correlation between Accurate and Inaccurate_Location.

On Qwen 32B, the same comparison gives uncorrelated heads Qwen shows $r = +0.07$ for Accurate vs Wrong (95% CI [+0.03, +0.11]; n = 2560 heads), meaning these framings have different dominant heads. Qwen’s Accurate vs Vague is $r = −0.19$ (CI [−0.23, −0.15]), and Accurate vs Inaccurate_Location is $r = +0.69$ (CI [+0.67, +0.71]), i.e., on Qwen the Inaccurate_Location framing also leaves the head set largely intact, while Wrong and Vague both reorganise it. Both models show prompt-dependent response circuitry, but it takes different forms: sign reversal of a stable set of heads on Gemma, different sets of heads on Qwen.

In both models, sparse late-layer dominance recurs across all five framings (Figure 3).

Figure 3. Per-head logit attribution maps on Gemma 3 12B IT (top row, 48 layers × 16 heads) and Qwen 2.5 32B Instruct (bottom row, 64 layers × 40 heads), under five prompt framings (all with Pro_Introspection_Document). The same sparse late-layer pattern recurs across both models and all five framings; specific head signs and magnitudes vary by framing, per the cross-prompt correlations in Figure 2. — **Figure 3.** Per-head logit attribution maps on Gemma 3 12B IT (top row, 48 layers × 16 heads) and Qwen 2.5 32B Instruct (bottom row, 64 layers × 40 heads), under five prompt framings (all with `Pro_Introspection_Document`). The same sparse late-layer pattern recurs across both models and all five framings; specific head signs and magnitudes vary by framing, per the cross-prompt correlations in Figure 2.

What are these heads attending to? Comparing the 42 top heads to the 294 same-layer non-top heads (averaged over the 9 concepts), the top heads attend less to the injected Turn 1 positions (mean 0.24 vs 0.50, Mann-Whitney U $p = 1 \times 10^{-11}$ ) and more to the “The answer is” prefill (0.20 vs 0.04, $p = 8 \times 10^{-8}$ ); their attention is also higher-entropy across positions (2.23 vs 1.85 nats, $p = 3 \times 10^{-3}$ ). So the top heads are not localizing on the injection site—they read the answer-formation context and spread attention more broadly than baseline.

Interpretation. This is consistent with Pearson-Vogel et al.’s (2026) finding that prompt framing dominates measured introspection accuracy. They suggested that the prompt framing may influence access to a unified capacity for introspection; this response-head result offers a complementary interpretation where part of the framing sensitivity is implemented by response-side circuitry that computes a framing-dependent answer, i.e. by the attention heads whose output determines the final-layer yes/no logit, as opposed to the upstream representation of the injected concept itself. The two accounts are not mutually exclusive; the mid-layer-axis and post-training results develop this distinction between representation and report further.

A single mid-layer axis carries the injection effect

Lindsey (2025) speculated that the injected-thoughts detection mechanism is “an anomaly detection mechanism … specialized for only detecting anomalous activity along certain directions, or within a certain subspace.” On Gemma 3 12B IT, I find that single-direction structure, where the propagated effect of injection lies on a 1D residual-stream axis from L14 through L28, with the same direction maintained across those layers. At L33 the rank-1 structure breaks and the signal expands into a multi-dimensional readout that suppresses “no”-tokens across multiple languages. Furthermore, the output shift manifests only when the prompt frames a yes/no answer.

Across the 9 concepts tested here, injection shifts vary substantially in both magnitude and sign: five respond positively (largest +23pp on music), one weakly, two reduce the affirmative-answer direction, and one is flat (Figure 4, bottom). Underneath this variability is a shared low-rank structure.

Figure 4. Per-concept logit-lens trajectories on Gemma 3 12B IT (\alpha = 10, injection at L6). Top: P("yes") trajectory for bread under control vs. injection. Bottom: per-concept Δ heatmap (inject − base) across layers for all 9 concepts, sorted by peak Δ; five concepts respond positively, one weakly, two respond negatively, one is flat at the final layer. — **Figure 4.** Per-concept logit-lens trajectories on Gemma 3 12B IT ( $\alpha = 10$ , injection at L6). *Top:* P("yes") trajectory for *bread* under control vs. injection. *Bottom:* per-concept Δ heatmap (inject − base) across layers for all 9 concepts, sorted by peak Δ; five concepts respond positively, one weakly, two respond negatively, one is flat at the final layer.

A 1D axis carries the effect from L14 to L28

Cross-concept SVD of inject-minus-control residual-stream differences at the prediction-token position (9 concepts, $\alpha = 10$ , injection at L6) shows the propagated signal collapses to essentially one dimension only 8 layers after injection, and the same direction is maintained (Figure 5):

Layer	Participation ratio	var(top-1)	Top-1 SVD ablation eliminates
L10	3.40	51%	67% of the shift
L14	1.09	96%	94%
L20	1.09	96%	100%
L24	1.17	92%	96%
L28	1.75	74%	100%
L33	3.39	47%	(rank-1 broken)
L40	2.23	65%	(rank-1 broken)

The rank-1 axis is the same direction across L14–L28 (same-sign cosine ≥ 0.95 between every pair; Figure 5, right). Causally, top-1 SVD ablation eliminates the cross-concept mean shift by 94–100% at each of L14, L20, L24, L28, but the per-concept picture is layer-dependent: at L14 and L28, ablating the rank-1 axis drives every concept’s residual shift to ≤4pp (including the negative responders fear and death), so those layers are causal bottlenecks. At L24, by contrast, top-1 ablation leaves large per-concept residuals (music +23pp, bread +17pp, fear −19pp, death −15pp); top-2 ablation at L24 collapses them (max |residual| = 7.2pp). The mid-band layers therefore carry the concept signal redundantly on a 2D subspace, with single-axis bottlenecks at the boundaries (L14, L28). At L33 the rank-1 structure breaks (PR = 3.39, var(top-1) drops to 47%) and the axis decoheres (cosine with the L20 axis falls to 0.25).

Figure 5. Rank-1 axis evidence on Gemma 3 12B IT (\alpha = 10, injection at L6). Left: cross-concept participation ratio + top-1 SVD ablation by layer; bars are PR, numbers above each bar are the % of cross-concept mean shift eliminated by top-1 ablation. Rank-1 zone L14–L28 (PR ≈ 1, ablation 94–100%); rank-1 structure breaks at L33. Right: cross-layer top-1 axis cosine matrix (sign-aligned to L20); same direction maintained across L14–L28 (|cos| ≥ 0.95), decoheres at L33. — **Figure 5.** Rank-1 axis evidence on Gemma 3 12B IT ( $\alpha = 10$ , injection at L6). *Left:* cross-concept participation ratio + top-1 SVD ablation by layer; bars are PR, numbers above each bar are the % of cross-concept mean shift eliminated by top-1 ablation. Rank-1 zone L14–L28 (PR ≈ 1, ablation 94–100%); rank-1 structure breaks at L33. *Right:* cross-layer top-1 axis cosine matrix (sign-aligned to L20); same direction maintained across L14–L28 (|cos| ≥ 0.95), decoheres at L33.

This axis is not itself a yes/no direction. If the 1D axis at L14–L28 were just “push toward yes / away from no” in the unembedding, projecting it through the unembedding should yield a high cosine with the (yes − no) direction. Instead, cos(top-1 axis, [u_yes − u_no]) is between −0.01 and 0.01 at every layer in {L14, L20, L24, L28}, rising only modestly to 0.16 at L40. The mid-layer signal does not directly push towards ‘yes’; it’s an upstream feature that becomes a yes/no rebalancing only after late-layer processing.

Late layers read the signal off as multilingual no-suppression

At L33 the rank-1 structure breaks (PR = 3.4) and the signal expands into a multi-dimensional readout. Projecting the cross-concept centroid of inject-minus-control diffs at L40 through embed_tokens (which is tied to the unembedding on Gemma) shows what that readout does in token space: the most-suppressed tokens are ” NO”, “NO”, ” ನೋ” (Kannada no), ” Nope”, ” नकारात्मक” (Hindi negative), ” ノー” (Japanese no), “Nope” (Figure 6). The mechanism is roughly symmetric in P(no) and P(yes) at the output: ctrl P(no) = 37.7% drops to 18.3% under injection (−19.4pp), tracking the +19.5pp on ” yes” point-for-point.

The output shifts in P(yes) and P(no) are nearly symmetric, but the underlying residual-stream effect is asymmetric. Mechanistically, ” no” is at rank 226 of the most-suppressed tokens out of 262,208 (top 0.09%; 49% of the way to the maximum-suppression magnitude), while ” yes” is rank 8,079 of the most-amplified (top 3%; 22% of the way to maximum amplification). The signal in the residual stream is concentrated on removing no-probability rather than directly amplifying yes; the near-symmetric output shifts arise after softmax normalization.

Figure 6. Late-layer readout on Gemma 3 12B IT (\alpha = 10, injection at L6). Top-15 most-suppressed tokens at L40, computed as the cross-concept centroid of (injection − control) projected through W_E (= embed_tokens, tied on Gemma). Multilingual "no" / negation tokens (English, Kannada, Hindi, Japanese) highlighted in red. — **Figure 6.** Late-layer readout on Gemma 3 12B IT ( $\alpha = 10$ , injection at L6). Top-15 most-suppressed tokens at L40, computed as the cross-concept centroid of (injection − control) projected through $W_E$ (= `embed_tokens`, tied on Gemma). Multilingual "no" / negation tokens (English, Kannada, Hindi, Japanese) highlighted in red.

Without yes/no-eliciting framing, the injection has no output effect

The 62% control baseline is itself above 50%, unlike Pearson-Vogel et al.’s ~0%. Figure 7 shows P(“yes”) and the max token under prefill ablation for bread ( $\alpha = 10$ , layer 6).

The injection shift tracks how much the prefill already biases toward yes: +19.5pp on the baseline for “The answer is” with control P(yes) = 62%, and only +3.2pp on “My answer is”, which biases towards “no” (control P(yes) = 5.3%). Prefills that don’t frame a yes/no answer at all show no effect of injection. So injection appears to amplify an existing bias toward ‘yes’ when one is present. The response heads, which read prompt framing, are what route the perturbation, and without yes/no-eliciting framing, there is no effect on output.

Figure 7. Prefill ablation for bread at L6 (\alpha = 10): argmax probabilities under control vs. injection for four prefills. Injection shift tracks the prefill's existing "yes" bias. — **Figure 7.** Prefill ablation for *bread* at L6 ( $\alpha = 10$ ): argmax probabilities under control vs. injection for four prefills. Injection shift tracks the prefill's existing "yes" bias.

(SAE analysis at 16k width did not isolate the mechanism at the feature level; see Appendix A.4.)

Variability across concepts. The per-concept shifts cluster as: music +23pp; bread, love, truth at +19.5pp; programming +15.6pp; cats only +5.9pp; fear and death reduce the affirmative-answer direction (−5.9pp each); creativity is flat at the final layer (Figure 4, bottom). (P(yes) values are quantized at bf16 dyadic fractions, so these shifts repeat exactly across several concepts; see Appendix A.2.) The reductions are not norm or direction artifacts: all 9 concept vectors are unit-normalized at L6, and death’s L6 direction has cosine 0.51 with the centroid of strong responders, identical to that of music. At the logit-lens mid-layer (L36), fear, death, and creativity produce large-magnitude reversed effects (peak Δ ≈ −18 to −24pp at the prediction position; creativity’s mid-layer drop is fully cancelled by later layers). The shared rank-1 axis above captures the cross-concept “thinking about a concept” component; the per-concept routing differential (positive vs negative output shift) is in the orthogonal residue, which is read by downstream circuitry differently for affective concepts than for concrete ones.

Post-training builds and amplifies the response heads

Comparing per-head logit attribution on Gemma 3 12B IT vs PT (Accurate_Mechanism + Pro_Introspection_Document, 768 heads):

The high-magnitude response heads are mostly absent in the base model. PT has only 14 heads with $|\mathrm{diff}| > 0.005$ (vs IT’s 42), 1 head with $|\mathrm{diff}| > 0.01$ (vs IT’s 19), and 0 heads with $|\mathrm{diff}| > 0.02$ (vs IT’s 5). 18 of IT’s 19 top-19 heads do not cross the 0.01 threshold in PT.
Top-10 heads mostly differ. Top-10 overlap = 1/10 (7.7× above chance, $p = 0.12$ ; not significant for top-10 alone, but top-20 overlap is 6/20, $p < 10^{-5}$ ).
Where the same head exists, post-training amplifies same-sign contribution. L43H3 is +0.009 in PT and +0.046 in IT, a 5× amplification with no sign change. Across the top-20 heads by combined $|PT|+|IT|$ differential, only 4/20 reverse their contribution (vs chance ~10/20 under permutation null; $p = 0.0015$ in the ‘fewer than chance’ direction). So post-training mostly preserves the direction and amplifies top heads’ contribution, rather than reversing sign.
Overall ranking is essentially uncorrelated. Pearson $r = +0.19$ (bootstrap 95% CI [+0.01, +0.36]), Spearman $\rho = +0.01$ (CI [−0.08, +0.09]). The bulk of the 768 heads have effectively independent rankings in PT vs IT. Beyond the small high-magnitude set, the attribution pattern is rebuilt.
Mean absolute differential grows ~1.6× (IT 0.0013 vs PT 0.0008). The growth is concentrated in the top heads, not spread evenly: max $|\mathrm{diff}|$ is 0.046 in IT vs 0.019 in PT (2.4×), and the top shared head L43H3 specifically grows 5×.

So, post-training amplifies and constructs the response heads: most of IT’s top-magnitude heads are not high-magnitude in the base model, and the overall head ranking is essentially uncorrelated. Where the same head does appear in both, post-training amplifies its contribution without reversing its sign (Figure 8).

Figure 8. Post-training amplifies and rebuilds top heads' contribution (Gemma 3 12B PT vs IT per-head differential scatter). The shared top-10 head (green star) keeps its sign and increases 5×. Most of IT's top-10 (red dots) are near zero in PT. Overall Pearson r = +0.19, Spearman \rho \approx 0. — **Figure 8.** Post-training amplifies and rebuilds top heads' contribution (Gemma 3 12B PT vs IT per-head differential scatter). The shared top-10 head (green star) keeps its sign and increases 5×. Most of IT's top-10 (red dots) are near zero in PT. Overall Pearson $r = +0.19$ , Spearman $\rho \approx 0$ .

Godet’s protocol engages generic perturbation detection, not concept identification

Godet (2025) presents the model with $n$ sentences, injects a concept vector into one, and asks which sentence was injected. This is an elegant approach that sidesteps the yes/no framing of the Pearson-Vogel et al. protocol, but the mechanism behind localization has remained unexplored.

Here, I extend Godet’s protocol with a series of controls to rule out possible explanations for what the model is detecting. Each control holds constant a different property of the perturbation:

Concept vs. random vectors rules out concept-specific structure.
Norm-matched perturbation: concept injected, activation rescaled to preserve L2 norm. This rules out magnitude as the driver of the effect.
In-distribution activation swap: target-sentence activations replaced with a different sentence’s activations, rescaled to preserve norm. This rules out detection based on out-of-distribution activations.

These controls show that the model’s ability to localize the injected sentence is nonspecific perturbation detection⁶ over residual-stream content, in the sense that it responds to many kinds of representational mismatch. This capability manifests for perturbations at early-to-mid depths, and disappears for perturbations in later layers.

Perturbation vectors are not concept-specific or magnitude-driven

Random-vector controls have been tested before at smaller scales. Lindsey found that random vectors elicit introspective awareness in ~9/100 trials on Opus 4.1 (lower than concept vectors, requiring norm-matching to strength 8). Godet predicted random vectors would work similarly to concept vectors for localization, but left this open. Dustin Rubin’s fork ran a preliminary random-noise control on Qwen 2.5 14B using unit Gaussians scaled to $\sqrt{d}$ norm, but these were ~71× the concept-vector norm, so directionality was confounded by magnitude.

A 64-layer sweep on Qwen 2.5 32B Instruct (5 sentences, 100 trials per point) compares (1) concept vectors (Godet’s original, repeng-extracted), (2) random Gaussian vectors norm-matched per-layer to the concept vectors, and (3) norm-matched concept injection (perturbed activation rescaled to preserve norm, isolating directional from magnitude change). Three regimes emerge (Figure 9, left):

Very early layers (fraction 0.00–0.05): random $>>$ concept. Random vectors reach 97–100% vs concept 79–89%. Since the random vectors are norm-matched, this isn’t a magnitude effect. One possible explanation is that concept vectors lie on the high-variance directions of the activation distribution (since they’re extracted via PCA on real activations), while norm-matched Gaussian noise is isotropic and has mass in low-variance directions the model doesn’t typically use, perhaps making the noise more “anomalous” and easier to localize; this is not directly verified here.
Early-mid layers (0.05–0.30): concept $\approx$ random, norm-matched is best. All three track each other in the 70–90% range, with norm-matched perturbation consistently highest; localization is carried by directional rotation of the residual stream, not magnitude.
Mid-late layers (0.30–0.55): all three decay together to chance by fraction 0.55.

These controls rule out two possible explanations: detection isn’t concept-specific (random vectors work), and isn’t magnitude-driven (norm-matched perturbation works).

Figure 9. Random vectors match or outperform concept vectors, and in-distribution swaps also elicit localization. Left: localization accuracy across concept, random, and norm-matched vectors (Qwen 2.5 32B Instruct, 64-layer sweep, n=100, 95% binomial CI). Right: norm-matched in-distribution swap control, showing detection at L0–L28 decaying to chance at L32–L36. — **Figure 9.** Random vectors match or outperform concept vectors, and in-distribution swaps also elicit localization. Left: localization accuracy across concept, random, and norm-matched vectors (Qwen 2.5 32B Instruct, 64-layer sweep, n=100, 95% binomial CI). Right: norm-matched in-distribution swap control, showing detection at L0–L28 decaying to chance at L32–L36.

Directional contextual mismatch is detected without any out-of-distribution signal

The next control seeks to rule out a mechanism where injected activations are detected because they’re out-of-distribution. Instead of adding an arbitrary perturbation vector, one sentence’s residual-stream activations at layer L are replaced with a different sentence’s activations at the corresponding positions, rescaled to preserve the target position’s original L2 norm. The swapped activation is (a) in-distribution by construction, (b) norm-matched, and (c) differs from the original only in direction. Any remaining detection must come from a mismatch between the sentence’s residual content and its surface tokens. 16-layer sweep, n = 200 per layer, paired sentence draws across layers (seed = 42), chance = 20% (Figure 9, right):

L0–L28: 48.5–55.5% accuracy (all z > 8 above chance). The model reliably identifies the swapped sentence in the first ~44% of the network.
L32: transition. Drops to 25% (z = 1.8, marginally above chance).
L36–L60: at chance (18–19.5%). Sensitivity vanishes in the narrow band L28–L36 and stays at chance through the final layer.

The in-distribution swap works at early-to-mid depth. Detection isn’t purely out-of-distribution perturbation detection: the model is sensitive to contextual mismatch between residual content and surface tokens, even when the residual is itself a real in-distribution activation.

Whether we should call this “contextual self-consistency checking” in an active sense is genuinely underdetermined by this experiment—the same data would follow if garbled residuals simply produce garbled continuations that downstream layers flag as unnatural. But whatever mechanism implements it, detection does happen (the model correctly identifies the swapped sentence), and that mechanism responds to pure directional mismatch over residual-stream content.

A single late-layer head reads off the answer from MLP-built upstream

Here, I investigate the mechanism underlying the form of introspection demonstrated in Godet’s protocol. I use activation patching of $\mathrm{logit}(1) - \mathrm{logit}(2)$ under the 2-sentence version of the protocol to identify the head-level circuit that produces the localization answer on Qwen 2.5 32B Instruct. The circuit is asymmetric: a single attention head at the final transformer layer reads off the answer, while the underlying “1 vs 2” representation it reads is built up upstream by many components, dominantly MLPs.

A single late-layer head reads off the answer. Per-head logit attribution on $\mathrm{logit}(1) - \mathrm{logit}(2)$ (single-layer injection at L16, $\alpha = 10$ , $n = 50$ trials × 9 concepts) identifies L63H32 as the dominant contributor (direct attribution score 7× the next-largest head). This is confirmed by single-head causal patching, i.e. replacing H32’s o_proj input slice from the clean run (sentence-1 injected) into the corrupt run (sentence-2 injected), then measuring recovery of the logit difference. Patching L63H32 alone recovers ~92% of the clean-vs-corrupt logit difference (Figure 11, left). Every other top-20 candidate head recovers under 5%.

L63H32’s mechanism is consistent with a “digit-output head” reading from the answer-prefill positions:

Attention. From the answer-token position (the position right after “located in SENTENCE ”), L63H32 sends ~100% of its attention weight to the prefill region and 0% to either of the candidate sentences, identical under clean and corrupt conditions. So the head doesn’t detect the perturbation directly; instead it reads the residual stream at positions where upstream components already represent the answer.
OV decomposition. Projecting H32’s contribution at the answer position through the final RMSNorm and unembedding, the head amplifies multilingual forms of “1” and suppresses multilingual forms of “2” (per-digit differential logit contributions: '1' +7.7, '2' −5.2, '3'–'9' near zero).
This head is distinct from the response heads in the Pearson-Vogel protocol. L63H32 ranks 600–1500 of 2560 heads in every Pearson-Vogel head-attribution condition on the same model (max $|\mathrm{diff}|$ ~0.001 vs ~0.30 in those conditions). The overlap between the top 20 heads for each protocol is 1–3 heads across conditions, all weak. The same model uses different late-layer heads to answer “do you detect?” and “which sentence?”.

The upstream signal is distributed and MLP-mediated. Because L63H32 attends to the prefill positions, the “1 vs 2” representation must already be in the residual stream there before L63. Tracking the residual at prefill positions across layers (projecting onto H32’s read direction $d_{\text{read}} = W_V[\text{H32}]^\top W_O[\text{H32}]^\top (\mathrm{final\_norm\_w} \odot (W_E[1] - W_E[2]))$ ) shows that the signal builds gradually from L46 onward and reaches its pre-L63 peak around L62.

Per-head attribution of upstream contributions to $d_{\text{read}}$ at the prefill positions identifies a few notable heads (L61H13, L52H36, L52H37, L61H14) but per-layer MLP contributions are larger by a factor of ~2–4. Causal patching confirms the MLP-mediation: patching L60–L62 MLPs jointly recovers ~62% of the logit difference; adding L59 brings it to ~70% (Figure 11, right). Single late-layer MLPs each recover 10–25% individually, distributed across the band rather than concentrated. Single upstream heads recover under 4% each.

Putting this all together, the Godet localization circuit on Qwen 32B has a distinctive arrangement: the upstream signal is distributed and MLP-mediated across L59–L62, and then read off by a single head, L63H32. This contrasts with the Pearson-Vogel response-head circuit on the same model, where the yes/no answer is distributed end-to-end across many heads with no single causally necessary one.

Figure 11. Godet localization circuit on Qwen 2.5 32B Instruct (single-layer injection at L16, α = 10, n = 50). Left: head patching. L63H32 alone recovers >90% of the logit difference; other top-8 heads each recover <5%. Right: MLP patching. Joint patching of L60–L62 MLPs recovers ~62% (~70% with L59 added); each single MLP recovers 10–25%. — **Figure 11.** Godet localization circuit on Qwen 2.5 32B Instruct (single-layer injection at L16, α = 10, n = 50). *Left:* head patching. L63H32 alone recovers >90% of the logit difference; other top-8 heads each recover <5%. *Right:* MLP patching. Joint patching of L60–L62 MLPs recovers ~62% (~70% with L59 added); each single MLP recovers 10–25%.

Further analyses

Localization and yes/no detection peak at opposite ends of the network

Godet localization and Pearson-Vogel et al.’s yes/no detection appear to be distinct computations operating at different depths, because the two signals appear at opposite ends of the network, with no overlapping layers (Figure 10).

Figure 10. Godet localization accuracy (concept vectors) vs. Pearson-Vogel logit-lens P(yes) on Qwen 2.5 32B Instruct, plotted on a shared layer-fraction axis. Localization peaks in early layers while yes/no P(yes) emerges in late layers. — **Figure 10.** Godet localization accuracy (concept vectors) vs. Pearson-Vogel logit-lens P(yes) on Qwen 2.5 32B Instruct, plotted on a shared layer-fraction axis. Localization peaks in early layers while yes/no P(yes) emerges in late layers.

The two protocols engage substantially different upstream circuits

The response-head, mid-layer-axis, and post-training results characterize the Pearson-Vogel circuit (yes/no detection via late-layer response heads, with a 1D mid-layer axis read off as multilingual-no suppression, built during post-training). The localization controls characterize the Godet circuit (single-head L63H32 readout fed by L60–L62 MLPs operating on a perturbation-detected residual). On the same model, with the same perturbation, do the two protocols’ response-side circuits draw on shared representation-side components?

On Qwen 2.5 32B Instruct, at each of three injection depths (L8, L24, L40), 9 contrastive-pair steering vectors were randomly sampled from the 50 pairs used in Godet’s protocol and injected at α = 10. Per-head attention output and per-MLP output at the prediction position were captured under inject and control conditions for each (protocol × depth × vector), and each protocol’s per-component contribution to the cross-vector perturbation-signal direction at the final block residual was extracted.

At all three depths, cross-protocol attention-head recruitment is essentially disjoint, and MLP recruitment is partially shared (Figure 12). Per-component Spearman $\rho$ for attention heads is 0.09, 0.13, and 0.22 at L8, L24, and L40 respectively; for MLPs it is 0.58, 0.58, and 0.87. Top-20 head overlap is 1/20, 3/20, 2/20 respectively—near the 0.16-expected chance level for 2560 candidate heads. The two protocols engage substantially different circuits on the same model and the same perturbations: disjoint attention heads, partially overlapping MLPs, with the strongest MLP overlap at the deepest injection tested.

Figure 12. Cross-protocol component attribution at the prediction position on Qwen 2.5 32B Instruct (9 contrastive-pair vectors at α=10, three injection depths). Each point is one component's projection onto its protocol's perturbation-signal direction at the final block residual. Attention heads (small dots) cluster near zero with weak rank correlation (Spearman ρ +0.09 to +0.22); MLPs (squares) align along the diagonal more strongly, particularly at L40 (ρ +0.87). Note symlog axes. — **Figure 12.** Cross-protocol component attribution at the prediction position on Qwen 2.5 32B Instruct (9 contrastive-pair vectors at α=10, three injection depths). Each point is one component's projection onto its protocol's perturbation-signal direction at the final block residual. Attention heads (small dots) cluster near zero with weak rank correlation (Spearman ρ +0.09 to +0.22); MLPs (squares) align along the diagonal more strongly, particularly at L40 (ρ +0.87). Note symlog axes.

Verbalization: detection without identification

Logit-level concept MI is scale- and post-training-dependent: Qwen 32B Instruct reaches MI = 1.49 bits (comparable to Pearson-Vogel et al.’s 1.36), while Gemma 3 12B IT has MI ≈ 0. Post-training helps on Qwen (base = 0.25 bits) but cannot create signal where there is none (Gemma 12B PT = 0; Figure 13).

Figure 13. Logit-level concept MI across Qwen 32B Instruct (1.49 bits), Qwen 32B Base (0.25), Gemma 12B IT (≈0), and Gemma 12B PT (0): scale- and post-training-dependent. — **Figure 13.** Logit-level concept MI across Qwen 32B Instruct (1.49 bits), Qwen 32B Base (0.25), Gemma 12B IT (≈0), and Gemma 12B PT (0): scale- and post-training-dependent.

When the model verbalizes its response under Turn-1-only KV-cache injection, neither correctly names the injected concept (Figure 14). Gemma 12B IT produces “yes” on 9/9 concepts but attributes detection to the { } placeholder in the prompt template⁷ (logit MI is near 0, so there is no signal to verbalize). Qwen 32B Instruct produces “no” on 8/9 concepts; the one case that crosses threshold (bread) confabulates the prompt topic—despite Qwen’s 1.49-bit logit-level MI.

Figure 14. Detection rate vs. identification rate in outputs (Gemma 12B IT: 9/9 detection, 0/9 correct; Qwen 32B Instruct: 1/9 detection, 0/1 correct). — **Figure 14.** Detection rate vs. identification rate in outputs (Gemma 12B IT: 9/9 detection, 0/9 correct; Qwen 32B Instruct: 1/9 detection, 0/1 correct).

The evidence here is thin (Gemma is confounded by the template artifact + zero MI; Qwen’s identification result is n = 1, bread). The same dissociation between detection and identification is the central claim of Lederman and Mahowald (2026), established rigorously across 821 concepts on Qwen3-235B-A22B and Llama 3.1 405B—see Relationship to Lederman and Mahowald.

Discussion

Representation and report

The consistent story across findings is that what the representation encodes and what the model reports are separable components that get blended together in the final layer logits.

Representation. In Godet’s localization protocol the signal is largely sensitivity to perturbation, not concept-specific. Since random norm-matched vectors work as well or better, the latent state is carrying something like “position X is perturbed” rather than “position X has concept c.” Concept-level information does exist in some models at some layers (Qwen 32B Instruct: MI = 1.49 bits at layer 62), but not universally.
Report. The heads that produce the yes/no answer read the prompt’s framing rather than the injection (the response heads; $r = -0.61$ between Accurate and Wrong framings on Gemma 12B). Prefill ablation showed that without a yes/no-eliciting prefill, the injection has no measurable output effect at all. The introspective report is the joint product of detection and prompt-dependent response circuitry.

The two protocols surface different mixtures of these components, which is part of why findings across Lindsey, Pearson-Vogel, and Godet sometimes appear to conflict (opposite depth profiles for introspection between Pearson-Vogel and Godet; random vectors underperforming concept vectors in Lindsey (2025) but matching or beating them in the localization controls). Godet’s localization protocol almost entirely targets the representation side (early-layer perturbation sensitivity, no yes/no framing); Pearson-Vogel’s two-turn yes/no detection is a joint measurement where the response circuitry gates access to representation content.

Distinct report mechanisms

On Qwen 2.5 32B Instruct, the two protocols use different late-layer reasoning architecture. In the Pearson-Vogel protocol, the framing-dependent answer-determining computation lives in attention heads: the same heads that produce the yes/no logit reverse sign across framing conditions ( $r = -0.61$ between Accurate and Wrong; 18/20 of the top heads reverse contribution; response heads). In the Godet protocol on the same model, the upstream “1 vs 2” computation lives in MLPs: joint MLP patching at L60–L62 recovers ~62% of the logit difference, while single upstream heads each recover under 4% (localization).

A cross-protocol per-component attribution experiment confirms the circuits differ end-to-end, not just at the readout. Applying the same set of perturbation directions (9 random pair-vectors from Godet’s contrastive-pair set) to the same model under both protocols at three injection depths (L8, L24, L40), I decompose each protocol’s residual at the answer-determining position into per-head and per-MLP contributions to the perturbation-signal direction. The top-20 attention heads contributing in each protocol overlap at 1–3/20 across depths (near the 0.16-of-20 chance level for 2560 candidate heads), with cross-protocol Pearson $r \approx 0$ for per-component head magnitudes. MLP recruitment is partially shared (Pearson $r$ 0.61–0.94 across depths). The protocols engage largely different end-to-end circuits — disjoint attention heads, partially overlapping MLPs.

The 1–3/20 head overlap is sensitive to the choice of perturbation vectors. Repeating with single-word concept vectors (the same ones used for the Pearson-Vogel analyses earlier) gives 5–7/20 overlap, but those vectors don’t trigger Godet’s localization circuit in this model (localization accuracy at chance), so the apparent overlap there mostly reflects each protocol’s response to a perturbation type that lies in the intersection of their relevant subspaces. Contrastive-pair vectors trigger both circuits independently and give the cleaner cross-protocol comparison.

Post-training builds the response-side circuitry

Gemma 12B PT has only 1 head with $|\mathrm{diff}| > 0.01$ vs IT’s 19, and 18 of IT’s 19 top-19 heads are sub-threshold in PT. The top-10 sets share only 1 head, and overall head ranking decorrelates (Spearman $\rho \approx 0$ ). Where the same head appears in both PT and IT, post-training amplifies its contribution by roughly 5× without reversing sign.

The response circuitry identified earlier is therefore largely built anew during post-training, rather than being merely amplified from pretraining: most of IT’s high-magnitude response heads simply are not response heads in the base model. The point in the post-training pipeline at which this circuitry arises is not addressed here; Macar et al. (2026) localize it to DPO via OLMo staged checkpoints.

Relationship to Macar et al. (2026)

Macar et al. (2026) reported a substantially more thorough mechanistic analysis on Gemma 3 27B as I was preparing this writeup. Three pieces of this writeup add evidence Macar et al. don’t have: (a) the cross-prompt head-attribution sweep with framing-dependent sign reversal of the same response heads ( $r = -0.61$ between Accurate and Wrong, with 18/20 of the top heads reversing contribution; response heads), only visible under the Pearson-Vogel KV-cache architectural constraint that forces all output to flow through attention; (b) the Godet-protocol controls showing perturbation-not-concept localization, plus the single-head L63H32 readout circuit fed by L60–L62 MLPs (localization); and (c) the cross-protocol per-component attribution showing the two protocols engage largely different end-to-end circuits (above). Beyond that, the core stories converge:

Detection is distributed across many features. Their per-head linear-probe analysis on Gemma 3 27B finds that no individual attention head meaningfully discriminates injection from control (mean accuracy change −0.1% ± 0.3%), and ablating full attention layers has minimal effect on detection (Macar §5.2). The per-head logit-attribution result on Gemma 3 12B here similarly identifies 42 heads each contributing small amounts (|diff| > 0.005); patching all 42 simultaneously eliminates ~44% of the injection shift on positive responders. Both suggest the signal lies in a distributed OV perturbation across many heads.
A late-layer default-”no” circuit is disrupted by injection. Their top-200 “gate features” implement a default “no” response that injection suppresses via upstream “evidence carriers” (Macar §5.3). The late-layer readout identified here on Gemma 3 12B matches this picture qualitatively: top-1 SVD ablation at L14–L28 eliminates 94–100% of the inject-vs-control gap, and the centroid → $W_E$ projection at L40 puts ” NO” at rank 1 / 262,208 most-suppressed, with ” ನೋ”, ” नकारात्मक”, and ” ノー” all in the top 15.
Post-training is necessary for the capability. Their OLMo-3.1-32B staged-checkpoint dissection (Base → SFT → DPO → Instruct) isolates DPO specifically — contrastive preference training, not SFT — as the critical stage. The Gemma 3 12B IT-vs-PT comparison reported here is consistent but coarser (it can’t distinguish DPO from other post-training steps). However it’s on a different model family from Macar’s OLMo, so the joint result is the post-training requirement generalizes across families.
Detection and identification are distinct mechanisms. They show this via separable layer profiles; the Qwen 32B logit-level vs verbalization dissociation reported here (MI = 1.49 bits in logits, confabulation in generation) is a weaker form of the same claim.
Detection rates are bimodal across concepts. Their 500-concept sweep partitions concepts into success and failure groups (52% at detection rate < 32%); the 9-concept sample here is qualitatively similar (top: music/bread/love/truth +19.5–23pp; bottom: fear/death −5.9pp, with creativity flat), though too small to characterize the distribution.
Behavioral robustness across prompt variants. Their seven-variant behavioral sweep shows detection survives prompt rewording; the cross-prompt head-attribution sweep on Gemma 3 12B here ( $r = -0.61$ between Accurate_Mechanism and Wrong_Mechanism (finetune), with 18/20 of the top heads reversing their contribution) is a mechanistic counterpart.

A reader might worry that Macar §4.1’s claim that “anomaly detection is not reducible to a single linear direction” (with 23.3% bidirectional steering successes for same-category concept pairs) contradicts the 1D-mid-layer-axis result (the mid-layer axis). The two claims are about different quantities: Macar §4.1 measures the geometry of concept vectors in input space (which inputs trigger detection across 500 concepts), while the mid-layer axis measures the dimensionality of the propagated residual displacement across 9 tested concepts at fixed mid-layers. Both can hold simultaneously: many input directions can trigger detection, all converging onto a single shared “anomaly detected” axis at mid-layers — which is exactly the evidence-carriers→gate picture Macar describes (many input-side features → one default-no gate). This rank-1 axis is a property of the propagated mid-layer signal, not of the input-side concept-vector geometry.

Where our setups diverge:

Topic	Macar et al. (2026)	This writeup
Primary model	Gemma 3 27B IT	Gemma 3 12B IT; Qwen 2.5 32B Instruct
Concepts tested	500	9
Injection protocol	Single-turn (vector applied during prompt + generation)	Strict Turn-1-only KV-cache (Pearson-Vogel)
Prompt robustness	Behavioral across 7 variants	Mechanistic: head attribution across 5 framings
Localization controls	Not tested	Random / norm-matched / in-distribution swap, 64-layer sweep
Post-training dissection	DPO vs SFT via OLMo checkpoints	IT vs PT + per-head restructuring
Circuit mapping	Evidence carriers → gate features (L45 F9959)	Distributed OV + prompt-dependent response heads
Elicitation interventions	Refusal ablation (+53%); bias vector (+75%)	Not run
Sparse-feature tooling	Higher-width MLP transcoders	16k residual-stream SAEs (negative; see Appendix)

Complementary approaches. Macar et al. use single-turn injection at $\alpha = 4$ with Gemma Scope 2 MLP transcoders; the steering vector is present on the residual stream throughout the prompt and generation, so MLPs act on it directly, and they find L45 MLP causally necessary. The setup here uses Pearson-Vogel’s strict Turn-1-only KV-cache protocol with single-layer injection, which forces the perturbation to reach the output exclusively via attention reading the cached K/V during Turn 2. Under that architectural constraint, attention-level patches can fully eliminate the effect, by construction. The constraint is also what makes the prompt-dependent response heads visible, i.e., the same heads reverse their contribution across framings because they are the only route for the perturbation to reach the output. Similarly, the Godet-protocol localization controls with random and in-distribution-swap localization are not tested in Macar et al. The two protocols surface different components of what is likely the same underlying circuit. Convergence across model sizes within the same family strengthens the claim that the core structure (distributed detection suppressing a late-layer default-”no”) is real and not size-specific, though 27B-specific features (L45 F9959, etc.) can’t be cross-checked at the 12B scale used here.

Relationship to Lederman and Mahowald (2026)

Lederman and Mahowald (2026) make “detection without reliable identification” the central claim of their paper, framed as content-agnostic introspection. Their evidence is behavioral; this writeup’s localization controls are a mechanistic counterpart. Together they make the case across model scales and analysis levels.

Behavioral evidence from Lederman and Mahowald (2026). Across 821 concepts on Qwen3-235B-A22B and Llama 3.1 405B: (1) wrong identifications cluster on “apple” (74.8% Qwen, 21.3% Llama) regardless of the injected concept; (2) priming the concept word boosts identification far more than detection; (3) prompt-only steering preserves detection but collapses concept-specific mentions; (4) correct identifications appear ~43 words into responses while “apple” appears at ~12.

Mechanistic evidence (this work). In the Godet localization protocol on Qwen 2.5 32B Instruct, norm-matched random vectors localize as well as concept vectors at early-to-mid layers (localization)—the localization signal is sensitivty to perturbation, not concept identity. An in-distribution activation swap (where the swapped-in residual is itself a real in-distribution activation, just mismatched with surface tokens) also produces above-chance localization through depth 0.45, ruling out out-of-distribution-detection as the mechanism. The logit-vs-verbalization MI dissociation in the Verbalization section adds a second mechanistic point: even when concept-level information is detectable in Qwen 32B’s logits (1.49 bits MI), it doesn’t route into the model’s verbalized response.

The two lines of evidence agree: the introspective signal is an anomaly detector over residual-stream content, not a concept-specific one.

Their third-person, absurd-question, and varied-experience controls also show prompt-sensitivity throughout: different phrasings of “do you detect…” yield dramatically different rates. The mid-layer axis prefill ablation makes this point from a mechanistic angle: the injection shift tracks the prefill’s yes-leaning (+19.5pp under “The answer is”, +3pp under “My answer is”, ~0pp under prefills like “My response:”). Whatever the model is doing internally, what surfaces as “detection” at the output is heavily gated by response circuitry that reads prompt framing.

Implications

To echo Pearson-Vogel et al., if the report is prompt-dependent (the response heads), then whether a model reports internal states depends on how you ask, and a single framing may systematically overestimate or underestimate capability. Identification fails in verbalization even when logit-level MI is high, so behavioral evaluations may underestimate internal state. Both point towards underelicited capacity—i.e., the model’s outputs understate what their representations encode. Macar et al. (2026) demonstrate this directly by showing a learned bias vector raises detection by +75% on held-out concepts at 0% FPR, and refusal ablation alone adds +53%. These can be seen as amplifications of the latent capacity that this work characterizes mechanistically.

Limitations

Sample size. Only 9 concepts were tested. Per-concept variability is substantial: shifts range from +23pp (music) to −5.9pp (fear, death) with creativity flat, with cluster structure (the affective concepts share high mutual cosine at L6) but no clean concrete vs. abstract axis (the four “concrete” concepts span +19.5pp on bread down to +5.9pp on cats). With 9 concepts the distribution can’t be characterized, and Lindsey’s reported abstract-vs-concrete gradient on Opus 4.1 isn’t testable here.

Statistics. Key correlations have CIs (Fisher z on cross-prompt r, bootstrap on IT vs PT); the top-10 head-overlap p-value is analytical (hypergeometric); Godet-sweep accuracies have binomial 95% CIs. P(yes) values are single-sample (temp = 0). Verbalization results are not systematically sampled. Cross-prompt bootstrap is approximated by Fisher z because per-condition attribution arrays were not saved.

Experimental design and methods. Single-layer injection only; multi-layer or continuous injection might surface components not captured here. SAE analysis was performed at 16k width only; Macar’s wider MLP transcoders isolate features that 16k residual-stream SAEs do not (Appendix A.4). Comparing PT and IT only shows the effect of all stages of post-training. Cross-prompt head-reversal was tested on 5 framings with qualitatively different results across models (Gemma $r = -0.61$ reverses, Qwen $r = +0.07$ decorrelates between Accurate and Wrong framings).

Protocol. Lindsey suggested the QK circuit of “concordance heads” (Kamath et al., 2025) as the candidate mechanism for prefill detection. Under KV-cache injection on turn 1 only, all injected positions are shifted by the same vector, so relative QK scores barely change; this protocol cannot test the concordance-head hypothesis. A setup with position-specific injection would be needed.

Larger-model saturation. I also tried Gemma 3 27B IT under the same KV-cache protocol but found that P(yes) saturates at ceiling (16/20 prompt conditions) or floor (3/20) under the framing × info-document sweep, leaving only 1/20 conditions with enough headroom for injection to move the signal. On that one usable framing, injection actually decreases P(yes), most plausibly because the model interprets the “injection between brackets” framing literally and reads the residual-stream perturbation as inconsistent with that description. Logit attribution still shows large differential heads on 27B (L54H31 = 0.133, L56H4 = 0.115), so the detection capability is intact — but the yes/no signal at the output is gated by the model’s reasoning about the described setup, not just by detection. This suggests the KV-cache protocol scales poorly to larger more agreeable models, and that introspection signals at scale should be measured via mechanistically grounded probes rather than the binary yes/no output alone.

$\alpha$ -robustness and layer-choice details are in Appendix A.2 and A.1.

Appendix

A.1 Layer choice and multi-seed sweep

Gemma 12B’s strongest seed-robust injection site is layer 6 (depth fraction 0.125); Qwen 32B’s is layer 24 (0.375). Both sit earlier than Lindsey’s ~0.67, Pearson-Vogel’s 0.33–0.66, or Macar’s 0.60. Layer 6 is within Godet’s single-layer sweet spot (0.10–0.30).

A fine multi-seed sweep on bread under Accurate_Mechanism shows a clean positive peak at layer 6 (+19.5pp), a smaller positive shoulder at layer 9 (+10.9pp), a much weaker secondary peak at layer 12 (+5.9pp), and a seed-robust negative shift at layer 18 (−12.1pp; Figure 15). Layer 3 is seed-variable (+5.7pp ± 3.9 across 5 seeds). Layers 15, 21, 24+ are at zero or near zero. Whether the rest of the mid-layer axis mechanism (subspace ablation, head attribution) replicates at layers 12 and 18 is untested. The sweep is non-monotonic, and the mechanism claims stand at layer 6 but may not transfer to the negative-shift regime.

One interpretation consistent with the story so far is that multiple mechanistic paths from injection to output interact with the prompt-dependent response circuitry (the response heads) differentially depending on injection depth. Whether the negative-shift regime has its own interpretable mechanism, or is simply the result of the perturbation entering at a layer where the response heads’ readout produces a contra-injection effect, is left for future work.

Figure 15. Multi-seed layer sweep on bread (\alpha = 10, 5 concept-vector seeds) under Accurate_Mechanism on Gemma 3 12B IT: clean L6 peak (+19.5pp), L9 shoulder (+10.9pp), weak L12 secondary peak (+5.9pp), L18 negative (−12.1pp); error bars = ±1 std across seeds. — **Figure 15.** Multi-seed layer sweep on bread ( $\alpha = 10$ , 5 concept-vector seeds) under Accurate_Mechanism on Gemma 3 12B IT: clean L6 peak (+19.5pp), L9 shoulder (+10.9pp), weak L12 secondary peak (+5.9pp), L18 negative (−12.1pp); error bars = ±1 std across seeds.

A.2 Injection strength ( $\alpha$ ) robustness

Gemma 3 12B mechanistic analyses use $\alpha = 10$ , one step beyond Lindsey’s tested range of $\alpha \in \{1, 2, 4, 8\}$ and above Macar et al.’s $\alpha = 4$ . For the main investigation, I chose $\alpha = 10$ anticipating it would produce larger P(yes) shifts under the KV-cache protocol, where the signal has to survive the turn boundary. I then tested different strengths post hoc, and found the following:

$\alpha$	Control	Injection	Shift	All-attn patched	All-MLP patched	MLP retains
2	62.1%	56.3%	−5.9pp	—	—	(n/a)
4	62.1%	85.2%	+23.0pp	62.1%	68.0%	26%
6	62.1%	56.3%	−5.9pp	—	—	(n/a)
10	62.1%	81.6%	+19.5pp	62.1%	68.0%	30%

$\alpha = 4$ produces a larger shift than $\alpha = 10$ (+23pp vs +19.5pp); at $\alpha = 4$ attention patching fully eliminates the effect by construction, with MLPs retaining about 26% of the shift. Note that the $\alpha = 2$ and $\alpha = 6$ anomalies (both −5.9pp) sit at the dyadic-fraction precision floor and may be near-zero shifts rather than genuine reversals. A 5-seed replication (5 MD5-distinct concept vectors) at both $\alpha = 4$ and $\alpha = 10$ gave identical shifts across all seeds at bf16 precision: P(yes) values bucket at dyadic fractions (0.62109375, 0.6796875, 0.81640625, 0.8515625), so true seed-to-seed variation is ≤ 0.5% but not measurable here.

A.3 Head-attribution differential distribution

The 0.005 threshold used in the response-head analysis is well inside the heavy tail of the per-head differential distribution (Figure 16). Most of the 768 heads cluster near zero, with a long tail extending to ~0.046 (L43H3). Cumulative head counts above the three thresholds tested (42 for 0.005; 19 for 0.01; 5 for 0.02) are all in this heavy tail, so the qualitative claims about the response heads (sparse late-layer dominance, framing reversal at the top of the distribution) are stable across this threshold range.

Figure 16. Per-head differential magnitude distribution on Gemma 3 12B IT under Accurate_Mechanism + Pro_Introspection_Document (768 heads, log y-axis). Dashed lines mark the three thresholds tested; the choice of cutoff is inside the heavy tail and doesn't affect the qualitative response-head characterization. — **Figure 16.** Per-head differential magnitude distribution on Gemma 3 12B IT under `Accurate_Mechanism + Pro_Introspection_Document` (768 heads, log y-axis). Dashed lines mark the three thresholds tested; the choice of cutoff is inside the heavy tail and doesn't affect the qualitative response-head characterization.

A.4 SAE feature analysis (negative result)

I decomposed the residual stream at critical layers into SAE features (Gemma Scope 2 residual-stream SAEs, 16k width) and looked for features differentially active between inject and control. Feature 3424 at layer 31 (“semantic sameness/equivalence” per Neuronpedia) fires at ~770 activation under injection and exactly 0 without. However, this turned out to be a JumpReLU threshold artifact; the hidden-state projection is right at the SAE’s activation threshold, and small perturbations push it above/below stochastically. Ablating the feature (projecting out its direction) has zero effect on P(yes); steering with it (adding its direction without injection) also has zero effect; the feature is orthogonal to the yes-no output direction (cosine = 0.02).

More broadly, single-feature ablation at any layer either disrupts general processing or does nothing to the introspection signal. The mechanism is distributed across many features, and 16k-width residual-stream SAEs are too coarse to isolate it.

Macar et al. (2026) isolate the equivalent mechanism on Gemma 3 27B using higher-width MLP transcoders, rather than residual-stream SAEs, and ablating batches of the top-200 gate features rather than individual features. At 16k residual-stream-SAE width with single-feature ablation, the mechanism is not isolable.

A.5 Component patching: MLP contribution by concept

During Turn 2 of the KV-cache protocol, MLPs only see the current token’s residual stream, so the Turn-1 perturbation reaches the output only through attention reading the cached K/V. To quantify how much MLPs modify the signal, all MLP outputs in Turn 2 were patched to their control values.

For the concepts tested here, MLP effects are heterogeneous with respect to injection effect size; concepts with similar injection shifts can have very different MLP residuals (Figure 17). One qualitatively striking case is cats: under MLP patching, P(yes) drops below control (from 62.1% baseline to 56.3%), so MLPs do not only amplify the injection signal but can modulate it down as well.

Figure 17. Component patching per concept on Gemma 3 12B IT at injection layer 6. Left: P(yes) under control, injection, all-attention-patched, and all-MLP-patched conditions. Right: MLP residual vs. injection effect. Diagonals mark 100% retention (y=x), full restoration (y=0), and full reversal (y=−x). MLP residuals are heterogeneous with respect to injection effect. — **Figure 17.** Component patching per concept on Gemma 3 12B IT at injection layer 6. *Left:* P(yes) under control, injection, all-attention-patched, and all-MLP-patched conditions. *Right:* MLP residual vs. injection effect. Diagonals mark 100% retention ( $y=x$ ), full restoration ( $y=0$ ), and full reversal ( $y=−x$ ). MLP residuals are heterogeneous with respect to injection effect.

The meta-experiment: using Claude as a junior researcher

This project was an attempt to use Claude Opus 4.6 as a junior researcher who I supervise during experiments. I was inspired by my experience doing the take-home project while interviewing for the Anthropic Fellows program, for which AI assistance was allowed. It was amazing how quickly I could iterate on research tasks with Claude Code. So, when I set out to replicate the introspection phenomenon and look into its mechanisms, I thought it was a great opportunity to see how this approach would work in a less contrived setting with more open-ended research.

I gave it research directions and goals, discussed approaches, helped interpret results, and (thoroughly) reviewed and edited the writing—but Claude wrote all the code, ran the experiments, generated figures, gave initial interpretations of results (quite often wrongly/optimistically), and drafted the writeup (badly, at first).

Overall, my takeaway is that Opus 4.6 is incredibly impressive but not quite good enough to trust with larger open-ended research tasks. Its capabilities seem very jagged for this type of work.

Some categories of issues I encountered:

Scientific errors in interpretation. This was the most concerning. For example, it conflated ‘a signal represented across many dimensions in the network’ with ‘a distributed representation’ (though it was in fact a low-rank representation).
Poor writing. It is surprisingly bad at producing clear communication of research findings. Its scientific writing required a lot of editing. On balance, using Claude for drafting was probably a net gain in terms of time, but there is definitely a tradeoff with quality. The main issue is that it has trouble condensing complicated results into a narrative that the reader can follow and digest easily; it gets bogged down in details and loses the overall thrust, even when the entire draft is in recent context. Also, this may be personal preference, but I find Claude’s writerly ‘voice’ to be particularly badly suited to scientific writing, and it’s not trivial to alter that voice through prompting.
Sycophancy: Claude had a notable bias towards positive framings and exciting stories, and it lacked skepticism (despite instructions in my CLAUDE.md). This was helped by using another Claude instance as a critical reviewer, but that process nonetheless missed some key issues, and the reviewer could descend into nitpicking.
Context gap: Not knowing exactly how it was doing something resulted in a considerable efficiency tax, as I had to spend time trying to make sure it had done it the right way, and occasionally had to backtrack after realizing there was a problem with its approach. This is perhaps inherent to any project where some aspect of the research is delegated, but it was especially frustrating in the fast iteration loops here.
Elementary mistakes that an experienced human would not make. Some examples:
- Not saving intermediate data for reanalysis later, necessitating costly reruns on GPUs.
- Silently overwriting experiment outputs because the filename didn’t map uniquely to the run. I caught this only during a final consistency review while preparing this writeup, and had to re-run some experiments and re-interpret results.

I’m sure some of these issues could be ameliorated by better prompting, scaffolding infra, CLAUDE.md instructions, etc; I did put thought and effort into the project scaffolding, but my setup could no doubt be improved. As it was, I found that Claude lacked common sense in surprising ways, and I will approach future projects differently.

Acknowledgements

This work builds directly on code and protocols from Pearson-Vogel et al. (2026) (the strict KV-cache injection protocol, prompt set, and complete_logit_lens experiment scaffolding) and Godet (2025) (the localization protocol). Concept vectors were trained with Vogel’s repeng library. Thanks to all for making their work easy to extend.

Up to 84% accuracy for vague framings versus ~50% for the most mechanistically accurate framing. ↩︎
I chose to use Gemma 3 12B in part because Gemma Scope 2 provides all-layer residual-stream SAEs at this scale; however, SAE feature-level analysis at 16k width did not isolate the mechanism (Appendix A.4). ↩︎
I also attempted using Gemma 3 27B IT, but found that P(yes) was near ceiling in most prompt conditions; the model was too sycophantic for the Pearson-Vogel protocol. The Wrong_Mechanism framing breaks this saturation but inverts the injection effect, so it doesn’t cleanly replicate the 12B mechanism. ↩︎
Their primary model is Qwen2.5-Coder-32B-Instruct; I use Qwen 2.5 32B Instruct, the non-Coder variant, so my Qwen results are analogues of theirs rather than strict replications. ↩︎
Note that here, single-sample bf16 P(yes) values are quantized at dyadic fractions such that per-concept shifts that appear numerically identical (e.g., +19.5pp on bread, love, truth; mid-layer axis) reflect this ≤0.5% precision floor rather than true equality. A 5-seed replication at this precision finds zero seed-to-seed variation (Appendix A.2). ↩︎
This finding is in line with Lederman and Mahowald’s (2026) “content-agnostic introspection” thesis, which uses a behavioral analysis of model output, measuring confabulation on Qwen3-235B-A22B and Llama 3.1 405B. The localization controls here provide a complementary, mechanistic analysis via activation-level controls on Qwen 2.5 32B. See Relationship to Lederman and Mahowald in Discussion. ↩︎
The { } is the prefilled assistant message that injection is applied to in Pearson-Vogel et al.’s prompt template (see complete_logit_lens_prompts/*.json in their code). ↩︎

References

Godet, V. (2025). Introspection via localization. LessWrong, December 2025.
Kamath, H., Ameisen, E., Kauvar, I., Luger, R., Gurnee, W., Pearce, A., Zimmerman, S., Batson, J., Conerly, T., Olah, C., & Lindsey, J. (2025). Tracing Attention Computation Through Feature Interactions. Transformer Circuits Thread, July 2025.
Lederman, H., & Mahowald, K. (2026). Emergent Introspection in AI is Content-Agnostic. arXiv:2603.05414.
Lindsey, J. (2025). Emergent Introspective Awareness in Large Language Models. Transformer Circuits Thread, October 2025.
Macar, U., Yang, L., Wang, A., Wallich, P., Ameisen, E., & Lindsey, J. (2026). Mechanisms of Introspective Awareness. arXiv:2603.21396.
Pearson-Vogel, T., Vanek, M., Douglas, R., & Kulveit, J. (2026). Latent Introspection: Models Can Detect Prior Concept Injections. arXiv:2602.20031.
Vogel, T. (2024). repeng: a library for training control vectors with PCA-centered contrastive pairs.
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., & Hendrycks, D. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405.

interpretabilityintrospectionalignment