In Gurnee et al. (2025), Anthropic showed that language models can represent a counting variable tied to line formatting. In their setup, the model effectively tracks distance since the previous line break, which helps maintain consistent wrapped text generation.
This project asks whether the same behavior appears in open-weight models and whether we can recover the underlying representation with fully open tooling. We focus on model families with different training data and scales to test how robust this phenomenon is outside the original closed-model setting. Code for this project is available at github.com/corl-team/counting_manifolds.
What We Reproduced
We reproduced the wrapped-text evaluation pipeline by converting long documents into width-limited lines and measuring newline prediction directly. We then ran layer-wise linear probes on hidden states to identify where character-offset information is most recoverable, and complemented that analysis with PCA-based geometry and manifold visualizations of the position representation. Finally, we examined SAEs to isolate individual features that track position.
What Is New in This Reproduction
Our main extension is breadth and comparability across model families including GPT-2, Pythia, Llama, Gemma, and Qwen, rather than focusing on a single lineage. This allows us to separate architecture and data effects from simple scale effects, and shows that model size alone does not determine newline competence. We also provide an SAE-focused analysis workflow that captures non-monotonic position features and explicitly compares those features against AutoInterp descriptions to reveal interpretation failure modes.
TL;DR
- Several modern open-weight families predict wrapped newlines reliably, while GPT-2 remains weak even at larger size (including GPT-2-XL), and Pythia improves sharply by the Pythia-410m scale.
- Position information is clearly present in hidden states but usually peaks in mid layers, not consistently in the earliest layers.
- SAE features recover fine-grained position structure, including hill-shaped tuning curves that encode position without simple linear trends.
- AutoInterp labels for SAE features often miss the actual mechanistic role of these features in counting.
We now follow this pipeline step by step: first the data setup and labeling scheme, then a tokenizer-agnostic newline definition, then behavioral comparisons across models, and finally hidden-state and SAE-level analysis.
- Gurnee, W., Ameisen, E., Kauvar, I., Tarng , J., Pearce, A., Olah, C., & Batson, J. (2025). When Models Manipulate Manifolds: The Geometry of a Counting Task. Transformer Circuits Thread. https://transformer-circuits.pub/2025/linebreaks/index.html
Setup
We adopted the line-wrapping setup from the original paper.
Specifically, we filtered documents from the HuggingFaceFW/fineweb dataset to include only those that (i) contained no newline characters in the raw text (i.e., each document was a single long line) and (ii) yielded at least 10 lines after wrapping.
We sampled 1,000 such documents.
We then wrapped each document into 150-character lines using textwrap and assigned each token a position index equal to the number of characters since the most recently inserted newline (\n).
Having established the dataset and labels, the next step is to define what counts as a newline prediction in a tokenizer-agnostic way.
How We Determine \n for Different Tokenizers
Comparing model families requires a tokenizer-agnostic newline definition, because tokenizers may encode line breaks as standalone tokens or merge them with spaces and other characters.
We use one rule everywhere: a token counts as a newline token if its decoded text contains at least one \n.
We apply this rule consistently in both metrics. For probability plots, we sum the next-token probability over all vocabulary items containing \n. For accuracy, a prediction is counted as correct when the predicted token contains \n at a wrapped position where a line break is expected.
This does not separate pure newline tokens from mixed tokens (newline plus extra text), but it directly matches the question we care about: does the model place a line break at the right location, independent of tokenizer boundaries.
Which Models Can Predict \n Correctly?
In our first experiment, we tested whether open-source language models could correctly predict newlines in wrapped text. We evaluated a range of model sizes from the GPT-2 and Pythia families, as well as the more recent Gemma-3 family. We report newline prediction accuracy using the tokenizer-agnostic newline rule described in the previous section.
The family-level comparison shows a clear quantitative split. GPT-2 stays low on newline accuracy, ranging from 12.3% to 19.7% across Small through XL. Pythia shows a sharp transition between Pythia-160m and Pythia-410m (36.1% to 63.8%) and reaches 66.4% at 1.8B. Gemma is consistently strong (Gemma-3: 63.9%-75.1%, Gemma-2: 71.1%-78.1%), and Qwen3 is strong from 1.7B upward (60.7%-67.8%; 0.6B is 39.0%).
This pattern suggests that scale alone is not enough for learning line counts. A direct comparison is GPT-2-XL (19.7%) versus Pythia-410m (63.8%), a +44.2 point gap despite the smaller model. The effect appears to depend more on training data composition and formatting exposure.
Accuracy shows whether the top prediction is correct at wrap points. To see how strongly models lean toward newline even when it is not top-1, we next inspect total newline probability mass.
The probability-mass view supports the same conclusion at a finer level. Models that learned the wrapped-text pattern assign substantial next-token probability to newline-containing tokens at expected wrap points, while weaker models fail to align that mass as consistently.
This is also reflected in aggregate probabilities at expected wrap positions: GPT-2 remains low (0.053-0.099), while stronger models are much higher (for example, Pythia-410m at 0.391, Gemma-2-9B at 0.575, and Qwen3-8B at 0.550). Taken together, the accuracy and probability evidence indicates a genuine representational gap in how strongly different model families internalize line-break structure.
Given this behavioral gap, the next question is representational: where in the hidden states is line-position information stored, and how compact is that signal?
Predicting Token Position
We next ask where line-position information is most accessible in the network. At each layer, we fit a linear probe that predicts character offset since the last \n from hidden states, and report R². This is correlational rather than causal, but it localizes the layers where the position signal is strongest.
Starting with the default view, Llama3.1-8B reaches a peak probe R² of 0.791 at layer 5, indicating strong recoverability of token position in mid layers. More broadly, peak R² is 0.868 for Gemma-2-9B (layer 11), 0.852 for Qwen3-8B (layer 5), 0.744 for Pythia-160m (layer 3), and 0.724 for Pythia-410m (layer 6). GPT-2 is substantially weaker, peaking at 0.363 (layer 6), while Gemma-3-4B-pt reaches 0.491 (layer 18).
In contrast to Anthropic’s original result, where peaks appeared in very early layers, the best layer is never in the first two layers for any tested model here. Across models, best layers are 3, 5, 5, 6, 6, 11, and 18 (median 6), so the position signal is usually early-to-mid (layers 3-6), with later-layer exceptions in Gemma-2-9B (11) and Gemma-3-4B-pt (18).
Next, for the best-performing layer, we checked whether this “counting signal” lives in a compact subspace. We averaged the hidden state for each position, ran PCA, and reported cumulative explained variance for the first n principal components. The variance curves suggest that top-3 components already capture most of the signal in most models, so we use a top-3 elbow by default. We keep two explicit exceptions: Gemma-2-9B uses top-4 because PC4 adds a large jump (93.9% to 97.6%), while GPT-2 is left without a fixed elbow because its spectrum is weaker and less cleanly separated.
For the default Llama3.1-8B view (layer 5), PC1-3 already explains 97.1% of variance, and PC1-6 explains 98.0%, so most structure is concentrated in the first three components. Across models, PC1-3 explains 93.9%-98.2% for Gemma-2, Gemma-3, Qwen3, and both Pythias (78.8% for GPT-2), while PC1-6 explains 96.6%-98.7% for those stronger models (81.7% for GPT-2). Following the original Anthropic setup, we inspect both 3D component groups (PC1-3 and PC4-6) in the projection plots below.
These manifold views are consistent with the probe and PCA results: models with high linear recoverability show smooth, position-ordered trajectories, while weaker models show less coherent geometry.
Why Do We See This Shape?
From scratch, the clockwise algorithm asks one question: as token position increases, does the point rotate in a consistent direction? You can imagine each position as a point on a clock face. We compute an angle for each point, compare consecutive angles, and average those local turns. If that average is positive, rotation is counterclockwise; if negative, it is clockwise. If local turns keep the same sign most of the time, the helix is clean; if signs flip often, the geometry is noisy.
To connect this directly to the clock-style view from the addition and feature-level mechanistic literature (Kantamneni & Tegmark, 2025; Nanda et al., 2023), we use the following angle-difference metric in the plane:
Here, means counterclockwise rotation and means clockwise rotation. Sign consistency is the fraction of steps whose has the same sign as . In this metric, Llama3.1-8B (layer 5) rotates counterclockwise with and sign consistency , and Qwen3-8B is similar at with consistency . Gemma-2-9B and GPT-2 rotate clockwise instead ( and ), which shows that helix direction is model-dependent even when the manifold remains position-ordered.
We then fit unwrapped phase with a linear model:
This gives tokens () for Llama3.1-8B, () for Qwen3-8B, (, opposite rotation sign) for Gemma-2-9B, and () for GPT-2. Notably, the fitted period for the stronger models is about 155, which matches the wrapped-string length scale in this setup.
| Model | Best layer | Rotation (PC1-2) | Sign consistency | Fitted period (tokens) |
|---|---|---|---|---|
| Llama3.1-8B | 5 | Counterclockwise | 99.2% | 154.7 |
| Qwen3-8B | 5 | Counterclockwise | 97.7% | 154.2 |
| Gemma-2-9B | 11 | Clockwise | 80.0% | 155.1 |
| GPT-2 | 6 | Clockwise | 57.7% | 177.1 |
Relative to the original Anthropic result, our reproduction more often shows the cleanest helix in PC1-3, while PC4-6 is usually less consistent, matching the variance concentration in the first three components.
Next, we move to SAEs to ask which individual features carry the position signal.
- Kantamneni, S., & Tegmark, M. (2025). Language Models Use Trigonometry to Do Addition. arXiv Preprint arXiv:2502.00873. 10.48550/arXiv.2502.00873
- Nanda, N., Rajamanoharan, S., Kramár, J., & Shah, R. (2023). Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level. In Alignment Forum. https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
Finding Position Features in SAEs
Next, we look for individual SAE features that encode character offset in open-source SAEs. We use GemmaScope and LlamaScope (He et al., 2024; Lieberum et al., 2024), together with GPT-2 SAEs from (Gao et al., 2025), for this analysis. For each model, we take the SAE from the layer with the highest R², compute the mean feature activation at each position, and then select features whose activation patterns vary strongly with position.
Some features activate only within a narrow range of positions, producing “hill-shaped” tuning curves (for example, a bump around positions 5–10). Such features can encode position very well even when their activation is non-monotonic across the position axis. Variance across positions captures these patterns more reliably.
For the default Llama3.1-8B view (layer 5), the top variance-ranked features are 2424, 72720, 73327, 126269, and 4041. The default synced feature 126269 is rank 4 and is clearly non-monotonic, peaking near position 62. For Gemma-2-9B (layer 11), the top features are 124789, 114007, 862, 41464, and 44341, and the default feature 44341 is also non-monotonic, peaking near 69. GPT-2 (layer 6) still has position-sensitive features (104860, 72264, 6666, 117724, 22611), but their tuning is less clean and more edge-biased.
Across the top-20 features, peak positions span almost the full range (0-149) in all three models, which suggests a distributed bank of position-tuned features rather than a single scalar counter.
AutoInterp descriptions are often only partially aligned with this position-tracking behavior, so the activation curves are still necessary for interpretation. Excluding undefined early positions, feature ranking in the view below is computed from position-wise activation variance at the selected layer.
Finally, we test whether the identified SAE features actually preserve the same geometry as the full residual stream. For each position, we take the mean hidden state at the selected layer, then compare two trajectories in PCA space: (1) the original full-state trajectory, and (2) the trajectory obtained after projecting to the span of the selected SAE features. We use the same PCA coordinates for both views, so direct overlap means the selected features capture the position manifold well, while systematic gaps indicate position information that is still outside this SAE subspace.
Excluding undefined early positions, reconstruction quality is strong for Llama3.1-8B and Gemma-2-9B at their best layers: the main and SAE-subspace trajectories nearly overlap in PC1-5 (with weaker agreement in PC6 for Llama). GPT-2 is visibly less stable in this comparison, with larger gaps in the middle components, matching its weaker probe results in the previous section.
This closes the chain from behavior to hidden-state geometry to feature-level mechanisms, which we summarize in the final section.
- Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., & Wu, J. (2025). Scaling and evaluating sparse autoencoders. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=tcsZt9ZNKD
- He, Z., Shu, W., Ge, X., Chen, L., Wang, J., Zhou, Y., Liu, F., Guo, Q., Huang, X., Wu, Z., Jiang, Y.-G., & Qiu, X. (2024). Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders. arXiv Preprint arXiv: 2410.20526. 10.48550/arXiv.2410.20526
- Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram’ar, J., Dragan, A., Shah, R., & Nanda, N. (2024). Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 10.48550/arXiv.2408.05147
Results Summary
Across open models, token position is linearly recoverable from hidden states, but both strength and layer location vary by architecture. The table below summarizes the main quantitative results used throughout this post.
| Model | Best layer | Max probe R² | PC1-3 variance at best layer | SAE-span manifold comparison |
|---|---|---|---|---|
| Llama3.1-8B | 5 | 0.791 | 97.1% | Strong alignment (main vs span) |
| Gemma-2-9B | 11 | 0.868 | 93.9% | Strong alignment (main vs span) |
| Qwen3-8B | 5 | 0.852 | 95.3% | Not available in this SAE comparison |
| Gemma-3-4B-pt | 18 | 0.491 | 98.2% | Not available in this SAE comparison |
| GPT-2 | 6 | 0.363 | 78.8% | Weaker alignment than Llama/Gemma-2 |
| Pythia-160m | 3 | 0.744 | 95.4% | Not available in this SAE comparison |
| Pythia-410m | 6 | 0.724 | 97.4% | Not available in this SAE comparison |
For newline prediction quality, results also vary strongly by model family and training setup: Gemma-2-9B reaches 77.9% exact-match on newline tokens, Qwen3-8B reaches 62.5%, and Pythia-410m reaches 63.8%, while GPT-2 family models remain much lower (12.3%-19.7%).
Final Thoughts
Open-weight models do learn internal signals that track position, but where this signal is most recoverable is not consistent across architectures. In our runs, it usually peaked around mid layers rather than the earliest ones. SAE analysis then lets us move from a coarse layer-level statement to specific feature-level mechanisms, and many of the strongest position features show non-monotonic, hill-shaped tuning curves instead of a simple linear trend with position. At the same time, automatic feature labeling remains fragile: AutoInterp descriptions are often plausible-sounding but still miss the actual role of these features in position tracking.
Overall, this behavior appears across most modern LLM families we tested, and it can be studied end-to-end with publicly available models, datasets, and interpretability tooling.