Trainloop AI

BLOG

Learning GSM8K is Inherently Low-Rank

16th April, 2026

5 min read

Hasith Vattikuti along with others at Trainloop

As an effort to work with the garage door up, this is a collection of some interesting discoveries we made while trying to study how new skills are internalized during finetuning. This is not meant to be a full-fledged study, but rather an entry point to ask further questions.

Introduction

With all the efforts that have been going into developing better post training algorithms for continual learning, we wanted to see if we could improve fine tuning methods by studying the training dynamics of LoRA adapters. On a high level, we were interested in the following questions

1. What are the differences between the internal representations of the model before and after fine tuning?

2. Does the model exhibit distinct "phases" of learning?

3. Are there multiple ways to learn the new skill? What is the geometry of the parameter space where a skill can be said to be "learned"?

Here, we are going to be exploring the third question, while saving the first two for follow-up works. As a minimal model of learning, we chose to study LoRA adapters due to their robustness (Schulman et al.), simplicity to train, and the ability for them to give results with only low-rank updates, which could reveal some interesting things about the parameter updates that are necessary to learn a skill.

Training Details

Inspired by Morris et al., we used Qwen/Qwen2.5-3B for all our experiments due to it being small enough to quickly iterate on. We focused on very low-rank LoRA adapters with the following configs:

All three ranks converged to an average reward of 0.85-0.87, which matches the performance in Morris et al., and we also trained a shorter, 500-step, r=8 version as a baseline, which also achieved the same score. While they all achieved around the same score, the higher ranks converged to it faster.

The LoRA adapters were applied to all linear layers [1] in the base model excepting embedding and decoding layers: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj].

Findings

Since each model [2] has a different rank of LoRA adapters on it, each has a different amount of learnable parameters, and so it is difficult to have consistent comparisons between all three ranks. However, we can instead consider the effective update.

Let a rank r LoRA adapter on module

$m \in \{q_{\text{_proj}}, k_{\text{_proj}}, \ldots\}$

layer

$\ell$

and training step t be

$\Delta W_{r,t}^{(\ell, m)} = B_{r,t}^{(\ell, m)} A_{r, t}^{(\ell, m)}.$

Then, because the size of

$\Delta W_{r,t}^{(\ell,m)}$

is independent of rank, we can compare them directly across all models,

is independent of rank, we can compare them directly

and a sufficiently good comparison metric should be able to handle the fact that

independent of rank, we can compare them directly across all models, and a sufficiently good comparison metric should be able to

across all models, and a sufficiently good comparison metric should be able to handle the fact

handle the fact that

$\Delta W_{r,t}^{(\ell,m)}$

is low rank.

that

$\Delta W_{r,t}^{(\ell,m)}$

is low-rank.

We denote the effective update as the concatenated collection of flatted adapters throughout the model. Namely, a rank r effective update at training step t is

$\Delta W_\text{eff}^{(r,t)} = \text{concat}(\text{vec}(W_{r,t}^{(\ell,m)}))_{\ell,m}.$

Note that

$\Delta W_\text{eff}^{(r,t)}$

is essentially just a very high-dimensional vector which contains all the ways that LoRA

is essentially just a very

is essentially just a very high dimensional vector which contains all the

high dimensional vector which contains all the ways that LoRA will update the base model's

will update the base model's linear layers, and

linear layers, and

ways that LoRA will update the base model's linear layers, and

$\Delta W_\text{eff}^{(r,t)}$

is constant as long as the base model is the

is constant as

same and we are applying LoRA adapters to the same layers/modules across the runs.

long as the base model is the same and we are applying LoRA adapters to the same layers/modules across the runs.

Now that we can consistently compare across LoRA configs, we can run PCA to visualize their training trajectories in parameter space (PCA, since it is an optimal compression method, which will handle the low-rank representations well).

While the projection onto the first two principal components captures above 90% variance in the set of all effective updates throughout training and the different models, constructing the basis for a single model's training trajectory reveals that the three are moving in orthogonal directions with respect to each other. Additionally, tracking how far they move from their initialization shows that they are all moving away from each other at approximately the same rate. Also note how the effective update weights are extremely small compared to the base weights that they are modifying.

So, the 2D PCA projection very much resembles projecting 3 mutually orthogonal vectors onto a plane.

This plane presents itself as a sort of subspace of optimal parameters that GRPO steered the LoRA runs to, so we felt it was worthwhile to study it to learn about the task's geometry as represented by the LoRA parameter space.

This plane carries has two very interesting properties: (1) each point on this plane has a very low stable rank

This plane carries has two very interesting properties: (1) each point on this plane has a very low stable rank across all modules, as in

This plane carries has two very interesting properties: (1) each point on this plane has a very low

across all modules, as in each

stable rank across all modules, as in each

each

$\Delta W_{r,t}^{(\ell, m)}$

sampled from the plane is very low in stable rank; (2) relative to the

sampled from the plane is very

sampled from the plane is very low in

length of the trajectories projected onto the plane, the plane seems to map out a big region of--in fact, most

low in stable rank; (2) Relative to the length of the trajectories projected onto the plane, the plane seems to map out a big region of--in fact, most scores are even higher than what our models converged to during training.

stable rank; (2) Relative to the length of the trajectories projected onto the plane, the plane seems to map out a big region of--in fact, most scores are even higher than what our models converged to during training.

scores are even higher than what our models converged to during training.

A natural question to ask at this point is to see just how large this region of good LoRA parameters is, and he answer is that it is huge. Due to compute constraints, here is about a third of that good region:

To verify the robustness of this region, we took a couple of samples from that region and found that we could truncate them down to just rank-1 adapters, and they lost no performance. In fact, they often did better than anything we actually trained. So, this region really is a space of genuine, low-rank effective updates that instill the skill of GSM8K into the model.

Together, the findings above show that a large space of parameters that "solve" GSM8K up to a good baseline on Qwen2.5-3B are all very low-rank. Furthermore this large region of parameters was very naturally discovered by simply computing the directions in which the GRPO algorithm varied the updates the most. These two pieces of information hint that learning GSM8K may inherently be a low-rank update.

Discussion & Open Questions

In summary, by projecting the effective updates of the models throughout training onto just their first two principal components, we discovered a very robust region of low-rank solutions to GSM8K.

These results likely indicate that the GSM8K is an extremely easy problem for models to fine-tune themselves to--at the very least to get to about 87% accuracy, and that its solutions in parameter space are surprisingly easy for LoRA to discover.

Our findings, while not strong enough to make a general conclusion across different tasks and models at various scales, still reveal some interesting research directions.

First, we have the obvious question of how well does this scale? If we use higher LoRA ranks, will the number of principal components needed to capture above 90% of the variance scale slower than O(n)?. What about other optimization algorithms, other tasks, larger models, etc.?

Then, we have the more surgical questions:

How can we make lower ranks converge faster? Is converging faster equivalent to taking a more direct path to the PCA plane?
How far can we push rank-1 training? Can we selectively concentrate it just around a few modules to reduce the parameter count even further (see [1])?
What is changing internally with the model throughout training? Can we directly observe knowledge being formed?

We are currently working on a follow-up work that is meant to explore the third bullet point.

Foot Notes

One interesting thing that we found is that having adapters only on the middle third of the model performs almost exactly as if we had adapters on all layers, whereas having adapters on the beginning or third ends of the model performs worse. We didn't look too far into this, but we hypothesize that this is related to the hydra effect (McGrath et al.) nd the redundancy in the middle layers of model (Lawson & Aitchinson). The more redundant layers are possibly more "malleable" and can learn new information easier.

Model, here and for the rest of the document, refers to the base model with a set of LoRA adapters applied to it.

References

Matuschak, A. (n.d.). Work with the garage door up. Andy Matuschak’s working notes.

Morris, J. X., Mireshghallah, N., Ibrahim, M., & Mahloujifar, S. (2026). Learning to reason in 13 parameters. arXiv.

McGrath, T., Rahtz, M., Kramár, J., Mikulik, V., & Legg, S. (2023). The Hydra effect: Emergent self-repair in language model computations. arXiv.

Lawson, T. S., & Aitchison, L. (2025). Learning to skip the middle layers of transformers. arXiv.

Schulman, J. (2025, September 29). LoRA without regret. Thinking Machines.

Blogs

schedule an intro

Blogs

intro

Blogs

schedule an intro

Training reasoning models aligned with your goals.

Email: founders@trainloop.ai

SOCIALS

1.1

1.2

1.3

GitHub

1.4

YC (W25)

LEGAL

2.1

Privacy

North Beach, San Francisco, CA

Training reasoning models aligned with your goals.

Email: founders@trainloop.ai

SOCIALS

1.1

1.2

1.3

GitHub

1.4

YC (W25)

LEGAL

2.1

Privacy

North Beach, San Francisco, CA