BLOG
Hasith Vattikuti along with others at Trainloop
As an effort to work with the garage door up, this is a collection of some interesting discoveries we made while trying to study how new skills are internalized during finetuning. This is not meant to be a full-fledged study, but rather an entry point to ask further questions.
With all the efforts that have been going into developing better post training algorithms for continual learning, we wanted to see if we could improve fine tuning methods by studying the training dynamics of LoRA adapters. On a high level, we were interested in the following questions
1. What are the differences between the internal representations of the model before and after fine tuning?
2. Does the model exhibit distinct "phases" of learning?
3. Are there multiple ways to learn the new skill? What is the geometry of the parameter space where a skill can be said to be "learned"?
Here, we are going to be exploring the third question, while saving the first two for follow-up works. As a minimal model of learning, we chose to study LoRA adapters due to their robustness (Schulman et al.), simplicity to train, and the ability for them to give results with only low-rank updates, which could reveal some interesting things about the parameter updates that are necessary to learn a skill.
Training Details
Inspired by Morris et al., we used Qwen/Qwen2.5-3B for all our experiments due to it being small enough to quickly iterate on. We focused on very low-rank LoRA adapters with the following configs:
All three ranks converged to an average reward of 0.85-0.87, which matches the performance in Morris et al., and we also trained a shorter, 500-step, r=8 version as a baseline, which also achieved the same score. While they all achieved around the same score, the higher ranks converged to it faster.
The LoRA adapters were applied to all linear layers [1] in the base model excepting embedding and decoding layers: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj].
Findings
Since each model [2] has a different rank of LoRA adapters on it, each has a different amount of learnable parameters, and so it is difficult to have consistent comparisons between all three ranks. However, we can instead consider the effective update.
layer
Then, because the size of
We denote the effective update as the concatenated collection of flatted adapters throughout the model. Namely, a rank r effective update at training step t is
Note that
Now that we can consistently compare across LoRA configs, we can run PCA to visualize their training trajectories in parameter space (PCA, since it is an optimal compression method, which will handle the low-rank representations well).

While the projection onto the first two principal components captures above 90% variance in the set of all effective updates throughout training and the different models, constructing the basis for a single model's training trajectory reveals that the three are moving in orthogonal directions with respect to each other. Additionally, tracking how far they move from their initialization shows that they are all moving away from each other at approximately the same rate. Also note how the effective update weights are extremely small compared to the base weights that they are modifying.

So, the 2D PCA projection very much resembles projecting 3 mutually orthogonal vectors onto a plane.

This plane presents itself as a sort of subspace of optimal parameters that GRPO steered the LoRA runs to, so we felt it was worthwhile to study it to learn about the task's geometry as represented by the LoRA parameter space.


A natural question to ask at this point is to see just how large this region of good LoRA parameters is, and he answer is that it is huge. Due to compute constraints, here is about a third of that good region:

To verify the robustness of this region, we took a couple of samples from that region and found that we could truncate them down to just rank-1 adapters, and they lost no performance. In fact, they often did better than anything we actually trained. So, this region really is a space of genuine, low-rank effective updates that instill the skill of GSM8K into the model.

Together, the findings above show that a large space of parameters that "solve" GSM8K up to a good baseline on Qwen2.5-3B are all very low-rank. Furthermore this large region of parameters was very naturally discovered by simply computing the directions in which the GRPO algorithm varied the updates the most. These two pieces of information hint that learning GSM8K may inherently be a low-rank update.
Discussion & Open Questions
In summary, by projecting the effective updates of the models throughout training onto just their first two principal components, we discovered a very robust region of low-rank solutions to GSM8K.
These results likely indicate that the GSM8K is an extremely easy problem for models to fine-tune themselves to--at the very least to get to about 87% accuracy, and that its solutions in parameter space are surprisingly easy for LoRA to discover.
Our findings, while not strong enough to make a general conclusion across different tasks and models at various scales, still reveal some interesting research directions.
First, we have the obvious question of how well does this scale? If we use higher LoRA ranks, will the number of principal components needed to capture above 90% of the variance scale slower than O(n)?. What about other optimization algorithms, other tasks, larger models, etc.?
Then, we have the more surgical questions:
How can we make lower ranks converge faster? Is converging faster equivalent to taking a more direct path to the PCA plane?
How far can we push rank-1 training? Can we selectively concentrate it just around a few modules to reduce the parameter count even further (see [1])?
What is changing internally with the model throughout training? Can we directly observe knowledge being formed?
We are currently working on a follow-up work that is meant to explore the third bullet point.
Foot Notes
One interesting thing that we found is that having adapters only on the middle third of the model performs almost exactly as if we had adapters on all layers, whereas having adapters on the beginning or third ends of the model performs worse. We didn't look too far into this, but we hypothesize that this is related to the hydra effect (McGrath et al.) nd the redundancy in the middle layers of model (Lawson & Aitchinson). The more redundant layers are possibly more "malleable" and can learn new information easier.
Model, here and for the rest of the document, refers to the base model with a set of LoRA adapters applied to it.
References
Matuschak, A. (n.d.). Work with the garage door up. Andy Matuschak’s working notes.
Morris, J. X., Mireshghallah, N., Ibrahim, M., & Mahloujifar, S. (2026). Learning to reason in 13 parameters. arXiv.
McGrath, T., Rahtz, M., Kramár, J., Mikulik, V., & Legg, S. (2023). The Hydra effect: Emergent self-repair in language model computations. arXiv.
Lawson, T. S., & Aitchison, L. (2025). Learning to skip the middle layers of transformers. arXiv.
Schulman, J. (2025, September 29). LoRA without regret. Thinking Machines.

