BLOG
On a held-out clinical treatment reasoning and response prediction task, a single OAPL policy update produced
Group Relative Policy Optimization (GRPO) is often the first tool we reach for when training a model for a customer via reinforcement learning. GRPO is an online training loop. You generate from the current policy, score generations within a group, update, repeat. Parts of that process can be batched, but the algorithm still assumes the data comes from the current or very recent policy. For this reason rollouts (generations from the policy model) are usually done live during training. This process is referred to as an “online” training loop. In contrast, an offline training loop is a training algorithm which allows for training on data entirely generated in advance and often not even data generated by your model.
OAPL is an RL training approach which is similar in many ways to GRPO. It differs in that it treats the rollout policy as a lagged reference and optimizes directly against data from that older policy. In other words the model generating rollouts is not updated until a specified sync interval. That sync interval it turns out can be quite long, meaning the algorithm supports training from a substantial number of off policy generations.

Training models that continuously improve from your data
One problem with GRPO is that you need to simulate an environment which looks like the tasks you’re training for and run the model through that environment during training. In a world where models constantly improve from real world interactions your model encounters, it’s much better to train on the real world than a simulated one. Our question was, can we collect real world customer traces, generate and score groups after the fact, and do so on a large number of traces (a requirement to avoid constantly deploying a new model to production)?
We tested that question with OAPL, a new training strategy from Databricks, on a text-only binary clinical prediction task. Our model was presented with clinical narratives and was instructed to reason and then predict whether progression-free survival crossed a 180-day threshold given a treatment plan. We post trained Qwen 3.5 27b to specialize toward this task, comparing one-step OAPL against a matched GRPO setup. We then stress-tested multi-step OAPL under different amounts of policy lag.
On this task, the answer looks like yes, OAPL can allow us to post train a model in one step with all rollouts generated from the original policy.
Learning from off policy data
OAPL is interesting here for a practical reason: It is designed to learn from rollouts generated by an older reference policy, rather than requiring every update to stay tightly coupled to freshly sampled data. That matters if the workflow we actually want looks more like this: generate a batch, score it offline, run a compact training job, evaluate, and ship.
The reward also punished unparseable outputs heavily. So some of the early gains came from learning the response contract, not just the label itself.
The first threshold: can one update do enough?
Before training, the baseline model was effectively unusable on the 50-example held-out slice used for the single-step runs. It only produced parseable answers about 10% of the time, and none of those parseable answers were correct.
Single-step results

Results of a single step from different configs. OAPL with a single step from 1024 rollouts did the best, and scores 4pp. higher than a GRPO step also with 1024 rollouts. (Evaluated on 50 held out samples)
The matched one-step comparison with GRPO is encouraging but not decisive. OAPL landed at 48% end-to-end correctness, compared with 44% for GRPO. That slightly favors OAPL, but the held-out slice is small enough that we should be careful. The safer conclusion is that OAPL looks at least competitive on quality while fitting the batch-first workflow better.
There were two other useful signals in the single-step runs.
First, covering more prompts in the update seemed to matter more than simply throwing far more rollouts at the same basic configuration. Moving from the smaller setup to the 1,024-rollout setup helped. Pushing all the way to 4,096 rollouts did not really move end-to-end correctness beyond the best 1,024-rollout result.
Second, ten steps did not beat one. That’s not a verdict on OAPL per se, but it is a warning that the hyperparameters for a strong first update were not automatically safe for longer training.
A one-step result answers the workflow question in a narrow sense, but it does not tell you whether the method stays stable when you train longer. So we ran a second set of experiments: 100 training steps, same basic setup, and different sync intervals for the reference policy.
When we refreshed the reference every step — the fully on-policy case, or L=1 — the run looked strong early. Training accuracy briefly climbed above 90%. Then the model collapsed into unparseable outputs around step 27 and never recovered. At the tested learning rate, multi-step OAPL with L=1 was unstable.
Policy Lag
The fix was into the reason OAPL was interesting in the first place: policy lag.
Instead of forcing the reference policy to refresh every step, we let it lag. We tested sync intervals of 10, 50, and 100 steps. In plain language, that means the model kept training for a while against rollout data generated by an older snapshot.
Policy-lag results

Training results of a single OAPL step from different sync-intervals. A sync interval of 100 dramatically outperforms tighter sync intervals. (Evaluated on 200 held out samples.)
This table changed how we thought about the problem.
The first takeaway is straightforward: lagged-reference OAPL was dramatically safer than on-policy OAPL. The difference between L=1 and L>1 was not subtle. One version collapsed. The others stayed stable enough to produce clearly better held-out performance.
The second takeaway is that the training dynamics made sense once we accepted that the rollout data was stale. With L=10 and L=50, metrics tended to flatten or drift between syncs and then jump after the reference was refreshed. That is exactly the pattern you would expect if the model is improving while the training stream lags behind it.
The third takeaway is an interesting one: L=100 produced the best final held-out result even though its training reward looked worse during the run. That suggests two things. First, heavy lag did not destabilize OAPL here. Second, stale-rollout training reward is not a reliable proxy for final policy quality when the lag is large.
We do not think that proves “always use L=100.” This was one task and one seed. But it does show that large policy lag is not automatically a problem, and may even act like a useful regularizer in some settings.
What we actually learned
The main result is not that OAPL crushed GRPO on raw quality, but the two methods fit different operating models. OAPL was designed for learning from rollouts generated by a lagged inference policy, with infrequent syncs and without extra importance-weighting or clipping tricks to make stale data look fresh while GRPO was designed for online training.
That gives us a clear next application for OAPL in training reasoning tasks: situations where rollouts can be generated in bulk, graded offline, and turned into compact adapter updates on a predictable schedule.