BLOG


We rebuilt knowledge-work agents as coding agents and measured them on APEX-Agents, Mercor's
benchmark for long-horizon professional work. The simple coding agents were 25% better, over 2x
faster and less expensive, and hit a new SOTA on the benchmark. We then applied the same
approach to a real world logistics agent, finding that the coding-native rewrite reduced failures by
80% and cost by 40%
In his 1990 essay “The Dynamo and the Computer,” economist Paul A. David explored why computers had failed to deliver measurable productivity gains despite a decade of investment. His answer came from an earlier technological shift: the introduction of the dynamo (electric motor) in manufacturing. Early factories ran every machine off a central line shaft. But with the dynamo, each machine could be powered by its own electric motor.
Initially, the efficiency gains were modest: better speed control, and more space between stations. The real gains came later, when operators realized that the layout of the factory floor no longer needed to be designed around the line shaft. Factories were redesigned, and dramatically more productive assembly lines followed.
AI agents are at a similar stage. Companies are using agents to augment human workflows. But, should workflows themselves change to be agent-native instead?
To test this, we redesigned real-world knowledge work agents as coding agents and measured them on Mercor's APEX-agents benchmark — 480 expert-graded knowledge-work tasks across 33 simulated worlds in investment banking, law, and consulting. APEX-Agents was built to capture the messy-middle of professional work which includes long horizons, dozens of interlinked files, and real software. It is graded with pass/fail criteria written by domain experts and evaluated at scale by an LM judge.
The reference harness for APEX-Agents is Archipelago. It’s an MCP gateway exposing nine typed servers: mail, calendar, chat, documents, spreadsheets, presentations, PDFs, filesystem, and code. These MCP servers mirror the same tools knowledge workers use everyday.
Over a series of trials, we found that coding agents score much higher, and are over twice as efficient. In one trial, Sonnet 4.6 jumped from 23.7% to 42% (+18pp.), just by swapping the mcp-based harness for claude code. In ablations with Kimi-K2.5, we demonstrate a series of performance and efficiency gains when switching to Kimi’s native coding harness, and removing all typed MCP servers, ultimately achieving 3x higher reward-per-token compared with Archipelago.
Finally, we validated this finding on a real-world agent, purpose built for logistics. After replacing its human-modeled system with a directory of files and Claude code, failures dropped 80% and cost per task dropped 40%.
In this post, we’ll argue three things:
Your logistics/legal/banking/accounting agent should be a coding agent.
Any problem can be reframed as a coding problem and any environment as a coding environment.
Model and harness have so much RL together and that makes your custom tools unnecessary.
Code-native harnesses dramatically impact performance, even for knowledge work
We tested Kimi K2.5 on APEX-Agents through four scaffolds, progressing from Archipelago, to Claude Code (with MCP-based tools), to Kimi-cli (with MCP), and finally to Kimi-cli with only bash, read, and write. This changes how the model approaches tasks. Initially, archipelago mirrors human workflows: Collecting context from drive, pdfs, and email, and producing outputs in powerpoint, excel, and word docs. The native coding agent removes these 80+ tools and 9 MCP servers in favor of python execution: for the agent to search a PDF it imports pdftotext, and for spreadsheets it uses openpyxl instead of the spreadsheets server and sheets_server.read_range.
Comparing the code-native agent to Archipelago, we see a dramatic improvement.
All coding agents outperform the default knowledge-work agent.
Kimi K2.5 performs better in its native coding harness compared to Claude Code.
Python code execution outperforms MCP-based tool use.
Token use is cut by 60%. The final coding agent uses ~612K tokens on average vs 1.53M originally, and runs in half the wall-clock time (4m 20s vs 9m 1s).
In the coding agent scaffolds, the models are able to stay in a “happy path” that closely mirrors its training data. Infrastructure errors and retry loops also become less of a problem, as the model can more comfortably write and validate API specifications for the Python tools compared with MCP.
Validating the findings on a real production agent
After seeing these results, we replicated them in a real-world customer use case.
We began with the customer’s agent, which is meant to act as a logistics coordinator: quoting, booking, replying to carriers, updating shipments. The existing implementation followed the current best practices of agent engineering. Common workflows decomposed into custom tools, and a long, detailed system prompt explained how to escalate accessorial charges, what to do when a quote is missing a weight, and other bits of nuance that experienced logistics coordinators understand intuitively.
As an experiment, we rebuilt the whole system as a barebones coding agent. Building upon Terminus 2, we translated their custom tools into python scripts, and moved the SOPs into a knowledge base for the agent.
We then evaluated the new agent in a complete mock of the production environment and database. Compared with the original logistics agent, the simple coding agent had 80% fewer failed runs, and was 40% cheaper per run. Beyond that, we post-trained specialized models in both harnesses. The coding agent converged 20% faster to a higher score than the original agent.
Are there hidden benefits to MCP?
The quality of artifacts produced by knowledge work agents can’t be fully evaluated by binary criteria alone. For example, a slide deck that looks unprofessional but technically contains the right information would still be unacceptable for most firms. The MCP servers expose a higher layer of abstraction than the core Python libraries, and can result in better-formulated artifacts than those which are created by Python alone.
Additionally, the raw Python code may give models more opportunities to reward-hack during training. For example, by evaluating the presence of certain explanations, we teach the model to over-explain its solution, providing every bit of context in a hope to achieve extra reward.
If you train against the binary critera alone, you’re probably not going to have a lovely, extensible spreadsheet that can do sensitivity analysis, and similarly may miss a beautiful formatted slide deck.
Ultimately, models should learn both paths: to correctly utilize MCP contracts specified by the application providers, and to write code which produces artifacts that professional firms can be proud of.
The harness is the RL environment
We used to think of the model and harness as separate things: The model is the brain, while the harness is the hands and wiring. But now, look at what the labs ship:Codex. Claude Code. Gemini CLI. Underneath, all three are architected around bash, read-file, and write-file.
The lab spent unimaginable amounts of compute teaching the model to do useful work using exactly these primitives, against hundreds of thousands of seeded environments. This explains the high scores of coding agents on APEX: code is the language that the models are the most familiar with.
Next year’s model will be better at bash than this one. It will not be better at your QuoteClient tool. The gap widens every release, so you should build for what the model is being trained against, not for what you wish it knew.

