Cursor Composer 2 puts the focus on long horizon coding
Cursor Composer 2 is not just another update to an AI coding model. What makes it interesting is the combination of three things that usually do not improve at the same time. It pushes coding quality higher, lowers cost compared with several premium rivals, and introduces a training approach aimed at one of the hardest problems in agentic software engineering: staying useful across very long task chains.
That last point is the real story. Many coding assistants look impressive on short tasks such as writing a function, fixing a clear bug, or explaining a code snippet. Real software work is different. A useful coding agent often has to inspect multiple files, reason about prior changes, test alternative approaches, interpret terminal output, and recover from dead ends. Once that process stretches across dozens or even hundreds of actions, context management becomes a bottleneck. Cursor Composer 2 is built around that constraint.
For developers, that means the model is worth looking at for a more practical reason than benchmark screenshots. It is designed for agentic coding workflows where the model needs to keep going, preserve the right details, and avoid losing the thread halfway through a difficult engineering task.
What Cursor says Composer 2 improves
According to Cursor, Composer 2 delivers large gains on the coding benchmarks the company tracks, including Terminal Bench 2.0 and SWE bench Multilingual. Cursor also says this release is the first one built on a continued pretraining run, which gave the company a stronger base model before scaling reinforcement learning on top of it.
That detail matters because it changes the training recipe. Earlier iterations relied on reinforcement learning applied to an existing base. Composer 2 adds a stronger underlying foundation and then trains for longer horizon coding behavior. In other words, the company is not only tuning how the model answers. It is also strengthening the model that is being tuned.
Cursor prices Composer 2 at $0.50 per million input tokens and $2.50 per million output tokens. It also offers a faster variant with the same intelligence at $1.50 per million input tokens and $7.50 per million output tokens. Cursor says fast is becoming the default option. That pricing positions Composer 2 as a model that is trying to compete on capability and operating cost at the same time.
Why long horizon tasks break many coding agents
To understand why Composer 2 matters, it helps to look at what goes wrong in long coding sessions. Most developers have already seen a version of this problem in practice. A model starts well, explores the repository, forms a decent plan, edits several files, and then begins to drift. It forgets an earlier constraint. It repeats work it already did. It loses track of a failed experiment and reintroduces a bad path. Or it misses a crucial piece of terminal feedback from 20 turns earlier.
This is not only a reasoning problem. It is also a memory and context problem.
Agentic coding generates long trajectories. The model reads files, proposes edits, runs commands, interprets errors, updates its plan, and tries again. As the interaction grows, it approaches the model’s context limit. Once that happens, the system needs some form of compaction. If it does nothing, the model cannot continue. If it compacts badly, the model continues with partial memory and degraded judgment.
Traditional approaches tend to use one of two methods. The first is prompted summarization in text, where another step compresses the earlier conversation into a shorter summary. The second is a sliding window, where older context is dropped and only recent tokens remain visible. Both approaches can work, but both can also discard information that later turns out to be important.
That tradeoff becomes more painful as tasks get harder. On a shallow task, losing a little context may not matter. On a deep software engineering task, it can be the difference between convergence and failure.
Self summarization is the key idea behind Composer 2
Cursor’s main technical claim is that Composer 2 has been trained to summarize its own working context as part of the reinforcement learning process. The company calls this self summarization.
Instead of treating compaction as a separate external trick, Cursor brings it into training itself. When the model reaches a token threshold, it is prompted to summarize the current state of the work. That summary then becomes part of the condensed context used for the next stage of the task. The condensed context includes not just a recap of what happened, but also task state such as the current plan, remaining work, and prior summarization history.
The important part is that these summaries are not a side product. They are part of the behavior being optimized. During training, multiple generations can be chained together by self summaries rather than handled as a single prompt and response pair. The final reward signal covers the full trajectory. That means a useful summary is indirectly rewarded because it helps the model solve the broader task. A poor summary is penalized because it causes later failure.
This is a meaningful shift in design. Instead of assuming that any generic summary will do, Cursor is trying to teach the model what information is worth preserving in the middle of a real coding process.
Why this training approach matters in practice
There is a practical reason self summarization is more interesting than it first sounds. In many agent systems, summarization is treated as overhead. It is necessary, but it costs tokens and often degrades performance. Cursor’s argument is that summarization can become a learned skill rather than a patch.
In its research writeup, the company compares self summarization with a tuned prompt based compaction baseline. The baseline uses a long summarization prompt with many structured instructions and produces compacted contexts that average more than 5,000 tokens. By contrast, Composer’s self summaries reportedly average around 1,000 tokens because the model has learned to identify high value information more efficiently.
Cursor says this reduces compaction error by 50 percent while using far fewer tokens. If that result holds in broader usage, it has two major implications. First, the model can preserve more useful state across a long task. Second, it can do so with lower token overhead, which matters for responsiveness and cost.
This is especially relevant in real codebase work, where the expensive part is often not one answer but the entire sequence of exploration, testing, revision, and follow through. A model that wastes less context budget on redundant recap can spend more of its budget on useful work.
Benchmark performance is only part of the story
Composer 2 has been presented as a strong benchmark step for Cursor. Reporting around the launch highlighted that the model scored 61.7 percent on Terminal Bench 2.0, above Anthropic’s Claude Opus 4.6 at 58.0 percent, while still trailing OpenAI’s GPT 5.4 at 75.1 percent. On cost, Composer 2 is positioned much lower than some premium coding models, which makes the comparison more favorable for teams that care about efficiency as much as raw peak performance.
Those numbers are useful, but they should be read carefully. Benchmarks help show direction, especially when they test more realistic agent behavior in a terminal environment. But they are still controlled evaluations. Developers should treat them as indicators of capability, not as guarantees of day to day outcomes across every repository and workflow.
Still, Terminal Bench 2.0 is more relevant than lightweight code generation tests because it measures multi step software engineering behavior. That makes Composer 2’s result notable. It suggests Cursor is improving where coding agents actually tend to struggle.
A case study that shows the ambition
One of the more revealing examples from Cursor’s research post is a Terminal Bench task known as make doom for mips. The assignment sounds simple on the surface but requires a chain of engineering steps, testing, debugging, and adaptation. Cursor says an early research checkpoint of Composer was able to solve it correctly over 170 turns.
That is the kind of example that gives context to the long horizon claim. The point is not that one model can compile Doom in a quirky environment. The point is that solving such a task requires the model to sustain coherent behavior over a long sequence of actions while preserving the right intermediate decisions. Cursor says the model self summarized more than 100,000 tokens down to about 1,000 tokens that it considered most useful for continuing the work.
How Composer 2 fits into the broader coding model market
Composer 2 also says something about where the coding assistant market is heading. For a while, the dominant pattern was simple model access inside an editor. Then the focus moved toward agents that can inspect repositories, use tools, and execute commands. Now the pressure is shifting again toward persistence across longer workflows.
That shift changes what counts as a competitive edge. Raw intelligence still matters. Speed still matters. Price still matters. But long horizon reliability is becoming a separate category. A model that is excellent in short bursts but degrades badly over time may be less useful than a slightly weaker model that remains coherent for 100 turns.
Cursor has another advantage here because its product environment gives it a direct path from usage data to agent design. Since Cursor is model agnostic, users can compare models in similar workflows and let the product route tasks through different options. That can help the company refine where its own model is strongest, especially in balancing intelligence, latency, and cost.
What developers should pay attention to before adopting it
If you are evaluating Cursor Composer 2, the most important question is not whether it beats another model on a chart. The better question is where it saves time in your actual development loop.
There are a few areas worth testing closely.
Repository exploration
See how well the model builds an accurate picture of a medium sized or large codebase. Strong long horizon behavior starts with good early exploration.
Plan retention
Check whether it remembers decisions made 20 or 50 turns earlier, especially after a few file edits and terminal runs.
Error recovery
Good agents do not just generate code. They recover from wrong assumptions without losing the whole thread.
Token efficiency
If self summarization works as described, the benefit should appear not only in quality but also in reduced waste during long sessions.
Stability across difficult tasks
Try tasks that require setup, debugging, refactoring, and verification rather than isolated code generation. That is where the architecture is supposed to matter most.
The bigger idea behind Composer 2
The most important thing about Cursor Composer 2 may be that it treats memory management as a first class model skill. That sounds technical, but it points to a larger shift in AI systems. As agents take on more realistic work, progress will depend less on one shot brilliance and more on disciplined continuity. Models will need to decide what to keep, what to drop, and how to carry state forward without becoming bloated or forgetful.
Self summarization is one answer to that problem. It may not be the final answer, and other approaches will likely emerge, including better latent memory systems and stronger external state tracking. But Cursor’s approach is notable because it integrates the problem directly into training rather than leaving it entirely to orchestration layers.