SubQ by Subquadratic, the end of AI memory hacks?

SubQ by Subquadratic is a direct attack on one of the most expensive habits in AI: pretending that models have memory when most systems are really juggling summaries, chunks, embeddings and retrieval pipelines. Subquadratic has launched SubQ as a 12 million token large language model built for long context reasoning, with the promise that entire repositories, months of pull requests and long running agent state can fit into one prompt.

On its own site, Subquadratic describes the product with a simple line: “All your context. Always available.” That is the core claim. Instead of asking developers to slice information into pieces before a model can use it, SubQ is designed to read much larger bodies of information directly.

Why AI memory needed hacks in the first place

Most modern LLMs are built around transformer attention. Attention is the mechanism that helps a model compare tokens with other tokens and decide what matters. The problem is cost. In dense attention, every token is compared with every other token. Double the input length and the work does not simply double. It grows much faster.

This quadratic scaling problem is why many AI products rely on retrieval augmented generation, vector databases, chunking, summarization and multi agent routing. These systems can be useful, but they are also scaffolding around a limitation. The model cannot afford to read everything, so engineers build machinery to decide what it should see.

That machinery introduces tradeoffs. A RAG pipeline may retrieve the wrong chunk. A summary may erase a detail that later becomes important. A coding agent may lose track of a dependency because the relevant file was not selected. The industry has become very good at these workarounds, but they remain workarounds.

What makes SubQ by Subquadratic different

Subquadratic says SubQ uses Subquadratic Selective Attention, or SSA. The idea is not to compare every token with every other token. Instead, the model learns which relationships matter and spends compute there. The company frames this as a fully subquadratic sparse attention architecture, built so cost grows roughly linearly with input length.

That distinction matters. Sparse attention is not new. Earlier approaches often used fixed patterns, compressed states or hybrid designs that still kept some dense attention layers. Those systems can reduce cost, but they may fail when important information appears far away or in an unexpected place. Subquadratic argues that SSA is content dependent, meaning the model chooses what to attend to based on the actual prompt rather than a fixed layout.

Subquadratic says this changes the economics most dramatically at very long context lengths. Its site claims that at 12 million tokens, SubQ reduces attention compute almost 1,000 times compared with dense attention. The company also says SSA runs 52 times faster than FlashAttention at 1 million tokens. If those numbers hold up under broader independent testing, the practical impact would be significant.

The benchmark picture is strong but narrow

SubQ arrives with eye catching benchmark claims. In the launch materials shared for SubQ by Subquadratic, the model is reported to score 97% on RULER at 128K tokens, a benchmark used to test long context accuracy. That compares with 94% for Claude Opus 4.6 in the same summary. The reported cost difference is even more striking: about $8 to run SubQ versus roughly $2,600 on frontier models.

On MRCR v2, a multi needle retrieval benchmark, SubQ is reported at 83, ahead of Opus at 78, GPT 5.4 at 39 and Gemini 3.1 Pro at 23. At the full 12 million token length, beyond the public context windows of current frontier models, SubQ reportedly reached 92% recall. Subquadratic has also said it is targeting 100 million tokens by Q4.

There is also a coding angle. SubQ is reported at 81.8% on SWE Bench, while Anthropic’s Opus 4.7 still leads at 87.6%. That detail is important. SubQ is not being presented as a universal best model across every task. Its strongest case is long context retrieval and software engineering workflows where context size is the bottleneck.

There are caveats. Additional reporting from outlets such as The New Stack and VentureBeat notes that the published benchmark set is relatively narrow and concentrated on the areas where SubQ should shine. Some reports also point out that certain model comparisons were run only once because long context inference is costly. Subquadratic says its results are third party validated, but the market will still want broader evaluations across reasoning, math, multilingual tasks, safety and real production workloads.

What SubQ changes for developers

The most immediate use case is software engineering. Subquadratic’s own site says SubQ can reason across entire repositories, months of pull requests and long running agent state. It gives examples such as the entire Python 3.13 standard library and six months of React pull requests, described as about 1,050 pull requests against the React codebase.

This is why SubQ Code matters. It is a command line coding agent designed to load a whole repository in one pass. In today’s coding agents, a lot of effort goes into deciding which files to inspect, which symbols to follow and when to ask a stronger model for help. A long context model changes the shape of that workflow. Instead of building a perfect file selection strategy, the agent can begin from a much wider view of the codebase.

Subquadratic is also launching a 12 million token API for developers and enterprise teams. The company says it supports streaming, tool use and endpoints compatible with OpenAI style integrations. That matters because architecture alone is not enough. Developers need the model to fit into existing stacks without forcing a full rewrite.

The most interesting implication is not that RAG disappears. It is that RAG becomes optional in more places. Search, indexing and memory files still matter for cross session state, permissions, auditability and structured knowledge management. But for one large task, such as reviewing a repository, comparing long contracts or investigating months of operational logs, the model may not need a retrieval layer to guess what to read first.

Why skepticism is healthy

The AI field has heard claims about transformer replacements before. Mamba, RWKV, hybrid architectures and sparse attention variants have all promised better scaling. Many delivered useful ideas, but few displaced dense attention at the frontier. The common failure mode is simple: the architecture gets cheaper, but quality drops when the task requires precise retrieval or broad reasoning.

SubQ may be different because the company is shipping API access and a coding product rather than only publishing a research paper. The team also includes researchers with backgrounds from Meta, Google, Oxford and Cambridge. Its backers include former SoftBank Vision Fund partner Javier Villamizar and Tinder co founder Justin Mateen. Reports differ on the exact seed amount, with figures around $25 million to $29 million, but the funding is enough to make Subquadratic a serious company to watch.

Still, the open question is not whether 12 million tokens sounds useful. It obviously does. The question is whether SubQ can combine three things at once: low cost, reliable long context retrieval and frontier level reasoning. Long context specialists can win important workloads without beating the biggest general models everywhere, but that would create a split market rather than a single replacement.