Kimi K2.7 Code, the open-weight coding model that thinks 30% less

Moonshot AI shipped Kimi K2.7 Code on June 12, 2026, and the headline is not a bigger model. It is a smarter one. The new release keeps the same 1 trillion parameter Mixture-of-Experts backbone as K2.5 and K2.6, but reorganises how the model thinks. The result: roughly 30% fewer reasoning tokens per task and double-digit gains on every internal coding benchmark Moonshot tracks.

For teams running long autonomous coding sessions, that combination matters more than raw parameter count. You pay for thinking tokens the same way you pay for output tokens, and an agentic run can spend hundreds of those tokens on a single retry. Cut a third of them and the savings compound across an entire workflow.

What Kimi K2.7 Code actually is

Kimi K2.7 Code is a coding-focused agentic model built on top of K2.6. It is not a general chat model. Moonshot designed it for long-horizon software engineering: planning, editing files, running tools, reading test output, and debugging across many steps without losing the thread.

The architecture is a Mixture-of-Experts setup:

1 trillion total parameters, 32 billion active per token
384 experts, 8 selected per token plus 1 shared
61 layers, including 1 dense layer
MLA attention with SwiGLU feed-forward path
MoonViT vision encoder of 400M parameters for image and video input
Native INT4 quantization
256K context window, exactly 262,144 tokens

The weights ship on Hugging Face as moonshotai/Kimi-K2.7-Code under a Modified MIT licence that allows commercial use with attribution. The repository weighs in at roughly 595 GB on disk, so it works best on a server.

The 30% token cut, and why it changes the cost math

Moonshot frames the token reduction as less overthinking. The model produces shorter reasoning chains for the same task quality, which results in three things.

First, lower output-token cost per task. Reasoning tokens bill as output tokens on most price cards, so a 30% cut shows up directly on the invoice. Second, faster steps, which matters in interactive CLI sessions where every retry adds latency. Third, more steps before hitting the 256K context ceiling, since less of the window is consumed by the model talking to itself.

A concrete example. A 12-hour agentic coding run that previously consumed about 2 million reasoning tokens on K2.6 now uses roughly 1.4 million on K2.7 Code. At Moonshot’s published output rate of $4.00 per million tokens, that single run saves around $2.40 in reasoning costs alone. Multiply by a fleet of agents running daily and the gap becomes material.

Benchmark results

All numbers below are first-party. Moonshot ran K2.7 Code in Kimi Code CLI, GPT-5.5 in Codex with xhigh mode, and Claude Opus 4.8 in Claude Code with xhigh mode. Independent reproductions are still pending.

Coding benchmarks

Kimi Code Bench v2: K2.7 Code scores 62.0, up from K2.6 at 50.9. That is a 21.8% jump. Claude Opus 4.8 sits at 67.4.
Program Bench: K2.7 Code reaches 53.6, up 11.0% from K2.6’s 48.3. Opus 4.8 leads at 63.8.
MLS Bench Lite: the biggest leap, from 26.7 to 35.1, a 31.5% improvement. GPT-5.5 sits just ahead at 35.5.

Agentic benchmarks

The most interesting result is on MCP Mark Verified, a human-checked benchmark for tool use across Notion, GitHub, Filesystem, Postgres, and Playwright. K2.7 Code scores 81.1, beating Opus 4.8 at 76.4. For workflows that hinge on correct tool invocation through the Model Context Protocol, that gap is significant.

The Kimi Claw 24/7 Bench measures persistent multi-day coworking across software engineering, ML research, recruiting, trading, and marketing. K2.7 Code shows consistent gains here too, which lines up with Moonshot’s positioning of the model as a long-horizon agent rather than a single-shot coder.

Two constraints you cannot work around

K2.7 Code has fewer knobs than most commercial APIs, and that is deliberate.

Thinking mode is mandatory. Disabling it returns an API error. Instant mode is not supported. The model also forces preserve_thinking to true, which retains full reasoning content across multi-turn interactions. This is what keeps long agentic loops coherent, but it also means you cannot opt out of the reasoning tax.

Sampling parameters are locked. Temperature is fixed at 1.0, top_p at 0.95, n at 1, and penalties at 0.0. Default max output is 32,768 tokens. Override any of these and the request errors. For teams used to tuning sampling per task, this takes some adjustment.

There is one more rule for multi-step tool use: you must keep reasoning_content from the current turn in context, and set tool_choice to only auto or none. Skip that and the agent loses its train of thought between calls.

Pricing and access

The Moonshot API is OpenAI-compatible and Anthropic-compatible, so existing tooling works with a base URL change. Pricing on the Moonshot platform breaks down as:

Input: $0.95 per million tokens
Output: $4.00 per million tokens
Cache hit: $0.19 per million tokens

The cache-hit rate is worth highlighting. At roughly a fifth of the input price, structuring sessions so that stable context (system prompts, project conventions, repository layout) gets reused makes a real dent in costs. Combined with the 30% reasoning-token cut, K2.7 Code lands well below the closed frontier models on cost-per-quality-token.

You can also self-host with vLLM, SGLang, or KTransformers. The deployment method is identical to K2.5 and K2.6, so existing inference setups can swap in the new weights without infrastructure changes. The transformers version requirement is >=4.57.1, <5.0.0.

Where the model fits in practice

Four use cases line up cleanly with the model’s design.

Repo-scale refactors. Point the agent at a failing test suite, let it read files across modules, edit, rerun tests, and iterate until green. The 256K window holds enough of a codebase that the model does not constantly re-read files.

Pull request review. Feed a diff plus related files and logs in one prompt. The long context keeps everything in scope so risk analysis covers the actual surface area of the change, not just the lines touched.

MCP tool-use workflows. The 81.1 score on MCP Mark Verified translates to reliable tool invocation in production loops: CI checks, ticket updates, file edits, database queries, all coordinated through one agent.

Multimodal debugging. Documentation, screenshots, and a recorded reproduction can share one prompt. For UI bugs or visual regression work, mixing text, image, and video input in a single session is genuinely useful.

Pairing K2.7 Code with an agent framework

Moonshot recommends Kimi Code CLI as the native harness, and it is what the benchmarks were run in. But the OpenAI-compatible endpoint means K2.7 Code drops into provider-agnostic agents too.

Hermes Agent from Nous Research is one example. It is an open-source self-improving agent that builds reusable skills from successful work and persists memory across sessions. Paired with K2.7 Code‘s long context, a skill learned once can be applied across an entire repository in a single session, because the window holds enough of the codebase for the agent to act with full awareness.

The wiring is simple: register Kimi as a custom provider, point Hermes at https://api.moonshot.ai/v1 with the model id kimi-k2.7-code, set the API key, and select the model. Routing through OpenRouter at moonshotai/kimi-k2.7-code works equally well if you prefer one gateway across providers.

To remember

K2.7 Code is the sixth release in the K2 series, following a steady two to three month cadence since the original K2 dropped in July 2025. Each version has built on the same backbone rather than chasing parameter counts. K2.7 optimises the execution layer, the how of token usage, which suggests the next major architecture jump is being saved for K3.

What is worth watching is not whether open weights catch up to closed frontier models on raw scores. It is whether token efficiency becomes the metric teams actually optimise for. A model that scores two points lower but costs 40% less per completed task wins most procurement conversations.