AI Weekly: DeepSeek V4, Kimi K2.6, and GPT-5.5

Codex on Mac is becoming a superapp, Anthropic has rolled out a postmortem on its own bugs, and DeepSeek spent 58 pages explaining why it has the best open-weight model.

OpenAI released GPT-5.5 and positions it as a “flagship for real-world work and autonomous workflows.” The price is $5 per million input tokens and $30 per million output tokens, with the Pro variant at $30/$180. That is exactly twice as expensive as GPT-5.4 per token. They soften the “you have to pay for this” idea by saying the model uses fewer tokens: Artificial Analysis reported a ~40% reduction in token usage on its Intelligence Index, so the final bill rises by only ~20%. Context is 1M in the API and 400K in Codex.

The benchmarks confirm this. Terminal-Bench 2.0: 82.7%, OSWorld-Verified: 78.7%, SWE-Bench Pro: 58.6%. ARC Prize confirmed ARC-AGI-2 at 85.0% at a cost of $1.87 per task. On AA’s Intelligence Index, the model took clear first place, while GPT-5.5 medium tied with Claude Opus 4.7 max at roughly a quarter of the cost. Dan Shipper from Every tested it on their Senior Engineer benchmark: 62/100 versus 33/100 for Opus 4.7, with the best results coming when Opus wrote the plan and GPT-5.5 implemented it.

This model has one big fly in the ointment. Hallucination rate on AA-Omniscience for GPT-5.5 is 86%. Opus 4.7 scores 36% on the same metric, Gemini 3.1 Pro Preview 50%. If you plan to use it where truth matters more than speed, keep this in mind.

The main event of the day is not hidden in the model itself. For the release, the Codex Mac App got browser control, work with Sheets and Slides, Docs and PDFs, system dictation, and automatic code review. Back in March, WSJ wrote that OpenAI was preparing a desktop superapp, and now it is clear what it is being built on.

DeepSeek dropped the long-awaited V4. This is the first major architecture update since DSV3, and they rolled out two model tiers at once. V4 Pro offers a 1.6T-parameter MoE with 49B active parameters, while V4 Flash is 284B/13B. Both have a 1M context, both are MIT-licensed, and both run on Huawei Ascend through CANN. Pricing is aggressive: Pro is $1.74/$3.48 per million, Flash is $0.14/$0.28. And the main thing is in the 58-page technical report: a new long-context system where the KV cache is compressed to 9.62 GiB for 1M tokens, versus 83.9 GiB for V3.2. That is 8.7 times smaller. On the AA Intelligence Index, V4 Pro in max mode scored 52, becoming second among open weights after Kimi K2.6. Several researchers called the paper itself “the most important AI text of the year.”

V4 has the same problem: hallucinations. AA-Omniscience: 94% for Pro, 96% for Flash. The prices look good exactly until you calculate the cost of a full run of their index: V4 Pro ate 190 million output tokens, V4 Flash 240 million. Cheap per token ≠ cheap per task.

Moonshot a couple of days earlier showed Kimi K2.6, a 1T MoE with 32B active parameters and a 256K context, under Modified MIT. Their own agentic demos make it clear where everything is heading. One run downloaded and optimized Zig inference for Qwen3.5-0.8B over more than 12 hours through 4,000+ tool calls, raising throughput from 15 to 193 tok/sec. Another, after 1,000+ tool calls, reworked the exchange-core matching engine and delivered a +185% gain in median throughput. These are still vendor demos, but they are closer to real work than screenshots from leaderboards. On r/LocalLLaMA, tons of posts appeared along the lines of “Kimi K2.6 covers 85% of the tasks I kept Opus 4.7 for.” Given the price difference and open weights, this is a serious signal.

Xiaomi this same week announced MiMo-V2.5 and V2.5-Pro — the third Chinese open-weight player alongside Kimi and DeepSeek. V2.5-Pro is tuned for code and long agentic sessions: SWE-bench Pro 57.2, τ3-Bench 72.9, Claw-Eval 63.8, with 1,000+ autonomous tool calls claimed. The base V2.5 comes with native omnimodality and a 1M context. The family is less hyped than Kimi or DeepSeek, but Artificial Analysis has already added MiMo to its Index, and the Hermes agent picked up the integration within a couple of days.

Last week I wrote about Qwen 3.6 35B-A3B; this week Alibaba released its dense sister model, Qwen 3.6 27B, under Apache 2.0. The difference is in the architecture. The MoE version has 35B total parameters, but only 3B are active on each token — hence “A3B” — which gives ~65 tok/sec on an M5 Max. In the dense model, all 27B work on every token: 24 tok/sec, but with higher accuracy and stability on long instructions. The 27B became the main local-model story of the week.

In coding, 27B beats its own Qwen3.5-397B-A17B MoE. SWE-bench Verified: 77.2 versus 76.2, SWE-bench Pro: 53.5 versus 50.9, Terminal-Bench 2.0: 59.3 versus 52.5. People write that on an M5 Max through llama.cpp, it feels close to Opus on many coding tasks — but we all understand the caveat. With quantization, the model fits into 16GB of VRAM. Roughly speaking, you take 35B-A3B when speed matters, and 27B when accuracy matters.

68747470733a2f2f7169616e77656e2d7265732e6f73732d616363656c65726174652e616c6979756e63732e636f6d2f5177656e332e362f466967757265732f7177656e332e365f3237625f73636f72652e706e67.jpg

Meanwhile, Anthropic had a storm of its own raging internally. First, Claude Code quietly disappeared from the $20 Pro plan, framed as an A/B test on 2% of new subscribers. Reddit and Twitter exploded within a day; Anthropic explained it by pointing to the growing load on the Max tier: Claude Code, Cowork, long asynchronous agents — all of it is expensive. Sam Altman on Twitter dropped a snide “ok boomer.” A couple of days later, Claude Code returned to Pro, but the aftertaste remained.

Anthropic also published a postmortem on three bugs that had been quietly undermining Claude Code for a whole month:

On March 4, they quietly lowered reasoning effort from high to medium to reduce latency, and only rolled it back on April 7.
Starting March 26, a cache bug caused Claude to forget its reasoning history, the cache missed, and users’ limits burned down faster than usual.
On April 16, a system prompt change limited responses between tool calls to 25 words and noticeably worsened coding; it was rolled back on the 20th.

All three were fixed in v2.1.116, and all subscribers had their limits reset. It is good that Anthropic released this kind of postmortem at all; for AI labs, it is a rare genre.

On the same day as GPT-5.5, OpenAI launched Workspace Agents in ChatGPT for business plans, education, and teams. These are Codex agents that can move through docs, email, chat, code, and external systems, have access to Slack workflows, and can run in the background or on a schedule. It is the same story as Codex outside coding: the product is shifting toward the team desktop, not a single user in a chat.

GPT-Image-2 blew up the internet: on Image Arena it is #1 across all leaderboards — text-to-image 1512, single-image edit 1513, multi-image edit 1464. A +242 Elo lead in text-to-image is generational-change territory. The main thing is that it reads and writes text inside images and produces UI mockups, infographics, and QR codes as fully readable artifacts ready for use. The Thinking variant can review its own output and iterate — and yes, one image can take up to 11 minutes.

Cursor signed a $10 billion contract with xAI, with an option to buy Cursor for $60 billion. The numbers are so big that even the GPT-Image-2 news cycle barely managed to cover them. If the option is exercised, it will be the largest M&A deal in AI tooling and, at the same time, an attempt to lock AI coding inside a single ecosystem loop. Sitting on two chairs — model and IDE/environment — is very fashionable right now: OpenAI has Codex, Anthropic has Claude Code, and Google potentially has anything through Gemini. Through xAI, Cursor gets a cheap and controllable model plus financing; xAI gets distribution through a tool with millions of active developers.

While Silicon Valley is divvying up AI tools, in our version of the Matrix the red pill is now sold in RuStore, and the blue one through the Turkish App Store.

Stay curious.

Weekly Hallucinations: DeepSeek V4, Kimi K2.6, and the uncontrolled hallucinations of OpenAI's new flagship

Some other interesting read