The Stack Completes

May 20, 2026

Happy Tuesday. I scan 100+ Chinese-language AI and tech sources daily to find the stories that matter before they reach the English press. Today: Alibaba ran its annual cloud summit in Hangzhou and showed everyone what a vertically integrated AI stack looks like when it actually works. Plus: DeepSeek is building its own Claude Code, China's three telecom giants just turned AI into a utility bill, and a humanoid robot ran 21 kilometers in under 51 minutes without a single sensor looking at the road.

Let's go.

The Stack Completes

Qwen 3.7 Max spent 35 hours optimizing a kernel on hardware it had never encountered during training. At the end, it was 10 times faster than where it started.

That sentence requires context. The task was Extend Attention, a production GPU kernel inside SGLang, the inference serving framework. It's the operator that computes attention scores over long prefix KV caches -- a latency-sensitive, memory-intensive calculation that sits in the hot path of every LLM inference request. The hardware was T-Head's new M890 chip. Qwen 3.7 Max had no hardware documentation, no performance profiling data, no example kernels for this architecture. It had a task description, an existing implementation, and an evaluation script.

Over 1,158 tool calls, the model diagnosed compilation errors, fixed correctness bugs, identified performance bottlenecks by reading its own runtime measurements, and rewrote the kernel architecture multiple times. Speedup progression: 0.33x to 2.58x in the first two hours. 2.58x to 5.37x by hour five. 6.85x to 8.50x over the next twenty hours. And in the final stretch -- hours 32 to 35 -- one more architectural redesign pushed it to 10x. The model was still finding meaningful improvements at hour 30.

Alibaba ran the same task on competing models. GLM 5.1 reached 7.3x. Kimi K2.6 reached 5.0x. DeepSeek V4 Pro reached 3.3x. Qwen 3.6 reached 1.1x.

The gap between 3.3x and 10x is not a benchmark gap. It is a demonstration that the M890 chip and Qwen 3.7 Max are co-adapted to each other's failure modes in a way that no external model can replicate. DeepSeek V4 Pro is one of the best models in the world. When it optimized a kernel on hardware it had never seen, it got 3.3x. Qwen 3.7 Max got 10x on its own hardware. That difference is the vertical stack, in a number.

T-Head introduced the M890 at the same summit. Three times the performance of its predecessor. 144GB HBM. 800GB/s chip-to-chip bandwidth. Paired with ICN Switch 1.0, a dedicated interconnect chip that links 128 M890s into a single supernode with under 150 nanoseconds of latency. The description from 智东西: "128 chips as one machine. Agent concurrent requests arrive like cars on a highway. The lanes are wide enough. The toll booth processes them instantly."

Alibaba has shipped 560,000 T-Head chips. More than 400 companies across 20+ industries are running on them, including China Telecom, SAIC, and Pudong Development Bank. T-Head's roadmap extends through 2028: M890 (Q2 2026), V900 (Q3 2027), J900 (Q3 2028), each generation delivering approximately 3x performance gains. The first time T-Head took the main stage at Alibaba Cloud Summit in 11 years was this week.

The benchmark numbers for Qwen 3.7 Max tell a similar story. GPQA Diamond: 92.4 (Claude Opus 4.6 is 91.3). Terminal Bench 2.0: 69.7 (DeepSeek V4 Pro is 67.9). SWE-Pro: 60.6 (Kimi K2.6 is 59.5). Apex math: 44.5 (DeepSeek V4 Pro is 38.3). IMO benchmark: 90.0 (Kimi K2.6 is 86.0). These are leading scores across every major benchmark category, achieved on the month the corresponding chip shipped.

Qwen 3.7 Max follows Qwen 3.5 Max Preview (March 20) and Qwen 3.6 (April 20). Every month, on the 20th. Alibaba is not releasing models when they are ready. It has committed to a monthly schedule and is executing it. The original lead researcher on Qwen, Lin Junyang, departed in March. The team released two major models in the two months after he left.

An independent market research report from Omdia, published the day before the summit, found that Alibaba Cloud holds 38.1% of China's AI cloud market -- up from 35.8% earlier in 2025, and more than the second, third, and fourth competitors combined. AI revenue at Alibaba Cloud has grown at triple-digit rates for 11 consecutive quarters. The last reported quarter: AI is 30% of cloud revenue. The projection for end of year: over 50%.

There are four companies globally with a complete vertical stack: their own chip, their own cloud, their own frontier model. Google. Microsoft (via its investment in OpenAI and custom silicon). Amazon. And now Alibaba. The others took a decade and hundreds of billions of dollars to build it. Alibaba is doing it under export controls that were supposed to prevent exactly this.

The Briefing

DeepSeek is building its own Harness team, and they named what it competes with. According to 甲子光年, citing people close to the company, DeepSeek has formed an internal team focused on code agent products. DeepSeek researcher Chen Deli confirmed the formation publicly: "We are building a new Harness team for Harness-direction products and research. Simply put, it's competing with Claude Code, building DeepSeek Code Harness." The job posting frames the product philosophy as "Model + Harness = Agent" -- the model handles reasoning, the harness handles everything else: context management, tool calls, file reads, code execution, test loops, git operations. DeepSeek's model quality has been world-class since V3. Its harness has been borrowed by Cursor, Claude Code, and dozens of other tools. Now it wants to own the application layer too.

Andrej Karpathy joined Anthropic to build a team that uses Claude to accelerate pretraining research. His announcement was three sentences: the next few years at the LLM frontier are "especially formative," he's returning to research, he still loves education. He'll be embedded in Anthropic's pretraining team under Nick Joseph, building a sub-team focused on using Claude itself to speed up pretraining. This is the third prominent OpenAI figure to move to Anthropic. Polymarket currently gives Anthropic a 65% probability of having the best model by end of June, versus 4% for OpenAI. Chinese AI observers are reading this as validation that Anthropic's technical choices are compounding in a way that attracts people who could go anywhere -- and that the perceived gap between Chinese and Western frontier models is narrow enough that the difference is increasingly about talent concentration, not training methodology.

China's three telecom giants launched token subscription plans this week, and the demand numbers are hard to believe. China Telecom, China Mobile, and China Unicom each unveiled consumer-grade token packages. China Telecom's entry plan: 9.9 RMB per month ($1.40) for 10 million tokens. The premium individual plan: 49.9 RMB per month ($6.90) for 80 million tokens. China Mobile partnered with Tencent to offer 400,000 tokens for 1 RMB ($0.14) in Shanghai. The context: according to China's National Data Administration, average daily token consumption in China has jumped from 100 billion at the start of 2024 to 140 trillion as of March 2026. That is a 1,400x increase in 15 months. The Chinese telcos are converting their distribution infrastructure -- billing systems, subscriber relationships, national coverage -- into the delivery pipe for a product that already has 140 trillion units of daily demand.

China published its first comprehensive governance framework for AI agents on May 8. The document, jointly issued by the Cyberspace Administration, NDRC, and Ministry of Industry and Information Technology, applies "precision mandatory" standards in high-risk domains -- healthcare, transportation, media, public security -- and gives companies wide latitude below those floors. The framework mandates adoption of AIP (Agent Interconnection Protocol), a national standard for agent-to-agent communication covering identity authentication, capability discovery, collaboration, billing, and audit trails. AIP is designed for a world where thousands of agents from different companies need to interact safely. More than 100 companies are already enrolled in the pilot, including Huawei, Alibaba, China Mobile, Xiaomi, and CATL. Analysis from Huxiu frames the distinction clearly: China is not just regulating agents. It is standardizing the plumbing that agents will run on, and doing it before the infrastructure calcifies around proprietary protocols.

What I Found on Bilibili This Week

The video I want to highlight is from 量子位 (Quantabit). Title: "Humanoid robots crushed the marathon, no vision AI, running blind?" 959,000 views, 12 minutes 40 seconds.

This was the second annual humanoid robot marathon in China. The headline number: 21 kilometers, 50 minutes 26 seconds, fully autonomous. That breaks the human half-marathon world record.

The transcript covers something the press releases don't: how they actually navigate. No visual AI. No cameras processing the environment in real time. The technical stack is RTK (real-time kinematic GPS, centimeter-level precision) plus LiDAR point cloud. RTK answers "where am I." LiDAR answers "what's immediately in front of me." The difference between consecutive point cloud frames answers "how am I moving." The destination is pre-programmed. Everything else is correction.

This is why the robots run in serpentine patterns. "It's like a person walking with their eyes closed," the sensor engineer explains. "The navigation algorithm keeps pulling them back toward the target. But if the correction timing is off, or the robot's posture was wrong when it corrected, the next adjustment will also be slightly off." The path looks strange because it's iterative rather than perceptual.

The winning team, Honor Lightning (荣耀闪电), runs with water-cooled motors. The joint damage during a 21km run is extreme: each footfall lands with 1,400 to 2,100 newtons of force on a 70kg frame, over 30,000 to 50,000 steps. The reporter notes: "By the time they cross the finish line, many of the robots are operating near the edge of their component lifetimes in places you can't see."

The sensor engineer's closing observation: "The core problem isn't the marathon. It's autonomous perception and autonomous intelligence. Running ability is now demonstrated. What comes next is the ability to understand and respond to an open environment." In other words: they've solved locomotion. The next frontier is judgment.

Signals

YMTC is advancing toward an IPO on Shanghai's STAR Market. Yangtze Memory Technologies, China's only mainland integrated 3D NAND flash manufacturer, has completed IPO tutoring registration with the Hubei branch of CSRC, with CITIC Securities as sponsor. Reuters previously reported a formal application as early as mid-June. YMTC matters because memory is the bottleneck constraining domestic AI data center buildout at least as much as compute. A public listing accelerates the capital available for capacity expansion.

China and the United States confirmed formal AI governance dialogue. Following Xi and Trump's state visit meetings, Foreign Ministry spokesman Guo Jiakun confirmed both governments will hold further dialogues on AI governance. Guo's framing: "As two leading AI powers, China and the United States need to work together to promote the development of AI and improve its governance to make sure it will better contribute to the progress of human civilization." Trump characterized the discussions as "standard guard rails." The formal channel matters more than either characterization of it.

Cursor launched Composer 2.5 using Kimi K2.5 as its base model -- still. The new model benchmarks close to Claude Opus 4.6 on SWE-Bench Multilingual (79.8% vs 80.5%) at roughly a tenth of the cost. The interesting detail: Cursor is running post-training on a Chinese model to produce a coding agent that competes with Anthropic's flagship on specialized tasks. Cursor's business currently depends on Anthropic for Claude API access and on Moonshot AI for the model behind Composer. That is a structurally unusual position that will resolve one way or another.

China unveiled a GPU-free supercomputer with 2.45 million domestic CPU cores. The National Supercomputing Center in Shenzhen deployed LineShine, delivering 1.54 exaFLOPS using Armv9-based LX2 processors. No GPUs. The 20,480 computing nodes are connected through Lingqu interconnect at 1.6 Tb/s per node. It is a different architecture choice for HPC workloads where general-purpose compute dominates, and a domestic semiconductor supply chain story: 2.45 million cores on Chinese-designed processors.

The Bigger Picture

There is a pattern running through this week's stories that is worth naming.

Three years ago, the export control thesis was that restricting access to advanced chips would slow China's AI development by limiting training compute. The thesis was not wrong about the mechanism. It was wrong about what the mechanism would produce.

Restriction creates constraint. Constraint creates a design problem. Design problems, when solved by capable engineers with adequate investment, produce solutions that are adapted to the constraint -- and occasionally adapted better than the unconstrained alternative would have been.

T-Head built M890 because H100 was unavailable at scale. The 128-chip supernode with sub-150ns latency was not designed around H100's architecture. It was designed around the specific requirements of agentic workloads: high concurrency, low-latency inter-chip communication, unified training and inference on a single chip. The H100 was designed for something different. T-Head had the option of designing for exactly the problem Alibaba needed to solve.

When Qwen 3.7 Max trained on T-Head hardware throughout its development, and then ran a 35-hour kernel optimization task on M890, it achieved 10x speedup. DeepSeek V4 Pro, which trained on H100s, achieved 3.3x on hardware it had never seen. The gap is not model quality. It is co-adaptation. You only get 10x if the model and the chip were optimized together from the start.

The token economy story follows the same pattern. China's telcos are selling AI tokens at $1.40 per month because domestic model providers drove inference costs to near-zero through competition. That competition was possible because multiple Chinese labs, unable to rely on API access to foreign frontier models, built their own. The same supply independence that looks like a disadvantage in the short term produced 140 trillion tokens of daily domestic consumption in 15 months.

The humanoid robot marathon ran blind because lidar and RTK are cheaper, more reliable, and more available than the high-end cameras and visual AI that would require either expensive foreign components or a visual processing stack that Chinese teams have not fully commoditized yet. So they ran blind. And then they got very good at running blind. 21 kilometers in 50 minutes 26 seconds, world record.

Constraints do not stop capable builders. They redirect them. The redirection produces things that would not have existed without the constraint. That is the China AI story in 2026, running on repeat.

None of this makes Western headlines, even though all of it matters. That's why this newsletter exists.

I exist because this information asymmetry shouldn't.

If you find value in this, tell one person. Or subscribe if you haven't already.

China AI Dispatch

Discussion about this post

Ready for more?