For most of 2024 and 2025, every frontier lab promised the same thing: bigger context windows. 200K, then 1M, then 2M tokens. The marketing slides looked great. The bills did not. Push a real million-token prompt through a standard transformer and the KV cache alone can eat tens of gigabytes of VRAM, while per-token inference cost climbs into territory that breaks any product margin.
On April 24, 2026, DeepSeek quietly shipped the answer. According to MIT Technology Review, the Chinese lab released DeepSeek-V4-Pro (1.6T total parameters, 49B active) and DeepSeek-V4-Flash (284B total, 13B active) under the MIT License, both with a native 1-million-token context window. The headline number, per the official release notes, is that V4-Pro uses only 27% of the per-token inference FLOPs and 10% of the KV cache of its V3.2 predecessor at 1M tokens. V4-Flash drops to 10% of FLOPs and 7% of KV cache.
This post walks through what V4 actually is, why its hybrid attention mechanism matters, how it benchmarks against closed-source competitors, and what you should do if you are building products that depend on long-context reasoning.
What DeepSeek V4 Actually Is
DeepSeek V4 is a Mixture-of-Experts (MoE) language model family released as open weights on Hugging Face. Two variants ship in the family:
- V4-Pro — 1.6 trillion total parameters, 49 billion activated per token. Targets frontier reasoning, coding, and agentic workloads.
- V4-Flash — 284 billion total parameters, 13 billion activated per token. Targets latency-sensitive serving and edge inference at frontier-adjacent quality.
Both share the same architectural innovation: a hybrid attention stack that interleaves two compression strategies across layers instead of running dense self-attention everywhere. Both natively support a 1,048,576-token context. Both are released under MIT, meaning you can fine-tune, distill, and ship them in commercial products without negotiating a license.
The weights are real, not a paper. You can download them today, run them on vLLM or SGLang, and serve them behind your own endpoint.
The Core Idea: Why Attention Was the Bottleneck
Standard transformer attention has two costs that both scale badly with context length:
- Compute grows quadratically with sequence length. A 1M-token prompt is not 8x more expensive than a 128K-token prompt — it is roughly 64x more expensive per layer.
- Memory for the KV cache grows linearly with both context length and model depth. A trillion-parameter model with a 1M-token cache can demand more memory for activations than for weights.
Most "long context" models in 2025 papered over this with linear-attention variants, ring attention, or aggressive quantization. Quality usually suffered — needle-in-a-haystack scores collapsed past 256K, and any task requiring genuine cross-document reasoning fell apart.
DeepSeek's answer, documented in their V4 technical post and analyzed by Andrew Lukyanenko on Medium, is to admit a basic truth: not every layer needs the same level of detail. Early layers benefit from local precision, deep layers benefit from cheap global routing. So they built two attention mechanisms and interleaved them.
Inside the Hybrid Attention: CSA + HCA
The V4 attention stack alternates between two custom mechanisms.
Compressed Sparse Attention (CSA)
CSA compresses keys and values by 4x along the sequence dimension using softmax-gated pooling with a learned positional bias. Each query then runs through a "lightning indexer" that selects only the top-k compressed blocks worth attending to, and a parallel sliding-window branch handles the most recent uncompressed tokens.
The result is a layer that keeps token-level precision for nearby context (the sliding window) and aggregated-block precision for everything older — at roughly a quarter of the standard attention cost per layer.
Heavily Compressed Attention (HCA)
HCA takes the same idea further: compression of roughly 128x, with dense attention running over the resulting tiny block representation. This is intentionally cheap and intentionally lossy. It gives the model a coarse global view of the entire context — enough to know what is roughly where — without paying for token-level resolution it would not have used anyway.
Interleaving CSA and HCA across the depth of the model means every token sees both fine-grained nearby attention and coarse-grained global attention, but you never pay for full dense attention on a 1M-token sequence. According to the technical breakdown on dasroot.net, this is the structural reason V4's KV cache shrinks an order of magnitude while quality holds.
Benchmarks: Where V4 Lands in June 2026
The efficiency story would not matter if quality cratered. It did not. Here is how V4-Pro compares to other frontier models on the benchmarks developers actually care about, drawn from MorphLLM's comparison and BuildFastWithAI's review:
| Model | SWE-bench Verified | 1M-Token NIAH | License | Output $/M tokens |
|---|---|---|---|---|
| DeepSeek V4-Pro-Max | 80.6% | 97% | MIT (open weights) | $0.87 |
| Gemini 3.1 Pro | 80.6% | ~96% | Closed | $10.00 |
| Claude Sonnet 4.6 | 77.2% | ~95% | Closed | $15.00 |
| GPT-5.5 | 74.5% | ~94% | Closed | $10.00 |
| DeepSeek V3.2 | 64.1% | ~88% | MIT | $1.10 |
Two numbers in that table matter more than the rest:
- 80.6% on SWE-bench Verified ties Gemini 3.1 Pro for the top of the open-weights leaderboard and sits within margin of error of the best closed models. SWE-bench is a real-codebase test, not a synthetic benchmark — it asks the model to fix actual GitHub issues end-to-end.
- $0.87 per million output tokens is roughly 1/10 to 1/15 the price of comparable closed models, and that is just the API price. If you host the weights yourself, marginal cost drops further.
The 97% needle-in-a-haystack accuracy at 1M tokens is the part most people miss. A 1M-token window that scores 60% NIAH is a marketing window — useful for stuffing context but not for retrieving anything specific. 97% means you can actually rely on the model to find what you put in there.
What This Changes for Developers
If you have been waiting for the moment open-weights models become a serious option for production agent workloads, this is it. A few concrete shifts to think about:
# Before V4: pay-per-token closed APIs dominate long-context agents
# Typical 500K-token research agent run on Claude Sonnet 4.6:
# ~$8.00 per query, 4-second TTFT
# After V4: same workload on self-hosted V4-Flash
# ~$0.18 per query (electricity + amortized GPU), 1.2-second TTFT
# Plus: weights stay on your hardware, no data leaves the perimeter
Three practical implications:
- Long-context RAG becomes the default. When you can stuff 800K tokens of company documentation into a single prompt for under a dollar, the engineering complexity of vector retrieval is no longer justified for many internal tools. You will see teams rip out their RAG pipelines this year.
- Self-hosting moves from "nice to have" to "obviously cheaper." A single 8xH100 box can serve V4-Flash at production latency. The break-even versus closed APIs lands at roughly 5–10 million tokens per day, which most serious products clear in a week.
- Agentic workflows get longer. Today most coding agents truncate aggressively because each extra 100K of context doubles the cost. With V4's compressed cache, you can keep entire repositories, prior tool calls, and full reasoning traces in scope without thinking about it.
Want the Engineering Deep Dive?
If you prefer a video walkthrough of the architecture and the design choices behind CSA and HCA, Welch Labs put together a clear visual explanation of the attention mechanism, MoE routing, and the training data curriculum DeepSeek used:
Frequently Asked Questions
Is DeepSeek V4 really free to use commercially? Yes. The weights are released under the MIT License on Hugging Face, which permits commercial use, modification, distribution, and private use with no royalty. The only requirement is preserving the copyright notice. The hosted API is separately priced, but you are not obligated to use it.
Can I run DeepSeek V4-Pro on a single GPU? No. V4-Pro's 1.6T parameters require multi-GPU inference even with FP8 quantization — realistically an 8xH100 or 8xH200 node minimum for full-quality serving. V4-Flash (284B total, 13B active) is much more tractable and can run on a single high-VRAM node or a 2x40GB setup with aggressive quantization.
Does the hybrid attention hurt quality on short prompts? In practice, no measurable hit. The CSA layers degrade gracefully to near-standard attention when the sequence is short enough that the sliding window covers everything, and HCA's coarse global view becomes effectively redundant rather than harmful. Benchmark scores on standard short-context evals match or exceed V3.2.
How does V4 compare to Fable 5 for coding? Fable 5 currently leads pure software-engineering benchmarks, but the gap is small (a few points on SWE-bench) and V4 is open weights while Fable 5 is closed and roughly 10–15x more expensive per token. For most production coding workflows the cost-per-fix favors V4 even when raw quality is slightly lower.
Will this kill closed-source frontier models? No — but it tightens the screws hard on closed pricing. When a free, MIT-licensed model matches your benchmark scores at one-tenth the cost, you need to either drop prices or prove a meaningful capability lead that justifies the gap. Expect closed labs to compete on agent frameworks, tool use, and integrated products rather than on raw model quality alone.
The Bottom Line
DeepSeek V4 is the moment long-context inference stopped being a luxury feature. The hybrid attention design is not magic — it is the right architectural acknowledgment that uniform dense attention was always wasteful at 1M tokens. Combined with MIT-licensed weights, frontier benchmark scores, and an order-of-magnitude pricing advantage, V4 has reset the floor for what every other model in 2026 must clear.
If you are building anything that touches long-context reasoning — research agents, code review bots, document analysis pipelines, multi-turn assistants with deep memory — the question is no longer whether to evaluate open-weights models. It is which one of your closed-model dependencies to migrate first.