DeepSeek releases ‘sparse attention’ model that cuts API costs in half
DeepSeek unveils its experimental V3.2-exp model featuring Sparse Attention — a breakthrough that halves API inference costs in long-context operations and boosts transformer efficiency.
Researchers at DeepSeek released a new experimental model, V3.2-exp, on Monday, designed to have dramatically lower inference costs when used in long-context operations. DeepSeek announced the model with a post on Hugging Face, also posting a linked academic paper on GitHub.
The most significant feature of the new model is DeepSeek Sparse Attention, an intricate system that is described in detail in the diagram below. In essence, the system uses a module called a "lightning indexer" to prioritize specific excerpts from the context window. After that, a separate system called a "fine-grained token selection system" chooses specific tokens from within those excerpts to load into the module's limited attention window. Taken together, they allow the Sparse Attention models to operate over long portions of context with comparatively small server loads.
Screenshot
For long-context operations, the benefits of the system are significant. Preliminary testing by DeepSeek found that the price of a simple API call could be reduced by as much as half in long-context situations. Further testing will be required to build a more robust assessment. However, since the model is open-source and freely available on Hugging Face, it won't be long before third-party tests can assess the claims made in the paper.
DeepSeek's new model is one of a string of recent breakthroughs tackling the problem of inference costs — essentially, the server costs of operating a pre-trained AI model, as distinct from the cost of training it. In DeepSeek's case, the researchers sought ways to enhance the efficiency of the fundamental transformer architecture — and found significant room for improvement.
Based in China, DeepSeek has been an unusual figure in the AI boom, particularly for those who view AI research as a nationalist struggle between the U.S. and China. The company made waves at the beginning of the year with its R1 model, which was trained primarily using reinforcement learning at a significantly lower cost than its American competitors. But the model has not sparked a wholesale revolution in AI training, as some predicted, and the company has receded from the spotlight in the months since.
The new "sparse attention" approach is unlikely to produce the same uproar as R1 — but it could still teach U.S. providers some much-needed tricks to help keep inference costs low.
What's Your Reaction?
Like
0
Dislike
0
Love
0
Funny
0
Angry
0
Sad
0
Wow
0