Our previous posts on DeepSeek v3 and DeepSeek R1 have been very popular, with DeepSeek R1 clearly resonating in the Substack community. Welcome to the new members who joined us recently. Given that DeepSeek R1 is trending in the news, this post attempts to understand and explain why DeepSeek R1 has caught so much love and attention this week. There are two main factors: its unique reasoning power, and potential to lead to cheaper models downstream. Lets examine these two factors in detail.
Unique reasoning power
To capture attention, you must distinguish yourself from your competition, and offer users something interesting that they are really craving for. Since the viral ChatGPT release in November 2022, we have mostly seen incremental releases in the form of better models which answer questions with more accuracy from closed source vendors such as OpenAI and open-source vendors such as Meta. Deepseek took a different approach with R1. DeepSeek R1 distinguishes itself by showing how it is thinking and reasoning about the question it was asked. Although OpenAI was first at shipping a reasoning model in the form of o1, DeepSeek R1 bettered o1 by printing out how it was thinking and reasoning. This reasoning ability struck a chord with humans who can closely identify their thinking process with how DeepSeek R1 reasons out a problem. More than the actual answer (which turns out to be better in many benchmarks), it is the thinking and reasoning that makes DeepSeek R1 stand out. This thinking and reasoning got perfected by DeepSeek’s approach to use Reinforcement Learning (RL) to train DeepSeek R1. While RL has been tried before with LLMs, DeepSeek’s approach succeeded because they had a great base model (Deepseek-v3), and supplemented their RL-only model (Deepseek-R1-Zero) with some fine-tuning and rejection sampling to yield a better model (Deepseek-R1)
To understand his reasoning ability, we tried to improve on our previous integration of DeepSeek R1 on Vector + Wirepod. In our previous post, we needed to disable the system prompt in the query issued by Wirepod in order to have DeepSeek R1 return an intelligent response. However, disabling the system prompt has the unintended consequence that Vector does not animate while speaking (The system prompt carries the animation descriptions). This week we tweaked around, re-enabled the system prompt, and experimented on different prompt sequences. The result is that Vector now gives short and concise responses, but can also animate while speaking the final answer. Since other reasoning models like o1 disable the system prompt, this is the first example of Vector animating answers from a LLM reasoning model. Checkout our video below, and let us know how you feel about the thinking part (in the beginning), and the animation part (at the end). If you wish to try this out on your Vector, please use our Wirepod git, and follow instructions at our previous post.
Low Compute/ Cheaper model
Perhaps the more newsworthy consequence of DeepSeek R1 was the rout of Tech stocks that happened on Monday January 27, which sent many semiconductor stocks such as nVidia and Broadcom down by 15-20%. This rout was due to a narrative that emerged over last weekend that DeepSeek R1 used far fewer compute resources to train ($5.5 Milion) as opposed to other recent models such as Llama-70b and o1 which have exceeded $100 Million in training costs. This narrative promoted the speculation that the billions of dollars that are being invested in GPU based setups at hyperscalers may sit idle and lie underutilized if it took far fewer compute resources to train a model. Let’s try to understand if this speculation is true, and what it may imply for the future.
The training costs incurred by Deepseek are captured in the DeepSeek v3 paper (which got released on Dec 26, 2024.. long before the stock market rout of Jan 27, 2025). Let me present you Table 1 from the DeepSeek v3 paper.
The paper mentions that training each trillion tokens required only 180K
nVidia H800 GPU hours, i.e., 3.7 days on a cluster with 2048 H800 GPUs. The training efficiency is achieved by an array of low level optimizations such as FP8 quantization of some weights, programming GPU Streaming Multiprocessors (SMs) to achieve efficient computation- communication overlap, and a few others, which are pretty technically challenging to achieve. Nevertheless, for training on a dataset of 14.8 Trillion tokens, Deepseek took 14.8 * 3.7 = 55.86 (or approximate 56 days of training) with a compute budget of 2.8 Million GPU hours (Please refer to Table 1). At a relatively cheap rate of $2 per H800 GPU per hour, and adding up other computation costs, the total lands at ~$5.58 Million as Table 1 above shows. For a comparison, Llama 3.1 70B took 7.1 Million GPU hours, while Llama 3.1 405B took 30.9 Million GPU hours (Note that these GPU hours are for a H100 GPU which is different from H800 that Deepseek uses). So Deepseek definitely seems to have made improvements in terms of the raw computation costs required to train a model.
But before you start thinking that one could train a Deepseek alike by just investing $5.5 Million, consider the following. Renting a GPU cluster with 2048 GPUs in a cloud or hyperscaler is pretty much infeasible. So, anyone attempting this form of training, must deploy such a cluster in their own private datacenter. The cost of a nVidia H800 GPU runs around $40000. For 2048 GPUs, just the cost of GPUs is ~$80 Million. Adding other expensive assembly such as memory, interconnects, switches, power, etc. my estimate is that the capital expense of deploying such a compute cluster will take more than $500 Million. Add to that, the cost of operating power, and hiring personnel (Deepseek’s team has about 200 people), adds a fair chunk. Thus, in spite of Deepseek’s arguments, rendering a training exercise is unfeasible for everyone other than the big companies or research labs. But bringing down the computation power needed by a considerable factor raises the possibility that we will see more powerful reasoning models quickly from the bug guns who have access to large compute in the form of farms of nVidia H100s and soon Blackwells.
The other part where Deepseek might have an advantage is during inference. Reasoning models must process a lot of tokens (while thinking), before they deliver the final answer. Reasoning models therefore are costly, with Open AI o1 offered at $60 per Million tokens. Deepseek R1 on the other hand is offered by $7 per Million tokens (current prices) by Together.ai, and at $2.13 per Million tokens by Deepseek.com (this options means your traffic is routed to mainland China). If we compare the numbers from Together.ai (an US provider), R1 is still a 8x cheaper model compared to o1.
Part of the reason the inference is cheap is because Deepseek v3/R1 is a Mixture of Experts (MoE) model. While the complete model is huge (671 Billion params), only 37B parameters are used for generating every token. A MoE model accomplishes this by routing each query to a subset of the graph, depending on the keywords in the query. Deepseek v3/R1 MoE architecture has 256 expert networks in total. To generate each token, 8 of these experts are chosen dynamically along with one additional shared expert. Having so many experts means that each expert is tiny and can easily fit in GPU memory. Having 8 experts is also important, because each expert can fit in an individual GPU in a cluster of 8 GPUs connected by nVidia’s interconnect NVLINK. Deepseek tailored their model to nVidia’s architecture.
Deepseek mentions that the minimum deployment size for the prefill portion of v3/R1 (prefill generates the first token) is 32 GPUs, while the minimum size for the decode portion (which generates the remaining tokens) is 320 GPUs. 362 GPUs is a pretty large deployment for inference. But they fail to mention the maximum batch size they can support, or the throughput yield in terms of tokens generated per second from this architecture; which makes it had to analyze and compare the true costs of inference. Nevertheless, the MoE architecture has a substantial advantage to offer in terms of costs for running inference. A highly accurate MoE model will therefore knock out its non MoE counterparts.
Conclusion
Deepseek R1 certainly offers a breakthrough in reasoning Large Language models. If you haven’t had a chance to try this on the Vector robot, I strongly encourage you to try Deepseek R1 in any of the vendors that offer it for free. This list includes Together.ai, Fireworks.ai, Perplexity, and Deepseek.com for now. Please follow up with your thoughts in the comments section below.