AI Ecosystem Updates | Issue# 12 [February 19, 2025]
What?
Recently, Chinese AI startup DeepSeek disrupted the AI landscape with its open-source Large Language Model (LLM) DeepSeek-R1, proving that startups can build competitive foundation models without relying on vast amounts of high-end GPUs. This efficiency has drawn global attention, highlighting that AI breakthroughs do not necessarily require unlimited hardware / computational resources. The impact? Built in under two months with a small team and limited technical and financial resources, DeepSeek’s AI assistant – powered by DeepSeek-V3, the foundation of DeepSeek-R1 – displaced OpenAI’s ChatGPT to become the #1 app on the US app stores (Google Play Store and Apple’s App Store). More about the model in DeepLearning.AI’s coverage of the release here. The code and weights are freely licensed for both commercial and personal use, including the ability to train new models using R1’s outputs.
Reasoning
DeepSeek-R1 has impressed experts, including OpenAI CEO Sam Altman, with its advanced reasoning abilities. Unlike OpenAI o1, it operates with full transparency, allowing users to see how it arrives at conclusions. The foundation model is open-source, enabling businesses and researchers to modify it freely. DeepSeek-R1 is optimized to handle complex reasoning tasks using Chain-of-Thought (CoT) without needing explicit instructions, excelling in areas such as multi-step logical deduction, mathematical problem-solving, and complex code generation, setting it apart from many existing AI systems. More information about how DeepSeek trained its R1 model can be found in this technical report.
How it Works
DeepSeek-R1 is a version of DeepSeek-V3-Base that was fine-tuned over four stages to enhance its ability to process a CoT. It utilizes a Mixture-of-Experts (MoE) transformer architecture, where only a subset of parameters is activated for each input. In the case of DeepSeek-R1, the model has 671 billion (671B) total parameters, with only 37 billion (37B) active at any given time, optimizing resource efficiency. Each MoE layer includes multiple neural networks, known as experts, and a gating module that determines which experts to engage based on the input. This enables different experts to focus on distinct types of data. By utilizing only a portion of its parameters at a time, the model reduces energy consumption and operates more efficiently than dense models of comparable scale. The model has a context window of 128,000 tokens and is designed to balance computational cost and performance.
DeepSeek-R1 underwent multiple rounds of fine-tuning to boost its problem-solving skills. It was fine-tuned on a synthetic dataset of long-form CoT examples generated through multiple prompting techniques and refined by human annotators. To improve its problem-solving capabilities, the model was further optimized using group relative policy optimization, a Reinforcement Learning (RL) method that rewarded accuracy and structured reasoning. This structured training process ensured high accuracy, especially in fields like mathematics and logical reasoning. An additional 600,000 reasoning responses were generated using in-progress versions of R1, along with 200,000 non-reasoning examples, to refine its outputs. A final round of RL enhanced its accuracy on reasoning problems, and in general, its helpfulness and harmlessness. The key insight here is that DeepSeek-R1 was not trained using Supervised Fine-Tuning (SFT), and instead, RL alone drives reasoning behaviors like self-verification, reflection, and CoT, marking an important advancement in AI and RL-driven models.
Other Models
DeepSeek-R1-Zero is a variant of DeepSeek-R1 that was fine-tuned exclusively through RL. The model independently developed problem-solving strategies when given appropriate incentives. However, this approach led to a higher likelihood of language mixing and unreadable outputs. Additionally, DeepSeek introduced six dense models with parameter sizes of 1.5B, 7B, 8B, 14B, 32B, and 70B. Of these, four were derived from versions of Qwen (from Alibaba), while the remaining two were based on versions of Llama (from Meta).
Benchmark Performance
Performance tests showed DeepSeek-R1 competing closely with OpenAI o1, outperforming it on 5 of 11 benchmarks tested. It excelled in AIME 2024 (achieved a pass rate of 79.8%), MATH-500 (attained a score of 97%.3%), and SWE-Bench Verified while delivering competitive results in Codeforces (outperformed 96.3% human participants), GPQA Diamond, and MMLU. On LiveCodeBench, DeepSeek-R1 solved 65.9% of problems correctly, surpassing o1’s 63.4%. Additionally, it outperformed Anthropic’s Claude 3.5 Sonnet on 19 of 21 benchmarks and OpenAI’s GPT-4o on 20 of 21. DeepSeek also introduced distilled models: DeepSeek-R1-Distill-Qwen-32B, which outperformed OpenAI o1-mini across all tested benchmarks, and DeepSeek-R1-Distill-Llama-70B, which surpassed o1-mini on all except Codeforces. Like DeepSeek-R1, the distilled models too showcased strong performance with lower resource requirements. Learn more about the performance of DeepSeek-R1 in this technical report.
Access
Interact with DeepSeek-R1 on its official website and use the “DeepThink” option for reasoning tasks. For API integration, use the OpenAI-compatible API. To run DeepSeek-R1 locally, visit the DeepSeek-V3 repository for instructions, resources, and access to the open-source models, including distilled versions. Support for Hugging Face’s Transformers has not been added yet. Learn more here.
The Economic Impact
After the US imposed a GPU export embargo on China, DeepSeek had to find resourceful and cost-effective ways to innovate and optimize compute – and they succeeded. Notably, while AI giant OpenAI’s GPT-4 was trained on tens of thousands of GPUs (25,000 NVIDIA A100 GPUs as per OpenAI, 100,000 NVIDIA H100 GPUs as per a report by Inc42), DeepSeek used just 2,000 NVIDIA H800 chips – designed specifically for the Chinese market – and kept training costs under $6 Mn. Therefore, DeepSeek is able to offer access to its model’s APIs at a cost 95% lower than OpenAI’s latest models (including OpenAI o1), and this has disrupted the perspective of AI investors globally. Learn more.
DeepSeek’s entry into the market is not just driving down AI training costs significantly, but inference cost as well. While OpenAI o1 charges $60 per million output tokens, DeepSeek-R1 offers the same service via its API for just $2.19 (30X cost reduction), making high-performance AI much more affordable at scale. Additionally, DeepSeek charges $0.55 per million input tokens ($0.14 for cached inputs), which is significantly lower than that of OpenAI o1, which charges $15 per million input tokens and $7.50 for cached inputs. This massive cost reduction highlights the growing influence of open-weight models. By making advanced AI more accessible, DeepSeek is accelerating innovation for startups, enterprises, and independent developers who can now build AI-powered applications at a fraction of previous costs.
China’s Growing Influence
DeepSeek’s success is also a sign of China’s rapid advancements in AI, despite the US-imposed GPU export embargo. China showcased a strong intention to innovate, execute with speed, and thrive under constraints and restrictions. While the US initially led the generative AI revolution, China has been catching up fast. Foundation models like Qwen, Kimi, InternVL, and DeepSeek demonstrate that China is now competing at the highest levels of AI research and development. In fact, in video generation, China has at times taken the lead.
DeepSeek-R1’s open-weight release and detailed technical report is a step forward towards high-performance open models. Some US companies push for regulations restricting open-source AI by exaggerating hypothetical risks like human extinction. As open-source models gain traction, businesses worldwide may increasingly rely on AI models shaped by China’s values rather than the US, especially if the US tries to restrict open-source AI development. More on this here.
Defying the Scaling Laws
There has been significant hype around scaling up foundation models as the primary driver of AI performance and progress. Notably, AI foundation models have been trained on most of the available open and proprietary data. However, the next generation of high-profile models have not shown significant performance improvements, with further training offering negligible gains relative to the high costs involved (see more).
The success of DeepSeek-R1 challenges the idea that scaling up model size is the only way to improve AI, following a strategy similar to models like Google’s Gemini and Mistral AI’s Mixtral, which also prioritize efficiency and architectural innovations over sheer parameter count. Instead, smarter training techniques and optimized architectures are proving to be just as effective.
DeepSeek’s innovations were partly driven by the need to work with less-powerful hardware due to the US AI chip embargo, leading to a model that delivers high performance without excessive computational costs. As AI becomes more efficient, businesses will have more opportunities to leverage cutting-edge AI without the need for massive computing power.
The Impact of Open Models
DeepSeek-R1’s release signals a major shift in AI accessibility. As this articly by DeepLearning.AI highlights, open-weight models like DeepSeek-R1 and Meta’s Llama 3 are becoming mainstream, reducing costs (see section The Economic Impact), and increasing competition. This has lowered the cost of accessing state-of-the-art foundation models, and DeepSeek reinforces this trend. With the accessibility bar lowered, AI startups can focus on building Generative AI (GenAI) applications on top of the foundation models.
A growing trend among investors across the globe is to invest in startups that build GenAI applications on top of AI foundation models (both open-source and proprietary), rather than startups working on building the foundation models. This is evident from the fact that GenAI funding reached new heights in 2024. The primary reasons behind this are faster returns of investment, given foundation models are costly and time-consuming to build, and the fact that they are more prone to risks arising out of regulatory uncertainties across geographies. Instead, by leveraging established foundation models from other companies, startups can concentrate on creating unique, value-added services tailored to specific consumer or business needs, enhancing their market appeal and competitive edge.
As the cost of accessing proprietary foundation models declines with the rise of high-performance open-source alternatives, developers and investors are expected to continue favoring GenAI applications over building foundation models. As noted in the section China’s Growing Influence, the success of open-weight models challenges the hype from major tech companies advocating stringent regulation over hypothetical AI risks like human extinction. Additionally, the success of DeepSeek-R1 may spark renewed investor interest in foundation models.
Thoughts
OpenAI o1 started the trend of reasoning models that incorporate CoT reasoning without explicit prompting. However, both o1 and its successor, o3 (released with the o3-mini version), keep their reasoning steps hidden. In contrast, DeepSeek-R1 provides full transparency, allowing users to see its thought process. DeepSeek’s research on distillation highlights the effectiveness of such models as teachers for training smaller models. Additionally, these distilled models inherit some of the reasoning capabilities of their larger counterparts, improving their accuracy.
During his visit to India in 2023, OpenAI CEO Sam Altman had famously remarked that India should not even attempt to build foundation LLMs given the compute- and cost-heavy nature of such a project. Ironically, Altman too found DeepSeek-R1’s performance and cost-effectiveness to be “impressive”. Renowned tech investor Marc Andreessen famously referred to the model as “AI’s Sputnik moment.” DeepSeek’s success highlights that mastering AI and foundation models is primarily an execution challenge, driven by talent and speed. It proves that AI innovation is not limited to massive tech companies and could shift the perception that AI development requires vast resources.
DeepSeek’s breakthrough offers hope to countries seeking to carve their own AI path through a resource-efficient and innovation-driven approach. The AI startup’s foundation models not only deliver strong performance but also come with a license that allows their outputs to be used for distillation. With AI models becoming more affordable and adaptable, startups and tech firms have new opportunities to build intelligent applications.
Additionally, advancements of this kind contribute to the evolution of language and multimodal models across various scales. As AI continues to evolve, innovation will no longer be limited to large tech giants, but will instead be available to a broader range of developers and businesses worldwide. However, as this article by DeepLearning.AI puts it, it is uncertain whether this development will reduce the demand for computing power, as lower costs can often drive higher overall spending. In the long run, the demand for intelligence and compute seems limitless, making it likely that usage will continue to grow even as it becomes more affordable.
Lastly, the geopolitical implications of DeepSeek-R1, such as reliance on the US or China for compute and foundation models, its impact on international relations, AI regulations, policy decisions, and global competition – will unfold over time. It is crucial for countries to innovate and develop foundation models that leverage native data and reflect their own values. After all, technology should be equitable for all.
If you are interested to learn more, feel free to check out these articles about DeepSeek-R1 and its impact by DeepLearning.AI and this technical report by DeepSeek. Additionally, here is our coverage about India’s AI ambitions and the DeepSeek impact.
Thank you for reading through! I genuinely hope you found the content useful. Feel free to reach out to us at [email protected] and share your feedback and thoughts to help us make it better for you next time.
Acronyms used in the blog that have not been defined earlier: (a) Artificial Intelligence (AI), (b) Graphics Processing Unit (GPU), (c) United States (US), (d) Chief Executive Officer (CEO), (e) American Invitational Mathematics Examination (AIME), (f) Mathematics Aptitude Test of Heuristics (MATH), (g) Software Engineering Benchmark Verified (SWE-Bench Verified), (h) Graduate-Level Google-Proof Q&A (GPQA), (i) Massive Multitask Language Understanding (MMLU), (j) Application Programming Interface (API), and (k) Million (Mn).