Artificial Intelligence (AI) has evolved over the past few decades, from narrow, rule-based systems to broad, general-purpose models capable of interpreting and generating human-like content. In the midst of this rapid growth, DeepSeek has recently been making headlines, both as a potential competitor to well-known products such as ChatGPT and as a disruptive force in finance, healthcare and various other industries. While technical details highlight its strengths under the hood, the true story of DeepSeek goes beyond algorithms and data – touching on global market fluctuations, international competition, and the broader sense that AI is now woven into the fabric of modern society.
A Glimpse into DeepSeek’s Origins
DeepSeek, owned by Liang Wenfeng, emerged from a cluster of China-based technology firms and research labs, combining expertise in natural language processing (NLP), deep learning, and data analytics.
Although exact figures remain debated, rumour has it that DeepSeek was trained on hundreds of billions of parameters. If accurate, this places it among the most robust AI systems on the market. Yet what makes DeepSeek noteworthy is not just its scale; it’s the breadth of its training data, spanning multiple industries, languages, and specialized texts. The creators aimed for an AI that could handle diverse tasks—from straightforward language translations to sophisticated financial forecasting—without the need for extensive domain-specific tuning.
At the time of writing, this key selling point about the training data remains a marketing statement rather than a verifiable fact. As of now, no detailed information about the training datasets has been released, making it impossible for the research community to assess the scope, diversity, or quality of the data. Without access to this core component, it’s difficult to evaluate how much of the model’s performance can genuinely be attributed to its training breadth—and this further underscores the need for transparency beyond promotional narratives.
Not Open source (yet?)
The single most significant update in DeepSeek’s story is that its developers have announced the open source release of all major components of the model. But, even if this statement is clearly visible in the main website, at the moment Deepseek model (especially R1) is not open source.
In fact, while DeepSeek markets its models as open source, this claim doesn’t hold up under closer scrutiny. According to the Open Source Initiative (OSI) and the Open Source AI Definition 1.0, a system can only be considered truly open source if all components—code, model architecture, training and fine-tuning datasets, and weights—are made freely and easily accessible under an OSI-approved license. In the case of DeepSeek, although some model weights have been released, there is no comprehensive repository that includes the full codebase, training data, or reproducible pipelines. This falls short of the transparency and accessibility that the term “open source” implies. As the community increasingly calls out these “open-washing” practices, it becomes essential to distinguish between partially released models and those that are genuinely open source. If only weights are available while key components remain closed, then such models should be clearly labeled as “partially open” or “restricted-access”—not open source.
If DeepSeek decides to follow a true open source path, it could be a pivotal moment in AI, rivalling the importance of earlier open source frameworks such as TensorFlow or PyTorch, but with an even bigger pre-trained ‘brain’ behind it.
That said, we’re always open to constructive dialogue. If any readers have additional information or believe our assessment is inaccurate, we’d be happy to review it and update this article accordingly. Transparency benefits everyone—and we welcome any clarification that helps set the record straight.
Balancing Simplicity and Sophistication
Most state-of-the-art AI models rely on a Transformer architecture introduced by researchers several years ago. DeepSeek takes cues from this approach, using attention mechanisms to figure out which words or data points are most relevant. Reports suggest it layers on an additional hierarchical attention system that processes both low-level linguistic details and higher-level themes or meanings. While that may sound highly technical, the practical advantage is fairly simple: DeepSeek can (in theory) read text with a greater sense of nuance—recognizing context more accurately and providing more logical answers.
It’s important to note that two distinct models exist within the DeepSeek-R1 framework: DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero represents the initial version, while DeepSeek-R1 is an evolved model that builds upon R1-Zero, incorporating further improvements and refinements.
In the published paper, DeepSeek highlighted the focus of its R1-zero model, which includes
- Reinforcement learning. DeepSeek adopted a large-scale reinforcement learning framework emphasizing reasoning tasks. DeepSeek opted for a rule-based approach rather than relying on traditional reinforcement learning methods guided by human or AI feedback. The technique used is called Group Relative Policy Optimization (GRPO), a custom algorithm developed internally by the DeepSeek team.
- Reward Modeling. The research team established a rule-oriented incentive system to train DeepSeek-R1-Zero that surpasses the commonly implemented neural reward mechanisms. Reward engineering focuses on defining the reinforcement strategy that directs a model’s training.
- Distillation. By employing effective knowledge compression methods, DeepSeek incorporated sophisticated functionalities into models containing as few as 1.5 billion parameters.
- Emergent behaviour network. This development from DeepSeek reveals that intricate reasoning processes can spontaneously surface through reinforcement learning, without any explicit coding of those patterns.
DeepSeek-R1-Zero Performance Insights
What do we know about DeepSeek-R1-Zero model performances?
In the table above, we see that DeepSeek-R1-Zero is comparable to OpenAI o1, and even surpasses it in some cases. This is a noteworthy achievement, as it underscores the model’s ability to learn and generalize effectively through RL alone. Additionally, the author says that performance of DeepSeekR1-Zero can be further augmented through the application of majority voting (majority voting is used during model evaluation by generating multiple responses and selecting the most frequent answer as the final result). For example, when majority voting is employed on the AIME (American Invitational Mathematics Examination) benchmark, DeepSeek-R1-Zero’s performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912.
Note: These benchmarks are based on the results reported by the authors of the paper. As with many academic and technical publications, some degree of cherry-picking is expected and not unusual. In the future, this section should be expanded with independent evaluations and third-party replications to provide a more comprehensive and objective assessment of the model’s real-world performance.
Self-evolution Process of DeepSeek-R1-Zero
One particularly interesting point raised by the authors is the emergence of complex behaviors as the model’s test-time computation increases. The paper highlights phenomena such as reflection, where the model revisits and reassesses its own reasoning steps, and the spontaneous exploration of alternative problem-solving strategies. These behaviors are not hard-coded but appear naturally through the model’s interaction with the reinforcement learning environment.
Through reinforcement learning, the model naturally learns to allocate more thinking time when solving reasoning tasks. Amazingly, this occurs without any external adjustments.
The downside of DeepSeek-R1-Zero and its evolution: DeepSeek-R1
Even if the capabilities of this model are remarkable, there are some reason that pushed DeepSeek to develop a new model, the DeepSeek-R1.
- Readability Issues: DeepSeek-R1-Zero’s outputs often suffer from poor readability.
- Language Consistency: It frequently mixes languages within a single response.
To make reasoning processes more readable and share them with the open community, they developed DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data.
Let’s see in detail the pipeline to train DeepSeek-R1, that consists in 4 stages:
- Cold Start
Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor.
To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1- Zero outputs in a readable format, and refining the results through post-processing by human annotators.
In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data include:
• Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for reading. Responses may mix multiple languages or lack markdown formatting to highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1, we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. Here, we define the output format as |special_token||special_token|, where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results.
• Potential: By carefully designing the pattern for cold-start data with human priors, we observe better performance against DeepSeek-R1-Zero. We believe the iterative training is a better way for reasoning models. - Reasoning-oriented Reinforcement Learning
After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale reinforcement learning training process as employed in DeepSeek-R1-Zero.
This phase focuses on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions.
During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT.
Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable. Finally, we combine the accuracy of reasoning tasks and the reward for language consistency by directly summing them to form the final reward. We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks. - Rejection Sampling and Supervised Fine-Tuning
When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks. Specifically, we generate the data and fine-tune the model as described below.- Reasoning data
We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training.
In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.
Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For each prompt, we sample multiple responses and retain only the correct ones. In total, we collect about 600k reasoning related training samples. - Non-Reasoning data
For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3.
For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting. However, for simpler queries, such as “hello” we do not provide a CoT in response. In the end, we collected a total of approximately 200k training samples that are unrelated to reasoning.
We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples.
- Reasoning data
- Reinforcement Learning for all Scenarios
To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions.
For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios.
We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts.
For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process.
Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
Source: DeepSeek-R1 Paper
Distilled Models
Distilled models are lighter and more efficient versions of large-scale AI systems, created through a process called knowledge distillation. In this approach, a smaller model (student) is trained to replicate the behavior and reasoning capabilities of a larger, more powerful model (teacher).
Moreover, distilled models can be adapted and fine-tuned for specific domains, such as coding, mathematics, or scientific problem-solving, making them highly practical for real-world applications where computational resources are limited, but domain-specific performance is critical.
DeepSeek plans to further explore distillation techniques to bring advanced reasoning capabilities to more efficient, smaller models. As a first step, they distilled DeepSeek-R1’s knowledge into models like Qwen and Llama, fine-tuning them on 800K curated samples using supervised fine-tuning (SFT) only, without reinforcement learning. Deepseek authors claim that this approach significantly improved the performance of smaller models—DeepSeek-R1-Distill-Qwen-1.5B, for instance, outperformed GPT-4o and Claude-3.5-Sonnet on math benchmarks, while larger distilled versions consistently outperformed other instruction-tuned models on reasoning tasks.
DeepSeek-R1 Evaluation
Future Work
Looking ahead, DeepSeek aims to enhance general capabilities (e.g., function calling, multi-turn conversations, role-playing, and JSON output), address language mixing issues in multilingual scenarios, and improve prompt engineering, as the model performs best in zero-shot settings. Additionally, future versions will focus on improving performance in software engineering tasks by introducing rejection sampling and asynchronous evaluation methods during the reinforcement learning phase.





