Background: Large language models (LLMs) like Claude-3.5-Sonnet and GPT-4o are widely used in research and education but are limited by high costs and proprietary restrictions. DeepSeek-R1, an open-source LLM developed by DeepSeek-AI, leverages a Mixture-of-Experts (MoE) architecture and multi-stage training to offer a cost-effective alternative. This study evaluates DeepSeek-R1’s reliability for academic and clinical applications compared to Claude-3.5-Sonnet and GPT-4o, focusing on performance, cost efficiency, and limitations such as censorship and data privacy.
Methods: A mixed-methods approach was employed, including benchmark evaluations across MATH-500 (mathematics), HumanEval (programming), MMLU (general knowledge), and MedQA (medical reasoning). A prospective user study with 112 Iranian medical researchers assessed diagnostic accuracy on 50 standardized medical cases across specialties (internal medicine, pediatrics, psychiatry). Performance was measured as mean accuracy ± SD, with paired t-tests (p<0.05) and ANOVA for comparisons. Confidence scores were analyzed using calibration curves (Pearson r). Cost, latency, and limitations (e.g., censorship, data storage) were evaluated using model documentation and reports.
Results: DeepSeek-R1 achieved 97.3% ± 1.2 on MATH-500 and 96.3% ± 1.5 on HumanEval, outperforming Claude-3.5-Sonnet (95.1% ± 1.4, 94.2% ± 1.7) and GPT-4o (96.0% ± 1.3, 95.5% ± 1.6). MMLU and MedQA accuracies were comparable (90.8% ± 2.0 and 85.0% ± 3.2, respectively). In the user study, DeepSeek-R1’s diagnostic accuracy (79.2% ± 4.0) matched Claude-3.5-Sonnet (78.5% ± 4.2, p=0.42) and GPT-4o (77.8% ± 4.1, p=0.51), with strong performance in internal medicine (83% ± 4.5) and pediatrics (81% ± 5.0). DeepSeek-R1 offered 96% cost savings ($0.14 vs. $4.5/M-tok) and faster latency (42 tokens/s). Limitations include a 4k-token output cap, real-time censorship, and data storage in China.
Conclusion: DeepSeek-R1 is a reliable, cost-effective alternative to proprietary LLMs, excelling in technical and medical reasoning tasks. Its open-source nature enhances accessibility, but censorship and privacy concerns necessitate careful adoption. Comparative analyses guide its use in academic and clinical settings, emphasizing the need for ethical oversight.