XML Print


1- Student Research Committee, Golestan University of Medical Sciences, Gorgan, Iran
2- Gorgan Congenital Malformations Research Center, Jorjani Clinical sciences Research Institute, Golestan University of Medical Sciences, Gorgan, Iran; Department of Medical Genetics, School of Advanced Technologies in Medicine, Golestan University of Medical Sciences, Gorgan, Iran , oladnabidozin@yahoo.com
Abstract:   (417 Views)
Background: Large language models (LLMs) like Claude-3.5-Sonnet and GPT-4o are widely used in research and education but are limited by high costs and proprietary restrictions. DeepSeek-R1, an open-source LLM developed by DeepSeek-AI, leverages a Mixture-of-Experts (MoE) architecture and multi-stage training to offer a cost-effective alternative. This study evaluates DeepSeek-R1’s reliability for academic and clinical applications compared to Claude-3.5-Sonnet and GPT-4o, focusing on performance, cost efficiency, and limitations such as censorship and data privacy.
Methods: A mixed-methods approach was employed, including benchmark evaluations across MATH-500 (mathematics), HumanEval (programming), MMLU (general knowledge), and MedQA (medical reasoning). A prospective user study with 112 Iranian medical researchers assessed diagnostic accuracy on 50 standardized medical cases across specialties (internal medicine, pediatrics, psychiatry). Performance was measured as mean accuracy ± SD, with paired t-tests (p<0.05) and ANOVA for comparisons. Confidence scores were analyzed using calibration curves (Pearson r). Cost, latency, and limitations (e.g., censorship, data storage) were evaluated using model documentation and reports.
Results: DeepSeek-R1 achieved 97.3% ± 1.2 on MATH-500 and 96.3% ± 1.5 on HumanEval, outperforming Claude-3.5-Sonnet (95.1% ± 1.4, 94.2% ± 1.7) and GPT-4o (96.0% ± 1.3, 95.5% ± 1.6). MMLU and MedQA accuracies were comparable (90.8% ± 2.0 and 85.0% ± 3.2, respectively). In the user study, DeepSeek-R1’s diagnostic accuracy (79.2% ± 4.0) matched Claude-3.5-Sonnet (78.5% ± 4.2, p=0.42) and GPT-4o (77.8% ± 4.1, p=0.51), with strong performance in internal medicine (83% ± 4.5) and pediatrics (81% ± 5.0). DeepSeek-R1 offered 96% cost savings ($0.14 vs. $4.5/M-tok) and faster latency (42 tokens/s). Limitations include a 4k-token output cap, real-time censorship, and data storage in China.
Conclusion: DeepSeek-R1 is a reliable, cost-effective alternative to proprietary LLMs, excelling in technical and medical reasoning tasks. Its open-source nature enhances accessibility, but censorship and privacy concerns necessitate careful adoption. Comparative analyses guide its use in academic and clinical settings, emphasizing the need for ethical oversight.
Full-Text [PDF 391 kb]   (104 Downloads) |   |   Full-Text (HTML)  (33 Views)  
Article Type: Letter to editor | Subject: Education Management

References
1. Bird S, Klein E, Loper E. Natural Language Processing with Python: Analyzing Text with the atural Language Toolkit. Sebastopol (CA): O'Reilly Media, Inc.; 2009. [View at Publisher] [Google Scholar]
2. Khosravi T, Al Sudani Z.M, Oladnabi M. To what extent does ChatGPT understand genetics? Innovations in Education and Teaching International. 2024;61(6):1320-9. [View at Publisher] [DOI] [Google Scholar]
3. Guo D, Yang D, Zhang H, Song J, Zhang R, Xu R, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. 2025. [View at Publisher] [DOI] [Google Scholar]
4. Singhal K, Azizi SH, Tu T, Mahdavi SS, Wei J, Chung HW, et al., Large language models encode clinical knowledge. Nature. 202;620(7972):172-80. [View at Publisher] [DOI] [PMID] [Google Scholar]
5. DeepSeek A. DeepSeek-R1 [Internet]. 2025 Jan [cited 2025 Sep 30]. Available from:https://huggingface.co/DeepSeek/DeepSeek-R1. [View at Publisher]
6. Srinivasan S, Ai X, Zou M, Zou K, Kim H, Soon Lo Th W, et al. Can OpenAI o1's Enhanced Reasoning Capabilities Extend to Ophthalmology? A Benchmark Study Across Large Language Models and Text Generation Metrics. JAMA Ophthalmol. 2025. [View at Publisher] [Google Scholar]
7. Krause D. DeepSeek and FinTech: The Democratization of AI and Its Global Implications. Available at SSRN 5116322. 2025. [View at Publisher] [DOI] [Google Scholar]
8. Analytica O. China aims to deter further US tech controls. Emerald Expert Briefings [Internet]. 2025[cited 2025 Sep 30];(oxan-db):1–2. Available from:https://dailybrief.oxan.com/Analysis/DB292530/China-aims-to-deter-further-US-tech-controls. [View at Publisher]
9. Jabal MS, Warman P, Zhang J, Gupta K, Jain A, Mazurowski M, et al., Open-Weight Language Models and Retrieval-Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports: Assessment of Approaches and Parameters. Radiol Artif Intell. 2025;7(3):e240551. [View at Publisher] [DOI] [PMID] [Google Scholar]
10. Raffel C, Shazeer N, Roberts A, Lee K, Narang Sh, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(140):1-67. [View at Publisher] [Google Scholar]
11. Ahmed M, Knockel J. The impact of online censorship on LLMs. Free and Open Communications on the. Internet (FOCI) 2024; 2024. [View at Publisher] [Google Scholar]
12. Chen Y. AI sovereignty: Navigating the future of international AI governance. 2024. [View at Publisher] [Google Scholar]

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.