DeepSeek-R1: Reliability for research and education – A comparative study with Claude-3.5-Sonnet and GPT-4o

Rakhshi , Reza; Khosravi , Teymoor; Rahimzadeh , Arian; Mohsenipour , Mohadeseh; Oladnabi , Morteza

doi:10.29252/JCBR.9.3.1

Volume 9, Issue 3 (Journal of Clinical and Basic Research (JCBR) 2025) jcbr 2025, 9(3): 1-5 | Back to browse issues page

‎ 10.29252/JCBR.9.3.1

Mendeley

Zotero

RefWorks

Rakhshi R, Khosravi T, Rahimzadeh A, Mohsenipour M, Oladnabi M. DeepSeek-R1: Reliability for research and education – A comparative study with Claude-3.5-Sonnet and GPT-4o. jcbr 2025; 9 (3) :1-5
URL: http://jcbr.goums.ac.ir/article-1-516-en.html

DeepSeek-R1: Reliability for research and education – A comparative study with Claude-3.5-Sonnet and GPT-4o

Reza Rakhshi¹

, Teymoor Khosravi¹

, Arian Rahimzadeh¹

, Mohadeseh Mohsenipour¹

, Morteza Oladnabi ^*²

1- Student Research Committee, Golestan University of Medical Sciences, Gorgan, Iran
2- Gorgan Congenital Malformations Research Center, Jorjani Clinical sciences Research Institute, Golestan University of Medical Sciences, Gorgan, Iran; Department of Medical Genetics, School of Advanced Technologies in Medicine, Golestan University of Medical Sciences, Gorgan, Iran , oladnabidozin@yahoo.com

Abstract: (655 Views)

Background: Large language models (LLMs) like Claude-3.5-Sonnet and GPT-4o are widely used in research and education but are limited by high costs and proprietary restrictions. DeepSeek-R1, an open-source LLM developed by DeepSeek-AI, leverages a Mixture-of-Experts (MoE) architecture and multi-stage training to offer a cost-effective alternative. This study evaluates DeepSeek-R1’s reliability for academic and clinical applications compared to Claude-3.5-Sonnet and GPT-4o, focusing on performance, cost efficiency, and limitations such as censorship and data privacy.
Methods: A mixed-methods approach was employed, including benchmark evaluations across MATH-500 (mathematics), HumanEval (programming), MMLU (general knowledge), and MedQA (medical reasoning). A prospective user study with 112 Iranian medical researchers assessed diagnostic accuracy on 50 standardized medical cases across specialties (internal medicine, pediatrics, psychiatry). Performance was measured as mean accuracy ± SD, with paired t-tests (p<0.05) and ANOVA for comparisons. Confidence scores were analyzed using calibration curves (Pearson r). Cost, latency, and limitations (e.g., censorship, data storage) were evaluated using model documentation and reports.
Results: DeepSeek-R1 achieved 97.3% ± 1.2 on MATH-500 and 96.3% ± 1.5 on HumanEval, outperforming Claude-3.5-Sonnet (95.1% ± 1.4, 94.2% ± 1.7) and GPT-4o (96.0% ± 1.3, 95.5% ± 1.6). MMLU and MedQA accuracies were comparable (90.8% ± 2.0 and 85.0% ± 3.2, respectively). In the user study, DeepSeek-R1’s diagnostic accuracy (79.2% ± 4.0) matched Claude-3.5-Sonnet (78.5% ± 4.2, p=0.42) and GPT-4o (77.8% ± 4.1, p=0.51), with strong performance in internal medicine (83% ± 4.5) and pediatrics (81% ± 5.0). DeepSeek-R1 offered 96% cost savings ($0.14 vs. $4.5/M-tok) and faster latency (42 tokens/s). Limitations include a 4k-token output cap, real-time censorship, and data storage in China.
Conclusion: DeepSeek-R1 is a reliable, cost-effective alternative to proprietary LLMs, excelling in technical and medical reasoning tasks. Its open-source nature enhances accessibility, but censorship and privacy concerns necessitate careful adoption. Comparative analyses guide its use in academic and clinical settings, emphasizing the need for ethical oversight.

Keywords: Artificial Intelligence, Large Language Models, Chatbot, DeepSeek

Full-Text [PDF 391 kb] (405 Downloads) | | Full-Text (HTML) (127 Views)

Article Type: Letter to editor | Subject: Education Management

References

1. Bird S, Klein E, Loper E. Natural Language Processing with Python: Analyzing Text with the atural Language Toolkit. Sebastopol (CA): O'Reilly Media, Inc.; 2009. [View at Publisher] [Google Scholar]

2. Khosravi T, Al Sudani Z.M, Oladnabi M. To what extent does ChatGPT understand genetics? Innovations in Education and Teaching International. 2024;61(6):1320-9. [View at Publisher] [DOI] [Google Scholar]

3. Guo D, Yang D, Zhang H, Song J, Zhang R, Xu R, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. 2025. [View at Publisher] [DOI] [Google Scholar]

4. Singhal K, Azizi SH, Tu T, Mahdavi SS, Wei J, Chung HW, et al., Large language models encode clinical knowledge. Nature. 202;620(7972):172-80. [View at Publisher] [DOI] [PMID] [Google Scholar]

5. DeepSeek A. DeepSeek-R1 [Internet]. 2025 Jan [cited 2025 Sep 30]. Available from:https://huggingface.co/DeepSeek/DeepSeek-R1. [View at Publisher]

6. Srinivasan S, Ai X, Zou M, Zou K, Kim H, Soon Lo Th W, et al. Can OpenAI o1's Enhanced Reasoning Capabilities Extend to Ophthalmology? A Benchmark Study Across Large Language Models and Text Generation Metrics. JAMA Ophthalmol. 2025. [View at Publisher] [Google Scholar]

7. Krause D. DeepSeek and FinTech: The Democratization of AI and Its Global Implications. Available at SSRN 5116322. 2025. [View at Publisher] [DOI] [Google Scholar]

8. Analytica O. China aims to deter further US tech controls. Emerald Expert Briefings [Internet]. 2025[cited 2025 Sep 30];(oxan-db):1–2. Available from:https://dailybrief.oxan.com/Analysis/DB292530/China-aims-to-deter-further-US-tech-controls. [View at Publisher]

9. Jabal MS, Warman P, Zhang J, Gupta K, Jain A, Mazurowski M, et al., Open-Weight Language Models and Retrieval-Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports: Assessment of Approaches and Parameters. Radiol Artif Intell. 2025;7(3):e240551. [View at Publisher] [DOI] [PMID] [Google Scholar]

10. Raffel C, Shazeer N, Roberts A, Lee K, Narang Sh, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(140):1-67. [View at Publisher] [Google Scholar]

11. Ahmed M, Knockel J. The impact of online censorship on LLMs. Free and Open Communications on the. Internet (FOCI) 2024; 2024. [View at Publisher] [Google Scholar]

12. Chen Y. AI sovereignty: Navigating the future of international AI governance. 2024. [View at Publisher] [Google Scholar]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

Copyright Policy
This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License which allows users to read, copy, distribute and make derivative works for non-commercial purposes from the material, as long as the author of the original work is cited properly.

Contact us

Email: JCBR@goums.ac.ir

Telephone: +98-17324255166