Objective: This study aimed to evaluate the diagnostic performance of 3 flagship models from 3 different companies, Chat Generative Pre-trained Transformer-4 Omni (ChatGPT-4o), Claude 3.5 Sonnet, and Gemini 2.0 Flash, on image-based questions in ocular oncology and pathology to investigate potential differences between these models, and their clinical utility. Material and Methods: Fifty multiple-choice, imagebased questions were randomly selected from 312 questions in the field of ocular oncology and pathology from the OphthoQuestions (www.ophthoquestions.com) database. The answers given to the questions were compared with the answer key and recorded as correct or incorrect. ChatGPT-4o, Claude 3.5 Sonnet and Gemini 2.0 Flash models, which have the ability to process images in large language models (LLMs), were included in the study. Cochran's Q test was applied to compare the performance of the 3 LLMs and McNemar's test was used in pairwise comparisons. Results: There was a statistically significant difference between all 3 LLMs (p=0.001, Cochran's Q test). Claude 3.5 sonnet showed the highest accuracy by correctly identifying 84% of the questions, followed by ChatGPT-4o with 80% and Gemini 2.0 with 62%. In the pairwise comparisons, Claude 3.5 sonnet and ChatGPT-4o were found to be statistically superior to Gemini 2.0 Flash model (p=0.002, p=0.004, respectively). There was no significant difference between Claude 3.5 and ChatGPT-4o (p=0.727, McNemar test). Conclusion: Our results indicate that Claude 3.5 Sonnet and GPT-4o outperform Gemini 2.0 Flash in diagnostic accuracy for ocular oncology and pathology. While LLMs show promise in this field, they require evaluation with larger datasets, and their accuracy must be improved before they can be clinically implemented.
Keywords: ChatGPT-4o; Claude-3.5 Sonnet; Gemini 2.0 Flash; ocular oncology; image processing
Amaç: Bu çalışmanın amacı, 3 farklı şirketten 3 önde gelen büyük dil modeli Chat Generative Pre-trained Transformer-4 Omni (ChatGPT-4o), Claude-3.5 Sonnet ve Gemini 2.0 Flash'ın oküler onkoloji ve patoloji alanındaki görüntü tabanlı sorularda tanısal performansını değerlendirmek ve bu modeller arasındaki potansiyel farkları ile pratik kullanım için uygunluklarını araştırmaktır. Gereç ve Yöntemler: OphthoQuestions (www.ophthoquestions.com) veritabanından oküler onkoloji ve patoloji alanında yer alan 312 sorudan rastgele seçilen 50 çoktan seçmeli, görüntü tabanlı soru kullanılmıştır. Soruların yanıtları, doğru yanıt anahtarıyla karşılaştırılmış ve doğru ya da yanlış olarak kaydedilmiştir. Çalışmaya, güncel ve görüntü işleme özelliği olan büyük dil modelleri [large language models (LLMs)] ChatGPT-4o, Claude 3.5 Sonnet ve Gemini 2.0 Flash dâhil edilmiştir. 3 LLMs modellerinin performansını karşılaştırmak için Cochran's Q testi, 2'li karşılaştırmalar için ise McNemar testi kullanılmıştır. Bulgular: Üç LLMs arasında istatistiksel olarak anlamlı bir fark bulunmuştur (p=0,001, Cochran's Q testi). Claude 3.5 Sonnet, soruların %84'ünü doğru cevaplayarak en yüksek doğruluk oranını göstermiş, bunu %80 doğrulukla ChatGPT-4o ve %62 doğrulukla Gemini 2.0 Flash takip etmiştir. İkili karşılaştırmalarda, Claude 3.5 Sonnet ve ChatGPT-4o, Gemini 2.0 Flash modeline karşı istatistiksel olarak üstün bulunmuştur (sırasıyla p=0,002, p=0,004). Claude 3.5 Sonnet ile ChatGPT-4o arasında ise anlamlı bir fark bulunmamıştır (p=0,727, McNemar test). Sonuç: Sonuçlarımız, Claude 3.5 Sonnet ve GPT4o'nun oküler onkoloji ve patoloji alanında tanısal doğruluk açısından Gemini 2.0 Flash'ı geride bıraktığını göstermektedir. LLMs bu alanda umut vaat etse de, daha büyük veri setleriyle eğitilmesi ve doğruluklarının klinik kullanım için iyileştirilmesi gerekmektedir.
Anahtar Kelimeler: ChatGPT-4o; Claude 3.5 Sonnet; Gemini 2.0 Flash; oküler onkoloji; görüntü işleme
- Grossniklaus HE. Ophthalmic pathology: history, accomplishments, challenges, and goals. Ophthalmology. 2015;122(8):1539-42. [Crossref] [PubMed]
- Apple DJ, Werner L, Mamalis N, Olson RJ. The "demise" of diagnostic and research ocular pathology: temporary or forever? Trans Am Ophthalmol Soc. 2003;101:127-37; discussion 137-9. [PubMed] [PMC]
- Klintworth GK. Ashton lecture. Ophthalmic pathology from its beginning to the high technology of this millennium. Eye (Lond). 2001;15(Pt 5):569-77. [PubMed]
- Zhang C, Chen J, Li J, Peng Y, Mao Z. Large language models for human?robot interaction: a review. Biomimetic Intell Robot. 2023;3(4):100131. [Link]
- Kalyan KS. A survey of GPT-3 family large language models including ChatGPT and GPT-4. Natural Language Processing Journal. 2024;6:100048. [Link]
- Demir S. A comparative analysis of GPT-3.5 and GPT-4.0 on a multiple-choice ophthalmology question bank: a study on artificial intelligence developments. Rom J Ophthalmol. 2024;68(4):367-71. [PubMed] [PMC]
- Chen Z, Chambara N, Wu C, Lo X, Liu SYW, Gunda ST, et al. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine. 2025;87(3):1041-9. [PubMed] [PMC]
- Kumar P. Large language models (LLMs): survey, technical frameworks, and future challenges. Artifcial Intelligence Review. 2024;57(10):260. [Crossref]
- Sezgin E. Artificial intelligence in healthcare: complementing, not replacing, doctors and healthcare providers. Digit Health. 2023;9:20552076231186520. [PubMed] [PMC]
- Longwell JB, Hirsch I, Binder F, Gonzalez Conchas GA, Mau D, Jang R, et al. Performance of large language models on medical oncology examination questions. JAMA Netw Open. 2024;7(6):e2417641. [PubMed] [PMC]
- Benet D, Pellicer-Valero OJ. Artificial intelligence: the unstoppable revolution in ophthalmology. Surv Ophthalmol. 2022;67(1):252-70. [PubMed] [PMC]
- Yaghy A, Yaghy M, Shields JA, Shields CL. Large language models in ophthalmology: potential and pitfalls. Semin Ophthalmol. 2024;39(4):289-93. [PubMed]
- Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, et al. ChatGPT-4 omni performance in USMLE disciplines and clinical skills: comparative analysis. JMIR Med Educ. 2024;10:e63430. [PubMed] [PMC]
- Sonoda Y, Kurokawa R, Nakamura Y, Kanzawa J, Kurokawa M, Ohizumi Y, et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases. Japanese Journal of Radiology. 2024;42(11):1231-5. [Crossref]
- Fujimoto M, Kuroda H, Katayama T, Yamaguchi A, Katagiri N, Kagawa K, et al. Evaluating large language models in dental anesthesiology: a comparative analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of anesthesiology board certification exam. Cureus. 2024;16(9):e70302. [PubMed] [PMC]
- Kim W, Kim BC, Yeom HG. Performance of large language models on the Korean dental licensing examination: a comparative study. Int Dent J. 2025;75(1):176-84. [PubMed] [PMC]
- Schmidl B, Hütten T, Pigorsch S, Stögbauer F, Hoch CC, Hussain T, et al. Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. Eur Arch Otorhinolaryngol. 2024;281(11):6099-109. [PubMed] [PMC]
- Liu X, Duan C, Kim MK, Zhang L, Jee E, Maharjan B, et al. Claude 3 Opus and ChatGPT With GPT-4 in dermoscopic image analysis for melanoma diagnosis: comparative performance analysis. JMIR Med Inform. 2024;12:e59273. [PubMed] [PMC]
- Sensoy E, Citirik M. A comparative study on the knowledge levels of artificial intelligence programs in diagnosing ophthalmic pathologies and intraocular tumors evaluated their superiority and potential utility. Int Ophthalmol. 2023;43(12):4905-9. [PubMed]
- Hurst A, Lerer A, Goucher AP, Perelman A, Ramesh A, Clark A, et al. GPT-4o system card. arXiv Preprint. 2024. [Link]
- Grote T, Berens P. A paradigm shift?-On the ethics of medical large language models. Bioethics. 202438(5):383-90. [Crossref] [PubMed]
- Haltaufderheide J, Ranisch R. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs). NPJ Digit Med. 2024;7(1):183. [Link]
- Ullah E, Parwani A, Baig MM, Singh R. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology-a recent scoping review. Diagn Pathol. 2024;19(1):43. [Crossref] [PubMed] [PMC]
- Demir S. Investigating the role of large language models on questions about refractive surgery. Int J Med Inform. 2025;195:105787. [Crossref] [PubMed]
- Lin Z. How to write effective prompts for large language models. Nat Hum Behav. 2024;8(4):611-5. [PubMed]
.: Process List