Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. In ACM Transactions on Computing for Healthcare (HEALTH) (eds Lee, I. & Stankovic, J. A.) 3, 1−23 (Association for Computing Machinery, 2022).
Nori, H. et al. Sequential diagnosis with language models. Preprint at https://arxiv.org/abs/2506.22405 (2025).
OpenAI. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/ (2025).
Saab, K. et al. Capabilities of Gemini models in medicine. Preprint at https://arxiv.org/abs/2404.18416 (2024).
Tu, T. et al. Towards conversational diagnostic AI. Preprint at https://arxiv.org/abs/2401.05654 (2024).
Wang, S. et al. LINS: a general medical Q&A framework for enhancing the quality and credibility of LLM-generated responses. Nat. Commun. 16, 9076 (2025).
Arora, R. K. et al. HealthBench: evaluating large language models towards improved human health. Preprint at https://arxiv.org/abs/2505.08775 (2025).
Handler, R., Sharma, S. & Hernandez-Boussard, T. The fragile intelligence of GPT-5 in medicine. Nat. Med. 31, 3968–3970 (2025).
Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024).
Jin, Q. et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. NPJ Digit. Med. 7, 190 (2024).
Pfau, J., Merrill, W. & Bowman, S. R. Let’s think dot by dot: hidden computation in transformer language models. In First Conference on Language Modeling (COLM) https://openreview.net/forum?id=NikbrdtYvG (2024).
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020).
Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022).
Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. Preprint at https://arxiv.org/abs/1412.6572 (2015).
Szegedy, C. et al. Intriguing properties of neural networks. Preprint at https://arxiv.org/abs/1312.6199 (2013).
The New England Journal of Medicine: Image Challenge. https://www.nejm.org/image-challenge (2026).
JAMA Network Clinical Challenge. https://jamanetwork.com/collections/44038/clinical-challenge (2026).
Comanici, G. et al. Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Preprint at https://arxiv.org/abs/2507.06261 (2025).
Anthropic. Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet (2024).
OpenAI. GPT-4o system card. https://openai.com/index/gpt-4o-system-card/ (2024).
OpenAI. OpenAI o3 and o4-mini system card. https://openai.com/index/o3-o4-mini-system-card/ (2025).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In NIPSʼ22: Proceedings of the 36th International Conference on Neural Information Processing Systems 24824−24837 (eds Koyejo, S. et al.) (Curran Associates, 2022).
Lau, J. J., Gayen, S., Ben Abacha, A. & Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5, 180251 (2018).
Hu, Y. et al. OmniMedVQA: a new large-scale comprehensive evaluation benchmark for medical LVLM. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR52733.2024.02093 (IEEE, 2024).
Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).
He, X., Zhang, Y., Mou, L., Xing, E. & Xie, P. PathVQA: 30000+ questions for medical visual question answering. Preprint at https://arxiv.org/abs/2003.10286 (2020).
Liu, B. et al. SLAKE: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. Preprint at https://arxiv.org/abs/2102.09542 (2021).
Zhang, X. et al. PMC-VQA: visual instruction tuning for medical visual question answering. Preprint at https://arxiv.org/abs/2305.10415 (2023).
Yue, X. et al. MMMU: a massive multidiscipline multimodal understanding and reasoning benchmark for expert AGI. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR52733.2024.00913 (IEEE, 2024).
Fleiss, J. L. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971).
Wu, Z. et al. DeepSeek-VL2: mixture-of-experts vision-language models for advanced multimodal understanding. Preprint at https://arxiv.org/abs/2412.10302 (2024).
Bai, S. et al. Qwen3-VL technical report. Preprint at https://arxiv.org/abs/2511.21631 (2025).
Li, C. et al. LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. In NIPS ʼ23: Proceedings of the 37th International Conference on Neural Information Processing Systems (eds Oh, A. et al.) 28541−28564 (Curran Associates, 2023).
Sellergren, A. et al. MedGemma technical report. Preprint at https://arxiv.org/abs/2507.05201 (2025).
