Specialized clinical artificial intelligence (AI) tools are entering medical practice at scale1,2. These proprietary large language model (LLM)-based tools promise superior clinical performance to general-purpose frontier LLMs as a result of domain-specific training or retrieval-augmented generation (RAG)3. Yet, their architectures, base models and training pipelines are not public. Clinicians and health systems must therefore assess their value and safety without independent evidence. Conversely, large training corpora and extensive alignment of frontier LLMs may enable them to challenge clinical AI tools without domain-specific modification. We test this hypothesis by comparing clinical AI tools (OpenEvidence1 and UpToDate Expert AI2) to leading general-purpose LLMs (OpenAI GPT-5.2, Google Gemini 3.1 Pro Preview and Anthropic Claude Opus 4.6). Later, we include auto-enabled Google Search AI Overview as a real-world control frequently encountered by physicians.
Our evaluation (Fig. 1) has three stages: (1) 500 US Medical Licensing Examination-style MedQA4 questions assessing medical knowledge, (2) 500 HealthBench5 items evaluating agreement with expert clinicians and (3) 100 real clinical queries (RCQ) drawn from physician LLM queries during live clinical deployment. The RCQ stage underwent randomized, blinded review by 12 US clinicians, producing 1,800 model–question annotations. The combined analysis spans multiple-choice reasoning, expert clinical judgment and everyday clinician use.
Schematic overview of the comparative analysis between frontier, clinical-specific and search-embedded AI models. The framework integrates automated scoring (MedQA and HealthBench) with high-fidelity blinded and randomized clinician reviews (RCQ) to assess model performance across accuracy, safety and reliability metrics. N, no; USMLE, US Medical Licensing Examination; Y, yes. Blindfold/blinded icon from Tailwind Labs under an MIT license (©Tailwind Labs); other icons from React Icons under a CC BY 4.0 (brain, stethoscope, book, bar chart, doctor, justice scale, chat, shield) or Apache 2.0 (browser, API badge).
General-purpose LLMs outperformed clinical AI tools on the MedQA questions (Fig. 2a and Extended Data Fig. 1a,b). Among frontier LLMs, Gemini achieved the highest accuracy at 97.4% (95% confidence interval (CI) 95.6%–98.5%), followed by GPT at 94.2% (91.8%–95.9%) and Claude at 90.2% (87.3%–92.5%). Clinical tools scored lower, with OpenEvidence achieving an accuracy of 89.6% (86.6%–92.0%) and UpToDate achieving 88.4% (85.3%–90.9%). Gemini outperformed all other models (McNemar P < 1 × 10−4 versus OpenEvidence, UpToDate and Claude; P = 0.02 versus GPT). GPT outperformed OpenEvidence (P = 0.008), UpToDate (P = 0.0004) and Claude (P = 0.04).
a, MedQA accuracy (n = 500 questions per model). b, HealthBench score (n = 500 questions per model). c, RCQ mean aggregate clinician rating (1–4-point scale). Each of n = 100 clinical questions was answered by all 6 models and independently rated by 3 of 12 clinicians; 32 question–model pairs identified as refusals were excluded, yielding n = 98 (Gemini), 97 (GPT-5.2), 99 (Claude), 99 (OpenEvidence), 81 (UpToDate) and 94 (Google AI) non-refusal items per model. d, RCQ scores disaggregated by evaluation dimension; sample sizes as in c. e, Refusal rate (n = 100 questions per model). f,g, Proportion flagged for harmful content (f) or hallucination by majority vote (≥2 of 3 raters) (g); sample sizes as in c. The unit of observation is the individual question (a,b,e) or the question–model–rater evaluation (c,d,f,g); each question represents an independent observation and individual raters serve as independent evaluators (biological replicates). No technical replicates were used. No control group was designated, as the study compares all models against one another. In a–d, data are presented as mean ± 95% CI (Wilson score interval in a; mean ± 1.96 × s.e.m. in b–d). Italic letters denote compact letter display (CLD) significance groups. Models sharing a letter do not differ significantly (adjusted P > 0.05). CLD groups were derived from two-sided pairwise McNemar tests with Holm–Bonferroni correction in a, two-sided Wilcoxon signed-rank tests with Holm–Bonferroni correction in b and two-sided Nemenyi post hoc tests following Friedman test (χ2 = 93.65, d.f. = 5, P = 1.15 × 10−18) in c. Higher values indicate better performance in a–d; lower values indicate better performance in e–g.
HealthBench (Fig. 2b) was graded by a panel of LLM judges to mitigate single-model bias. Scores reflect the proportion of rubric points achieved, scaled 0–100. GPT scored highest at 88.0 (95% CI 85.9–90.1), followed by Gemini at 79.3 (76.6–81.9) and Claude at 77.0 (74.2–79.9); both clinical tools scored lower (OpenEvidence scoring 62.6 (59.3–65.9) and UpToDate scoring 61.3 (58.0–64.6)). GPT outperformed all other models (Wilcoxon P < 10−9), and the two clinical tools did not differ (P = 0.6). In theme-level analysis (Extended Data Fig. 1c,d), GPT ranked first or tied for first in all seven categories, while OpenEvidence and UpToDate ranked lowest or tied for lowest in all seven categories, with differences from GPT significant in six of the categories (P ≤ 0.004; exception: responding under uncertainty, P = 1.00).
To develop the RCQ benchmark, we sampled 100 anonymous clinician queries to the NYU Langone Health Insurance Portability and Accountability Act-compliant GPT instance. Twelve blinded clinicians scored six models’ responses across four dimensions (clinical correctness, completeness, safety/harm avoidance and clarity) on a 1–4-point scale (Extended Data Fig. 2). For each response, three raters were then randomly assigned to evaluate them. We included Google Search AI Overview in the RCQ evaluation because it is routinely encountered by clinicians. After excluding 32 refusals, 568 responses remained.
The six models differed significantly (Friedman P < 10−9), with two performance tiers emerging (Fig. 2c). Frontier LLMs formed the first: Gemini (mean aggregate 3.62; 95% CI 3.56–3.68), GPT (3.54; 3.47–3.61) and Claude (3.52; 3.44–3.59), with no significant differences between them. Clinical tools and Google AI Overview followed: OpenEvidence (3.24; 3.17–3.32), UpToDate AI (3.17; 3.09–3.25) and Google AI Overview (3.27; 3.18–3.35), also without significant differences. All nine significant pairwise comparisons were between tiers (rank-biserial r = 0.5–0.9), meaning frontier models outperformed clinical tools on most individual questions, not just on average. After adjusting for rater leniency, clinical AI tools (including Google AI) had 49–87% lower odds of receiving a higher rating than Gemini (odds ratio 0.13–0.51; all P < 0.0001). In a sensitivity linear mixed model, this corresponded to 0.36–0.44 points lower on the 1–4-point scale (all P < 0.0001). Google AI Overview scored as well or better than OpenEvidence and UpToDate AI across all dimensions (Extended Data Fig. 3).
The tier structure held across all four dimensions (Fig. 2d). Models differed most on clarity (Kendall’s W = 0.292) and least on clinical correctness (W = 0.141). OpenEvidence scored lowest on clarity (mean 2.84), suggesting its weakness was communication, not knowledge. Qualitatively, incomplete clinical content, safety-critical omissions and disorganized responses were common, particularly for OpenEvidence and Google AI Overview (Extended Data Table 1). UpToDate AI refused 19% of queries (Fig. 2e), more than all other models (1–3%; P < 0.01) except Google AI Overview (6%; P = 0.10). Safety outcomes (Fig. 2f,g) did not differ across models: none of the models produced more harmful content (Cochran’s Q = 4.00, P = 0.55) or hallucinations (Q = 5.00, P = 0.42) than any of the others. All 12 clinicians ranked the models similarly (Kendall’s W = 0.651, P = 2.3 × 10−7), placing frontier LLMs above clinical tools (Extended Data Fig. 4).
This study is an independent, quantitative comparison of clinical AI tools against frontier LLMs using real-world physician queries from the course of care. Clinical AI tools lagged behind frontier models on every evaluation: knowledge, expert alignment and real-world clinical use across multiple dimensions. Google AI Overview, an auto-enabled search feature, matched clinical AI tools in this benchmark.
As the architecture of proprietary clinical AI tools is inaccessible, it is impossible to definitively assess a mechanistic understanding for their underperformance against general-purpose models. Evidence shows that RAG, which is likely employed by both OpenEvidence1 and UpToDate Expert AI2, may actually negatively affect model performance when irrelevant material is retrieved or poorly integrated by the base model6,7,8. Frontier LLMs may simply be better at the knowledge retrieval and reasoning that characterize most medical questions9. They also benefit from faster iteration cycles, larger training corpora and greater alignment than specialist systems. The observed advantages of frontier general-purpose models may reflect the accelerated development and investment in these systems. Should scaling returns diminish, the relative value of domain-specific tuning, curated retrieval and clinician-in-the-loop optimization may increase. Our results should therefore be interpreted as a snapshot of a rapidly evolving landscape rather than a permanent ordering of approaches. In particular, deeply subspecialized medical tasks may favor more sophisticated, domain-specific adaptation9,10,11.
This study has several limitations. Clinical tools lack public application programming interfaces (APIs), so they were queried through browser interfaces, which limited sample size and may have introduced differences in hidden prompts, retrieval behavior and output formatting. Standardized benchmarks have known issues such as data leakage7; models may have been exposed to MedQA or HealthBench during training, though our RCQ benchmark is free from this contamination. HealthBench is an OpenAI-developed benchmark that relies on a small number of physicians for each rubric, and public documentation provides limited detail on its construction and evaluation5. Evaluation of OpenAI models, including the highest-scoring model on HealthBench, GPT-5.2, may be influenced by potential benchmark–developer overlap, including potential similarities in training data, optimization objectives or rubric design. Grading bias is also possible, as frontier models served as both evaluated systems and judges, although we used a multimodel panel to mitigate this effect. Accordingly, we view the blinded clinician evaluation on the RCQ benchmark as the primary evidence in this study, while HealthBench should be interpreted as supplementary.
More broadly, industry-created benchmarks may systematically favor the systems developed by their creators, reinforcing the need for independently constructed evaluation instruments. The RCQ benchmark partially addresses this concern: it is derived from real clinical queries, evaluated by blinded clinicians and free from training-set contamination. Additionally, recently proposed safety-focused evaluations of LLM medical recommendations such as the NOHARM12 framework suggest that knowledge and communication benchmarks may not fully capture clinical risk. Related work also points to health-system-grounded evaluation frameworks, such as institution-specific operational tasks and prediction settings embedded in local clinical workflows, as an important complement to public, industry-authored benchmarks, because they may better capture whether a model is clinically useful in a given care environment13,14.
Finally, our evaluation did not assess response latency or citation quality. These factors are important for real-world clinical deployment and workflow integration, and may differ substantially between API-accessed frontier models and subscription-based clinical tools (Extended Data Table 2). Future work should systematically compare these practical dimensions alongside accuracy and safety.
Clinical AI tools may carry institutional legitimacy and are likely safe for routine use, but our results show that they are not superior to frontier models on knowledge, communication or clinical alignment. The superior performance of frontier models in our study suggests that scale, alignment and cross-domain reasoning may outweigh domain-specific tuning as determinants of medical competency for particular tasks, a finding with implications for procurement, reimbursement and regulatory oversight. The path forward may ultimately lie with hospital-specific LLMs that leverage institutional data13,14 to mitigate external harm15, along with careful use of frontier models for less-sensitive tasks16. As generative LLMs become integrated into healthcare at the enterprise, individual clinician and consumer levels, there is an increasing need for rigorous, independent evaluation on real-world tasks.
