In this study, we probe the ability of LLMs to provide additive support to generalists in the assessment of rare, life-threatening cardiac diseases that typically require subspecialty cardiac care. Further, we address the unmet need for the randomized evaluation of LLMs for challenging medical applications. To this end, we curate an open-source, de-identified, real-world clinical dataset for patients suspected to have inherited cardiomyopathies and propose an evaluation rubric for the quality of diagnosis, triage and clinical management of such patients. Blinded subspecialists employed this evaluation rubric to assess clinical assessments performed by general cardiologists, both with and without LLM assistance. The blinded evaluation by the subspecialist cardiologists demonstrated an overall preference for LLM-assisted clinical assessments. Specifically, the subspecialty cardiologists found that AMIE-assisted clinical assessments demonstrated fewer clinically significant errors (11.2% reduction) and missed less important content (19.6% reduction) while maintaining equivalent clinical reasoning quality and not introducing erroneous extraneous information. Furthermore, general cardiologists who utilized AMIE reported that the system helped their assessments in more than half of cases (57.0%), did not miss clinically significant findings in 93.5% of cases and reduced assessment time in over half of cases (50.5%).
Our results demonstrate the feasibility of using LLMs to assess patients with rare and life-threatening cardiac conditions. Adapting AMIE to this subspecialist and rarified domain was highly data-efficient, leveraging iterative feedback from subspecialist experts to enhance the quality of AMIE’s responses using just nine cases. This iterative process, combined with a self-critique and the incorporation of search functionality, enabled AMIE to assist general cardiologists in upskilling their clinical assessments to a preferred level. This contrasts with earlier studies using generic, nonspecialized LLMs, which did not achieve comparable clinical performance14.
Our RCT results indicate that LLMs can assist general cardiologists in diagnosing and managing complex cardiac patients. Our evidence suggests that LLMs could help bridge unmet needs in genetic cardiovascular disease and possibly in cardiac care more broadly. While further research could extrapolate our approach to a broader group of specialties, cardiology is a useful indicative example because it features (a) highly preventable morbidity and mortality, (b) a reliance on an array of clinical investigations (for example, TTEs, ECGs, ambulatory Holter monitors, CMRs, CPXs and, when appropriate, genetic testing) and (c) a substantial deficit in the cardiology workforce. Our findings are particularly noteworthy because access to subspecialist care is a global challenge. The American College of Cardiology has identified a “cardiology workforce crisis,” with the lack of access to subspecialty cardiologists an acute concern6. In the USA, despite five HCM centers of excellence in both California and New York, there are none across 27 states4. This has led to more than 60% of patients with HCM in the USA being undiagnosed, with estimates higher globally5. The propensity of inherited cardiomyopathy to cause sudden cardiac death (the leading cause of sudden cardiac death in young adults3), exacerbates the problem. Lack of access to appropriate care and long wait times can lead to preventable, premature mortality. LLMs may help identify undiagnosed cases, assist with the triage and prioritization of urgent cases, and streamline management. In this way, LLMs could improve access to specific care by assisting generalists.
For researchers, our results have a number of implications. First, we have made our data openly available, facilitating the rigorous evaluation of our results and providing data for other models to use in their tests. We have also created and validated a 10-domain evaluation rubric that may be used for future studies. More broadly, this study demonstrates the feasibility of conducting RCTs to evaluate LLMs, establishing a gold-standard evidence framework that should guide future research in this domain. Currently, LLMs are being used in many US health systems via their implementation in electronic medical records software13. This implementation has occurred without a similar scale of scientific evaluation; the benefits and the possible harms are only partially known16. These results represent a major step toward demonstrating the real-world utility of LLMs in subspecialty care.
The implications for clinicians are twofold. First, our results show a clear, albeit modest, improvement in overall clinical assessment quality. The significantly fewer errors and extra content, as well as the significant improvement in management plan quality, gives insights into how LLMs can assist clinically. The general cardiologists demonstrated high precision in diagnostic accuracy and triage decisions; however, the nuanced clinical management of complex patients was associated with increased omission errors compared to the AMIE-assisted assessments. These clinical improvements were accompanied by enhanced efficiency and increased clinician confidence. As such, we present RCT-level evidence for LLMs improving clinical care overall, specifically driven by improvements in management and a reduction in clinical errors and erroneous extra content, with simultaneous improvements in the time and confidence of providers.
It seems premature to deploy LLMs autonomously, and our RCT did not address this design directly, as general cardiologists reported that 6.5% of AMIE’s responses contained clinically significant hallucinations. Reassuringly, we showed that when LLMs are deployed with cardiologist oversight, hallucinations are most often identified, and this combination results in overall fewer errors and more preferred assessments. We qualitatively explored the nature of these hallucinations and found that the general cardiologists often described them as ‘mild’, ranging from assuming the patient’s sex to hallucinating the presence of a CMR feature (for example, “LLM stated left ventricular hypertrabeculation on CMR, but this is not explicitly stated”). Notably, the general cardiologists found that when asked about the hallucination, AMIE would correct itself.
Our study addresses a meaningful wider gap in earlier literature. Prior research has evaluated LLMs in a number of different settings in medicine, from assessing quality in question-answering (spanning medical license examinations as well as open-ended medical questions) to clinical image and complex diagnostic challenges11,17,18,19,20,21,22,23,24. There is a paucity of prior RCTs in medicine and cardiology. Despite more than 500 observational LLM papers published in 2024, systematic reviews of LLMs in medicine have consistently shown a lack of RCTs25,26,27. In fact, a 2025 systematic review found no RCTs assessing LLMs in cardiology26, while another concluded that “randomized trials or prospective real-world evaluations are needed to establish the clinical utility and safety of model use” and that “real-world trials addressing this possibility remain sparse”28. Beyond cardiology, there are some RCTs exploring LLMs. A 2024 JAMA Network Open study29 randomized physicians to use GPT-4 versus conventional resources alone for diagnostic reasoning on just six non-real-world clinical vignettes and found no significant improvement in diagnostic performance (76% versus 74%, P = 0.60), and no significant difference in time spent per case (519 versus 565 seconds, P = 0.20), but that GPT-4 alone outperformed both physician groups by 16 percentage points (P = 0.03). Similarly, a 2025 Nature Medicine RCT30 randomized physicians to use GPT-4 versus conventional resources for management reasoning tasks on five expert-developed clinical vignettes and found significant improvements in overall performance (6.5% improvement, P < 0.001), with management decisions improving by 6.1% and diagnostic decisions by 12.1%, demonstrating that LLMs may be more effective for treatment planning than initial diagnostic reasoning, in line with our results.
Observational studies have investigated the utility and potential of LLMs in real-world clinical tasks such as clinical letter generation31, medical information communication32, medical summarization8,33 and triaging mammograms and chest X-rays for tuberculosis34. Existing research on the performance of LLMs in medical subspecialties, such as cardiology35, ophthalmology36, gastroenterology37, neurology35 and surgery38, are also mostly limited to medical question-answering or examination benchmarking. A recent study39 evaluated ChatGPT in providing accurate cancer treatment recommendations concordant with authoritative guidelines with fixed question prompts. Another study investigated the diagnostic and triage accuracy of the GPT-3 relative to physicians and laypeople using synthetic case vignettes of both common and severe conditions40. A 2024 study41 compared GPT-4 performance with human experts in answering cardiology-specific questions from general users’ web queries. Our study is not only one of the first RCTs of LLMs in subspecialty domains42, it is also, to our knowledge, one of the first to use real-world data and to make this data for LLM evaluation available open-source.
The existing literature shows mixed results for LLMs in clinical cases8,14. A recent study showed the potential and safety concerns of using LLMs to provide an on-demand consultation service that assists clinicians’ bedside decision-making based on patient electronic health record data43. A 2024 study assessed the ability of LLMs to diagnose abdominal pathologies and showed that LLMs were inferior to clinicians14, though that study was not an RCT, and the authors noted that their results may be improved with fine-tuned LLMs. Although we did not fine-tune for this particular downstream task, our approach, which included using a general-purpose LLM equipped with web search and a multistep reasoning chain at inference time, may help explain our contrasting results.
Our study contains a number of important limitations, and the findings should be interpreted with appropriate caution and humility. First, our LLM system was constrained to reviewing text-based reports of investigations rather than the raw multimodal investigations themselves. This presents the possibility of upstream errors; however, we attempted to mitigate this by allowing cardiologists in both assisted and unassisted groups to review the raw imaging and clinical data themselves; general cardiologists noted a clinically significant omission in the text reports in fewer than 8% of cases. History and physical examination are indispensable components of real clinical practice, but they were not included in this study. This is a limitation of the applicability of our work, and future studies should consider the settings in which there is prospective interaction with these patients. However, we did offer cardiologists the ability to interact with our LLM. While our study was conducted on real patient cases, we do not consider LLMs ready for safe deployment and thus we did not deploy our LLM into live, prospective clinical care. If safety standards are met, future studies should assess the performance of LLMs in live, prospective clinical care. An additional limitation is that cardiologists were not blinded to their intervention assignment, thereby introducing a potential performance bias that may have influenced cardiologists’ subjective reports about LLM usefulness and time savings. However, our primary efficacy outcomes were protected from this bias through blinded subspecialist evaluation, and the subjective measures of user experience represent clinically meaningful assessments of technology acceptability that are relevant for real-world implementation. Further, our study relies on subspecialist preference as the primary outcome measure, which introduces inherent subjectivity despite our expert-developed evaluation rubrics. While this approach aligns with recent RCTs of AI-assisted clinical decision-making44,45, preference-based evaluation cannot definitively establish real-world clinical benefit. Evaluating downstream patient outcomes would require prospective studies with long-term follow-ups that are beyond the current scope.
Further limitations of our work include a biased sample of patients—patients were selected from one US center, using only English text. It is unclear how well our results will extrapolate to other non-US settings. Additionally, subspecialist evaluators were from the same institution where AMIE’s prompt engineering was developed, potentially introducing institutional bias, though this is mitigated by the use of different specialists for development versus evaluation along with minimal, held-out examples (nine cases only). Further, our dataset contained patients who were indeed referred for a suspicion of an inherited cardiac disease (correctly or incorrectly). A less biased population may be from a general cardiology clinic, where the prevalence of inherited disease is lower and with it possibly a higher chance of a false positive referral rate. However, this patient selection strategy was intentional and aligned with our research objective of evaluating whether general cardiologists could appropriately manage cases they would typically refer to subspecialty care when supported by an LLM. A similar limitation is that our patients had already completed a number of cardiac diagnostic tests. To help identify undiagnosed cases, LLMs would have to be studied in populations with less complete cardiac investigations. There was insufficient demographic or regional variation in the single-center population in our study to assess the potential for bias or health inequity, which is an important topic for AI systems in healthcare. This limitation is important, as disparities are well documented in the care of patients with inherited cardiomyopathies46 and should be addressed in prospective studies. A further consideration in the implementation of AMIE is the risk of automation bias, where clinicians may overly rely on AI outputs without sufficient scrutiny, potentially leading to inappropriate or unnecessary tests and management decisions. This bias has implications for patient safety, as it may result in increased healthcare costs, procedural risks and heightened patient anxiety. While AMIE demonstrated potential in enhancing cardiologists’ assessments, its use as a clinical aid requires careful oversight to prevent overreliance. For example, AMIE’s sensitive and detailed suggestions could lead to additional tests that are not clinically indicated. To mitigate these risks, clinicians interacting with AMIE must receive appropriate training to critically evaluate its outputs, ensuring they supplement clinical judgment rather than replace it.
Additionally, our research did not explore the potential benefits and risks from the perspective of patients. The early potential here demonstrates an opportunity for participatory research including the patient perspective on many potentially different workflows that could be enabled for subspecialist consultation. While AMIE’s performance was promising, our evaluation rubric highlighted notable areas for improvement, including the diagnosis and triage. The complementary and assistive utility of the technology requires extensive further study before it could be considered safe for real-world use, and there are many other considerations beyond the scope of this work, including regulatory and equity research and validation in a wider range of clinical environments.
In conclusion, AMIE, a research LLM-based AI system, can improve general cardiologists’ assessments of complex cardiac patients. Assistance from AMIE led general cardiologists to have significantly fewer errors, faster assessments, lower rates of erroneous extra content and equivalent clinical reasoning.