Design
We conducted a pragmatic, multicenter, parallel-group cluster-randomized controlled trial across a network of 16 primary care facilities operated by Penda Health in Nairobi and Kiambu counties in Kenya, comparing routine clinical officer-led consultations supported by an LLM to those conducted without LLM assistance. Randomization occurred at the level of the clinical officer, who constituted the unit of clustering, with patients nested within clinical officers and facilities.
Participants
Eligible clinical officers were those registered with the Clinical Officers Council of Kenya, actively providing outpatient care within participating facilities, and willing to use the EMR system for all consultations during the study period. Clinical officers who were not providing clinical care during the study period or who declined participation were excluded. Written informed consent was obtained from clinical officers before enrollment.
All patients whose consultation was led by a participating clinical officer during the study period were considered for inclusion. Eligibility was assessed by research assistants, who also invited patients to participate and obtained written informed consent before the consultation. Patients under 18 years of age attending for any reason were eligible, with those aged 12–18 years providing assent in addition to guardian consent. Adults or children attending for nonacute or planned wellness visits (such as weight checks, vaccinations or routine antenatal care), those unable to provide informed consent owing to impaired mental capacity, those unwilling or unable to be contacted for follow-up, and those requiring immediate emergency stabilization or referral at the time of screening were excluded.
Interventions
In both study arms, clinical officers used the same EMR system for documentation and order entry. In the intervention arm, clinical officers had access to a custom-built LLM-based CDSS feature (called ‘AI Consult’, version 2.0), which was embedded within the EMR. The system used GPT-4o (May 2025 release) with temperature 0.1, top-p 1.0 and 1024-token response/maximum output limit. During each patient encounter, the underlying LLM (GPT-4o) analyzed information (including all structured and free-text data fields, and excluding patient identifiers) entered by the clinical officer and generated tailored diagnostic and therapeutic guidance. Clinical officers were not required to initiate a separate query. The system was initialized through structured system prompts that defined its clinical role, scope and constraints, to support alignment with Kenyan national treatment guidelines and the broader healthcare context, while retaining generative flexibility within these boundaries. The color-coded alert logic was implemented through a version-controlled prompt, specifying explicit severity thresholds, rule-based criteria and constrained output formatting (i.e. a JavaScript Object Notation schema)26. Although outputs were generated via LLM inference, the use of fixed instructions, explicit severity definitions and few-shot examples constrained model behavior and supported reproducibility.
In the intervention arm, feedback was displayed using a three-tier visual signal, with green indicating no issues, yellow indicating minor issues and red indicating critical concerns, to guide provider attention. The EMR interface allowed clinicians to review the LLM-generated suggestions and selectively incorporate elements into the clinical documentation, including through copy-assisted entry where appropriate. Clinical officers retained autonomy over clinical assessment, documentation, diagnosis, prescribing and referral decisions, and could accept, modify or disregard the system’s suggestions.
Clinical officers randomized to the control arm provided routine care using the same EMR system for documentation, clinical review and order entry, but with the AI Consult 2.0 feature disabled. Clinical officers were expected to follow routine clinical guidelines and practice standards irrespective of study allocation, and no additional incentives related to guideline adherence or documentation were introduced as part of the trial. The control arm therefore reflected usual care conditions within the participating facilities, including access to routine information resources available in clinical practice.
Randomization, allocation and blinding
Randomization occurred at the provider (clinical officer) level to mitigate potential learning effects that might influence patient outcomes. Block randomization was implemented by the study statistician, with variable block sizes of four, six or eight. Clinical officers remained assigned to their randomized study arm throughout the study period and across routine clinical shifts.
Patients were assigned to study arms according to the allocation of the clinical officer managing their consultation, consistent with the cluster-randomized design, and all patients within intervention clusters were considered fully exposed to the intervention. Within the EMR, AI Consult 2.0 was activated for clinicians randomized to the intervention arm and disabled for clinicians randomized to the control arm. Because clinicians worked independently within consultation rooms, the risk of contamination between clinicians within the same facilities was considered low.
Blinding of clinical officers was not possible because those assigned to the intervention arm interacted directly with the LLM functionality within the EMR. Participating patients remained blinded to allocation throughout the study. Research assistants conducting patient satisfaction interviews and day 3 and day 14 follow-up assessments were also blinded to participant allocation. Clinical officers retained full autonomy over clinical decision-making irrespective of study allocation and could disregard or follow AI recommendations at their discretion.
Outcomes
The primary outcome was treatment failure within 14 days of the index consultation. Treatment failure was defined as re-presentation to primary care with unresolved symptoms, unplanned escalation to higher-level or emergency care, or a safety or adverse event, including delayed or missed referral, inappropriate prescription, missed diagnosis, life-threatening event or death. Components involving clinical judgment were defined using prespecified criteria and standardized during training of the expert panel before adjudication. Outcome data were collected by research assistants (blinded to clinical officer allocation) in follow-up telephone calls on days 3 and 14 after the initial visit. For participants who were unreachable on initial attempts, additional calls were made daily until contact was achieved or until 2 weeks after the end of follow-up for the last enrolled participant.
Secondary outcomes included the quality of clinical documentation (assessed for appropriateness of diagnosis, comprehensiveness and adequacy of the treatment plan), the management of sentinel conditions (hypertension, type 2 diabetes and malnutrition), the appropriateness of antibiotic and antimalarial prescriptions, and patient satisfaction. Following completion of the consultation, a subset of 900 participants completed a structured same-day patient satisfaction interview administered by trained research assistants who did not have access to clinician allocation, intervention status or the EMR interface. Interviews assessed participants’ perceptions of the consultation, including clinician communication, perceived thoroughness of assessment, clarity of explanations and overall satisfaction.
An independent expert panel of six family physicians (five female and one male), each registered with the Kenya Medical Practitioners and Dentists Council and with 10–16 years of clinical experience in Kenya, adjudicated all primary and clinical secondary outcomes. For adjudication of the primary outcome, panel members reviewed standardized summaries derived from follow-up data and clinical event information, with clinician allocation removed. Adjudicators did not have access to decision-support outputs for primary outcome classification. Two panel members independently reviewed each reported event to determine whether it met criteria for treatment failure and whether it was related to the original presentation. Discrepancies were resolved through discussion, and when consensus could not be reached, a third panel member served as arbiter.
Sample size
The sample size calculation accounted for clustering at the clinical officer level and was powered to detect a 50% relative reduction in treatment failure within 14 days, from an expected failure proportion of 2% in the control arm to 1% in the intervention arm. Assuming a design effect of 1.5, 80% power, a two-sided alpha of 0.05 and 10% loss to follow-up, the target enrollment was 9,000 patient encounters, corresponding to 100 clinical officer clusters with a mean cluster size of approximately 90 encounters.
Data collection, management and monitoring
Consultation data were recorded directly within the EMR at the point of care by the clinical officers. Follow-up data were collected using structured electronic case report forms administered by trained research assistants. Data quality was maintained through automated validation checks embedded within the EMR, routine data monitoring queries and periodic review by an independent monitoring team. All electronic data were stored on secure, password-protected servers with encrypted backups. Access was restricted through role-based permissions.
Statistical analysis
The primary analyses followed the ITT principle at the level of the randomized clinician cluster, with participants analyzed according to the allocation of the treating clinician. Missing outcomes were handled using complete-case analysis given the low proportion of missing data. Protocol deviations were defined as encounters in which patients were managed by clinicians assigned to a different study arm than originally recorded, resulting in potential exposure misclassification. These encounters were excluded from per-protocol analyses but retained in the ITT analysis. A detailed classification of protocol deviations and violations is provided in the publicly accessible study repository26. For all analyses, random effects were used to enable clustering by clinical officer and facility. For binary outcomes (including the primary outcome), a mixed-effects logistic regression model was used to estimate the aOR, with its corresponding 95% CI. Ordinal secondary outcomes were analyzed using a mixed-effects proportion-odds logistic regression model. The risk difference was estimated by computing marginal predicted risks for each treatment group from the fitted logistic model and taking their difference, with corresponding credible intervals obtained using Bayesian multilevel logistic regression. A one-stage Bayesian multilevel individual participant data meta-analysis was used to estimate pooled and hospital-specific treatment effects while accommodating sparse data and clustering. Between-hospital heterogeneity was quantified through the random treatment-slope variance. Prespecified exploratory subgroup analyses examined potential effect modification by patient age group, presenting condition (sentinel versus nonsentinel) and consultation timing.
In post hoc analyses, we assessed the effect of the intervention on a composite outcome of hospitalization or death within 14 days, the per-patient cost of LLM use (calculated from tokens generated per consultation multiplied by the unit token cost and summarized as means with 95% CIs), consultation costs by medication category and consultation duration. Consultation costs were compared using multilevel linear regression with an interaction between study arm and drug category, and random effects for clinical officer and facility. Consultation duration was compared using the Wilcoxon rank-sum test, reporting medians, IQRs and the median difference. All post hoc analyses were labeled as nonconfirmatory and interpreted cautiously without adjustment for multiple comparisons.
Analyses were conducted in R (version 4.5.1). Reporting followed the Consolidated Standards of Reporting Trials (CONSORT)-AI extension for clinical trials evaluating AI interventions42 and the CONSORT 2010 statement: extension to cluster randomised trials43.
Ethics and oversight
All data handling complied with the Kenya Data Protection Act (2019). Personal identifiers (including names, contact details and national identification numbers) were removed before model processing. The decision-support system operated within a secure clinical environment, and only de-identified clinical information required for generating recommendations was processed by the model. Role-based access controls were implemented within the EMR to restrict system access to authorized users. Data were retained in accordance with institutional data governance policies and study approvals. During the consent process, participants were informed that an electronic clinical decision-support tool incorporating AI would be used to assist clinicians during consultations. Participant information sheets and consent form templates are available alongside the study protocol26.
Adverse events and serious adverse events were monitored throughout the study using standardized procedures prespecified in the trial protocol. Suspected events were identified through follow-up assessments conducted by trained research assistants independent of the implementing healthcare organization, and internal review of clinical records where indicated. Treating clinicians retained full responsibility for patient management decisions, including escalation or referral where clinically indicated. An experienced study physician conducted the initial clinical review in consultation with the principal investigator as part of safety oversight and reporting procedures, after which cases were reviewed by the independent expert adjudication panel for final determination and attribution. All serious adverse events, including deaths, were reported to the study sponsor, ethics committees and the independent data and safety monitoring board (DSMB) within 48 h of investigator awareness, with a detailed follow-up report submitted within five working days (or seven calendar days) as additional information became available. The DSMB conducted scheduled safety reviews during the study, including an early review after enrollment of the first 1,000 participants, and had authority to recommend protocol modification, temporary suspension or termination if safety concerns arose. Because the intervention functioned as clinician-facing decision support rather than automated clinical management, formal efficacy stopping rules were not prespecified. Internal operational reviews conducted by the implementing organization did not influence outcome classification or causality determinations for the trial. Full details of safety monitoring and reporting procedures are provided in the published protocol, while a complete, de-identified listing of adverse events and serious adverse events observed during the trial is available in the publicly accessible study repository26.
Ethics approvals
The study received ethics approval from the Amref Health Africa Ethical and Scientific Review Committee (P1817/2025), with additional authorization from Nairobi (NCCG/HWN/REC/752) and Kiambu (HRDU/PAA/04/2025) counties and from the National Commission for Science, Technology and Innovation (P/25/416731). The Kenyan medical device regulator (the Pharmacy and Poisons Board) determined that the product fell outside its oversight scope, and thus no local equivalent to an ‘investigational device exemption’ was submitted.
Ethics and inclusion statement
This study was codesigned and implemented with local researchers and clinicians in Kenya, who were actively involved in study conception, protocol development, contextual adaptation of the intervention, participant recruitment, data collection, clinical evaluation, interpretation of findings and paper preparation. The research addressed a question of direct relevance to the local health system and was conducted within routine primary care settings serving the participating communities. The independent evaluation panel comprised six locally licensed Kenyan physicians with experience in the study context. Authorship reflects substantive contributions to the work, consistent with International Committee of Medical Journal Editors principles, with representation from investigators based in the study setting across study leadership, implementation, analysis and writing roles. Findings were shared in a dissemination workshop involving operational leadership within the implementing healthcare organization, Ministry of Health leadership and other local stakeholders to support service improvement and ensure continued local relevance and benefit.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
