Clinical datasets
All research procedures were conducted in accordance with the Declaration of Helsinki. Clinical tumor board documents and physician notes used to construct the HemaGuide clinical case memory were obtained via the Big Data Repository Used for Artificial Intelligence to Better Understand Hematologic Neoplasia (‘BRAIN’) database, a department-wide research repository established at the Department of Hematology, Oncology and Rheumatology of Heidelberg University Hospital. This study framework was approved by the Ethics Committee of the Medical Faculty of Heidelberg University (reference S-837/2019) and the project ‘HemaGuide – a case-grounded AI agent for clinical decision support in hematological malignancies’ was specifically approved by the Ethics Committee of the Medical Faculty of Heidelberg University (reference S-149/2026). Consent-independent data use was performed on the legal basis of Section 6(1) and (2) of the German Health Data Use Act (Gesundheitsdatennutzungsgesetz): the data originate from the healthcare institution’s own clinical records. In accordance with the BRAIN framework, data are handled under applicable confidentiality and data-protection regulations, with patient information processed locally and in pseudonymized form only; no transfer of patient data to third parties occurs and third parties did not receive access to original source documents. For benchmarking against external open- and closed-source models, cases were fully anonymized before use: on the basis of each pseudonymized original case, a new, clinically equivalent case was constructed that preserves the underlying decision situation in its complexity but precludes re-identification of the original patient. As part of this anonymization process, all temporal information relating to diagnosis, treatment course and disease dynamics was systematically altered; cytogenetic and molecular genetic findings were entirely replaced by fictitious yet clinically coherent constellations; rare co-diagnoses and unusual medication combinations that could be identifying in aggregate were substituted with clinically comparable alternatives or generalized; and further potentially identifying details (for example, occupational history, origin and family constellation) were removed. Age values were adjusted only within narrow ranges that preserved the original clinical decision context and never crossed clinically relevant thresholds (for example, transplant eligibility, intensive versus nonintensive therapy stratification or fitness categories per ELN (European LeukemiaNet)/IMWG (International Myeloma Working Group) criteria), such that no influence on routing or treatment recommendations was introduced by the anonymization step. No original genetic data were published or transmitted at any point. All study datasets, intermediate files and logs were stored exclusively on access-controlled infrastructure located on hospital premises and were processed for research purposes only. To ensure data protection, HemaGuide was deployed with a locally hostable inference stack so that case narratives were not transmitted to external model providers.
System architecture
HemaGuide works through a pipeline architecture with the following steps: (1) pre-runtime memory-construction, (2) structured extraction from input cases, (3) contextual enrichment and routing, (4) agentic decision tool selection and execution and (5) context aggregation with transparent reasoning. The architecture implements three specialized decision tools, which are autonomously selected by the agent: guideline mode, advanced mode and molecular mode. Each mode invokes distinct computational tools and input case handling.
Knowledge-base architecture for guidelines and SOP repositories
The system maintains a curated repository of entity-specific treatment algorithms, mainly derived from European guidelines and institutional SOPs. These guidelines are encoded as structured decision flowcharts representing canonical treatment pathways organized hierarchically by disease entity, risk stratification category and treatment line.
Flowcharts are stored as plain-text files (.txt) and designed with explicit decision nodes and branching logic for direct insertion into small LLM contexts without any further processing. This ensures that guideline-based reasoning can operate on authoritative, version-controlled source material, which could also be maintained and updated by an LLM at some point in the future.
Knowledge-base architecture for clinical case memory
A corpus of more than 2,000 institutional tumor board decisions with longitudinal patient follow-up data is indexed in a ChromaDB vector store to enable semantic similarity search. Unlike conventional document-level embedding approaches, we implement section-aware indexing that generates independent embedding vectors for separate clinically meaningful document segments. For each extracted case, the system computes embeddings, for example, for the clinical history narrative, primary diagnosis with staging information and (after the enrichment phase described below) the ‘expanded’ clinical question and a therapy summary. Each embedding is stored with associated metadata including a unique document identifier, section name, entity classification, source file path and fully anonymous case number. This decomposition enables retrieval on the basis of specific clinical dimensions: a query case with an unusual treatment history can match historical cases with similar therapeutic trajectories even when exact diagnoses differ, while a case with a rare cytogenetic profile can match on diagnostic features independently of treatment history. Embeddings are generated using lightweight models suitable for local deployment with the default setting: embeddinggemma:300 m via Ollama with a maximum dimension of 768 or cloud-based alternative for faster and more precise processing via text-embedding-3-large with up to 3,072 dimensions. The vector store uses cosine similarity as a distance metric with a configurable similarity threshold (default 0.7) to remove lower confidence matches. We empirically determined this threshold and added an additional over-retrieval by a factor of 3 of the desired number of similar cases, which were re-ranked and filtered by relevance with another LLM call, where the model was prompted to be an expert in analyzing medical similarity.
Deduplication control
We implemented a patient-level deduplication filter that checks the unique, fully anonymized case identifier before evaluation runs, excluding all tumor board documents from the same patient (across different time points and case identifiers). This prevents both retrieval of the query case itself and retrieval of temporally adjacent cases from the same patient’s trajectory.
Preprocessing and information extraction from clinical documents
Clinical tumor board documents or physician’s notes in Microsoft Word format (.docx) undergo a two-phase extraction process to generate semi-structured JSON representations suitable for downstream processing, starting with the classification of the hematological entity. The primary extraction phase employs entity-aware prompts to populate an nine-field clinical schema. We also extract the individual prior therapy lines from the patient history that have been used so far and transform them into a standardized format. This module is applied to both preprocessing for memory-building and for input document extraction during runtime. A parallel extraction module processes molecular appendix content, demarcated by the institutional ‘molecular genetics results’ header, to capture three molecular fields: next-generation sequencing variant data (gene symbol, transcript identifier, coding sequence change in Human Genome Variation Society (HGVS) notation, amino acid change and variant allele frequency), fluorescence in situ hybridization findings with aberration percentages and molecular pathology recommendations. Entity-specific gene panels prime extraction for the most frequent drivers without restricting it. Following extraction, a rule-based post-processing module applies entity-specific risk classification schemas to cytogenetic findings to ensure consistency with established prognostic frameworks including the Revised International Staging System for myeloma15 and European LeukemiaNet recommendations for acute leukemias16, as documented classification may be missing or outdated. All extraction calls operate at temperature 0.1 to minimize output variability while preserving extraction accuracy.
Agent orchestration for contextual enrichment of attention-focused retrieval
Before decision generation, extracted cases pass through a three-phase enrichment protocol, with the aim of more structured and focused representations optimized for downstream retrieval and language model attention allocation. This enrichment addresses one key limitation of retrieval-augmented generation (RAG)-based applications with smaller language models: the difficulty of identifying relevant information within lengthy or very complex clinical narratives. The enrichment phases proceed sequentially, each implemented as a separate LLM call at temperature 0.1. First, a question expansion call synthesizes the original clinical question with relevant history and diagnostic context to produce a comprehensive problem statement that articulates the specific clinical dilemma, relevant prognostic factors and constraints on treatment selection including comorbidities and patient preferences. Second, a classification call assigns a tumor board type category (initial diagnosis, relapse, refractory disease, second opinion or routine follow-up) on the basis of the expanded question content. Third, a decision expansion call integrates the documented tumor board recommendation with the full clinical context to produce an enriched decision rationale that explicates the reasoning underlying the treatment selection.
Rather than presenting the language model with extensive unstructured narratives from which relevant precedents must be identified, the enriched representations provide concise, structured question–decision pairs. This attention-focusing mechanism proves particularly valuable when operating with smaller, locally deployed models (≤120B parameters) that exhibit degraded performance when relevant information is distributed across the context window.
Agent orchestration for agentic routing
Following the extraction and enrichment of query cases, an agent routing module evaluates case characteristics to select the appropriate decision tool from our three options: on the basis of the entity of the case, the corresponding entity-specific data from our library is loaded into the context window and the model can assess the complexity and then select the appropriate tool. Cases explicitly flagged as molecular tumor board consultations, identified through the presence of molecular genetics data, automatically route to molecular mode without requiring a language model routing call. For nonmolecular cases, a routing call presents the case summary alongside descriptions of available decision tools and instructs the model to select the appropriate pathway. The routing decision considers whether the clinical question falls within established guideline coverage (favoring guideline mode), whether unusual features suggest benefit from historical precedent review and literature consultation (favoring advanced mode), or whether molecular findings require systematic variant interpretation (requiring molecular mode).
Handling of diagnostically uncertain cases
When entity classification during extraction does not yield a recognized hematological entity, for example, because the diagnosis is pending, ambiguous or outside the system’s current entity list, the system assigns a generic fallback classification rather than forcing an incorrect entity assignment. Cases with explicit molecular genetics data are routed to molecular mode regardless of entity classification certainty. Nonmolecular fallback cases are automatically routed to advanced mode, where the similarity search operates without entity filtering: the system queries the clinical decision memory across all disease groups using only the clinical history embedding vector, thereby broadening the evidence base to capture relevant precedents that may span multiple diagnostic categories. For literature retrieval, the term that is recognized and used as the main diagnosis (and not categorized as a hematological entity) is translated from German to English and then used for PubMed and Crossref queries. If no retrieval source meets the configured similarity threshold and literature searches return no relevant results, the system degrades to a plain LLM generation, transparently flagging the absence of grounding evidence and the unclear diagnosis in the output metadata alongside a failure reason (for example, ‘no sources’, ‘LLM synthesis error’ or ‘all sources irrelevant’). This cascading fallback architecture ensures that diagnostically uncertain cases are never forced into entity-specific guideline pathways, while making the degree of available evidentiary support explicit to the reviewing clinician.
Agent orchestration of guideline mode
Guideline mode provides protocol-based decision support for cases conforming to established early-line treatment algorithms. Upon invocation, the tool retrieves the entity-specific treatment flowchart from the repository and constructs a generation prompt that embeds the complete flowchart directly in the language model context. The generation call receives the full enriched case data alongside the flowchart and explicit instructions to follow the algorithmic decision pathway while documenting which flowchart nodes were traversed to reach the recommendation. This transparency requirement enables verification that the generated recommendation logically follows from the guideline structure rather than representing model confabulation. The output comprises a tumor board recommendation stating the proposed treatment regimen and a rationale documenting the decision pathway with explicit flowchart references. Guideline mode is appropriate for initial diagnosis cases with standard risk profiles, where treatment selection follows well-defined algorithmic pathways and the primary value of decision support lies in ensuring guideline adherence and documentation completeness rather than synthesizing novel treatment approaches.
Agent orchestration of advanced mode
For cases exceeding guideline-based decision pathways, HemaGuide employs a multisource evidence synthesis workflow that integrates our clinical case memory, peer-reviewed literature by PubMed search and conference proceedings by Crossref search through deterministic retrieval followed by a synthesis step with an LLM call, which selects the most relevant information and compacts it.
Similar case retrieval
For case retrieval, the clinical case memory ChromaDB vector store is queried using a section matching strategy. For a given query case, the system generates embedding vectors for the clinical sections, then executes similarity searches against the corresponding indexed fields in the knowledge base, with clinical history similarity proving most effective and therefore used exclusively. Results can be aggregated across sections by computing mean similarity scores for each candidate document for identification of cases that exhibit clinical similarity across multiple dimensions rather than single-feature matches. The retrieval applies entity filtering to constrain results to the same hematological malignancy category and enforces a similarity threshold to exclude low-confidence matches in a first step. The default configuration over-retrieves the number of desired similar cases (n = 2) up to the factor of three and excludes low confidence matches below the cosine similarity threshold of 0.7, which turned out to be a reasonable threshold for our clinical case memory. Then, LLM-based re-ranking filters again for the three most similar cases regarding the prior treatments.
PubMed retrieval
Literature retrieval is performed by constructing deterministic PubMed queries from structured case features without language model involvement for maximum reproducibility across executions. Query construction combines entity-specific Medical Subject Heading terms, disease state descriptors derived from the tumor board type classification (for example, ‘relapsed OR refractory’ for relapse cases, ‘newly diagnosed OR first-line’ for initial diagnosis cases) and gene symbols for any actionable variant identified during extraction. The query targets high-evidence publication types: randomized controlled trials, meta-analyses and systematic reviews published within the preceding 5 years, with automatic expansion to 15 years if insufficient results are retrieved.
Crossref retrieval
Conference literature retrieval supplements PubMed coverage by querying the Crossref API (application programming interface) for proceedings from major hematology/oncology meetings not consistently indexed in MEDLINE. The module targets three primary venues: the American Society of Hematology Annual Meeting (Blood), the American Society of Clinical Oncology Annual Meeting (Journal of Clinical Oncology) and the European Hematology Association Congress (HemaSphere). Queries are filtered by ISSN to ensure precision, with a default 5-year publication window that automatically extends to 10 years if initial retrieval yields insufficient results.
Context synthesis
Retrieved evidence (similar cases and literature abstracts) is ranked again and then an LLM selects the most promising results on the basis of this ranking. Afterward, the selected information undergoes context tailoring through a synthesis call that extracts clinically relevant insights. The tailoring prompt instructs the model to identify patterns across similar cases (for example, consistent treatment selections for comparable clinical scenarios and outcomes observed with specific regimens) and extract applicable evidence from literature (for example, efficacy data, safety signals and guideline recommendations) without synthesizing these into a premature treatment decision, yet.
Treatment recommendations in advanced mode
The final generation call receives the complete enriched case, the entity-specific guideline flowchart, the synthesized case precedent summary and the synthesized literature evidence. The model integrates these context sources to generate an annotated treatment recommendation that balances guideline adherence with case-specific considerations informed by clinical case memory precedent and currently available evidence.
Agent orchestration of molecular mode with federated knowledge curation
The molecular mode implements automated somatic variant classification according to the ClinGen/CGC/VICC SOP published by Horak et al.10. This module extends beyond retrieval-augmented generation by orchestrating real-time queries across eight biomedical knowledge bases to systematically evaluate 12 evidence criteria for each detected somatic variant. In addition, it performs a gene matching search in the clinical case memory.
In brief, HGVS notation is translated to genomic coordinates via the Ensembl Variant Effect Predictor REST API17 with fallback to NCBI Variation Services when transcript versions are not recognized. Population frequency is queried from the Genome Aggregation Database (gnomAD v4)18 to retrieve allele frequencies across population subgroups (SBVS1, SBS1 and OP4). Mutational hotspot recurrence is queried from the cancerhotspots.org19 database, with fallback to local Catalogue of Somatic Mutations in Cancer20 data, to identify variants at established mutational hotspots (OS3, OM3 and OP3). Null-variant status in tumor suppressors is identified by rule-based parsing (OVS1 and OM2). Functional domain membership is mapped against UniProt21. In silico pathogenicity predictions are retrieved from MyVariant.info22,23, which aggregates the database of Non-synonymous Functional Predictions (dbNSFP)24 annotations. It is then evaluated across REVEL, CADD, SIFT and PolyPhen-2 (OP1 and SBP1). Precedence rules specified by Horak et al.10 are applied.
Classification and treatment recommendations in molecular mode
Points from all applicable criteria are summed to yield oncogenicity classifications: oncogenic (≥10 points), likely oncogenic (6 to 9 points), variant of uncertain significance (0 to 5 points), likely benign (−6 to −1 points) or benign (≤−7 points). For variants classified as oncogenic or likely oncogenic, HemaGuide executes targeted literature searches.
Retrieved literature is assessed through an LLM and rated by relevance (high, medium and low). Only articles rated as high or medium are retained for the final context.
Agent orchestration for context aggregation and transparent reasoning
Regardless of which decision tool is invoked, all contextual elements are aggregated into a structured prompt for the final generation of a treatment recommendation. The generation call operates at temperature 0.3 to balance output coherence with appropriate response creativity, and produces three outputs: The ‘decision’ stating the treatment recommendation in the format expected by an institutional multidisciplinary tumor board, the ‘reason’ documenting the evidence basis and reasoning pathway and, for molecular mode cases, a human-readable molecular report detailing per-variant classification with criterion-level justification.
This architecture is designed to maintain maximum transparency by documenting all intermediate artifacts. Tool invocations are logged with timestamps and input parameters. Retrieved similar cases are recorded with similarity scores and section-level match details. Together, this audit trail enables post hoc verification of decision pathways and supports the human-in-the-loop review model whereby AI-generated recommendations serve as decision support subject to expert validation rather than autonomous clinical decisions.
CCSS and TBCS for comparing tumor board cases and conference decisions
To standardize comparing medical text-based information within the speciality of hematology, we had expert hematologists develop two assessment scores: (1) the clinical case similarity score (CCSS) and (2) the tumor board concordance score (TBCS).
The CCSS is used to establish a metric to make clinical tumor board cases comparable and comprises the following four dimensions: (1) entity match, (2) therapy phase, (3) prior therapy overlap and (4) age (Extended Data Fig. 2a). On the basis of the points given in each dimension and ranging from 0 to 10, cases with a score <5 have low similarity, 5–7 signals moderate similarity and >7 is considered highly similar.
The TBCS focuses on comparing the tumor board decision only and spans the following three dimensions: (1) therapy agents, (2) therapy intention and (3) Patient-adapted dosing and risk integration (Fig. 5c). The concordance score covers the range from 0 to 10 and a total of 8 points or more shows concordance.
Every time these scores were applied, we had at least two physicians evaluating them and always calculated the inter-rater reliability.
Silent trial of a prospective real-world testing scenario
We conducted a silent trial at the University Hospital Heidelberg for the total duration of 1 month, during which we obtained and de-identified all tumor board cases that were listed just before the tumor board conference started and processed them in parallel. Following this process, we arrived at a decision by HemaGuide just before the actual conference decision was available. Cases were excluded if they lacked sufficient diagnostic information, were primarily imaging demonstrations without clinical case discussion, or if the tumor board recommendation was just following enrollment into a clinical trial. The silent trial was run with a total case number of 64 (lymphoma 38, myeloma 17 and leukemia 9) and the docx files were fed into the system as they were originally prepared by the physicians for the real conferences.
We compared the results generated by HemaGuide against the ground truth with our TBCS to guarantee consistent outcomes with the external cohort. The concordance comparison was carried out by two independent hematologists and inter-rater reliability was calculated. In cases of disagreement, a third independent reviewer was consulted to reach a final determination.
Large-scale validation with data from another academic center
We used an external cohort of 555 independent and de-identified tumor board cases from Munich University Hospital (LMU), covering 47 distinct hematological entities, to validate HemaGuide. All cases were processed by the agent; the system generated the tumor board decisions and this output was assessed with our TBCS against the ground truth from Munich tumor board documents. The concordance comparison was carried out by two independent hematologists with a third reviewer resolving any discordant assessments. Inter-rater reliability was calculated and indicated for all data points.
Ablation study for the incremental evaluation of architectural components
To quantify the contribution of single system components to decision quality, we conducted a systematic ablation study isolating eleven configuration levels (L0–L10) that progressively add data layers (what the system has access to) and intelligence layers (how the system processes information), ranging from plain LLM baseline as the lower bound (L0; no tools and no additional information) through isolated data layers (L1 guideline flowcharts, L2 clinical case memory, L3 literature retrieval only and L4 molecular tools results only) and progressively enriched combinations (L5–L7: adding enrichments and LLM-intelligence functionality (=intel) to L1–L3, L8: clinical case memory, literature and intel, and L9: molecular and intel) to the full agent with autonomous routing (L10).
We selected a stratified subset of 15 cases of our 45 complex cases to ensure balanced coverage of the complex spectrum across routing paths: 5 cases per mode. Each ablation level was then evaluated on all 15 cases, leading to 165 combinations.
We had all 165 HemaGuide decisions evaluated against the ground truth by two independent hematologists. Evaluators were blinded to the ablation level and configuration. We used a concordance score that was utilized to assess correctness of the ablation results by comparing the following dimensions: (1) therapy agents (0, 2, 4), (2) therapy intent (0, 1, 2) and (3) dosing risk (0, 2, 4). On the basis of these individual ratings a concordance score was calculated, showing concordance between decisions when scoring 8 or above, and based on this concordance score we computed an inter-rater reliability.
Human evaluation and benchmarking of HemaGuide versus base models
In the first evaluation step, we performed a blinded, physician-rated comparison of HemaGuide against baseline LLMs. Four resident physicians (postgraduate years 1, 2, 4 and 5, spanning the range of hematology training experience) independently evaluated model outputs across 45 complex malignant hematology cases (15 leukemia, 15 lymphoma and 15 PC dyscrasia case journeys). Case selection was stratified by disease timepoint to ensure coverage of the clinical complexity spectrum encountered in hematological tumor boards: newly diagnosed cases (n = 4–5 per entity), representing standard first-line decision scenarios; relapsed/refractory cases (n = 7–8 per entity), representing the highest-complexity decisions where decision support tools are most clinically relevant; and molecular tumor board cases (n = 3 per entity), designed to evaluate molecular mode performance on variant interpretation tasks. Evaluating physicians were not involved in the design or development of HemaGuide and had no prior exposure to its outputs.
For each model configuration, three independent outputs per case were generated (runs 1–3), yielding 135 model runs and corresponding recommendations per configuration for expert assessment. Reviewers were blinded to model identity and configuration throughout; case assignment order was randomized to minimize order effects. Dimensions Q1–Q5 (tumor board concordance, guideline compliance, patient-context integration, reasoning quality and clinical utility) were assessed on run 1 only to avoid rating fatigue and anchoring bias from repeated exposure to the same case; dimension Q6 (run-to-run consistency) was assessed across all three runs. The evaluation dimensions were: (1) tumor board concordance (alignment with the interdisciplinary tumor board recommendation), (2) guideline compliance (consistency with current national and international guidelines, including Onkopedia, DGHO and ESMO), (3) patient-context integration (correct identification and consideration of prior therapies, disease status and therapy-relevant comorbidities), (4) reasoning quality (transparency and comprehensibility of the rationale), (5) clinical utility (usefulness for decision support in practice) and (6) consistency (stability of the recommendation across three repeated runs given identical input).
We used interdisciplinary tumor board consensus decisions as the reference standard, as these represent the highest routinely documented level of clinical deliberation at our institution. We acknowledge that tumor board decisions are not uniquely correct and that inter-board variability has been documented in literature; the use of single-institution tumor board consensus as ground truth is a pragmatic choice that may introduce systematic bias toward the therapeutic preferences and trial portfolio of our center (‘Discussion’).
All participating resident physicians provided written informed consent before study participation. As the evaluation involved no direct patient contact and utilized only de-identified case material, resident participation was classified as a quality assessment activity and conducted in accordance with institutional guidelines.
Model specifications
The system supports multiple language models to accommodate varying deployment requirements. Cloud-based deployments utilize OpenAI API endpoints (including gpt-5-nano and gpt-5-mini) with JSON mode enforcement for structured extraction outputs. Local and privacy-preserving deployments utilize Ollama-hosted models (including gpt-oss-120b and Qwen3-Next-80B-A3B). A hybrid configuration supports Ollama Cloud endpoints for organizations requiring European data residency without local GPU infrastructure. Embedding generation for the clinical case memory uses embeddinggemma:300 m (768 dimensions) for local deployments or text-embedding-3-large (3,072 dimensions) for cloud deployments. Vector storage uses ChromaDB with cosine similarity as a distance metric and persistent storage to enable incremental knowledge base updates.
Quantification and statistical analysis
Data are represented as individual values or as mean ± s.d., unless indicated otherwise. Group sizes (n) and applied statistical tests are indicated in figure legends. Statistical significance and multiple hypothesis testing corrections were assessed as indicated in figure legends. All reported P values are two-tailed, unless indicated otherwise. All analyses were performed using either R v4.5.1 (www.R-project.org) or GraphPad Prism 11.0. For all expert assessment experiments, LLM outputs were blinded to the assessors. Owing to the nature of this study, sample size determination was not applicable, as all available clinical cases from 2024 to 2025 were included in this study.
Data visualization
Data visualization was performed in GraphPad Prism 11.0 or in the R programming environment (v 4.5.1) using the ggplot2 package for heat maps, confusion matrices and dot plots; the fmsb package for radar charts; and the viridis package for color mapping. Figures were produced using Adobe Illustrator 2026.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
