Study population
The UKB is a large prospective cohort study comprising over 500,000 participants aged 40–69 years at recruitment (2006–2010) from across the UK. Extensive phenotyping was performed, including questionnaires, physical assessments, imaging, whole-genome sequencing and linkage to longitudinal electronic health records30. As part of the UK Biobank Plasma Proteomics Project (UKB-PPP)10, approximately 54,000 participants underwent proteomic profiling in plasma samples, an initiative that was funded by a consortium of 13 biopharmaceutical companies. Proteomic profiling procedures, including sample selection and handling, have been described in detail elsewhere10. In the present study, we analyzed data from female participants who had completed a questionnaire on reproductive factors. We excluded individuals with missing data on key covariates (age, sex, BMI, smoking status and self-reported ancestry) or those who failed proteomic quality control, as detailed below. The UKB received ethical approval from the North West Multi-Centre Research Ethics Committee as a research tissue bank (reference 11/NW/0382), and all participants provided written informed consent. All analyses were conducted under UKB application no. 43247.
Phenotype derivation of menstrual cycle day
We used questionnaire data to derive menstrual cycle day. To calculate menstrual cycle day, we used responses to two questions: “How many days is your usual menstrual cycle?” (data field 3710), referring to the typical number of days between menstrual periods, and “How many days since your last menstrual period?” (data field 3700), defined as the number of days since the first day of the most recent period. These questions were only answered by women who indicated they were still menstruating and had not yet reached menopause, based on their response to the question “Have you had your menopause (periods stopped)?” (data field 2724). Cycle day was calculated as the time since the last menstrual period minus the reported cycle length, plus one, such that day 1 corresponds to the first day of menstruation. We restricted this analysis to women under the age of 55 years who reported a regular menstrual cycle length between 21 days and 35 days and were not currently using oral contraceptives (data field 2804).
Proteomic profiling
Untargeted proteomic profiling was performed using the antibody-based Olink Explore 3072 proximity extension assay, which measures 2,941 protein analytes and 2,923 unique proteins10. Detailed assay procedures have been described previously10. Participants with more than 20% missing protein values were excluded, as were proteins missing in more than 10% of participants. Remaining missing values were imputed using the k-nearest neighbor method (k = 10) implemented in the impute R package (version 1.72.3). Protein levels were rank-based inverse normalized and scaled to a mean of 0 and s.d. of 1. We further excluded outlier individuals with a first or second principal component score more than 5 s.d. from the mean, or with a median normalized protein expression (NPX) value more than 5 s.d. from the overall cohort median, as has been done previously31. We assessed the association between protein levels and menstrual cycle day using linear regression, adjusted for age, BMI, smoking status, self-reported ancestry, time between blood draw and proteomic measurement, and batch. An FDR <0.05 was used to denote statistical significance. To compare the proportion of variance explained by menstrual cycle, age and BMI, we fit three separate models per protein, each using a natural cubic spline for one predictor (cycle day, age or BMI) to capture potential nonlinear relationships. For each predictor, spline knots were fixed globally at the 10th, 50th and 90th percentiles, with boundary knots at the observed minimum and maximum, as biologically defined inflection points are not known for most analytes32. For every protein–predictor pair, we reported the unadjusted R2 as the fraction of variance explained by that predictor alone. We then summarized the distributions of R2 across proteins (medians and the proportion with R2 > 0.05) to benchmark the proteome-wide influence of cycle timing against age and BMI.
Clustering analysis
To characterize temporal patterns of circulating proteins across the menstrual cycle, we focused on proteins significantly associated with cycle day (FDR <0.05). Each participant’s derived cycle day was rescaled linearly to a 28-day cycle. For each protein, mean expression was calculated across normalized cycle days, z-score normalized and clustered on the basis of temporal expression profiles using k-means clustering, with the optimal number of clusters determined using the gap statistic. Distinct expression trajectories were visualized with curves smoothed using locally estimated scatterplot smoothing (LOESS), and representative proteins with peak expression in specific phases were identified for each cluster. For interpretative purposes, menstrual phases were overlaid according to established endometrial physiology: menses/regenerative (days 1–6), follicular/proliferative (days 7–13), periovulatory (days 14–15), early luteal/secretory (days 16–19), mid-luteal/secretory (days 20–23), and late luteal/secretory (days 24–28), to illustrate how the data-driven clusters align with known physiological transitions.
Pathway and tissue enrichment analysis
We performed Gene Ontology (GO) enrichment analysis separately for each of the four clusters, as well as for the full set of cycle-associated proteins and for clusters, using the clusterProfiler R package (version 4.17.0)33. Gene symbols were converted to Entrez Gene IDs, and enrichment was conducted across GO ontologies. Significance was determined using an FDR < 0.05. Tissue enrichment of associated proteins was conducted using the TissueEnrich R package (version 1.6.0)34. Genes encoding proteins included on the Olink panel were used as the background set. For tissue-specific enrichment, we used RNA expression data from the Human Protein Atlas35, considering all genes reported as expressed in each tissue. Statistical significance was defined as an FDR < 0.05.
Cell type enrichment analysis
For the proteins significantly associated with the menstrual cycle, we prioritized potentially causal cell types using previously published scRNA-seq data on 71,032 single cells from ten healthy human endometria and their original cell clusters, generated with the 10× Chromium platform by Wang et al.36. In that study, both a lower-throughput Fluidigm C1 dataset (19 donors) and a higher-throughput 10× Chromium dataset (ten donors) were produced. We specifically used the 10× dataset because its substantially larger number of cells provides greater power to detect enrichment in less abundant endometrial cell populations. Quality control was performed in multiple steps with Seurat (version 4.4.0)37. First, genes expressed in fewer than three cells and cells with fewer than 200 transcripts were removed. We then excluded cells with ≤200 or ≥8,000 detected genes and ≥20% mitochondrial reads. The gene expression matrix was normalized and scaled with the NormalizeData() and ScaleData() functions. The top 1,000 variable genes were selected by FindVariableFeatures() for PCA, and batch effects were corrected with Harmony (version 1.0) using the first 20 principal components. Cell identities were taken from the original publication. The two unciliated epithelial subclusters were merged into a single ‘Unciliated epithelia’ category for all downstream analyses. Within each cell type, we investigated differential expression of significantly associated proteins using the FindMarkers() option with the following parameters: only.pos = TRUE, min.pct = 0.1, logfc.threshold = 0.25, test.use = “Wilcoxon” and assay = “RNA”.
Phenome-wide association study
To investigate whether proteins that vary physiologically across the menstrual cycle are also related to reproductive disorders, we conducted a phenome-wide association study (PheWAS) in the full UKB cohort with available Olink proteomic measurements (N = 23,674 women). This allowed us to first identify proteins that vary under normal physiological conditions and subsequently assess whether deviations from these physiological patterns are linked to reproductive pathology. We examined 42 female reproductive health diagnoses11, defined using the International Classification of Diseases, Tenth Revision codes (detailed in Supplementary Table 7), and included only outcomes with at least 20 cases. Associations between protein levels and both prevalent (that is, diagnoses before enrollment) and incident (that is, diagnoses occurring after enrollment) diseases were tested using logistic regression models, adjusted for age, BMI, smoking status, self-reported ancestry, time between blood draw and proteomic measurement, and batch. Statistical significance was defined as an FDR <0.05.
MR
We performed two-sample MR analyses using circulating protein levels as exposures and female reproductive health diagnoses as outcomes. cis-pQTLs from the UKB-PPP10 were used as instrumental variables, defined as genetic variants located within 1 Mb of the transcription start site of the corresponding protein-coding gene. For each protein, we selected genome-wide significant variants (P < 5 × 10−8) and performed LD clumping at r2 < 0.1 to maximize variance explained while retaining largely independent cis signals, consistent with prior proteome-wide MR studies. Variants were clumped using PLINK (version 1.9)38 and an LD reference panel derived from 10,000 randomly selected European-ancestry participants from the UKB. GWAS summary statistics for outcomes were obtained from a recent genome-wide meta-analysis of female reproductive traits11. Instrumental variants absent from the outcome GWAS were replaced with proxy variants in high LD (r2 ≥ 0.8), identified using PLINK and 1000 Genomes Project EUR panel39. For each missing variant, the proxy with the highest r2 within a ±1-Mb window was selected. To minimize weak instrument bias, we restricted analyses to instruments with F-statistics >10 (ref. 40). Causal effects were estimated using the random-effects inverse-variance weighted (IVW) method or the Wald ratio when only one single-nucleotide polymorphism was available, as implemented in the TwoSampleMR R package (version 0.6.17)41. Statistical significance was defined as an FDR <0.05 for IVW or Wald ratio estimates. We conducted multiple sensitivity analyses to assess robustness. First, we performed weighted median and MR-Egger analyses and retained only associations directionally concordant with the primary analysis. Second, we assessed heterogeneity using Cochran’s Q statistic and excluded associations showing evidence of heterogeneity (P < 0.05). Third, directional horizontal pleiotropy was evaluated using the MR-Egger intercept test for all IVW-based analyses, and associations with evidence of pleiotropy (intercept P < 0.05) were excluded. Fourth, associations passing these filters were reanalyzed using a more stringent LD threshold (r2 < 0.01). Fifth, to further support shared causal variants and mitigate confounding due to LD, we performed colocalization analyses using the coloc R package. We tested whether cis-pQTLs and female reproductive trait associations shared the same causal variant within a ±500-kb region using default priors. A posterior probability for colocalization (PP4) >0.8 was considered strong evidence of a shared causal signal. Finally, reverse MR analyses were conducted to evaluate potential reverse causality by testing whether female reproductive health diagnoses influenced circulating protein levels, using the same instruments, filtering criteria and analytical framework.
Proteomic score
To derive a menstrual cycle protein score, we randomly split the main cohort (2,760 women) into a training set (70%) and an independent test set (30%). For both splits, missing protein values were separately imputed using the k-nearest neighbor method (k = 10) and then rank-based inverse normalized, to avoid information leakage. In the training set, we fit a linear regression model, including 2,917 proteins that passed quality control, and penalization by the least absolute shrinkage and selection operator (LASSO) implemented in the glmnet R package (verision 4.1.8). Tenfold cross-validation was used to select the optimal regularization parameter (λ) on the basis of the mean squared error. Covariates included age, BMI, smoking status, ethnicity and batch. The resulting protein score was calculated in the test set as a weighted sum of protein values using non-zero coefficients from the LASSO model. To benchmark the predictive ability of the proteomic score against established hormonal biomarkers, we extracted serum estradiol concentrations (data field 30800). We then fit linear models predicting cycle day using (1) estradiol alone, (2) the proteomic score alone and (3) both predictors jointly. We modeled both exposures using natural cubic splines with knots placed at the 10th, 50th and 90th percentiles of its distribution. Model performance was compared using R2 in the held-out test set.
Statistics and reproducibility
No statistical method was used to predetermine sample size. Sample size was determined by the number of UKB participants with available plasma proteomic measurements and complete information on menstrual cycle characteristics and covariates. Data were excluded only according to predefined quality control criteria described in the Methods. In brief, participants with missing key covariates or failing proteomic quality control were removed, as were individuals with excessive missing protein measurements. Proteins with high missingness were excluded, and remaining missing values were imputed using the k-nearest neighbor method. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. Proteomic measurements were generated as part of the UKB-PPP using standardized protocols and internal quality control procedures independent of the present analyses. All statistical analyses were performed using R. Associations between circulating proteins and menstrual cycle day were assessed using linear regression models with adjustment for relevant covariates. Multiple testing was controlled using the Benjamini–Hochberg FDR procedure, with FDR <0.05 considered statistically significant. Additional analyses, including clustering, enrichment analyses, phenome-wide association analyses, MR and derivation of the proteomic score using LASSO regression with cross-validation, are described in detail above.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
