Sleep polysomnography predicts 130 health conditions, including cardiovascular disease

May 6, 2026

A new paper from Stanford trained a multimodal foundation model on polysomnography (PSG) recordings from 65,000 patients. From one night of sleep, SleepFM accurately predicts 130 conditions with at least 75% accuracy (c-index), including all-cause mortality (84%), dementia (85%), heart attacks (81%), heart failure (80%), chronic kidney disease (79%), stroke (78%) and atrial fibrillation (78%). These accuracy levels rival risk scores built from years of clinical history, including the Framingham Risk Score.

The rest of this post walks through what SleepFM actually predicts, how it was trained, why sleep contains so much cardiovascular signal in the first place, and how it compares to some of the work we’ve done at Empirical on physiological foundation models.

SleepFM: a multimodal contrastive model

SleepFM’s architecture is a multimodal contrastive model. PSG recordings are decomposed into four signal modalities (brain activity, EKG, respiratory, EMG), each passed through a 1D CNN (convolutional neural network) encoder. A channel-pooling transformer then aggregates within each modality, followed by a temporal transformer that captures dependencies across a five-minute context window.

SleepFM pretraining architecture: 1D CNN encoders per modality, channel-pooling transformer, temporal transformer, and leave-one-out contrastive loss across BAS, EKG, respiratory, and EMG signals SleepFM’s pretraining and fine-tuning architecture. Adapted from Figure 1 of the paper.

Pretraining objective: leave-one-out contrastive learning

The novel piece is the pretraining objective itself: leave-one-out contrastive learning (LOO-CL). Rather than next-token prediction, as LLMs are trained to do, LOO-CL aligns each modality’s embedding with the average of all the other modalities at the same timestep. This makes the model robust to missing channels, varying channel counts, and the inevitable heterogeneity of PSG signals across cohorts.

Here’s a comparison of the loss functions of an LLM, SleepFM, and JETS (a JEPA-style model):

Model	Pretraining objective	Loss
LLM	Next-token prediction over a discrete vocabulary	$\mathcal{L}_{\text{LLM}} = -\sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t})$
SleepFM	Leave-one-out cross-modal contrast	$\mathcal{L}_{\text{LOO-CL}}^{(m)} = -\sum_{n} \log \frac{\exp\!\left(\text{sim}(e_m^{(n)}, \bar{e}_{-m}^{(n)}) / \tau\right)}{\sum_{n'} \exp\!\left(\text{sim}(e_m^{(n)}, \bar{e}_{-m}^{(n')}) / \tau\right)}$
JETS	Latent prediction of masked patches	$\mathcal{L}_{\text{JEPA}} = \sum_{i} \left\lVert P\!\left(E_\theta(x_{\text{ctx}}), z_i\right) - \text{sg}\!\left[E_\phi(x_{\text{tgt}, i})\right] \right\rVert_1$

Notation: $p_\theta$ is the LLM’s softmax over a 30k-200k token vocabulary; $e_m^{(n)}$ is sample $n$ ‘s embedding for modality $m$ , and $\bar{e}_{-m}^{(n)} = \frac{1}{M-1} \sum_{m' \neq m} e_{m'}^{(n)}$ is the mean embedding of its sibling modalities; for JEPA, $E_\theta$ is the trainable context encoder, $E_\phi$ is its EMA copy, $P$ is a small predictor head, $z_i$ is positional metadata for masked patch $i$ , and $\text{sg}$ is stop-gradient.

Three contrasts pull out the design space.

Discrete vs. continuous targets. LLM cross-entropy works because language tokens are discrete and densely predictable. The next word has low entropy given context, and a softmax over a finite vocabulary gives a clean probability target. Continuous-valued physiological signals are the opposite case: most of the per-sample bits are unpredictable noise, and a reconstruction loss would burn capacity on microvolt jitter that has nothing to do with downstream tasks. Both SleepFM and JETS sidestep that by predicting in latent space, not input space.

Cross-modal vs. cross-time supervision. SleepFM’s positive pairs come from different sensors recording the same patient at the same instant. JETS’s positive pairs come from different time windows of the same signal. SleepFM’s setup is essentially free supervision when you have a sleep lab recording four channels simultaneously; JETS’s setup is what works when you have one or two channels but lots of continuous time. They’re solving the same problem (representations without labels) under different data shapes.

Collapse prevention. Both contrastive and predictive losses can collapse to trivial solutions where every input maps to a constant. SleepFM blocks collapse with negatives from other patients in the batch (the InfoNCE denominator). JETS blocks it with the EMA target encoder plus stop-gradient. LLMs don’t have a collapse problem at all, because the cross-entropy target is a one-hot label that can’t be gamed by a degenerate encoder.

The practical takeaway is that pretraining objectives aren’t interchangeable. The shape of your data (number of modalities, sampling regularity, channels per sample) does most of the work picking the right loss. SleepFM picked LOO-CL because PSG gives them four time-aligned modalities per patient. JETS picked JEPA because consumer wearables don’t.

Fine-tuning and training data

Once pretrained, downstream tasks (sleep staging, apnea classification, disease prediction) are fine-tuned by freezing the embeddings and training a small LSTM head on top. For disease prediction specifically, the team paired Stanford Sleep Clinic recordings with ICD-coded electronic health records, mapped the codes to PheWAS phecodes, and trained a Cox proportional hazards head on each one. The 585,000 hours of training data is roughly 5 to 25 times larger than previous supervised sleep models, drawing from five cohorts (Stanford Sleep Clinic, BioSerenity, MESA, MROS, and SHHS) so that the model isn’t overfitting to any one center’s hardware.

How accurately can a single night of sleep predict cardiovascular events?

SleepFM six-year cardiovascular outcome predictions on the Sleep Heart Health Study test set Six-year predictive accuracy on the SHHS held-out test set (n=2,000) for cardiovascular outcomes, by C-Index (top) and AUROC (bottom). Adapted from Figure 3 of the paper.

SleepFM evaluates on a six-year prediction horizon, scored with Harrell’s C-Index (the survival-analysis equivalent of AUROC). On the Stanford held-out set, all-cause mortality landed at 0.84, myocardial infarction at 0.81, heart failure at 0.80, stroke at 0.78, and atrial fibrillation at 0.78. Cardiovascular-specific death reached 0.86 on the Sleep Heart Health Study transfer set, congestive heart failure 0.83, coronary heart disease death 0.86, and hypertensive heart disease AUROC 0.88.

For context, the standard ASCVD risk calculator used by virtually every American cardiologist achieves a C-Index in the low to mid 0.7s in modern external validations. SleepFM is hitting that range, and in places exceeding it, from a single night of physiology with no labs, no blood pressure history, and no patient interview.

The model also predicts dementia at 0.85 and chronic kidney disease at 0.79. Neither is strictly cardiovascular, but both share so much pathophysiology with heart disease that they’re worth flagging. Across the 1,041 ICD-derived disease phecodes the team evaluated, 130 reached C-Index ≥0.75 after Bonferroni correction.

Why does sleep contain so much cardiovascular signal?

Sleep is the one window in a 24-hour cycle when your autonomic nervous system runs in a relatively controlled, low-noise regime. A few things happen simultaneously.

The respiratory signals capture sleep-disordered breathing (apneas and hypopneas) that are already independent risk factors for stroke, heart failure, and atrial fibrillation. The EKG channel captures arrhythmia burden, heart rate variability, and the autonomic tone shifts between sleep stages. The brain-activity signals capture arousal burden and slow-wave activity, both associated with metabolic and cardiovascular outcomes. EMG captures movement and microarousal patterns.

In other words, an overnight PSG is a multi-system stress test that runs without you doing anything. SleepFM’s contribution is that a foundation model can integrate all of those channels into a single representation that carries more cardiovascular signal than any component on its own. The paper’s ablations show that combining modalities consistently beats any single-modality variant, which fits the multi-system stress test intuition.

SleepFM vs. JETS: clinical sleep labs vs. everyday wearables

We trained our own foundation model, JETS, on a complementary problem. SleepFM and JETS sit at opposite ends of the data spectrum:

	SleepFM	JETS
Source	Clinical PSG (sleep lab)	Consumer wearables (Apple Watch, Fitbit, etc.)
Modalities	EEG, EOG, EKG, respiratory, EMG	63 channels including HR, HRV, sleep stages, SpO2
Pretraining scale	65,000 patients, 585,000 hours	~3 million person-days
Pretraining objective	Multimodal contrastive (LOO-CL)	Joint Embedding Predictive Architecture (JEPA)
Setting	One night, dense multi-channel	Continuous, sparse, irregular
Headline result	Six-year cardiovascular death C-Index 0.86	Hypertension AUROC 0.87

THese two models use opposite input data (PSG gets you a high-resolution snapshot of one night; wearables get you a low-resolution recording of every night for years) but converge methodologically. Both papers reach the same conclusion about pretraining: frozen foundation embeddings plus a tiny downstream head consistently beats end-to-end supervised baselines, even when the supervised model has more parameters and more demographic features. SleepFM’s ablations show their pretrained embeddings beat an end-to-end PSG model with identical architecture and parameter count by 5-17% AUROC across categories.

SleepFM is a strong signal that one night of physiology carries enough cardiovascular information to rival decades of risk-score research, and a tailwind for the work we’re doing on wearable foundation models. Read the full paper.

Get your free 30-day heart health guide

Evidence-based steps to optimize your heart health.