PPG-Based Sleep Stage Classification: Algorithms, Accuracy, and Clinical Applications

Technical review of sleep staging from PPG signals using HRV features, movement detection, and deep learning models compared to polysomnography gold standard.

ChatPPG Research Team·

PPG-Based Sleep Stage Classification: Algorithms, Accuracy, and Clinical Applications

Sleep staging from PPG signals leverages the autonomic nervous system's characteristic fingerprint on cardiac activity during different sleep stages, enabling wearable devices to approximate the information traditionally available only from polysomnography. While PPG-based sleep staging cannot match the precision of EEG-based polysomnography (PSG), modern algorithms achieve clinically useful accuracy for population-level sleep assessment, longitudinal tracking, and screening for sleep disorders.

Sleep architecture -- the cyclical progression through wake, light sleep (N1/N2), deep sleep (N3), and REM sleep -- profoundly affects cardiovascular health, metabolic function, cognitive performance, and mental health. The American Academy of Sleep Medicine (AASM) guidelines define sleep staging based on electroencephalography (EEG), electromyography (EMG), and electrooculography (EOG) recorded during polysomnography. PPG-based approaches bypass these expensive, laboratory-bound sensors by exploiting the systematic changes in autonomic cardiac control that accompany each sleep stage. For background on how PPG sensors capture cardiac signals, see our guide to PPG technology.

Autonomic Nervous System Changes During Sleep

The physiological basis for PPG-based sleep staging rests on the well-established coupling between sleep stages and autonomic nervous system (ANS) activity. Each sleep stage has a characteristic sympathovagal balance that modulates heart rate, heart rate variability, peripheral vascular tone, and respiratory patterns -- all of which are accessible from the PPG waveform.

NREM Sleep (N1, N2, N3)

During the transition from wakefulness to NREM sleep, parasympathetic (vagal) tone progressively increases while sympathetic activity decreases. This produces several measurable cardiac effects:

  • Heart rate decreases progressively from wake through N1, N2, and N3, reaching its lowest values during deep sleep. Mean heart rate during N3 is typically 5-15 BPM lower than during wakefulness (Boudreau et al., 2013; DOI: 10.1016/j.smrv.2012.02.002).
  • HRV increases during NREM sleep, particularly high-frequency (HF) power (0.15-0.4 Hz), which reflects respiratory sinus arrhythmia and vagal tone. HF power during N3 can be 2-4 times higher than during wakefulness.
  • LF/HF ratio decreases during NREM sleep, reflecting the shift toward parasympathetic dominance.
  • Blood pressure decreases by 10-20% during NREM sleep (nocturnal dipping), reducing PPG pulse amplitude variability.

Deep sleep (N3) shows the most pronounced autonomic signature: markedly slow, regular heart rate with high vagal tone and very low sympathetic activity. This makes N3 the most reliably detected NREM stage from PPG.

REM Sleep

REM sleep is characterized by autonomic instability with phasic surges of sympathetic activity superimposed on a background of parasympathetic tone. This produces:

  • Increased heart rate variability with irregular, phasic increases in heart rate during phasic REM bursts.
  • Increased LF power and LF/HF ratio compared to NREM sleep, though not as high as wakefulness.
  • Respiratory irregularity, with variable breathing rate and amplitude that modulates the PPG signal.
  • Peripheral vasoconstriction during phasic REM, which can reduce PPG signal amplitude.

The irregular, burst-like pattern of REM autonomic activity is distinct from both the stable vagal dominance of NREM and the sustained sympathetic activation of wakefulness, providing a basis for PPG-based REM detection.

Feature Extraction from PPG for Sleep Staging

Sleep staging algorithms extract features from PPG signals across multiple domains. The PPG signal is first processed to extract inter-beat intervals (IBIs), from which HRV features are computed, along with respiratory and morphological features.

Heart Rate Variability Features

HRV analysis of PPG-derived IBIs is the primary feature source. Features are typically computed over 30-second or 5-minute epochs to match PSG scoring convention:

Time-domain features:

  • Mean heart rate (HR) and its epoch-to-epoch variability
  • SDNN (standard deviation of NN intervals): overall HRV reflecting both sympathetic and parasympathetic activity
  • RMSSD (root mean square of successive differences): primarily reflects parasympathetic activity
  • pNN50 (percentage of successive intervals differing by >50 ms): another parasympathetic marker

Frequency-domain features:

  • Very low frequency (VLF) power (0.003-0.04 Hz): reflects thermoregulatory and hormonal fluctuations
  • Low frequency (LF) power (0.04-0.15 Hz): reflects both sympathetic and parasympathetic modulation
  • High frequency (HF) power (0.15-0.4 Hz): primarily reflects respiratory sinus arrhythmia and vagal tone
  • LF/HF ratio: traditionally interpreted as sympathovagal balance index
  • Total spectral power: overall autonomic modulation magnitude

Non-linear features:

  • Sample entropy (SampEn): measures signal complexity; lower during regular NREM rhythms, higher during variable REM and wake states
  • Detrended fluctuation analysis (DFA) alpha1: short-term fractal scaling exponent; differs between sleep stages due to different correlation structures
  • Poincare plot descriptors (SD1, SD2): SD1 reflects short-term (beat-to-beat) variability, SD2 reflects longer-term variability

PPG-Derived Respiratory Features

Respiration modulates the PPG signal through three mechanisms: respiratory-induced intensity variation (RIIV), respiratory-induced amplitude variation (RIAV), and respiratory-induced frequency variation (RIFV, equivalent to respiratory sinus arrhythmia). These can be extracted to estimate respiratory rate and regularity:

  • Respiratory rate: Differs between stages; typically 12-20 breaths/min during wakefulness, 10-16 during NREM, and variable (8-24) during REM.
  • Respiratory regularity: The coefficient of variation of breath-to-breath intervals; low during stable NREM, high during REM.

For detailed methods on extracting respiratory signals from PPG, see our article on PPG respiratory rate estimation.

Movement and Signal Quality Features

Wrist-worn devices often include accelerometers that provide movement data complementary to PPG. Even without a dedicated accelerometer, PPG signal quality metrics serve as movement proxies:

  • Signal quality index (SQI): Low SQI indicates motion, which correlates with wake or restless sleep.
  • Motion artifact density: The fraction of epochs with detected motion artifacts.
  • PPG amplitude stability: Stable amplitude suggests the subject is still (sleep), while large amplitude variations suggest movement (wake or stage transitions).

Classical Machine Learning Approaches

Early PPG-based sleep staging algorithms used hand-crafted features with classical classifiers, building on the HRV-based approaches originally developed for ECG-based sleep staging.

Random Forest and Gradient Boosting

Beattie et al. (2017) developed one of the most widely cited PPG sleep staging algorithms, using 32 HRV and movement features with a random forest classifier (DOI: 10.1093/sleep/zsw045). Validated against PSG in 60 subjects, the algorithm achieved:

  • Four-class accuracy (wake/light/deep/REM): 69%
  • Cohen's kappa: 0.52
  • Deep sleep (N3) sensitivity: 62%
  • REM sensitivity: 72%
  • Wake sensitivity: 56%

The relatively low wake detection sensitivity is a persistent weakness of cardiac-based sleep staging: quiet wakefulness (lying still with eyes closed) produces HRV patterns similar to light sleep.

Walch et al. (2019) improved on this using gradient-boosted trees (XGBoost) with 130 features including circadian clock features derived from the time of night, achieving four-class accuracy of 73% and kappa of 0.56 in 31 subjects with Apple Watch PPG data compared to clinical PSG (DOI: 10.5665/sleep/zsz180). The addition of circadian features significantly improved REM detection, as REM propensity follows a circadian pattern with longer episodes toward morning.

Hidden Markov Models

Sleep stages follow a structured temporal sequence: wake typically transitions to N1, then N2, with deep sleep (N3) occurring primarily in the first half of the night and REM episodes lengthening toward morning. Hidden Markov models (HMMs) and their variants exploit this temporal structure.

Fonseca et al. (2015) used an HMM with HRV features to classify sleep stages from wrist PPG in 15 subjects, achieving accuracy of 72% for three-class classification (wake/NREM/REM) and kappa of 0.42 (DOI: 10.1109/TBME.2015.2434304). The HMM's transition probability matrix naturally encodes the physiological constraints on sleep stage sequences -- for example, direct transitions from deep sleep to wake are rare, and the model learns to penalize such transitions.

Deep Learning Approaches

Deep learning has substantially advanced PPG-based sleep staging by learning feature representations directly from raw or minimally processed signals and by capturing complex temporal dependencies.

Convolutional Neural Networks

Kotzen et al. (2023) applied a multi-scale CNN to 30-second PPG epochs, learning features at multiple temporal resolutions. The architecture processed raw PPG, its first derivative, and its envelope simultaneously, achieving four-class accuracy of 77% and kappa of 0.62 on the MESA (Multi-Ethnic Study of Atherosclerosis) sleep dataset with 2,055 participants (DOI: 10.1109/JBHI.2022.3189923). This large-scale validation on a diverse, community-based cohort is particularly valuable because it demonstrates robustness across age (45-84 years), sex, and ethnicity.

Recurrent and Temporal Models

Sleep staging benefits from temporal context -- knowing the sleep stage of adjacent epochs constrains the current epoch's classification. Several architectures exploit this:

Radha et al. (2021) developed a deep temporal model combining CNN feature extraction with a bidirectional GRU (gated recurrent unit) sequence classifier, trained on PPG from 292 subjects with PSG reference. The model achieved four-class accuracy of 76.36% and Cohen's kappa of 0.65, with per-class sensitivities of: wake 73%, light sleep 80%, deep sleep 58%, REM 67% (DOI: 10.1038/s41598-021-87845-8). The recurrent component improved accuracy by approximately 4% over epoch-independent classification, demonstrating the value of temporal context.

Transformer-Based Models

Sun et al. (2023) applied a transformer architecture with positional encoding to PPG-based sleep staging, processing sequences of 20 consecutive 30-second epochs (10 minutes of context). The self-attention mechanism learned to weight adjacent epochs differently depending on the transition patterns, achieving state-of-the-art four-class accuracy of 79.1% and kappa of 0.68 on the SHHS (Sleep Heart Health Study) dataset with 5,793 subjects. The attention weights revealed that the model learned to attend most strongly to the immediately adjacent epochs and to epochs 4-5 positions away, capturing both local transitions and the approximately 90-minute ultradian sleep cycle.

Performance Comparison and Benchmarks

Performance comparison across studies is complicated by differences in datasets, population demographics, class definitions (three-class vs. four-class vs. five-class), and evaluation protocols. The following table summarizes representative results:

| Study | Method | N Subjects | Classes | Accuracy | Kappa | |-------|--------|-----------|---------|----------|-------| | Beattie et al., 2017 | Random Forest | 60 | 4 | 69% | 0.52 | | Walch et al., 2019 | XGBoost | 31 | 4 | 73% | 0.56 | | Fonseca et al., 2015 | HMM | 15 | 3 | 72% | 0.42 | | Radha et al., 2021 | CNN-GRU | 292 | 4 | 76% | 0.65 | | Kotzen et al., 2023 | Multi-scale CNN | 2,055 | 4 | 77% | 0.62 | | Sun et al., 2023 | Transformer | 5,793 | 4 | 79% | 0.68 |

For context, inter-scorer agreement between trained PSG technicians is typically 82-90% accuracy and kappa of 0.75-0.85 for five-class scoring (Rosenberg and Van Hout, 2013). PPG-based methods are approaching but have not reached this benchmark, particularly for the more challenging stage distinctions (N1 vs. N2, REM vs. wake).

Clinical Applications and Limitations

Population-Level Sleep Assessment

PPG-based sleep staging is most valuable when applied at scale over extended periods. While individual-night staging has meaningful uncertainty (5-15% of epochs may be misclassified), trends averaged over weeks to months are more reliable. Large epidemiological studies using wearable PPG data have revealed population-level associations between sleep architecture and health outcomes that were previously inaccessible due to the cost and inconvenience of PSG.

Sleep Disorder Screening

PPG-based sleep analysis shows promise for screening several sleep disorders:

  • Obstructive sleep apnea (OSA): PPG can detect apnea-related oxygen desaturations (via SpO2), cyclical heart rate patterns, and sleep fragmentation. Behar et al. (2019) demonstrated AHI estimation from PPG-derived features with sensitivity of 88% and specificity of 80% for moderate-severe OSA (AHI >= 15) in 887 subjects (DOI: 10.1016/j.smrv.2019.101223).
  • Insomnia: Total sleep time, sleep onset latency, and wake after sleep onset (WASO) can be estimated from PPG-based sleep/wake classification, though accuracy for detecting brief wake episodes remains limited.
  • Circadian rhythm disorders: Long-term PPG-derived sleep timing data can identify delayed or advanced sleep phase patterns without requiring sleep diaries or actigraphy.

Known Limitations

Several limitations constrain PPG-based sleep staging accuracy:

  1. N1 detection: N1 is poorly detected by all PPG methods due to minimal autonomic differentiation from wakefulness and N2.
  2. Wake detection during the night: Brief awakenings (1-5 minutes) are frequently missed because HRV changes lag the EEG state transition.
  3. Sleep disorders: Patients with atrial fibrillation, heart failure, or autonomic neuropathy have altered HRV patterns that violate the assumptions of sleep staging algorithms trained on healthy populations.
  4. Medications: Beta-blockers, anticholinergics, and other medications affecting autonomic function alter the HRV-sleep stage relationship.
  5. Age effects: HRV decreases with age, reducing the dynamic range of features used for staging. Algorithms trained primarily on younger adults may perform poorly in elderly populations.

For researchers working on sleep staging algorithms, our algorithms reference provides foundational signal processing methods, and understanding the conditions affecting sleep and cardiovascular health is essential for developing robust classification systems.

References

  • Beattie, Z. et al. (2017). Sleep. DOI: 10.1093/sleep/zsw045
  • Behar, J.A. et al. (2019). Sleep Medicine Reviews. DOI: 10.1016/j.smrv.2019.101223
  • Boudreau, P. et al. (2013). Sleep Medicine Reviews. DOI: 10.1016/j.smrv.2012.02.002
  • Fonseca, P. et al. (2015). IEEE Transactions on Biomedical Engineering. DOI: 10.1109/TBME.2015.2434304
  • Kotzen, K. et al. (2023). IEEE Journal of Biomedical and Health Informatics. DOI: 10.1109/JBHI.2022.3189923
  • Radha, M. et al. (2021). Scientific Reports. DOI: 10.1038/s41598-021-87845-8
  • Walch, O. et al. (2019). Sleep. DOI: 10.5665/sleep/zsz180

Frequently Asked Questions

How accurate is PPG-based sleep staging compared to polysomnography?
PPG-based sleep staging achieves 70-80% epoch-by-epoch agreement with PSG for four-class classification (wake, light/N1+N2, deep/N3, REM), compared to 82-90% inter-scorer agreement between trained human PSG technicians. Wake detection sensitivity is typically 60-75%, which is lower than sleep detection (90-95%). Deep sleep (N3) and REM detection accuracies range from 55-75% depending on the algorithm, population, and signal quality. The best deep learning models on large datasets approach Cohen's kappa of 0.65-0.70, compared to 0.75-0.85 for inter-scorer PSG agreement.
What PPG features are used for sleep staging?
The primary features come from heart rate variability (HRV) analysis of the PPG-derived pulse intervals. Time-domain features include mean heart rate, SDNN, RMSSD, and pNN50. Frequency-domain features include LF power (0.04-0.15 Hz), HF power (0.15-0.4 Hz), and LF/HF ratio. Non-linear features include sample entropy, detrended fluctuation analysis (DFA) alpha1, and Poincare plot descriptors. Additionally, PPG-derived respiratory rate, pulse amplitude variability, and movement-related signal quality metrics provide complementary information for distinguishing sleep stages.
Can a smartwatch accurately detect deep sleep and REM sleep?
Consumer smartwatches can detect deep sleep (N3) and REM sleep at a population level but have significant errors for individual nights. Studies show that wrist-worn devices like the Apple Watch and Fitbit detect deep sleep with sensitivity of 50-65% and REM with sensitivity of 55-70%, meaning they miss 30-50% of epochs classified as deep or REM by PSG. These devices are more reliable for total sleep time estimation (typically within 20-30 minutes of PSG) and general sleep pattern trends over weeks to months than for precise staging on any single night.
Why is N1 sleep so difficult to detect from PPG?
N1 (light drowsiness) is the transitional stage between wakefulness and established sleep. It is characterized by relatively minor changes in autonomic function compared to wakefulness -- heart rate decreases only slightly, and HRV changes are subtle and variable. Even PSG technicians have poor inter-scorer agreement for N1, with reliability below 50% in some studies. The autonomic signature of N1 overlaps substantially with relaxed wakefulness and early N2, making it nearly impossible to distinguish using cardiac-derived features alone. Most PPG-based systems combine N1 with N2 into a single 'light sleep' class.