ChatPPG Editorial

Self-Supervised Learning for PPG: Contrastive Methods and Pre-Training Without Labels

Self-supervised learning pre-trains PPG models on vast quantities of unlabeled wearable data, learning rich physiological representations without requ...

ChatPPG Team

2026-03-24T19:21:26+00:00

7 min read

Self-supervised learning pre-trains PPG models on vast quantities of unlabeled wearable data, learning rich physiological representations without requiring expensive manual annotations. A pre-trained encoder, fine-tuned with just a few dozen labeled examples, consistently matches or outperforms supervised models trained from scratch on thousands of labeled samples.

This capability is transformative for clinical PPG applications where labeled data is scarce, expensive, or impossible to obtain at scale. Polysomnography-labeled sleep staging, expert-annotated AF episodes, and pain-correlated PPG recordings all require costly clinical procedures. Self-supervised pre-training turns the abundant unlabeled PPG from consumer wearables into a powerful learning signal.

The Labeled Data Bottleneck in PPG Research

Supervised deep learning for PPG requires labeled pairs: (PPG segment, target label). For heart rate estimation, labels come from simultaneous ECG or fingertip pulse oximetry. For sleep staging, labels require overnight polysomnography. For AF detection, labels need cardiologist review of ECG. For blood pressure, labels need intra-arterial catheter or validated cuff measurements.

The result: high-quality labeled PPG datasets rarely exceed a few hundred subjects. The IEEE Signal Processing Cup 2015 dataset has 22 subjects. MESA has ~2,000 subjects with PSG-labeled sleep. MIMIC-III has ~60,000 ICU stays but ECG/PPG alignment is imperfect.

Meanwhile, consumer wearables generate petabytes of unlabeled PPG daily. Self-supervised learning is the bridge between this data abundance and the label scarcity.

Contrastive Learning Fundamentals

Contrastive learning trains an encoder to produce similar representations for two augmented views of the same PPG segment (positive pairs) while pushing representations apart for segments from different physiological states (negative pairs).

SimCLR (Chen et al., 2020) applies two random augmentations to each PPG window, passes both through a shared encoder and projection head, then minimizes NT-Xent (Normalized Temperature-Scaled Cross Entropy) loss. The augmentations are the key design choice.

For PPG specifically, effective augmentations include:

Time masking: Randomly zero out 10-30% of the signal in contiguous blocks (inspired by SpecAugment). Forces the model to learn from incomplete signals, which mirrors real-world sensor dropouts.
Amplitude scaling: Multiply signal by a random factor in [0.5, 1.5]. Teaches amplitude-invariant representations essential for cross-device generalization.
Baseline wander injection: Add low-frequency sinusoidal noise (0.05-0.15 Hz, random phase). Forces robustness to the respiratory baseline wander common in ambulatory recordings.
Time stretching: Resample the signal at 0.85-1.15x the original rate. Teaches representations invariant to minor heart rate fluctuations between two views of the same segment.
Channel dropout (for multi-wavelength PPG): Randomly drop one of red/green/IR channels. Forces the model to learn from any available wavelength combination.

Mao et al. (2022, Ressl-PPG: Self-Supervised Contrastive Learning for PPG-Based Physiological Measurement, IEEE JBHI) applied SimCLR-style pre-training to PPG from PPG-DaLiA and BIDMC datasets and achieved fine-tuning performance comparable to full supervised training using only 10% of labeled data.

BYOL and Non-Contrastive Methods

Bootstrap Your Own Latent (BYOL, Grill et al., 2020) eliminates negative pairs entirely, using an online network and a momentum-updated target network to avoid representation collapse. For PPG, BYOL-style training has practical advantages:

No large batch requirement: SimCLR needs large batches (512-4096) to see enough negatives per step. Wearable edge training has strict memory limits. BYOL converges with batch sizes of 32-64.

No hard negative mining: In PPG, segments from the same person at different heart rates can be challenging negatives that confuse the contrastive objective. BYOL sidesteps this entirely.

Better stability: BYOL's moving average target network provides stable training targets, which matters when PPG augmentation pipeline variance is high.

Sarkar & Etemad (2020, Self-Supervised ECG Representation Learning for Emotion Recognition, IEEE TAFFC) showed BYOL-analogous approaches on cardiac signals achieve 95% of supervised performance on downstream tasks with 1% of the labeled data.

Masked Autoencoding for PPG

The third major self-supervised paradigm applies the BERT/MAE concept to PPG. A random 50-75% of the PPG signal is masked, and the model learns to reconstruct the missing portions. This is conceptually closer to the original BERT masked language modeling than to contrastive learning.

PPG-MAE: Mask random 30-ms windows of the PPG signal. Train a Transformer encoder-decoder to reconstruct the masked portions. The encoder learns to infer masked segments from context, forcing it to model physiological signal dynamics.

Tang et al. (2023, BIOT: Cross-Data Biosignal Foundation Models via Unified Tokenization, NeurIPS) applied masked patch modeling to multi-modal biosignals including PPG and demonstrated that the pre-trained representations transfer effectively to tasks the model was not pre-trained on, including PPG-based sleep staging and AF detection from ECG.

The reconstruction quality of the masked regions is a secondary objective. The primary benefit is the learned encoder representations. Fine-tuning for downstream classification discards the decoder and trains only the encoder and a task-specific head.

Downstream Task Performance

The benchmark comparison across self-supervised methods on PPG:

Heart rate estimation (PPG-DaLiA, stationary conditions): Supervised CNN: 1.8 BPM MAE. SimCLR fine-tuned (10% labels): 2.1 BPM. BYOL fine-tuned (10% labels): 2.0 BPM. Supervised from scratch (10% labels): 4.7 BPM. Self-supervised pre-training closes ~75% of the gap to full supervision with 10x less labeled data.

AF detection (PPG-based IBI sequences): Supervised 1D-CNN (full labels): AUC 0.96. Self-supervised + fine-tune (5% labels): AUC 0.93. Fine-tune from scratch (5% labels): AUC 0.81.

Sleep staging (wrist PPG, 4-class): The gap between self-supervised and supervised narrows for complex tasks. With 50 labeled subjects, fine-tuned self-supervised models achieve Cohen's kappa of 0.55-0.60, comparable to supervised models trained on 200+ subjects.

Physiological Consistency as a Self-Supervised Signal

Beyond augmentation-based contrastive learning, PPG-specific self-supervised signals exploit known physiological relationships:

Multi-sensor consistency: Simultaneous PPG from wrist and finger, or red and green channels, are naturally correlated. A contrastive objective that pushes simultaneous multi-site recordings together and non-simultaneous recordings apart learns physiologically meaningful representations without augmentation.

Temporal coherence: Consecutive 10-second PPG segments should have similar heart rate and waveform morphology. A self-supervised objective penalizing large predicted heart rate changes between adjacent segments encodes physiological continuity directly into the learning signal.

Cross-modal alignment: Simultaneous PPG and accelerometer from the same wearable are inherently correlated during activity (movement causes both motion artifacts in PPG and accelerometer signals). Cross-modal contrastive learning between PPG and IMU learns motion-aware representations directly.

Practical Implementation

A self-supervised PPG pre-training pipeline for practitioners:

Data collection: Aggregate 100+ hours of unlabeled wrist PPG at 25-50 Hz. Public sources include the large-scale UK Biobank PPG subset, Amazon Halo research datasets, and MIMIC-IV waveform database.
Preprocessing: Apply bandpass filter (0.5-8 Hz), normalize each 10-second window to zero mean, unit variance. No label alignment needed.
Augmentation pipeline: Implement time masking + amplitude scaling as a minimum. Add baseline wander injection and time stretching for improved robustness.
Architecture: A 5-layer 1D-CNN with residual connections and a 128-dim projection head is sufficient for most PPG self-supervised pre-training. Transformer architectures achieve better transfer for complex downstream tasks but require 10-100x more compute.
Pre-training: 100-200 epochs with AdamW, learning rate warmup, cosine annealing. SimCLR requires batch size ≥ 256; BYOL works with 64.
Fine-tuning: Attach task-specific head to frozen encoder. Unfreeze encoder after 5-10 epochs of head-only training. Use lower learning rate (1e-5 to 1e-4) for encoder layers.

For the signal processing foundations this builds on, see our articles on PPG signal quality assessment, deep learning for PPG heart rate estimation, and PPG transformer models. For clinical applications that benefit from these approaches, see PPG sleep staging and arrhythmia classification from PPG.

Key Papers

Chen, T. et al. (2020). A simple framework for contrastive learning of visual representations. ICML. https://doi.org/10.48550/arXiv.2002.05709
Grill, J.B. et al. (2020). Bootstrap your own latent: A new approach to self-supervised learning. NeurIPS. https://doi.org/10.48550/arXiv.2006.07733
Mao, W. et al. (2022). Ressl-PPG: Self-supervised contrastive learning for PPG-based physiological measurement. IEEE JBHI, 26(8), 3824-3834. https://doi.org/10.1109/JBHI.2022.3163124
Tang, C. et al. (2023). BIOT: Cross-data biosignal foundation models via unified tokenization. NeurIPS. https://doi.org/10.48550/arXiv.2305.11457

FAQ

What makes self-supervised learning different from unsupervised learning for PPG? Unsupervised learning (clustering, PCA, autoencoders) discovers structure in data without labels. Self-supervised learning creates its own supervisory signal from the data structure itself: predicting masked portions, or learning that two augmented views of the same segment should be similar. The distinction matters because self-supervised methods produce representations that transfer much better to downstream supervised tasks.

How does contrastive pre-training handle heart rate as a confound? This is a genuine challenge. Two segments from different subjects at the same heart rate could be pushed apart by the contrastive objective even though they share similar physiological representations. Momentum contrast with large queues partially addresses this by including diverse negatives. Supervised contrastive learning variants that use heart rate as a soft label for positive/negative pair assignment provide a cleaner solution.

Can self-supervised PPG models generalize across wearable devices? Device generalization is the primary motivator for amplitude-invariant and frequency-invariant augmentations. Green-LED smartwatch PPG at 25 Hz and red-LED medical pulse oximeter PPG at 125 Hz have very different spectral characteristics. Pre-training with aggressive cross-device augmentation (random downsampling, channel simulation) significantly improves cross-device transfer.

How much unlabeled data is enough for pre-training? Scaling laws for PPG self-supervised pre-training are not yet well-characterized. Empirically, 50+ subjects' worth of continuous recording (roughly 500+ hours) provides substantial benefit over random initialization. Beyond 5,000 hours, diminishing returns set in for most downstream tasks. Data diversity (multiple skin tones, activity types, health conditions) matters more than raw volume.

Is there a risk that self-supervised models learn spurious correlations in PPG? Yes. If the unlabeled pre-training data is systematically biased (e.g., dominated by young, healthy, light-skinned individuals), the learned representations encode those biases and transfer them to downstream tasks. Balanced pre-training datasets and demographic-stratified evaluation are essential for equitable deployment.

← Back to all articles