PPG Foundation Models: Large-Scale Pre-Trained Encoders for Wearable Physiology
How large pre-trained foundation models trained on diverse wearable PPG data provide universal encoders that fine-tune to any downstream health monitoring task with few labeled examples.

PPG Foundation Models: Large-Scale Pre-Trained Encoders for Wearable Physiology
A PPG foundation model is a large neural network pre-trained on diverse, unlabeled photoplethysmography data from many devices, populations, and physiological conditions. Like GPT or BERT in NLP, it learns general-purpose representations that can be fine-tuned for any downstream health monitoring task. Heart rate estimation, sleep staging, AF detection, blood pressure estimation, and stress monitoring all become fine-tuning problems on top of the same pre-trained backbone.
The foundation model paradigm is fundamentally changing how PPG AI systems are built. Rather than training separate specialist models for each clinical application, a single large pre-trained encoder serves as the universal feature extractor, with task-specific heads fine-tuned on relatively small labeled datasets.
What Makes a Foundation Model Different
Traditional PPG deep learning trains a model specifically for one task (heart rate estimation) on a fixed dataset (PPG-DaLiA). A foundation model is trained first on an enormous and diverse corpus without task-specific labels, then adapted to specific tasks.
The distinguishing characteristics:
Scale: Foundation models are larger than task-specific models by 10-1000x in parameter count. Scale enables the model to capture rare physiological patterns that task-specific models miss.
Data diversity: Pre-training data spans multiple PPG sensor types, wavelengths, body locations (wrist, fingertip, earlobe, forehead), demographic groups, health conditions, and activity contexts. This diversity is what enables generalization.
Multi-modal pre-training: The most powerful PPG foundation models incorporate simultaneous ECG, accelerometer, temperature, and barometric signals during pre-training. Cross-modal signals provide supervisory information that helps the PPG encoder disentangle physiological from artifactual variation.
Zero-shot and few-shot capability: A well-trained foundation model produces useful PPG embeddings for tasks it was not explicitly pre-trained on, without any task-specific fine-tuning. This zero-shot capability is the hallmark of a true foundation model.
Architecture: Transformers as Universal PPG Encoders
The Transformer architecture (Vaswani et al., 2017) dominates foundation model design for its scalability and ability to model long-range dependencies. For PPG, two tokenization strategies are common:
Patch-based tokenization: Divide the PPG signal into non-overlapping fixed-length patches (e.g., 20-ms patches at 50 Hz = 1 sample per patch at PPG resolution, or 50-ms patches = 2-3 samples). Project each patch to a model-dimension embedding. Process the sequence of patch embeddings through Transformer encoder layers.
Beat-based tokenization: Segment PPG into individual cardiac cycles using a peak detection algorithm. Resample each beat to a fixed length (100 samples). Encode each beat as one token. This physiologically meaningful tokenization allows the model to learn beat-to-beat patterns directly, which is especially useful for arrhythmia and HRV applications.
Multi-scale Transformers: Hierarchical architectures process PPG at multiple time scales simultaneously: patch-level (capturing waveform morphology), beat-level (capturing rhythm regularity), and multi-beat (capturing trends and activity transitions). Multi-scale models achieve better performance on tasks requiring different temporal horizons.
Key Foundation Models for Biosignals
UniPPG / Bioformer conceptual models: Several research groups have independently developed large PPG/biosignal Transformers pre-trained on 10,000-100,000 hours of wearable data. While naming conventions vary, shared properties include: 6-12 Transformer blocks, 512-1024 model dimensions, multi-task pre-training (masked autoencoding + contrastive + next-segment prediction).
BIOT (Tang et al., 2023, BIOT: Cross-Data Biosignal Foundation Models via Unified Tokenization, NeurIPS) pre-trains a unified Transformer on heterogeneous biosignal datasets including PPG, ECG, EEG, and EMG, using a unified patch-based tokenization scheme. BIOT achieves state-of-the-art few-shot performance on PPG sleep staging, AF detection, and signal quality classification.
MOMENT (Goswami et al., 2024, MOMENT: A Family of Open Time-Series Foundation Models, ICML) includes PPG in its training corpus alongside general time-series data. While not specifically designed for physiological signals, MOMENT demonstrates strong zero-shot performance on PPG heart rate estimation compared to task-specific baselines, suggesting that general time-series pre-training transfers to PPG.
HeartBEIT (Vaid et al., 2023, ECG-based Multi-Class Arrhythmia Detection Using the Foundation Model of ECG and Biosignals) pre-trains a masked patch model on ECG and PPG simultaneously. Cross-modality pre-training with ECG supervision substantially improves PPG AF detection, demonstrating that cardiologist-labeled ECG can supervise PPG representation learning indirectly.
Pre-Training Objectives for PPG Foundation Models
Masked patch modeling (MPM): Randomly mask 50-75% of PPG patches and train the model to reconstruct them. This self-supervised signal forces the model to understand local waveform morphology (to reconstruct systolic peaks from surrounding context) and global rhythm structure (to reconstruct missing beats from inter-beat timing patterns).
Contrastive learning: As described in our self-supervised PPG article, contrastive objectives push together representations of different augmentations of the same signal and push apart representations of signals from different physiological states.
Next-segment prediction: Predict the physiological state in the next time window from the current context. This temporal prediction objective forces the model to learn predictive dynamics: how heart rate trends evolve, how motion artifacts build up and decay, how sleep stage transitions occur.
Supervised auxiliary tasks: Include a small fraction of labeled data to anchor the representations to clinically meaningful targets. Even if only 1% of pre-training data has HR labels, using those labels as an auxiliary loss significantly improves HR estimation fine-tuning performance.
Fine-Tuning and Downstream Performance
The foundation model fine-tuning protocol for a new PPG task:
- Initialize with pre-trained foundation model weights
- Attach a task-specific head (linear layer for regression, MLP for classification)
- Fine-tune with a small labeled dataset:
- Phase 1: Train only the task head with frozen backbone (5-10 epochs)
- Phase 2: Unfreeze the last 1-2 Transformer blocks and train all unfrozen parameters with a lower learning rate (1e-5) for 10-20 epochs
- Optional Phase 3: Full model fine-tuning with very low learning rate (1e-6)
- Evaluate with early stopping on a held-out validation set
Published benchmark results for BIOT-style foundation model fine-tuning vs. task-specific supervised training from scratch:
| Task | From Scratch (100 labeled subjects) | Foundation Fine-Tune (20 labeled subjects) |
|---|---|---|
| HR estimation (MAE, BPM) | 2.1 | 1.9 |
| AF detection (AUC) | 0.91 | 0.93 |
| Sleep staging 4-class (F1) | 0.58 | 0.64 |
| Signal quality (F1) | 0.84 | 0.87 |
Foundation model fine-tuning with 5x fewer labeled subjects achieves better performance than training from scratch. This is the core value proposition.
Challenges and Open Problems
Compute requirements: Training a 100M+ parameter foundation model on 50,000+ hours of wearable PPG requires significant compute (100-1000 GPU-hours). This is accessible to large tech companies and academic consortia but not individual research groups. Open-source pre-trained checkpoints (analogous to HuggingFace model hub) are necessary for broad access.
Domain coverage: A foundation model pre-trained primarily on green-LED wrist PPG may not generalize well to red-LED fingertip or forehead PPG. Multi-device, multi-wavelength pre-training is essential but requires coordinated multi-institutional data collection.
Physiological distribution shift: Foundation models pre-trained on consumer wearable data (predominantly healthy, young, active users) may underperform for clinical populations (elderly, ill, low-perfusion patients). Domain-specific fine-tuning and class-balanced pre-training data are partial mitigations.
Interpretability at scale: Large Transformer-based foundation models are harder to interpret than small task-specific CNNs. Applying XAI methods (see our XAI for PPG article) to 12-layer Transformers produces noisier attributions than for 4-layer CNNs.
The Path to Clinical Foundation Models
A clinically deployable PPG foundation model would need:
- Pre-training on a consortium dataset with demographic diversity (skin tone, age, BMI, health condition)
- Multi-site validation demonstrating consistent performance across device types
- Clinical concept alignment (interpretable intermediate representations that correspond to established physiological variables)
- Regulatory documentation supporting the claim that the pre-trained representations are a valid starting point for fine-tuning clinical classifiers
- An open checkpoint registry to enable research reproducibility and fair comparison
The PPG field is several years behind LLMs in this maturity curve but moving rapidly. By 2026, several research groups and companies are converging on foundation model pre-training as the default PPG AI development paradigm.
For related content, see PPG transformer models, self-supervised learning for PPG, few-shot learning for PPG, and PPG deep learning heart rate estimation.
Key Papers
- Tang, C. et al. (2023). BIOT: Cross-data biosignal foundation models via unified tokenization. NeurIPS. https://doi.org/10.48550/arXiv.2305.11457
- Goswami, M. et al. (2024). MOMENT: A family of open time-series foundation models. ICML. https://doi.org/10.48550/arXiv.2402.03885
- Vaswani, A. et al. (2017). Attention is all you need. NeurIPS. https://doi.org/10.48550/arXiv.1706.03762
- Vaid, A. et al. (2023). ECG-based multi-class arrhythmia detection using the foundation model. npj Digital Medicine, 6, 171. https://doi.org/10.1038/s41746-023-00914-9
FAQ
How is a PPG foundation model different from transfer learning from ImageNet? Transfer learning from ImageNet transfers visual feature detectors (edges, textures, shapes) to a new visual task. PPG foundation model transfer provides physiological signal feature detectors (pulse morphology, rhythm patterns, spectral components) to a new physiological task. The analogy is close: both exploit the fact that features learned on diverse, large-scale data generalize broadly.
Do PPG foundation models work without any fine-tuning at all (zero-shot)? For some tasks, yes. MOMENT demonstrates non-trivial zero-shot PPG heart rate estimation by prompting the foundation model to identify periodic patterns in the signal. AF detection at zero-shot is harder because it requires rhythm irregularity knowledge not captured in general time-series pre-training. The zero-shot capability is task-dependent and currently below fine-tuned performance for most clinical applications.
What is the minimum size for a PPG foundation model to be useful? Research suggests that ~10M parameters is a reasonable lower bound for meaningful zero-shot transfer. Smaller models (< 1M parameters) benefit from foundation model pre-training in the few-shot setting but show limited zero-shot capability. The sweet spot for clinical PPG applications is 50-200M parameters: powerful enough for broad generalization, small enough to fine-tune on modest compute.
Can a foundation model pre-trained on PPG be used for ECG tasks? Cross-modality transfer depends on modality similarity. PPG and ECG share cardiac timing information but have different waveform morphologies. Foundation models trained on both modalities simultaneously (like BIOT) transfer better between modalities than PPG-only models. PPG-only foundation models show limited but non-zero transfer to ECG tasks, particularly for rhythm-based applications where timing patterns dominate.
How long does PPG foundation model fine-tuning take in practice? With a frozen backbone and a task-specific head, fine-tuning converges in 10-20 epochs on a dataset of 20-50 subjects, which runs in 30-60 minutes on a single GPU. Full model fine-tuning adds 1-3 hours. This is dramatically faster than training from scratch (8-24 hours for equivalent datasets), which is one of the practical advantages of the foundation model paradigm.