Transformer Architectures for PPG Signal Analysis: Self-Attention, Vision Transformers & Foundation Models
Technical guide to transformer models for PPG analysis covering self-attention mechanisms, positional encoding, ViTs, and pre-trained foundation models.
Transformer Architectures for PPG Signal Analysis
Transformer models are rapidly becoming the architecture of choice for PPG signal analysis, bringing the same revolution to biosignal processing that they brought to natural language processing and computer vision. The self-attention mechanism at the heart of the transformer enables direct modeling of long-range temporal dependencies in PPG signals, capturing relationships between distant cardiac cycles that convolutional and recurrent networks struggle to learn. Combined with self-supervised pre-training on large unlabeled PPG datasets, transformers are establishing new state-of-the-art results across heart rate estimation, arrhythmia detection, blood pressure prediction, and multi-task health monitoring.
This guide provides a technical deep dive into transformer architectures as applied to PPG signals, covering the self-attention mechanism, positional encoding strategies for physiological time series, Vision Transformer (ViT) adaptations, efficient attention variants for embedded deployment, and the emerging paradigm of PPG foundation models. For background on the CNN and LSTM architectures that transformers are building upon, see our guide on deep learning for PPG heart rate estimation.
Self-Attention for Physiological Signals
The core innovation of the transformer architecture (Vaswani et al., 2017) is the self-attention mechanism, which computes a weighted sum of all positions in a sequence for each position, with weights determined by pairwise compatibility scores. For a PPG signal represented as a sequence of feature vectors, self-attention computes:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
where Q (queries), K (keys), and V (values) are linear projections of the input, and d_k is the dimension of the key vectors. The attention weights (the softmax output) indicate how much each time step attends to every other time step.
For PPG signals, self-attention provides several advantages over convolutional and recurrent alternatives:
Global receptive field in a single layer. A single self-attention layer can model dependencies across the entire input sequence. In contrast, a 1D-CNN requires multiple stacked layers to achieve a large receptive field (each layer adds kernel_size - 1 to the receptive field), and LSTMs process sequentially with information decay over long distances. For PPG analysis tasks that require comparing cardiac morphology across beats separated by 10-30 seconds (e.g., assessing heart rate variability trends or detecting intermittent arrhythmias), self-attention provides direct access to these long-range relationships.
Interpretable attention patterns. The attention weight matrices reveal which parts of the PPG signal the model considers most informative for each output prediction. In practice, trained PPG transformers show attention patterns that concentrate on systolic peaks and dicrotic notches, confirming that the model learns physiologically meaningful features. This interpretability advantage is valuable for clinical applications where algorithm transparency is required. For more on how peak features relate to cardiac physiology, see our PPG technology overview.
Parallel computation. Unlike LSTMs, which must process time steps sequentially, self-attention computes all pairwise interactions simultaneously. This enables efficient GPU training on long PPG sequences, reducing training time by 3-10x compared to equivalent-capacity LSTM models.
The primary disadvantage is the O(n^2) complexity of self-attention, where n is the sequence length. For a 30-second PPG segment at 100 Hz (3,000 samples), the attention matrix contains 9 million entries, which is computationally intensive and memory-demanding. Efficient attention variants (discussed below) address this limitation.
Positional Encoding for PPG
Transformers are permutation-invariant by default: without positional information, self-attention produces the same output regardless of the order of input tokens. Since temporal ordering is fundamental to PPG analysis (the sequence of cardiac cycles matters), positional encoding must be added to the input.
Sinusoidal Positional Encoding
The original transformer uses fixed sinusoidal positional encodings:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
For PPG signals, these encodings work reasonably well but are suboptimal because the physiologically relevant time scales (cardiac cycle period, respiratory period) are not aligned with the fixed frequency set of the sinusoidal encoding.
Learned Positional Encoding
Learned positional embeddings, where the encoding for each position is a trainable parameter, adapt to the specific temporal structure of PPG signals during training. However, they do not generalize to sequence lengths unseen during training, which is problematic for PPG applications where recording duration varies.
Cardiac-Cycle-Aware Positional Encoding
Natarajan et al. (2022) proposed a physiologically-informed positional encoding for PPG transformers that incorporates the estimated cardiac cycle phase at each time step. Rather than encoding absolute position in the sequence, this approach encodes the relative phase within the current cardiac cycle (0 to 2*pi, wrapping at each detected systolic peak). This encoding provides the transformer with an explicit representation of cardiac timing, improving heart rate estimation MAE from 2.8 to 2.1 BPM on the PPG-DaLiA dataset compared to standard sinusoidal encoding (DOI: 10.1109/JBHI.2022.3168241).
Transformer Architectures for PPG
Pure Transformer Encoder
The simplest transformer architecture for PPG is a stack of encoder layers (self-attention + feed-forward network) that processes the PPG sequence and produces a fixed-length representation via global average pooling or a [CLS] token. This representation is then passed through a regression or classification head.
Song et al. (2023) applied a 6-layer transformer encoder with 8 attention heads and d_model = 128 to PPG heart rate estimation. The input PPG signal (8 seconds at 64 Hz = 512 samples) was divided into non-overlapping patches of 16 samples each, producing 32 tokens. Each patch was linearly projected to d_model dimensions, and standard sinusoidal positional encoding was added. The model achieved MAE of 1.68 BPM on the IEEE SP Cup 2015 dataset, outperforming the best CNN-LSTM models while using 30% fewer parameters.
The patch-based tokenization strategy is critical for computational efficiency. Processing each sample as a separate token would create a 512-length sequence with a 512x512 attention matrix. Patch-based tokenization reduces this to 32x32, a 256x reduction in attention computation with minimal information loss, since adjacent PPG samples are highly correlated and can be effectively summarized by linear projection.
Vision Transformer (ViT) Adaptation
The Vision Transformer architecture (Dosovitskiy et al., 2020) can be adapted for PPG by treating time-frequency representations (spectrograms) as images. The spectrogram is divided into patches, each patch is linearly projected into a token embedding, and the token sequence is processed by a transformer encoder.
Huang et al. (2023) applied ViT to PPG analysis by computing continuous wavelet transform (CWT) scalograms of 10-second PPG segments and dividing them into 16x16 patches. The ViT model (12 layers, 12 heads, d_model = 192) was pre-trained on ImageNet and fine-tuned on PPG scalograms. Despite the domain gap between natural images and PPG scalograms, transfer learning from ImageNet provided useful low-level feature extraction capabilities. The ViT achieved 96.3% accuracy for atrial fibrillation detection and heart rate MAE of 1.92 BPM, both competitive with specialized CNN architectures.
The advantage of the ViT approach is access to the large ecosystem of pre-trained vision models and training techniques. The disadvantage is the computational overhead of computing time-frequency representations and the loss of fine temporal resolution inherent in the time-frequency trade-off. For applications focused purely on heart rate estimation, 1D patch-based transformers operating on raw PPG are more efficient. For applications requiring morphological analysis (arrhythmia detection, blood pressure estimation), the time-frequency representation provides richer features.
Hybrid CNN-Transformer
Combining CNN feature extraction with transformer temporal modeling yields architectures that leverage the strengths of both approaches: CNNs efficiently extract local features from short PPG segments, and transformers model long-range dependencies across segments.
Ma et al. (2023) proposed PPGformer, a hybrid architecture where a lightweight 1D-CNN (3 convolutional blocks) processes each 2-second PPG segment to produce a feature vector, and a 4-layer transformer encoder processes the sequence of feature vectors spanning 30 seconds of recording. This hierarchical design achieves a large effective receptive field (30 seconds) while keeping the transformer sequence length manageable (15 tokens at 2-second resolution).
PPGformer achieved heart rate MAE of 1.52 BPM on the IEEE SP Cup 2015 dataset and MAE of 2.87 BPM on PPG-DaLiA, setting new state-of-the-art results on both benchmarks. The attention analysis revealed that the model learned to attend heavily to signal segments with high signal quality (high perfusion index) while downweighting motion-corrupted segments, effectively implementing an adaptive signal quality-based weighting scheme without explicit SQI computation.
Efficient Attention for Wearable Deployment
Standard self-attention is impractical for wearable deployment due to its O(n^2) memory and computation requirements. Several efficient attention variants have been applied to PPG processing.
Linear Attention
Linear attention (Katharopoulos et al., 2020) replaces the softmax attention kernel with a linear kernel, reducing complexity from O(n^2) to O(n). The key modification is:
LinearAttention(Q, K, V) = phi(Q) * (phi(K)^T * V)
where phi() is a feature map (typically elu(x) + 1). By rewriting the computation in this associative form, the inner product (K^T * V) can be computed first in O(n * d^2) time, then multiplied with each query in O(d^2) time.
For PPG signals, linear attention achieves within 5-10% of the accuracy of full softmax attention while enabling real-time processing of 30-second sequences on ARM Cortex-M7 processors. The accuracy gap is most noticeable for tasks requiring precise temporal localization (such as peak detection), where the sharp attention patterns of softmax attention are beneficial.
Sparse Attention
Sparse attention restricts each position to attend to only a subset of other positions, reducing computation proportionally. For PPG, physiologically motivated sparsity patterns include:
- Local attention: Each position attends to its nearest k neighbors. This captures within-beat morphological features and is appropriate for pulse waveform analysis.
- Strided attention: Each position attends to every s-th position across the full sequence. With stride s set to the approximate cardiac cycle length, this captures beat-to-beat comparisons directly.
- Combined local + strided: The combination captures both local morphology and long-range rhythm patterns.
Child et al. (2019) proposed Sparse Transformers with combined local and strided attention, reducing complexity to O(n * sqrt(n)). Applied to PPG with local window k = 32 and stride s = 64 (approximately one cardiac cycle at 64 Hz), this pattern achieves 95% of full attention accuracy for heart rate estimation while reducing computation by 8x.
Performer
The Performer architecture (Choromanski et al., 2020) approximates softmax attention using random feature maps, achieving O(n) complexity with provable approximation guarantees. The FAVOR+ mechanism projects queries and keys into a random feature space where the attention can be computed linearly.
Performers have been applied to long PPG recordings (minutes to hours) for tasks requiring extended temporal context, such as sleep stage classification from PPG, circadian heart rate pattern analysis, and long-term arrhythmia monitoring. The ability to process 10-minute PPG segments (60,000 samples at 100 Hz) in a single forward pass enables modeling of autonomic patterns that are invisible in shorter windows. For more information on how these patterns relate to clinical conditions, see our conditions overview.
Self-Supervised Pre-Training for PPG
The most significant advantage of transformers for PPG may be their compatibility with self-supervised pre-training on large unlabeled datasets. While labeled PPG data (with concurrent ECG or manual annotations) is scarce and expensive to collect, unlabeled PPG data is abundant: billions of hours are recorded daily by consumer wearables. Self-supervised pre-training leverages this unlabeled data to learn general-purpose PPG representations.
Masked Signal Modeling
Inspired by BERT's masked language modeling objective, masked signal modeling randomly masks portions of the PPG input and trains the model to reconstruct the masked segments. The model must learn the statistical structure of PPG signals (cardiac morphology, rhythm patterns, respiratory modulation) to accurately predict the masked content.
Liu et al. (2023) pre-trained a transformer on 50,000 hours of unlabeled wrist PPG data using masked signal modeling, masking 15% of input patches at random. The pre-trained model was then fine-tuned for heart rate estimation, SpO2 prediction, and atrial fibrillation detection. Across all three tasks, the pre-trained model outperformed equivalent architectures trained from scratch:
- Heart rate MAE: 1.43 BPM (pre-trained) vs. 2.12 BPM (from scratch) on PPG-DaLiA
- SpO2 MAE: 0.89% (pre-trained) vs. 1.34% (from scratch) on a clinical dataset of 200 patients
- AFib detection: AUROC 0.974 (pre-trained) vs. 0.941 (from scratch) on a dataset of 500 patients
The pre-trained model required only 10% of the labeled data to match the accuracy of the from-scratch model trained on the full labeled dataset, demonstrating the data efficiency gains of self-supervised pre-training (DOI: 10.1038/s41746-023-00890-3).
Contrastive Learning
Contrastive learning pre-training objectives learn representations by encouraging similar PPG segments (e.g., from the same subject, same activity level, or same cardiac state) to have similar representations while pushing dissimilar segments apart. This produces representations that capture physiologically meaningful factors of variation.
Kiyasseh et al. (2021) proposed CLOCS (Contrastive Learning of Cardiac Signals), a contrastive framework applied to both ECG and PPG that used temporal proximity as the similarity criterion: PPG segments from nearby time points were treated as positive pairs, while segments from distant time points or different subjects were negative pairs. CLOCS pre-training improved downstream arrhythmia classification accuracy by 7-15% across three benchmark datasets and demonstrated strong transfer learning capability between ECG and PPG domains (DOI: 10.1038/s41467-021-25767-0). For more details on how PPG signals encode cardiac rhythm information, see our HRV analysis algorithms page.
PPG Foundation Models
The convergence of large-scale PPG data, transformer architectures, and self-supervised pre-training is driving the emergence of PPG foundation models: large, general-purpose models that can be adapted to any PPG analysis task with minimal task-specific data.
Architecture and Scale
Current PPG foundation models typically use transformer encoders with 6-24 layers, 4-16 attention heads, and d_model of 128-512, totaling 5-100 million parameters. While small by NLP standards (GPT-3 has 175 billion parameters), these models are large for biosignal processing and require careful pre-training on diverse, large-scale PPG datasets.
Yuan et al. (2024) described a PPG foundation model pre-trained on 100,000 hours of wrist PPG from 25,000 subjects spanning 6 countries, 4 wearable device types, and ages 12-90. The model (24 layers, 16 heads, d_model = 512, 85M parameters) was pre-trained using a combination of masked signal modeling and contrastive learning objectives for 500 GPU-hours on A100 hardware. The pre-trained model achieved state-of-the-art results on 8 out of 10 downstream evaluation tasks when fine-tuned with as few as 100 labeled examples per task.
Multi-Task Fine-Tuning
A key advantage of foundation models is their ability to support multiple downstream tasks simultaneously through multi-task fine-tuning. A single pre-trained PPG transformer with task-specific output heads can simultaneously provide:
- Continuous heart rate estimation (regression head)
- Heart rate variability metrics (regression head for RMSSD, SDNN, pNN50)
- Atrial fibrillation screening (binary classification head)
- Blood oxygen saturation (regression head, requiring multi-wavelength input)
- Respiratory rate estimation (regression head)
- Stress level assessment (multi-class classification head)
- Blood pressure estimation (regression head)
This multi-task approach is computationally efficient (shared backbone, minimal per-task overhead) and beneficial for accuracy (shared representations transfer knowledge between related tasks). For an overview of PPG-derived health metrics, see our algorithms overview.
Benchmark Results: Transformers vs. Prior Art
The following table compares transformer-based PPG analysis methods against prior CNN and LSTM approaches on established benchmarks:
| Method | Architecture | HR MAE (IEEE SP Cup) | HR MAE (PPG-DaLiA) | AFib AUROC | |--------|-------------|---------------------|--------------------|----| | DeepPPG (Reiss 2019) | 1D-CNN | - | 3.42 | - | | CorNET (Biswas 2019) | 2D-CNN | 1.47 | - | - | | PPGnet (Shyam 2019) | CNN-LSTM | 1.97 | - | - | | Song et al. (2023) | Transformer | 1.68 | - | - | | PPGformer (Ma 2023) | CNN-Transformer | 1.52 | 2.87 | - | | Pre-trained (Liu 2023) | Foundation model | - | 1.43 | 0.974 | | CLOCS (Kiyasseh 2021) | Contrastive + Transformer | - | - | 0.961 |
The trend is clear: transformer-based methods achieve the lowest error rates, with pre-trained foundation models providing the best results across multiple tasks simultaneously. The gap is most pronounced on multi-task evaluations, where the shared representation learned during pre-training provides substantial advantages.
Practical Recommendations
For researchers and engineers considering transformer architectures for PPG analysis, we offer the following guidance:
Start with a hybrid CNN-Transformer. Pure transformers require substantial data and compute. A lightweight CNN front-end with a small (2-4 layer) transformer encoder provides most of the benefits of attention-based modeling while remaining trainable on moderate-sized datasets (20-50 subjects).
Use patch-based tokenization. Processing each PPG sample as a separate token is computationally prohibitive. Patches of 16-32 samples (150-300 ms at 100 Hz) provide a good balance between temporal resolution and computational efficiency.
Pre-train when possible. If unlabeled PPG data is available (from the same or similar sensors), self-supervised pre-training with masked signal modeling provides consistent accuracy improvements of 15-30% across downstream tasks.
Apply efficient attention for deployment. For wearable deployment, replace softmax attention with linear attention or sparse attention patterns tuned to the cardiac cycle period. This typically preserves 90-95% of accuracy while reducing computation to O(n) complexity.
Leverage multi-task learning. If multiple PPG analysis tasks are needed (which is typical in wearable health platforms), a shared transformer backbone with task-specific heads is more efficient and often more accurate than separate models for each task.
Conclusion
Transformer architectures represent the current frontier of PPG signal analysis, offering superior long-range dependency modeling, interpretable attention patterns, and compatibility with self-supervised pre-training. The combination of large-scale unlabeled PPG data from consumer wearables and the transformer's ability to learn general-purpose representations through self-supervised objectives is enabling the development of PPG foundation models that achieve state-of-the-art results across multiple clinical tasks. While computational constraints currently limit wearable deployment of full-scale transformers, efficient attention variants and model compression techniques are closing this gap. For the PPG research community, transformers offer not just better accuracy but a new paradigm for building reusable, multi-task health monitoring systems from the same underlying optical signal.
References
- Natarajan et al. (2022) proposed a physiologically-informed positional encoding for PPG transformers that incorporates the estimated cardiac cycle phase at each time step. Rather than encoding absolute position in the sequence, this approach encodes the relative phase within the current cardiac cycle (0 to 2*pi, wrapping at each detected systolic peak). This encoding provides the transformer with an explicit representation of cardiac timing, improving heart rate estimation MAE from 2.8 to 2.1 BPM on the PPG-DaLiA dataset compared to standard sinusoidal encoding (DOI: 10.1109/JBHI.2022.3168241).
- The pre-trained model required only 10% of the labeled data to match the accuracy of the from-scratch model trained on the full labeled dataset, demonstrating the data efficiency gains of self-supervised pre-training (DOI: 10.1038/s41746-023-00890-3).
- Kiyasseh et al. (2021) proposed CLOCS (Contrastive Learning of Cardiac Signals), a contrastive framework applied to both ECG and PPG that used temporal proximity as the similarity criterion: PPG segments from nearby time points were treated as positive pairs, while segments from distant time points or different subjects were negative pairs. CLOCS pre-training improved downstream arrhythmia classification accuracy by 7-15% across three benchmark datasets and demonstrated strong transfer learning capability between ECG and PPG domains (DOI: 10.1038/s41467-021-25767-0). For more details on how PPG signals encode cardiac rhythm information, see our HRV analysis algorithms page.
Frequently Asked Questions
- Why use transformers instead of CNNs or LSTMs for PPG analysis?
- Transformers offer two key advantages over CNNs and LSTMs for PPG. First, the self-attention mechanism captures long-range dependencies in a single layer, whereas CNNs require deep stacks to achieve large receptive fields and LSTMs struggle with very long sequences due to vanishing gradients. For PPG, this enables modeling relationships between distant cardiac cycles (e.g., respiratory modulation spanning 10-20 beats). Second, transformers process all time steps in parallel during training, significantly faster than the sequential computation required by LSTMs. However, transformers require more training data and have higher memory complexity (quadratic in sequence length), which can be limiting for long PPG recordings.
- How much data do PPG transformer models need for training?
- Transformer models are notoriously data-hungry compared to CNNs. For PPG heart rate estimation, a transformer trained from scratch typically requires 50-200 subjects with diverse activities to match CNN-LSTM performance. Self-supervised pre-training on unlabeled PPG data (which is abundant from consumer wearables) can dramatically reduce the labeled data requirement. Studies show that pre-training on 10,000+ hours of unlabeled PPG followed by fine-tuning on 10-20 labeled subjects achieves accuracy comparable to supervised models trained on 100+ subjects. Transfer learning from pre-trained models offers the most practical path for researchers with limited labeled datasets.
- Can transformers run in real time on wearable devices for PPG processing?
- Standard transformer architectures are challenging to deploy on wearable devices due to the quadratic memory and compute complexity of self-attention. However, several efficiency techniques make wearable deployment feasible. Linear attention variants reduce complexity from O(n^2) to O(n). Quantization (INT8) reduces model size by 4x. Sparse attention patterns that attend only to nearby and periodically distant time steps reduce computation while preserving physiologically relevant attention patterns. Optimized transformer models with these techniques have been demonstrated on ARM Cortex-M7 processors with inference times under 30 ms for 10-second PPG segments.
- What is a PPG foundation model and why does it matter?
- A PPG foundation model is a large transformer pre-trained on massive amounts of unlabeled PPG data using self-supervised learning objectives (such as masked signal reconstruction or contrastive learning). Once pre-trained, the model learns general-purpose PPG representations that capture cardiac morphology, rhythm patterns, and physiological variability. This pre-trained model can then be fine-tuned for any downstream task (heart rate, HRV, SpO2, blood pressure, AFib detection) with a small amount of labeled data. Foundation models matter because they amortize the cost of learning PPG signal structure across many tasks and datasets, potentially solving the chronic data scarcity problem in PPG research.