Deep Learning for PPG Heart Rate Estimation: CNN, LSTM & Hybrid Architectures
Technical guide to CNN and LSTM deep learning architectures for PPG-based heart rate estimation, covering model design, training, and benchmark results.
Deep Learning for PPG Heart Rate Estimation: CNN, LSTM & Hybrid Architectures
Deep learning has fundamentally changed the landscape of PPG heart rate estimation, achieving accuracy levels that were unreachable with traditional signal processing just five years ago. Convolutional Neural Networks (CNNs) automatically learn optimal spectral and morphological features from raw PPG data, while Long Short-Term Memory (LSTM) networks capture the temporal dynamics of heart rate evolution. Hybrid CNN-LSTM architectures combine both capabilities to achieve state-of-the-art results on every major benchmark dataset.
This guide provides a comprehensive technical overview of deep learning architectures for PPG-based heart rate estimation, covering input representation, 1D and 2D CNN designs, recurrent architectures, hybrid models, training strategies, and practical deployment considerations. For context on the traditional signal processing methods that deep learning aims to supersede, see our guides on motion artifact removal and Kalman filter-based heart rate tracking.
Why Deep Learning for PPG Heart Rate?
Traditional PPG heart rate estimation follows a sequential pipeline: preprocessing, motion artifact removal, spectral estimation, peak selection, and temporal tracking. Each stage involves design decisions and hand-tuned parameters that are optimized independently. The result is a brittle pipeline where errors in early stages propagate and amplify through subsequent stages.
Deep learning offers an alternative paradigm: end-to-end learning from raw (or minimally preprocessed) PPG and accelerometer signals directly to heart rate output. The network learns to jointly handle noise removal, feature extraction, and temporal tracking in a single optimized system. This joint optimization avoids the suboptimal parameter choices that arise from stage-by-stage design and enables the model to discover signal features that human engineers may not have considered.
The empirical evidence strongly supports this approach. On the IEEE Signal Processing Cup 2015 dataset, which has become the de facto benchmark for PPG heart rate estimation during exercise, deep learning methods occupy 8 of the top 10 positions on published leaderboards. The gap between deep learning and traditional methods widens as motion intensity increases, precisely the regime where hand-designed algorithms struggle most.
However, deep learning is not a universal improvement. Traditional methods maintain advantages in interpretability (understanding why a particular estimate was produced), data efficiency (working with limited training data), computational efficiency (running on low-power processors), and generalization to out-of-distribution scenarios (demographics, sensor types, or activities not represented in training data). The choice between approaches depends on application requirements, available data, and deployment constraints.
1D Convolutional Neural Networks for PPG
One-dimensional CNNs process the raw PPG time series directly, applying learned convolutional filters that detect temporal patterns at multiple scales. The hierarchical feature extraction capability of CNNs naturally maps to the multi-scale structure of PPG signals: short filters capture individual pulse morphology, medium filters capture inter-beat dynamics, and long filters (or deep networks with large receptive fields) capture respiratory and autonomic modulations.
DeepPPG Architecture
Reiss et al. (2019) proposed DeepPPG, one of the first systematic deep learning approaches to PPG heart rate estimation. The architecture processes 8-second windows of 4-channel input (PPG + 3-axis accelerometer) at 64 Hz through a series of 1D convolutional layers:
- Input: 512 samples x 4 channels
- 4 convolutional blocks, each containing: Conv1D (kernel size 5) -> BatchNorm -> ReLU -> MaxPool(2)
- Channel sizes: 64, 128, 256, 256
- 2 fully connected layers: 512, 128
- Output: single heart rate value (regression)
DeepPPG achieved heart rate MAE of 3.42 BPM on the PPG-DaLiA dataset (15 subjects performing activities of daily living including walking, cycling, driving, and working at a desk), outperforming classical methods including TROIKA (MAE: 8.01 BPM) and a spectral peak tracking baseline (MAE: 12.87 BPM) on the same dataset (DOI: 10.3390/s19143079).
A critical design choice in DeepPPG is the inclusion of accelerometer channels alongside the PPG signal. By providing the raw motion data as input, the network learns to implicitly perform motion artifact removal as part of its feature extraction, eliminating the need for a separate preprocessing stage. Ablation studies showed that removing the accelerometer channels increased MAE from 3.42 to 5.78 BPM, confirming the importance of motion reference information.
Spectrogram-Based 2D CNN
An alternative to processing raw time-domain PPG is to first compute a time-frequency representation (spectrogram) and process it with a 2D CNN. This approach is motivated by the observation that heart rate tracking is fundamentally a time-frequency problem: the goal is to identify and follow a specific frequency component (the cardiac fundamental) across time in the presence of interfering components (motion harmonics).
Biswas et al. (2019) proposed CorNET, which computes the short-time Fourier transform (STFT) of both the PPG and accelerometer signals, producing 2D spectrogram inputs for a lightweight 2D CNN. The architecture uses depthwise separable convolutions to minimize parameter count:
- Input: STFT magnitude spectrograms (33 frequency bins x 20 time frames x 2 channels)
- 3 depthwise separable Conv2D blocks
- Global average pooling
- Dense output layer
- Total parameters: approximately 20,000
Despite its small size, CorNET achieved MAE of 1.47 BPM on the IEEE SP Cup 2015 dataset, competitive with much larger models. The spectrogram representation provides natural frequency resolution that aligns with the heart rate estimation task, and the 2D convolutions can learn patterns like frequency tracking (cardiac line), harmonic relationships (cardiac fundamental and harmonics appearing together), and motion artifact patterns (motion frequency lines correlated across PPG and accelerometer spectrograms) (DOI: 10.1109/ISCAS.2019.8702171).
Burrello et al. (2021) further optimized CorNET for deployment on ultra-low-power microcontrollers, achieving real-time operation on a GAP8 processor (8-core RISC-V at 64 MHz) with only 35 KB of model storage and 1.2 mJ per inference. This demonstrated that spectrogram-based deep learning for PPG heart rate estimation is feasible even on the most resource-constrained wearable platforms.
Recurrent Neural Networks: LSTM and GRU
While CNNs excel at extracting spatial features from fixed-length input windows, they do not inherently model the sequential dependency between consecutive heart rate estimates. Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures, are designed for exactly this purpose.
LSTM for Heart Rate Tracking
LSTM networks maintain a cell state that can store information over long time spans, controlled by input, forget, and output gates. For PPG heart rate tracking, the LSTM learns to maintain a running estimate of heart rate and update it based on each new PPG observation, functionally equivalent to an adaptive Kalman filter but with learned (rather than hand-designed) dynamics.
Chang et al. (2019) implemented a 2-layer LSTM with 128 hidden units per layer for PPG heart rate estimation. The network processed a sequence of spectral feature vectors (extracted from 8-second PPG windows at 2-second intervals) and output heart rate estimates at each time step. The LSTM achieved MAE of 2.04 BPM on the IEEE SP Cup 2015 dataset, with notably smooth heart rate trajectories that avoided the sudden jumps characteristic of frame-by-frame spectral estimation.
The key advantage of the LSTM over the Kalman filter approaches described in our Kalman filter guide is that the LSTM learns its state dynamics from data rather than requiring manual specification of a state-space model. This makes it more adaptable to complex, nonlinear heart rate dynamics and removes the need for noise parameter tuning. However, the LSTM requires substantially more training data and computation.
GRU as a Lightweight Alternative
The GRU architecture simplifies the LSTM by combining the input and forget gates into a single update gate, reducing the parameter count by approximately 25%. For PPG heart rate estimation, where the temporal dynamics are relatively simple (smooth heart rate evolution), the GRU achieves comparable accuracy to the LSTM at lower computational cost.
Chung and Shin (2020) compared LSTM and GRU architectures for PPG heart rate tracking and found no statistically significant difference in accuracy (LSTM MAE: 2.31 BPM, GRU MAE: 2.38 BPM, p = 0.42 on paired t-test), while the GRU required 18% fewer parameters and 15% less inference time. For resource-constrained wearable deployment, the GRU is therefore the preferred recurrent architecture.
Hybrid CNN-LSTM Architectures
The most successful deep learning models for PPG heart rate estimation combine the spatial feature extraction of CNNs with the temporal modeling of LSTMs. The CNN processes each PPG window independently to extract feature vectors, and the LSTM processes the sequence of feature vectors to track heart rate over time.
Architecture Design Principles
The typical CNN-LSTM hybrid follows this structure:
- CNN encoder: Processes each PPG/accelerometer window (5-8 seconds) through convolutional layers to produce a fixed-length feature vector (64-256 dimensions)
- Sequence modeling: LSTM or GRU processes a sequence of feature vectors (10-30 consecutive windows, covering 20-60 seconds of history) to capture temporal context
- Output head: Fully connected layer maps the LSTM hidden state to a heart rate estimate
This architecture separates concerns: the CNN handles the "what" (extracting cardiac and motion features from each window), while the LSTM handles the "when" (tracking how heart rate evolves over time and resolving temporal ambiguities).
Shyam et al. (2019) proposed PPGnet, a CNN-LSTM hybrid that achieved MAE of 1.97 BPM on the IEEE SP Cup 2015 dataset. Their CNN encoder used dilated convolutions with exponentially increasing dilation rates (1, 2, 4, 8, 16) to achieve a large receptive field (covering 2-3 cardiac cycles) with relatively few parameters. The LSTM decoder processed 15 consecutive feature vectors (30 seconds of context), enabling the model to maintain heart rate tracking through brief periods of severe signal corruption.
Multi-Task Learning
Several authors have demonstrated that training the CNN-LSTM to predict auxiliary outputs alongside heart rate improves overall accuracy through multi-task regularization.
Risso et al. (2022) trained a CNN-LSTM to jointly predict heart rate and signal quality index (SQI), with the hypothesis that learning to assess signal quality would help the model identify and appropriately weight reliable versus unreliable signal segments. The multi-task model achieved MAE of 1.73 BPM compared to 2.14 BPM for the single-task heart rate model on the PPG-DaLiA dataset, a 19% improvement from the addition of the SQI auxiliary task.
Training Strategies and Data Considerations
Datasets
The availability and quality of training data significantly impact deep learning model performance. Key publicly available datasets for PPG heart rate estimation include:
| Dataset | Subjects | Activities | Duration | Sensor Location | Reference Signal | |---------|----------|------------|----------|-----------------|------------------| | IEEE SP Cup 2015 | 12 | Arm exercises | 5 min each | Wrist | ECG | | PPG-DaLiA | 15 | Daily living | 2.5 hours each | Wrist | ECG | | WESAD | 15 | Stress study | 2 hours each | Wrist + Chest | ECG | | BIDMC | 53 | ICU patients | 8 min each | Finger | ECG |
The IEEE SP Cup 2015 dataset, while widely used as a benchmark, is limited by its small size (12 subjects) and narrow activity range (seated arm exercises). Models that achieve very low MAE on this dataset may not generalize well to other populations or activities. The PPG-DaLiA dataset provides more realistic conditions with diverse daily activities but is still limited to 15 subjects.
For building production-quality deep learning models, Reiss et al. (2019) recommend a minimum of 50 subjects with at least 30 minutes of labeled recording per subject, spanning rest, walking, running, and activities of daily living. Data should include diversity in age (18-70+), skin tone (Fitzpatrick I-VI), body composition, and sensor placement variability to ensure robust generalization.
Data Augmentation
Data augmentation is critical for training robust models from limited PPG datasets. Effective augmentation strategies include:
- Time stretching (0.9-1.1x): simulates heart rate variation within a window
- Amplitude scaling (0.5-2.0x): simulates varying signal strength due to sensor contact
- Additive Gaussian noise (SNR 5-20 dB): simulates electronic noise
- Signal mixing: combining cardiac components from one recording with motion artifacts from another
- Channel dropout: randomly zeroing accelerometer channels during training to improve robustness to missing sensor data
- Temporal shifting: random offsets in window alignment relative to cardiac cycles
Shyam et al. (2019) reported that data augmentation improved their model's MAE from 2.8 to 1.97 BPM on the IEEE SP Cup 2015 dataset, a 30% improvement from augmentation alone, with signal mixing and time stretching contributing the most.
Loss Functions
The choice of loss function affects both convergence behavior and the characteristics of heart rate estimation errors. Standard options include:
- Mean Squared Error (MSE): L2 loss penalizes large errors heavily, producing estimates biased toward the mean. Appropriate when all errors are equally costly.
- Mean Absolute Error (MAE): L1 loss is more robust to outlier labels and produces median-biased estimates. Often preferred for PPG HR because occasional labeling errors (from ECG R-peak detection failures) are common.
- Huber loss: Combines MSE for small errors with MAE for large errors, providing robustness to outliers while maintaining smooth gradients near zero. Delta parameter of 5-10 BPM works well for PPG heart rate.
Model Compression for Wearable Deployment
Deploying deep learning models on wearable devices requires aggressive compression to meet the strict constraints on model size (typically < 500 KB flash), memory (< 256 KB RAM), and inference latency (< 50 ms at 1 Hz update rate).
Quantization
Post-training quantization converts floating-point weights to fixed-point (INT8 or INT4), reducing model size by 4-8x with minimal accuracy degradation. For PPG heart rate models, INT8 quantization typically increases MAE by less than 0.3 BPM while reducing model size by 4x and inference time by 2-3x on ARM processors.
Burrello et al. (2021) demonstrated that mixed-precision quantization (INT8 for convolutional layers, INT16 for recurrent layers) maintained accuracy within 0.15 BPM of the full-precision model while enabling real-time operation on a sub-100 MHz processor.
Pruning and Knowledge Distillation
Structured pruning removes entire filters or channels from convolutional layers based on their importance (measured by weight magnitude, gradient, or feature map contribution). For PPG models, 50-70% of convolutional filters can typically be pruned with less than 0.5 BPM increase in MAE, as PPG signals are relatively low-bandwidth and do not require the rich feature hierarchies needed for image or speech processing.
Knowledge distillation trains a small "student" model to mimic the outputs of a large "teacher" model, transferring the learned representation into a compact architecture. This technique is particularly effective for PPG because the output space is simple (a single continuous value), making it easier for the student to approximate the teacher's behavior. Risso et al. (2022) distilled a 2.3M parameter CNN-LSTM teacher into a 45K parameter student model, with the student achieving MAE of 2.1 BPM compared to the teacher's 1.73 BPM on PPG-DaLiA, a modest accuracy trade-off for a 50x reduction in model size.
Performance Comparison Across Architectures
The following table summarizes published deep learning results on the IEEE Signal Processing Cup 2015 dataset for direct comparison:
| Architecture | Parameters | MAE (BPM) | Inference Time | Reference | |---|---|---|---|---| | 1D-CNN (DeepPPG) | 1.2M | 3.42* | 8 ms (GPU) | Reiss et al. 2019 | | 2D-CNN (CorNET) | 20K | 1.47 | 1.2 ms (MCU) | Biswas et al. 2019 | | LSTM (2-layer) | 400K | 2.04 | 12 ms (GPU) | Chang et al. 2019 | | CNN-LSTM (PPGnet) | 800K | 1.97 | 15 ms (GPU) | Shyam et al. 2019 | | CNN-LSTM multi-task | 950K | 1.73** | 18 ms (GPU) | Risso et al. 2022 |
*DeepPPG was evaluated on PPG-DaLiA rather than IEEE SP Cup; this result is from PPG-DaLiA. **Risso et al. result is from PPG-DaLiA.
For emerging approaches using attention mechanisms and transformer architectures for PPG, see our guide on transformer models for PPG analysis. For practical considerations on deploying heart rate estimation on PPG signal processing systems, see our algorithms overview.
Limitations and Open Challenges
Despite impressive benchmark results, deep learning for PPG heart rate estimation faces several unresolved challenges:
Generalization across populations. Models trained on young, healthy subjects from specific demographic groups may not generalize to elderly populations, patients with cardiovascular conditions, or individuals with different skin pigmentation levels. Melanin absorption differences affect PPG signal amplitude and morphology, and most publicly available datasets lack skin tone diversity.
Sensor variability. Deep learning models can overfit to the specific sensor hardware used in training. Differences in LED wavelength, photodetector sensitivity, sensor-skin coupling, and sampling rate between devices can degrade model performance. Domain adaptation and sensor-agnostic architectures are active research areas.
Interpretability. Understanding why a deep learning model produces a specific heart rate estimate is difficult, which creates challenges for clinical deployment where regulatory requirements demand algorithm transparency. Gradient-weighted class activation mapping (Grad-CAM) and attention visualization provide partial insight, but a complete understanding of learned features remains elusive.
Extreme motion. During very high-intensity activities (jumping, boxing, cross-training), even state-of-the-art deep learning models produce MAE of 8-15 BPM, approaching the error levels of traditional methods. The physical limit is that severe motion can completely obscure the cardiac signal, leaving no recoverable information regardless of algorithm sophistication.
Conclusion
Deep learning has established new accuracy benchmarks for PPG heart rate estimation, with CNN-LSTM hybrid architectures achieving MAE below 2 BPM on standard datasets. The end-to-end learning paradigm eliminates the need for hand-tuned signal processing pipelines and automatically discovers optimal features for heart rate extraction from noisy, motion-corrupted PPG signals. Model compression techniques including quantization, pruning, and knowledge distillation enable deployment on resource-constrained wearable processors, making deep learning a practical choice for commercial heart rate estimation. However, challenges in generalization, interpretability, and extreme motion robustness remain active research areas that will determine the ultimate clinical applicability of these methods.
References
- DeepPPG achieved heart rate MAE of 3.42 BPM on the PPG-DaLiA dataset (15 subjects performing activities of daily living including walking, cycling, driving, and working at a desk), outperforming classical methods including TROIKA (MAE: 8.01 BPM) and a spectral peak tracking baseline (MAE: 12.87 BPM) on the same dataset (DOI: 10.3390/s19143079).
- Despite its small size, CorNET achieved MAE of 1.47 BPM on the IEEE SP Cup 2015 dataset, competitive with much larger models. The spectrogram representation provides natural frequency resolution that aligns with the heart rate estimation task, and the 2D convolutions can learn patterns like frequency tracking (cardiac line), harmonic relationships (cardiac fundamental and harmonics appearing together), and motion artifact patterns (motion frequency lines correlated across PPG and accelerometer spectrograms) (DOI: 10.1109/ISCAS.2019.8702171).
Frequently Asked Questions
- How accurate is deep learning for PPG heart rate estimation compared to traditional methods?
- Deep learning methods consistently outperform traditional signal processing approaches on benchmark datasets. On the IEEE Signal Processing Cup 2015 dataset, state-of-the-art deep learning models achieve heart rate MAE of 1.5-3.0 BPM, compared to 2.2-4.5 BPM for traditional spectral tracking methods like TROIKA and JOSS, and 6-12 BPM for simple spectral peak picking. The advantage is most pronounced during intense physical activity where motion artifacts are severe. However, deep learning models require substantial training data and computational resources, and their performance depends heavily on the diversity of the training set.
- Can deep learning PPG models run on wearable devices in real time?
- Yes, but with significant architectural constraints. Full-sized models like DeepPPG (millions of parameters) require desktop-class GPUs. However, optimized architectures designed for edge deployment can run on wearable processors. Techniques including model pruning (removing 60-80% of parameters), quantization (INT8 or even INT4 instead of FP32), knowledge distillation (training a small student model from a large teacher), and depthwise separable convolutions can reduce model size to 50-200 KB and inference time to 5-20 ms on ARM Cortex-M7 processors. CorNET (Burrello et al., 2021) demonstrated real-time PPG heart rate estimation on a 64 MHz microcontroller with only 35 KB of model storage.
- How much training data is needed for PPG deep learning models?
- The data requirements depend on model complexity and target generalization. Small 1D-CNN models with fewer than 100K parameters can achieve reasonable performance with 10-20 subjects (approximately 10-20 hours of labeled PPG data). Larger models with millions of parameters typically require 50-200 subjects across diverse demographics, skin tones, and activity types. Data augmentation techniques (time stretching, amplitude scaling, additive noise, signal mixing) can effectively expand limited datasets by 5-10x. Transfer learning from large pre-trained models can reduce the labeled data requirement by 50-70%.
- What input format works best for deep learning PPG models?
- Three main input formats are used: raw time-domain PPG segments (1D signals), time-frequency representations (spectrograms or scalograms), and multi-channel inputs combining PPG with accelerometer data. Raw time-domain input with 1D convolutions is the most computationally efficient and works well for CNN architectures. Spectrogram input with 2D convolutions provides natural time-frequency localization and works well for heart rate tracking during exercise. Multi-channel input with PPG and 3-axis accelerometer data enables end-to-end motion artifact handling and generally achieves the best accuracy, with MAE improvements of 15-30% over PPG-only models.