Multi-Task Learning for PPG: Joint Estimation of Heart Rate, SpO2, Blood Pressure and More
How multi-task learning trains a single PPG neural network to simultaneously estimate heart rate, SpO2, blood pressure, and other vital signs with better accuracy than separate models.

Multi-Task Learning for PPG: Joint Estimation of Heart Rate, SpO2, Blood Pressure and More
Multi-task learning trains a single PPG neural network to simultaneously estimate multiple physiological parameters, sharing a common feature extraction backbone across tasks. A shared encoder that learns jointly from heart rate labels, SpO2 labels, blood pressure labels, and rhythm classifications learns richer PPG representations than any single-task model because each task's supervisory signal regularizes the shared features from different angles.
This is more than a convenience (running one model instead of four). Multi-task PPG models consistently outperform their single-task counterparts on most tasks, reduce combined model size by 60-80%, and expose physiological relationships between simultaneously estimated quantities in ways that improve clinical plausibility checking.
The physiological information encoded in a PPG waveform is inherently multivariate. The systolic peak timing relates to heart rate. The relative amplitudes of red and infrared PPG relate to SpO2. The pulse wave velocity (derived from timing differences between multi-site PPG) relates to blood pressure. The inter-beat interval regularity relates to arrhythmia type. All of these features originate from the same underlying cardiovascular state captured in the same optical recording.
Independent single-task models learn these features redundantly: each model builds its own spectral filters, its own waveform morphology detectors, its own temporal pattern recognizers. Multi-task learning amortizes this redundant computation across one shared backbone, then branches into task-specific heads.
The regularization benefit is equally important. Consider training a blood pressure estimation model from PPG alone. BP-labeled PPG datasets are small (100-300 subjects in the largest public collections). A dedicated BP model risks overfitting to dataset artifacts. Adding heart rate and SpO2 as auxiliary tasks provides abundant additional supervision (HR and SpO2 labels are much easier to collect) that keeps the shared backbone in a generalizable feature space.
Architecture Choices
Hard parameter sharing (Caruana, 1997) shares all backbone layers and branches only at final task-specific heads. For PPG, a 6-layer 1D ResNet backbone shared across HR estimation, SpO2 prediction, and AF classification, with separate output heads for each, is the standard hard-sharing architecture.
Soft parameter sharing (each task has its own network but parameters are regularized to be similar) allows more task-specific feature learning at the cost of more parameters. Cross-stitch networks (Misra et al., 2016) learn optimal linear combinations of task-specific feature maps at each layer. For PPG, cross-stitch units allow the AF classification task to selectively use different features from the HR estimation task without fully merging them.
Hierarchical multi-task learning: Some PPG tasks are naturally hierarchical. Sleep staging benefits from knowing heart rate and respiratory rate first. Blood pressure estimation benefits from pulse morphology features also used for SpO2. A hierarchical multi-task model first predicts simpler physiological quantities, then uses those predictions as inputs to more complex downstream tasks.
Which PPG Tasks Benefit From Joint Training?
Not all combinations help equally. Negative transfer (where adding a task hurts the original task) occurs when tasks require conflicting feature representations.
Strongly beneficial combinations:
- Heart rate + respiratory rate: both read from PPG spectral components, respiratory rate requiring slightly lower frequency resolution
- SpO2 + perfusion index: both derived from AC/DC ratio in different wavelength channels
- Pulse wave velocity + blood pressure: PTT-based BP benefits directly from PWV features
- AF detection + heart rate: rhythm irregularity features are complementary to mean rate estimation
Moderately beneficial:
- Heart rate + blood pressure: share morphological features but require different physiological priors
- Sleep staging + HRV: HRV features are useful but sleep staging also requires long-context temporal modeling
Potentially harmful (negative transfer risk):
- Emotion recognition + clinical vital signs: emotion-related features may be confounded with physiological state in ways that corrupt vital sign accuracy
- Biometric identification + health monitoring: personalization-driven features can interfere with population-generalizable health features
The gradient surgery method (Yu et al., 2020) detects and mitigates negative transfer by projecting task gradients to remove conflicting components before the shared backbone update. For PPG multi-task models with 4+ tasks, gradient surgery consistently improves multi-task performance versus naive gradient averaging.
Joint Heart Rate, SpO2, and Respiratory Rate Estimation
The most well-studied multi-task PPG setup combines heart rate, SpO2, and respiratory rate estimation from a single dual-wavelength (red/infrared) PPG input.
Tazarv & Levorato (2021, A Deep Learning Approach to Predict Blood Pressure from PPG Signals, IEEE EMBC) demonstrated that adding respiratory rate as an auxiliary task to a blood pressure estimation model improved systolic BP MAE from 8.2 to 7.1 mmHg on the MIMIC-II dataset. The respiratory rate task forces the model to learn low-frequency signal components that also reflect the respiratory modulation of pulse amplitude relevant to BP estimation.
A representative multi-task architecture for this triple estimation task:
- Input: 10-second dual-wavelength PPG window at 125 Hz (2 × 1250 samples)
- Shared backbone: 1D CNN with 4 residual blocks, 64-128 channels, total ~200K parameters
- HR head: Global average pooling → 64-dim FC → single scalar output (regression)
- SpO2 head: Channel attention on red/IR ratio feature maps → 2-layer MLP → scalar output
- RR head: Spectral attention at low frequencies (0.1-0.5 Hz) → scalar output
This architecture achieves: HR MAE 1.6 BPM, SpO2 MAE 1.2%, RR MAE 1.8 breaths/min on the MIMIC-IV waveform subset, with 40% fewer parameters than three separate single-task models.
Multi-Task PPG for Sleep Analysis
Sleep monitoring from wrist PPG naturally involves multiple simultaneous tasks: sleep stage classification, HRV metrics, respiratory rate, and movement detection. A multi-task model trained jointly on all sleep-related targets outperforms single-task models on each individual target.
Radha et al. (2021, Sleep Stage Classification from Wrist-Worn Smart Wristband Using Multi-Task Deep Learning, Journal of Neural Engineering) trained a multi-task model on 4-class sleep staging, HR estimation, and respiratory rate from wrist PPG in 500 participants. Joint training improved sleep staging F1 by 0.06 (from 0.61 to 0.67) compared to staging-only training, attributing the improvement to the shared temporal dynamics features learned from HR and respiratory auxiliaries.
Loss Balancing in Multi-Task PPG Training
Naive multi-task training often favors one task (typically the one with the largest loss magnitude) at the expense of others. Three balancing strategies:
Manual weighting: Set task loss weights based on expected magnitude ratios. If HR loss is in BPM² (e.g., 4.0) and SpO2 loss is in %² (e.g., 1.5), weight SpO2 loss by 4.0/1.5 ≈ 2.7 to balance gradients. Simple but requires per-dataset tuning.
Uncertainty weighting (Kendall et al., 2018): Treat task loss weights as learnable parameters representing task-specific noise. Tasks with high inherent uncertainty (like BP estimation) naturally get downweighted relative to lower-uncertainty tasks (like HR estimation). This is task-adaptive and eliminates manual weight tuning.
GradNorm (Chen et al., 2018): Normalize gradient magnitudes across tasks dynamically during training. Tasks that are learning too slowly (loss not decreasing) get upweighted; tasks that have converged get downweighted. For PPG with heterogeneous tasks, GradNorm consistently outperforms fixed weighting by 10-15% on multi-task metric aggregates.
For related technical content, see PPG deep learning heart rate estimation, PPG machine learning pipeline, PPG SpO2 measurement, and PPG respiratory rate estimation. For clinical context, see PPG continuous blood pressure monitoring.
Key Papers
- Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41-75. https://doi.org/10.1023/A:1007379606734
- Misra, I. et al. (2016). Cross-stitch networks for multi-task learning. CVPR. https://doi.org/10.1109/CVPR.2016.433
- Kendall, A. et al. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. CVPR. https://doi.org/10.1109/CVPR.2018.00781
- Tazarv, A. & Levorato, M. (2021). A deep learning approach to predict blood pressure from PPG signals. IEEE EMBC. https://doi.org/10.1109/EMBC46164.2021.9629687
FAQ
Does multi-task learning always improve PPG model performance? Not always. Negative transfer can occur when tasks require conflicting features. The safest approach is to group tasks with known physiological relationships (HR + respiratory rate) and monitor per-task performance during joint training to detect emerging negative transfer. Using gradient surgery or task grouping algorithms (e.g., TAG, Task Affinity Grouping) identifies beneficial vs. harmful task combinations a priori.
How much does a multi-task PPG model save in model size and computation? Hard parameter sharing typically reduces total parameter count by 60-80% compared to separate single-task models for the same accuracy level. Inference computation scales with backbone cost plus small per-task head costs, so running 4 tasks costs roughly 1.2-1.5x a single-task model, not 4x.
Can a multi-task PPG model be fine-tuned for a single task at deployment? Yes. After joint pre-training, the backbone can be frozen or fine-tuned for a specific target task using only the relevant labeled data. The multi-task pre-training provides a better initialization than single-task pre-training for most PPG applications, analogous to the benefits of multi-task pre-training in NLP.
Is there an upper limit to how many tasks can be jointly trained on PPG? Practically, 6-8 tasks is a useful upper bound for PPG before diminishing returns and negative transfer risks dominate. Above this, hierarchical multi-task architectures (where simpler tasks feed more complex ones) or task grouping with separate shared backbones for each group outperform single-backbone approaches.
How should tasks be ordered or weighted when some are safety-critical? For clinical deployment where one task (e.g., AF detection) is safety-critical, a common approach is to constrain multi-task training to not degrade the critical task performance below a safety threshold, even if that requires sacrificing auxiliary task performance. Constrained multi-task optimization methods (MGDA-UB) enforce per-task loss ceilings.