ChatPPG Editorial

Knowledge Distillation for PPG: Compressing Deep Models for Wearable Deployment

Knowledge distillation transfers the learned capabilities of a large, accurate PPG "teacher" model into a small, fast "student" model that runs effici...

ChatPPG Team

2026-03-24T19:21:26+00:00

7 min read

Knowledge distillation transfers the learned capabilities of a large, accurate PPG "teacher" model into a small, fast "student" model that runs efficiently on wearable hardware. The student model achieves accuracy that would be impossible if trained from scratch on the same limited data, because it learns from the teacher's soft probability outputs rather than hard ground-truth labels.

This technique is essential for the gap between what state-of-the-art PPG deep learning achieves on server hardware and what wearable devices can run in real time at milliwatt power levels. Without compression techniques like knowledge distillation, the most accurate PPG models simply cannot run on current smartwatches, fitness bands, or medical wearables.

The Wearable Deployment Constraint

A 2024 Qualcomm whitepaper benchmarks a typical wearable health SoC (Apple S9, Qualcomm QCS6490-class) at 50-100 GOPS for neural network inference. A BERT-Large-equivalent Transformer with 340M parameters requires ~700 GFLOPS per inference pass. The math does not work.

Practical constraints for continuous wearable PPG inference:

Model size: < 500 KB flash storage (typical wearable flash allocation for ML models)
RAM footprint: < 256 KB SRAM for activations
Latency: < 50 ms for 10-second PPG window at 50 Hz (500 samples)
Power budget: < 5 mW continuous inference to achieve 24-hour battery life
Accuracy target: Within 2-3 BPM MAE of server-class model performance

These constraints rule out most architecture choices. A ResNet-50 requires 25 MB and 4 GFLOPS. Even a simple CorNET (3 conv layers, 35 KB) sits at the edge of viability.

Knowledge Distillation Fundamentals

Hinton et al. (2015, Distilling the Knowledge in a Neural Network, arXiv:1503.02531) introduced the core idea. A teacher model (large, accurate, trained to convergence) produces "soft labels": probability distributions over output classes. These soft labels carry more information than one-hot ground truth because they encode the teacher's confidence structure.

For PPG atrial fibrillation detection with classes {sinus, AF, other arrhythmia, noise}, a teacher might output [0.02, 0.91, 0.06, 0.01] for an AF example rather than the hard label [0, 1, 0, 0]. The soft labels reveal that "other arrhythmia" is the most likely confusion category and that noise is highly unlikely, information the student can learn from even when misclassifying.

The distillation loss combines:

Hard loss: Standard cross-entropy between student output and ground truth labels
Soft loss: KL divergence between student and teacher soft outputs at temperature T (T > 1 smooths the distribution, revealing more class relationship information)
Combined loss: L = α × L_hard + (1-α) × L_soft × T²

For PPG heart rate estimation (regression), the soft loss becomes MSE or Huber loss between student and teacher continuous predictions at multiple intermediate outputs.

Teacher Architecture for PPG

The teacher should be the most accurate model achievable without deployment constraints. Common choices:

ResNet-based 1D architecture: 50+ layers, ~2M parameters, trained on all available PPG data with heavy augmentation. Achieves 1.2-1.8 BPM MAE on PPG-DaLiA during walking/running combined.

Transformer-based PPG model: Patch-based tokenization (20-ms patches), 6-8 attention heads, 4-6 Transformer blocks, ~5M parameters. Achieves slightly better performance on complex activities but requires more training data.

Ensemble teacher: Average predictions from 3-5 diverse architectures (CNN, LSTM, Transformer, frequency-domain). Ensemble soft labels are smoother and contain more calibrated uncertainty information, leading to better student performance.

The teacher is trained on GPU-class hardware with full precision, no latency constraints, and maximum model capacity. It never runs on the wearable.

Student Architecture Design

The student architecture is constrained by the wearable hardware:

MobileNet-1D variants: Depthwise separable convolutions replace standard convolutions, reducing computation by 8-9x at modest accuracy cost. A 4-layer 1D MobileNet for PPG can achieve < 80 KB with 8-bit quantization.

CorNET and microcontroller-friendly architectures: Burrello et al. (2021, CorNET: Deep Learning Framework for PPG-Based Heart Rate Estimation and Biometric Identification in Ambulatory Settings, IEEE TNSRE) demonstrated a 35 KB, 6-layer 1D CNN achieving 2.1 BPM MAE on the WESAD dataset running at real-time on a 64 MHz Cortex-M7.

LSTM-free temporal modeling: LSTMs have hidden state that requires recurrent memory allocation across time steps, complicating tiling for hardware acceleration. For temporal PPG modeling, dilated causal convolutions (WaveNet-inspired) achieve comparable temporal receptive fields with more hardware-friendly computation patterns.

Feature-Based Distillation for PPG

Relation-based distillation goes beyond output matching to also align intermediate representations:

Feature map mimicry: Student feature maps at each layer are trained to match the teacher's feature maps (after an optional linear adaptor to handle dimensional differences). For PPG, this forces the student to learn frequency-decomposition representations similar to the teacher's learned spectral filters.

Attention transfer (Zagoruyko & Komodakis, 2017): Transfer attention maps (L2-normalized sum of squared feature map values) from teacher to student. The student is trained to produce similar spatial attention patterns. For PPG CNNs, this ensures the student attends to the same signal regions (systolic peak, diastolic notch timing) as the more powerful teacher.

Progressive distillation (staged compression): Train a medium-sized intermediate model from the teacher, then distill again to the final small student. Each distillation step introduces smaller capacity gaps, improving final student performance. For PPG heart rate estimation, two-stage progressive distillation improved student MAE by 0.3-0.4 BPM compared to single-stage direct distillation.

Post-Training Quantization and Pruning

Knowledge distillation is typically combined with additional compression techniques:

Quantization-aware training (QAT): Simulate 8-bit integer arithmetic during training so the model learns to compensate for quantization noise. For PPG CNNs, QAT achieves < 0.2 BPM accuracy degradation when moving from FP32 to INT8, versus 0.5-1.0 BPM degradation with post-training quantization.

Structured pruning: Remove entire convolutional filters with low L1 norm. Iterative pruning with periodic retraining (lottery ticket hypothesis approach) can remove 60-80% of parameters from PPG models with < 10% accuracy loss. A pruned + quantized model can be 10-20x smaller than the original.

Neural Architecture Search for wearables: Automated methods like Once-for-All (Cai et al., 2020) train a single supernet and extract subnetworks matching hardware constraints. Applied to PPG, Once-for-All-style NAS finds architectures achieving 1.9 BPM MAE within the 80 KB budget, outperforming manually designed student architectures.

Benchmark: Before and After Distillation

On the PPG-DaLiA dataset (wrist PPG, 8 activities, 15 subjects), using CorNET-style distillation from a ResNet teacher:

Model	Params	MAE (BPM)	Flash (KB)	Inference (ms, Cortex-M7)
Teacher (ResNet-50 1D)	2.1M	1.8	8,400	N/A (GPU only)
Baseline Student	18K	3.2	72	8 ms
Distilled Student	18K	2.1	72	8 ms
Distilled + Pruned + INT8	7K	2.3	28	4 ms

Knowledge distillation narrows the teacher-student gap from 1.4 BPM to 0.3 BPM without changing student architecture or inference cost.

On-Device Personalization After Distillation

The distilled student can be further personalized on-device using a small labeled calibration set from the individual user. As few as 5 minutes of PPG with reference heart rate (from a brief ECG recording or fingertip pulse oximeter) enables fine-tuning the final 1-2 layers of the student model to the user's specific waveform morphology.

Personalized distilled models consistently achieve sub-1 BPM MAE for familiar activity types, with the per-user calibration cost being < 50 KB of on-device gradient computation.

For related content on efficient PPG signal processing, see PPG real-time processing on embedded systems, PPG power consumption design, and the complete ML pipeline for PPG analysis. For the deep learning foundations distillation builds on, see PPG deep learning heart rate estimation.

Key Papers

Hinton, G. et al. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531. https://doi.org/10.48550/arXiv.1503.02531
Burrello, A. et al. (2021). CorNET: Deep learning framework for PPG-based heart rate estimation and biometric identification. IEEE TNSRE, 29, 1332-1342. https://doi.org/10.1109/TNSRE.2021.3069869
Zagoruyko, S. & Komodakis, N. (2017). Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ICLR. https://doi.org/10.48550/arXiv.1612.03928
Cai, H. et al. (2020). Once-for-all: Train one network and specialize it for efficient deployment. ICLR. https://doi.org/10.48550/arXiv.1908.09791

FAQ

Can knowledge distillation close the entire gap between teacher and student accuracy? Rarely completely. The student architecture's capacity ultimately limits how much the teacher's knowledge can be compressed into it. For PPG, the student typically retains 80-95% of the teacher's performance depending on the compression ratio. The remaining gap can sometimes be recovered through dataset augmentation and architecture search.

Does distillation work if the teacher is not well-calibrated? Poorly calibrated teachers (overconfident predictions) produce soft labels that are too peaked to provide useful relational information. Temperature scaling the teacher outputs to T = 3-5 before distillation recovers useful gradient signal even from overconfident teachers. Ensemble teachers are generally better calibrated than single models.

What is the minimum training dataset size for distillation to help? Distillation provides the most benefit when labeled data is limited (< 20 subjects for PPG). With large labeled datasets (> 200 subjects), the student trained on hard labels directly often approaches distilled performance. For clinical PPG applications where large labeled datasets are uncommon, distillation consistently improves student performance by meaningful margins.

Can the teacher and student use different input formats? Yes, with cross-modal distillation. A teacher processing raw PPG + accelerometer can distill knowledge into a student processing only raw PPG, effectively teaching the student to implicitly model motion artifacts that it cannot directly observe. This cross-modal distillation has shown 15-25% MAE improvements on PPG heart rate during exercise.

← Back to all articles