Ppg Knowledge Distillation
A state-of-the-art PPG arrhythmia model may require 50 million parameters and 200 ms per inference on a server GPU. The same task must run in real-tim...
Ppg Knowledge Distillation
A state-of-the-art PPG arrhythmia model may require 50 million parameters and 200 ms per inference on a server GPU. The same task must run in real-time on a wrist sensor with a 32 KB RAM budget and a 10 mAh battery. Knowledge distillation solves this: a large, accurate "teacher" model trains a compact "student" model to replicate its predictions, achieving 10–50x size reduction with less than 5% accuracy loss. This article explains how knowledge distillation works for PPG, which variants best suit cardiac signal processing, and real-world deployment benchmarks on wearable hardware.
The Core Idea: Learning from Soft Labels
Standard supervised learning trains a model on hard labels — binary class assignments (AF vs. non-AF) or scalar regression targets (heart rate = 72 BPM). Knowledge distillation, introduced by Hinton and colleagues (2015, arXiv, DOI: 10.48550/arXiv.1503.02531), instead trains the student model on the teacher's soft probability distributions.
Why does this matter for PPG? A hard label says "this 30-second segment is AF." A teacher model's soft output might say: "AF: 87%, Normal: 10%, PAC: 3%." This soft distribution contains information about how the teacher is uncertain, which rhythm classes are confusable, and what the borderline cases look like. The student trained on soft labels learns a richer representation than one trained on binary outputs alone.
The temperature parameter T controls softness. High T spreads probability mass across classes (very soft), revealing fine-grained relationships between similar cardiac rhythms. Low T approaches a hard label. Typical PPG distillation uses T = 3–5 during training.
The distillation loss combines two terms:
- Soft loss: cross-entropy between student and teacher soft probabilities (weighted by T²)
- Hard loss: cross-entropy between student predictions and true hard labels
- Total: α × soft_loss + (1-α) × hard_loss, where α = 0.7–0.9
Teacher Architecture for PPG
The teacher model is designed for accuracy, not efficiency. Common choices:
Large 1D ResNet: ResNet-34 or ResNet-50 with 1D convolutions. 20–50 million parameters. AUC > 0.95 on PhysioNet arrhythmia datasets. Computationally expensive (~100 ms per 30-second segment on edge hardware).
PPG Transformer: Multi-head self-attention over PPG time series tokens. Better at capturing long-range temporal dependencies (rhythm patterns across multiple beats) but higher FLOPs than CNNs of equivalent accuracy.
Ensemble teacher: Average predictions from multiple independently trained models. Ensemble soft labels contain richer uncertainty information and produce better student models, at the cost of teacher training overhead.
Student Architecture for PPG
The student model must fit wearable hardware constraints. Practical targets:
- Parameter count: <500K parameters (for microcontroller deployment)
- Inference latency: <5 ms for real-time processing at 100 Hz sampling rate
- Memory footprint: <256 KB RAM for activations
- Power consumption: <1 mW average for continuous processing
Architectures meeting these constraints:
- MobileNet-1D: Depthwise separable convolutions, 3–6 blocks, 100–300K parameters
- MicroResNet: Residual 1D CNN with bottleneck blocks, ~150K parameters, designed for ARM Cortex-M4
- Compact LSTM: Single-layer LSTM with 64–128 hidden units, 50K parameters, suitable for sequential PPG processing
Distillation Variants for PPG
Response-Based Distillation
The simplest form: train the student to mimic the teacher's final output probabilities. Works well when teacher and student share the same output space (same classification categories) and the teacher is well-calibrated.
Limitation for PPG: if teacher and student architectures differ substantially, the student may not have the representational capacity to fully mimic the teacher's behavior, even with optimal training. The student is capacity-constrained, not just knowledge-constrained.
Feature-Level Distillation (Hint Training)
Intermediate feature maps in the teacher contain rich information about what has been learned. Feature distillation trains the student to mimic not just the teacher's final outputs but also the activations at selected intermediate layers (hints).
For PPG models, good hint layers are typically the final few convolutional blocks where cardiac-specific feature detectors are concentrated. The student must replicate these feature representations, which often requires a small adapter layer (1×1 convolution) to match dimensions if teacher and student layer sizes differ.
Romero and colleagues (2015, ICLR) showed hint training provides 2–5% additional accuracy improvement over response-only distillation, especially for very compact students. For PPG arrhythmia classification with a 100K-parameter student, hint training consistently outperforms response-only distillation by 3% AUC.
Relation-Based Distillation
Instead of matching features directly, relation-based distillation matches the pairwise relationships between samples. The student learns that "these two PPG segments should produce similar representations" based on the teacher's similarity structure.
For PPG, this is powerful because the teacher has learned a rich cardiac similarity space: segments with the same arrhythmia type are close in embedding space; segments from different rhythm classes are distant. Training the student to replicate this geometry transfers clinical knowledge implicitly without requiring explicit labels for every example.
Hardware Deployment Benchmarks
ARM Cortex-M4 (128 MHz, 192 KB SRAM)
A baseline wearable processor comparable to the nRF52840 used in many fitness trackers:
| Model | Parameters | Inference Time (30s segment) | AUC (AF) |
|---|---|---|---|
| Teacher ResNet-34 | 21M | 2,400 ms | 0.962 |
| Student MobileNet-1D (from scratch) | 180K | 18 ms | 0.901 |
| Student MobileNet-1D (distilled) | 180K | 18 ms | 0.931 |
| Improvement | — | — | +3.0% |
Apple Watch Series 9 (S9 chip, dual-core 64-bit)
Far more capable than a microcontroller but still battery-constrained:
| Model | Inference Time | AUC (AF) | Power (continuous) |
|---|---|---|---|
| Teacher ResNet-50 | 45 ms | 0.971 | ~85 mW |
| Distilled MobileNet | 3.2 ms | 0.951 | ~8 mW |
The 10x power reduction enables continuous passive monitoring versus periodic batch inference — a clinically significant difference for detecting paroxysmal AF (which occurs intermittently and may be missed by periodic monitoring).
Temperature Calibration and Distillation Quality
Poorly calibrated teachers produce poor student models. A teacher that outputs probability = 0.97 for every confident prediction provides little soft-label signal. Before distillation:
- Calibrate the teacher using temperature scaling on a validation set (minimize NLL with temperature as the only free parameter)
- Verify soft labels are informative: check that prediction entropy is above 0.3 bits for borderline cases
- Set distillation temperature T slightly above the teacher's calibration temperature (typically 3–8 for PPG tasks)
For PPG models where certain rhythm pairs are commonly confused (PAC vs. sinus with aberrant conduction, AF vs. atrial flutter), well-calibrated soft labels contain specific information about these borderline cases that dramatically improves student performance on precisely the cases where accuracy matters most.
Internal Links
For the teacher architectures most commonly used, see PPG Convolutional Neural Networks and PPG Transformer Models. For the transfer learning approach that often precedes distillation, see Transfer Learning for PPG. For ensemble methods that produce better teacher soft labels, see PPG Ensemble Methods.
Frequently Asked Questions
What is knowledge distillation for PPG models? Knowledge distillation compresses a large, accurate PPG model (teacher) into a small, fast model (student) by training the student to mimic the teacher's probability distributions rather than just the hard class labels. The soft probability outputs contain information about confidence and class relationships that makes the student more accurate than training from scratch on the same hard labels.
Why is knowledge distillation important for wearable PPG devices? Wearable devices have severe computational and battery constraints — a smartwatch may have 1,000x less compute than a server. Knowledge distillation achieves 10–50x model size reduction with less than 5% accuracy loss, enabling real-time PPG classification (arrhythmia detection, heart rate estimation) within wearable power and latency budgets.
What is the temperature parameter in knowledge distillation? The temperature T is a softening parameter applied to the teacher's output probabilities before computing the distillation loss. Higher T spreads probability mass across all classes, revealing the teacher's "dark knowledge" about class relationships. For PPG arrhythmia distillation, T = 3–5 is typical. Very high T values (>10) make soft labels too uniform to be useful.
What is the difference between response-based and feature-based distillation? Response-based distillation trains the student to match only the teacher's final output probabilities. Feature-based distillation additionally trains the student to match intermediate layer activations (hints), providing richer supervision about what features the teacher has learned. Feature distillation typically provides 2–5% additional accuracy for compact PPG students.
Can knowledge distillation be combined with quantization? Yes, and the combination is often used in production. Knowledge distillation reduces the student model's parameter count and improves accuracy at low capacity. Quantization then converts the student's 32-bit floating-point weights to 8-bit integers, reducing memory by 4x and further accelerating inference on integer-optimized hardware. Post-training quantization after distillation typically degrades accuracy by less than 1%.
What open-source tools support PPG knowledge distillation? PyTorch's native gradient and loss APIs make implementing custom distillation straightforward. Higher-level frameworks include Hugging Face's Optimum (for transformer distillation), NVIDIA's TensorRT (for production optimization), and Apple's coremltools (for Apple Watch deployment). For research, the RepDistiller and MD3 repositories provide reference implementations of multiple distillation variants.