ChatPPG Editorial

Explainable AI for PPG Models: Saliency, SHAP, and Clinical Interpretability

How XAI methods like SHAP, Grad-CAM, LIME, and attention visualization explain PPG deep learning predictions for clinical trust, regulatory compliance, and model debugging.

ChatPPG Research Team
8 min read
Explainable AI for PPG Models: Saliency, SHAP, and Clinical Interpretability

Explainable AI for PPG Models: Saliency, SHAP, and Clinical Interpretability

Explainable AI (XAI) methods reveal which parts of a PPG waveform drive a deep learning model's prediction. Clinicians, regulators, and patients are right to ask why a model flagged atrial fibrillation or predicted elevated blood pressure, and XAI provides the technical apparatus to answer that question in physiologically meaningful terms.

Without interpretability, even a 99%-accurate PPG model is a black box that clinicians cannot trust, regulators cannot approve, and engineers cannot debug. XAI transforms PPG deep learning from an opaque pattern matcher into a system whose reasoning can be inspected, challenged, and validated against physiological knowledge.

Why Interpretability Matters for PPG AI

The stakes differ by application. For activity-adjusted heart rate estimation, a suboptimal model is an inconvenience. For clinical AF detection, an unexplained false negative could delay treatment for a patient at stroke risk. For ICU PPG monitoring, a model that flags impending hemodynamic instability needs to explain its signal basis to a physician who may choose whether to intervene.

The FDA's AI/ML-based Software as a Medical Device (SaMD) guidance explicitly calls for transparency in how models make predictions. The EU AI Act requires "meaningful explanations" for high-risk AI decisions affecting health. XAI is not merely academic; it is increasingly a regulatory necessity.

Beyond compliance, XAI serves three practical engineering purposes:

  1. Model debugging: When a model performs poorly on a specific population (e.g., dark skin tones), saliency maps reveal whether it is attending to artifact-contaminated spectral regions rather than true pulse-related features.
  2. Dataset bias detection: If saliency analysis shows the model consistently attends to the DC offset of the PPG signal (which correlates with sensor pressure), the model may have learned a dataset artifact rather than physiological signal.
  3. Clinical communication: Visualizing the diastolic notch and systolic peak regions as high-salience features provides physicians with an interpretable explanation frame that links model behavior to known physiology.

Gradient-Based Saliency Methods

The simplest XAI approach computes the gradient of the model output with respect to the input PPG signal. High-gradient regions are the signal samples where small changes most affect the prediction.

Vanilla gradients: Compute |∂y/∂x| for each time sample x and output y. Fast and differentiable through any architecture, but noisy due to gradient saturation in deep networks.

Integrated Gradients (Sundararajan et al., 2017) interpolates between a baseline signal (typically zero or mean PPG) and the actual input, integrating gradients along the path. This satisfies the completeness axiom: attribution magnitudes sum to the difference between the model output on the actual input and the baseline. For PPG, this produces cleaner attribution concentrated on physiologically meaningful waveform features.

SmoothGrad (Smilkov et al., 2017) adds Gaussian noise to the input n times (typically 50 samples), computes gradients for each noisy version, and averages. This suppresses high-frequency attribution noise and produces smoother maps that are easier for clinicians to interpret.

Applied to an AF detection model, integrated gradients consistently highlight the RR interval irregularity encoded in the IBI sequence rather than waveform amplitude, which aligns with the known physiological mechanism of AF.

Class Activation Mapping for 1D PPG

Grad-CAM (Gradient-weighted Class Activation Mapping, Selvaraju et al., 2017) is the most widely used XAI method for CNNs. Originally designed for 2D image classification, it adapts naturally to 1D PPG signals.

For a 1D CNN processing a PPG segment:

  1. Extract feature maps from the last convolutional layer
  2. Compute gradients of the predicted class score with respect to each feature map channel
  3. Weight each channel map by its gradient magnitude (global average pooling)
  4. Sum the weighted feature maps to produce a temporal activation map over the input PPG segment

The result is a time-aligned "heat map" showing which portions of the raw PPG waveform were most important for the classification decision.

In PPG AF detection, Grad-CAM activation maps typically peak during irregularly spaced beats and at the transition from regular to irregular rhythm. This confirms the model is learning rhythm-based rather than morphology-based features, which is reassuring from a clinical standpoint.

For PPG blood pressure estimation, Grad-CAM analysis has been particularly revealing. Several studies found that BP estimation models attend heavily to the early systolic shoulder and diastolic notch rather than the overall waveform shape, consistent with the known relationship between arterial stiffness markers and blood pressure (Elgendi et al., 2019, The Use of Photoplethysmography for Assessing Hypertension, npj Digital Medicine).

SHAP for PPG Feature Attribution

SHAP (SHapley Additive exPlanations, Lundberg & Lee, 2017) applies cooperative game theory to attribution. Each input feature's SHAP value represents its marginal contribution to the model output, averaged over all possible orderings of features.

For tabular PPG features (spectral power ratios, morphological indices, HRV metrics), SHAP is directly applicable and produces feature importance rankings with statistical confidence intervals.

For raw waveform models, DeepSHAP and GradientSHAP combine SHAP values with backpropagation to produce per-sample attributions efficient enough for 1D CNN architectures.

SHAP waterfall plots are particularly effective for PPG clinical communication: they show, for a specific patient's specific PPG segment, which features pushed the AF probability prediction up (red bars) or down (blue bars) relative to the baseline prediction across the population.

A cardiologist reviewing an AF alert can see: "Spectral entropy of IBI sequence (+0.23), longest pause duration (+0.18), pNN50 irregularity (+0.15) all increased AF probability; high dicrotic notch presence (-0.08) and normal pulse wave velocity (-0.06) decreased it."

Attention Visualization for Transformer-Based PPG Models

Transformer architectures for PPG (see our PPG Transformer Models article) produce attention weights that have a natural interpretability structure: the attention matrix shows which time steps the model is attending to when processing each position.

Attention rollout (Abnar & Zuidema, 2020) propagates attention weights through all layers to compute the total influence of each input token on each output representation. For PPG Transformers with patch-based tokenization (e.g., 20-ms patches), attention rollout produces a coarse attention map over the input signal.

Caution on attention as explanation: Multiple studies have shown that attention weights do not necessarily correspond to feature importance (Jain & Wallace, 2019; Wiegreffe & Pinter, 2019). High attention to a region does not mean that region causally drives the prediction. Gradient-based attribution methods applied through attention mechanisms provide more reliable explanations for Transformer PPG models.

LIME for Model-Agnostic PPG Explanation

LIME (Local Interpretable Model-agnostic Explanations, Ribeiro et al., 2016) fits a simple linear model locally around a specific prediction. For PPG, LIME works by:

  1. Segment the PPG signal into contiguous superpixels (e.g., individual cardiac cycles or fixed 0.5-second windows)
  2. Generate perturbed versions by randomly zeroing out segments
  3. Query the model on each perturbed version
  4. Fit a linear model to the perturbation outputs weighted by proximity to the original sample
  5. Use the linear model coefficients as local feature importances

LIME is model-agnostic and works for any PPG architecture: 1D CNNs, LSTMs, XGBoost on extracted features, or Transformers. Its limitation is stochasticity: LIME explanations vary between runs, which complicates clinical use where reproducibility is expected.

Concept Bottleneck Models for Clinical Interpretability

Concept Bottleneck Models (Koh et al., 2020) offer a fundamentally different approach: force the model to make predictions by first explicitly predicting clinically meaningful intermediate concepts, then predicting the final label from those concepts.

For PPG, clinical concepts might include:

  • Dicrotic notch presence/absence
  • Regularity of inter-beat intervals (coefficient of variation)
  • Systolic-to-diastolic ratio
  • Pulse wave velocity estimate
  • Respiratory modulation depth

A concept bottleneck model first predicts these concept values from raw PPG, then predicts AF risk from the concept values. A clinician can inspect and even edit the intermediate concept predictions ("the dicrotic notch is actually present here") and see how that changes the final prediction. This intervention capability provides a level of clinical engagement that post-hoc XAI methods cannot match.

Regulatory and Clinical Deployment Considerations

For FDA 510(k) submissions involving PPG AI models, explanation requirements are evolving. The FDA's 2021 AI/ML action plan and 2023 discussion paper on transparency both emphasize the need for explanations accessible to non-technical clinicians, not just algorithmic developers.

In practice, this means:

  • Explanations should be in physiological terms (pointing to diastolic notch, irregular rhythm) not mathematical terms (gradient magnitude of feature map channel 42)
  • Explanation methods should be validated: do clinicians who see the explanation make better decisions than those who do not?
  • Explanation uncertainty should be quantified: SHAP confidence intervals, LIME stability metrics

For related content, see our articles on PPG deep learning heart rate estimation, PPG machine learning pipeline, and PPG morphology features.

Key Papers

FAQ

Why can't clinicians just trust a high-accuracy PPG AI model without explanations? Accuracy on a validation dataset does not guarantee correct reasoning. A model could achieve 95% AF detection accuracy by learning a dataset artifact (e.g., all AF recordings were collected on one device with a characteristic noise signature) rather than detecting actual rhythm irregularity. XAI methods reveal this kind of spurious correlation before clinical deployment.

Which XAI method is best for PPG models? There is no universally best method. Integrated gradients and GradientSHAP are most theoretically grounded. Grad-CAM is computationally fastest for CNNs. LIME is most model-agnostic. Concept bottleneck models provide the most clinically actionable explanations. For high-stakes clinical decisions, using two independent methods and checking for agreement is good practice.

Can XAI explanations be wrong or misleading? Yes. All post-hoc XAI methods are approximations. LIME has high variance. Attention weights correlate poorly with causal importance. Even integrated gradients can point to non-causal features if the baseline choice is inappropriate. XAI explanations should be interpreted as hypotheses about model behavior, not ground truth, and should be validated through ablation studies and clinical expert review.

How do XAI methods handle the temporal structure of PPG signals? Methods like vanilla gradients operate at individual sample resolution, which can be too granular for clinical interpretation. Segment-level LIME and attention rollout for Transformer models operate at beat or cycle resolution, which aligns better with how clinicians think about PPG. Choosing the appropriate temporal granularity for the explanation is an important design decision.

Is explainability required for FDA approval of PPG AI? Not explicitly mandated as a technical requirement yet, but the FDA's evolving guidance strongly encourages transparency and expects developers to characterize performance across subgroups (implicitly requiring some level of model understanding). The EU AI Act's mandatory transparency requirements for high-risk AI systems are more prescriptive and apply to CE-marked medical devices using AI in Europe.