ChatPPG Editorial

PPG Explainable AI

When a deep learning model flags an irregular heartbeat from a wrist PPG recording, what part of the waveform drove that decision? Explainable AI (XAI...

ChatPPG Team

2026-03-27T08:20:30+00:00

8 min read

When a deep learning model flags an irregular heartbeat from a wrist PPG recording, what part of the waveform drove that decision? Explainable AI (XAI) techniques answer this question — and the answer matters for clinical deployment, regulatory approval, and debugging. Without explainability, a high-accuracy PPG model is a black box that cannot be audited, trusted, or improved systematically. This article covers the main XAI methods applicable to PPG models, what they reveal about cardiac signal classification, and what the FDA expects from explainable medical AI.

Why Explainability Matters for PPG AI

Clinical acceptance of AI-based diagnostics depends on clinician trust, and trust requires understanding. A cardiologist presented with an "AF detected" flag from a wrist PPG device will ask: which beats were irregular? Did the model respond to true rhythm irregularity or to motion artifact? Is this a real atrial fibrillation episode or a false positive from poor signal quality?

These are not theoretical questions. Black-box PPG models have been shown to learn spurious correlations: a model might achieve high training accuracy by detecting noise patterns correlated with disease in the training set rather than the true physiological markers. Explainability catches this before deployment.

From a regulatory standpoint, the FDA's 2021 AI/ML action plan and the EU's MDR 2024 guidance on AI medical devices both emphasize transparency of AI decision-making. While neither mandates specific XAI techniques, both require documentation of model inputs, decision logic, and failure modes — all of which XAI methods help establish.

Gradient-Based Attribution Methods

Gradient-weighted Class Activation Mapping (Grad-CAM)

Originally developed for 2D CNNs applied to images, Grad-CAM computes how much each spatial location (or, for 1D CNNs, each time point) contributed to a specific class prediction. The gradient of the class score with respect to the feature maps of the final convolutional layer identifies which temporal regions were most important.

Applied to PPG arrhythmia classification: Grad-CAM highlights the time windows within a 30-second PPG segment that most influenced the prediction. For AF, this should highlight periods of irregular RR intervals. For normal sinus rhythm, highlights should concentrate around consistent beat morphology.

Studies validating PPG Grad-CAM maps against physician annotations show concordance rates of 70–80%: the model attends to the same waveform features cardiologists use, most of the time. Discordances often reveal that the model is responding to noise or artifact rather than true cardiac signals — a debugging insight not available without explainability.

Integrated Gradients

Integrated gradients attribute the prediction by integrating the gradient along a path from a baseline input (typically a flat zero signal) to the actual input. Unlike standard gradients (which only capture local sensitivity at the input point), integrated gradients satisfy two formal axioms: sensitivity (a feature that changes the prediction gets nonzero attribution) and implementation invariance (equivalent models receive equivalent attributions regardless of implementation).

For PPG, integrated gradients produce smoother, more stable attributions than vanilla gradients. The baseline choice matters: using a flat signal as baseline measures attribution relative to "no signal present." Using an average PPG waveform as baseline better captures attributions relative to "normal cardiac activity."

SHAP for PPG Models

SHAP (SHapley Additive exPlanations) applies Shapley values from cooperative game theory to machine learning attribution. SHAP values quantify each feature's contribution to a prediction relative to the expected model output. Unlike gradient methods, SHAP is model-agnostic and works for any ML model — CNNs, LSTMs, transformers, random forests.

DeepSHAP for Neural Networks

For deep learning models, the DeepSHAP algorithm efficiently approximates Shapley values using backpropagation, making it practical for PPG time series with hundreds of time steps. The output is a signed attribution for each time step: positive values indicate that time step pushed the prediction toward the predicted class; negative values indicate it pushed away.

Visualization: plotting SHAP values as a colored overlay on the raw PPG waveform directly communicates to clinicians which waveform features drove the decision. Green overlays indicate positive contributions (pro-AF evidence in an AF detector), red overlays indicate evidence against AF.

KernelSHAP for Non-Deep Models

For traditional ML models trained on extracted PPG features (HRV metrics, morphological indices), KernelSHAP provides feature-level attributions. Example: an AF classifier trained on 15 HRV features produces a SHAP summary showing that short-term HRV (SDNN) contributed most to the positive prediction, followed by pNN50, while mean heart rate contributed negatively. This is directly interpretable by cardiologists.

LIME for Local Explanations

LIME (Local Interpretable Model-agnostic Explanations) generates a local surrogate model — a simple, interpretable model — that approximates the complex PPG model's behavior in the neighborhood of a specific input. For PPG, this means perturbing the input waveform (masking segments, adding noise) and observing prediction changes to fit a linear approximation.

LIME is computationally expensive (requires many model inference calls per explanation) but model-agnostic. It is particularly useful for explaining predictions on unusual or edge-case waveforms where gradient methods may be unreliable.

Temporal Segment LIME

For PPG time series, divide the signal into temporal segments (e.g., 20 equal-length segments for a 20-second recording). LIME perturbs by including or excluding each segment, then fits a weighted linear model. The linear coefficients identify which temporal segments were most important for the prediction — effectively a coarse-grained saliency map.

Attention Visualization in Transformer Models

Transformer-based PPG models have a natural explainability mechanism: the attention weights between sequence positions. When an attention head assigns high weight to the connection between two time points, the model is explicitly using information from both to make its prediction.

For AF detection, attention maps from trained PPG transformers show that long-range attention heads learn to compare beat-to-beat intervals across the full recording — exactly the multi-beat comparison cardiologists make when assessing rhythm irregularity. Short-range heads attend to local morphology features (peak shape, dicrotic notch).

See our PPG Transformer Models article for the underlying architecture. Note that attention weights are not a perfect proxy for attribution — high attention weight does not necessarily mean high importance, as the subsequent linear transformation may downweight attended information. Gradient-based attribution applied atop attention weights (the "attention × gradient" method) provides more reliable explanations.

Internal Links

For the CNN architectures these XAI methods are applied to, see PPG Convolutional Neural Networks. For the clinical applications requiring regulatory-grade explainability, see PPG Atrial Fibrillation Screening. For machine learning pipeline design that incorporates XAI from the start, see PPG Machine Learning Pipeline.

Clinical Validation of XAI Explanations

Generating an explanation is not the same as validating it. A saliency map that highlights random waveform segments is worse than no explanation — it provides false confidence. XAI validation for PPG requires:

Physician concordance studies: Present XAI-highlighted PPG segments to cardiologists and ask them to rate whether the highlighted features match their clinical reasoning. Studies on ECG explainability show physician agreement with model attributions of 65–80% for correct predictions and as low as 40% for incorrect predictions — a useful debugging signal.

Sanity checks: Does the explanation change when the model is deliberately wrong? Does removing highlighted segments actually change the prediction? Do attributions flip appropriately when the class label changes? These programmatic checks catch explanations that are technically computed correctly but do not faithfully reflect model reasoning.

Consistency across similar inputs: Two PPG segments of the same arrhythmia type should produce similar explanation patterns. High explanation variance across similar inputs indicates model instability or spurious correlations.

Regulatory and Clinical Deployment Considerations

The FDA's predetermined change control plan (PCCP) framework for AI/ML medical devices requires documenting the types of modifications allowed without new 510(k) or PMA submissions. Explainability evidence — showing the model uses physiologically valid features — is essential for establishing PCCP justification.

For CE marking under MDR 2024 in Europe, Annex I requires "sufficient transparency" for AI medical devices. Explainability documentation does not currently need to follow a specific format, but regulatory submissions increasingly include XAI analysis as supporting evidence.

Frequently Asked Questions

What is explainable AI (XAI) for PPG signals? XAI for PPG refers to methods that identify which parts of a waveform drove a model's prediction. Instead of a black-box "AF detected" output, XAI provides a saliency map showing which time windows (irregular intervals, morphological anomalies) contributed most to the prediction. This enables clinical verification, debugging, and regulatory documentation.

What is Grad-CAM and how is it used for PPG models? Grad-CAM (Gradient-weighted Class Activation Mapping) computes the gradient of the predicted class score with respect to feature map activations in the final convolutional layer, then projects this back to the input time axis. For PPG, it produces a heatmap showing which time points within a waveform segment most influenced the classification. It is fast (single forward + backward pass) but only applicable to CNNs.

What is SHAP and why is it useful for cardiac AI? SHAP (SHapley Additive exPlanations) assigns each feature a contribution value based on Shapley game theory. For PPG, features can be individual time steps (in DeepSHAP) or extracted metrics like HRV indices. SHAP values are consistent and locally accurate, making them suitable for explaining individual predictions to clinicians in terms they understand.

Does explainability improve model performance? Not directly — XAI methods interpret existing models rather than improving them. However, XAI can guide model improvement by revealing spurious correlations or dataset biases. If Grad-CAM shows the model is attending to noise artifacts rather than cardiac morphology, you can augment training data to fix this, indirectly improving performance and reliability.

What do regulatory bodies require for explainable medical AI? The FDA does not currently mandate specific XAI techniques but requires documentation of model logic, decision criteria, and failure modes in AI/ML medical device submissions. The EU MDR 2024 requires "sufficient transparency" for AI-based medical devices. In practice, XAI analysis (typically Grad-CAM or SHAP plots with physician concordance data) is increasingly included in regulatory submissions as supporting evidence.

Can XAI methods be applied to PPG transformer models? Yes, though with some caveats. Attention weights provide a natural visualization of which time steps a transformer attended to, but attention weights are not the same as attribution. Gradient-based methods (integrated gradients, SHAP) applied atop transformer architectures provide more reliable feature attribution. Attention rollout and attention flow methods can also improve the reliability of attention-based explanations.

How do I evaluate whether a PPG model explanation is trustworthy? Run sanity checks: verify that removing the most-attributed time points changes the prediction; verify that explanations are consistent across similar inputs; conduct physician concordance studies; check that explanations flip appropriately when the prediction changes. An explanation that passes these checks is more likely to faithfully reflect model reasoning rather than being a post-hoc rationalization.

← Back to all articles