State Space Models for PPG: Can Mamba Beat Transformers on Long Waveforms?
State space models like Mamba offer linear-time sequence modeling for photoplethysmography, making them a compelling alternative to transformers when PPG windows span minutes or hours of continuous waveform data.

State space models may be one of the best candidates for long window PPG modeling because they can preserve information across thousands of timesteps with near linear compute, making them far more practical than transformers when a wearable stream runs for minutes or hours instead of seconds.
Photoplethysmography is an awkward modality for deep learning. It looks simple at first. A wrist device or fingertip sensor shines light into tissue, measures the reflected or transmitted signal, and produces a neat waveform with a pulse-like shape. But real world PPG is rarely neat for long. Motion artifacts, contact changes, ambient light leakage, sensor drift, vasomotor changes, respiration, and irregular rhythm can all distort the sequence. Once you move from short clean clips to continuous monitoring, the modeling problem becomes much harder.
That is exactly where state space models, especially modern selective state space models such as Mamba, become interesting. They promise long context reasoning without paying the full quadratic attention cost of transformers. For PPG researchers, that matters because many clinically useful patterns are not isolated inside a single beat. They unfold across dozens, hundreds, or thousands of beats. Sleep stage transitions, vascular tone changes, exercise recovery, arrhythmia burden, signal quality degradation, and context dependent heart rate variability all benefit from long temporal context.
In our earlier looks at transformer approaches for PPG and deep learning for heart rate estimation from PPG, the central tradeoff kept surfacing: richer sequence models usually want more context, but more context gets expensive fast. State space models change that conversation.
Why long context matters in PPG
A lot of PPG pipelines still operate on short windows, often 4 to 30 seconds. That makes sense for practical reasons. Labels are easier to align, training is faster, and inference on edge devices stays manageable. But short windows miss several things that matter in physiologic sensing.
First, many errors in PPG are contextual. A single distorted beat can look pathological or simply noisy depending on what came before and after it. This is one reason signal quality assessment for PPG is so important. Good models need memory of preceding morphology, motion contamination, and baseline dynamics.
Second, heart rhythm and vascular physiology contain information at multiple timescales. Beat level morphology matters, but so do slower trends such as respiratory modulation, autonomic shifts, perfusion changes, and recovery trajectories. A model that sees only a small snippet may estimate pulse rate correctly while missing the physiologic story.
Third, wearable applications increasingly want continuous outputs. Instead of predicting a label for one isolated chunk, systems must monitor streams. That pushes model design toward architectures that can update over time, retain useful state, and scale gracefully with duration.
Transformers can capture long dependencies well, but their standard self-attention cost grows quadratically with sequence length. For high sample rate PPG, or even modestly downsampled multi-minute segments, this becomes painful. You can use chunking, hierarchical attention, or sparse attention, but each workaround introduces complexity and often some loss of temporal fidelity.
What a state space model is doing
A classical state space model represents a sequence through a hidden state that evolves over time. At each timestep, the model updates its state based on the current input and then produces an output. In control theory and signal processing, this framing is old and well understood. What changed recently is that deep learning researchers found ways to make state space models expressive enough to compete with attention-based architectures on long sequence tasks.
The structured state space line of work, including the S4 family, showed that carefully parameterized state dynamics can model long range dependencies effectively. The paper Efficiently Modeling Long Sequences with Structured State Spaces helped re-open interest in sequence models that are not built around attention alone. For biosignal modeling, that is attractive because physiological data are already naturally temporal and often benefit from inductive biases related to filtering, memory, and dynamical systems.
Mamba pushed this idea further. In Mamba: Linear-Time Sequence Modeling with Selective State Spaces, the model introduces input dependent selection so that the state update can decide what to keep, what to forget, and what to emphasize. That makes the model less like a fixed filter and more like a content aware recurrent system with strong hardware efficiency. The practical headline is simple: long sequences become much cheaper to process than with full attention.
Why Mamba is especially interesting for PPG
PPG is not just long. It is long, noisy, and locally redundant. Those three properties make Mamba a natural fit.
1. PPG has long recordings but limited marginal novelty per sample
A continuous PPG waveform can run for hours, yet adjacent samples are highly correlated. Transformers still pay attention costs for all pairwise interactions unless heavily optimized. A selective state space model can compress history into a running latent state, which is often a more sensible match for waveform streams.
2. Useful signal is mixed with nuisance variation
PPG models need to ignore a lot of junk without erasing true physiology. Motion spikes, clipping, baseline wander, low perfusion, and sensor repositioning can dominate large stretches of a recording. Mamba's selective mechanism suggests a useful behavior for this setting: do not memorize every point equally, and do not spend equal compute on every interaction. Instead, learn state updates that preserve salient information and downweight transient corruption.
3. Edge deployment matters
A realistic wearable model may need to run continuously with tight latency and power budgets. Even if training happens in the cloud, inference cost still matters. Linear-time sequence models are appealing here because they offer a path toward longer contexts without exploding memory. That could make advanced morphology-aware monitoring more feasible on-device or near-device.
Where transformers still have advantages
This is not a simple story where Mamba automatically beats transformers. Transformers are still extremely strong, especially when datasets are large, multimodal fusion is important, or interpretability through attention maps is useful. They also benefit from a mature tooling ecosystem and many proven pretraining recipes.
For PPG, transformers can be excellent when:
- sequence lengths are moderate
- you need direct interaction across distant points or across modalities
- you want to align signal with text, metadata, or event markers
- you can afford the compute and memory budget
So the better question is not “Are transformers obsolete?” It is “At what PPG timescale and deployment regime do state space models become the better engineering choice?”
My view is that the break point arrives earlier than many teams expect. Once you start working with minute scale or hour scale waveform context, especially from consumer wearables, efficiency stops being a nice-to-have and becomes central to product viability.
A practical architecture for long waveform PPG
If you were building a modern PPG foundation model today, a state space architecture would deserve a serious place in the shortlist. A sensible design could look like this:
-
Front-end signal encoder Use a shallow convolutional or patching stem to convert raw PPG into local features. This stage can denoise slightly, reduce temporal resolution, and capture pulse morphology primitives.
-
Selective state space backbone Feed the feature sequence into stacked state space blocks, potentially with residual connections and channel mixing. This is where long context memory lives.
-
Multi-scale heads Different downstream tasks need different temporal granularity. Heart rate estimation wants stable outputs at short intervals. Arrhythmia screening or sleep related inference may want broader context. Multi-resolution heads can read from shared long-context embeddings.
-
Quality-aware gating Add an explicit branch for signal quality or artifact confidence. This matters because poor quality segments should influence predictions differently. The model can learn to condition its state updates on quality indicators rather than treating every chunk as equally trustworthy.
-
Streaming inference mode Preserve recurrent state across windows during deployment. This is where state space models shine. You do not need to rerun full attention over the entire history each time a new chunk arrives.
That kind of design could support continuous monitoring tasks much more naturally than a pure transformer stack.
What success would look like in PPG benchmarks
For Mamba style models to truly matter in PPG, they should not just match transformer accuracy on short benchmark slices. They should win on the settings that are hardest for current pipelines.
Important evaluation targets include:
- Long horizon heart rate tracking across motion and activity transitions
- Arrhythmia or irregular pulse detection where context before and after an event matters
- Sleep and recovery modeling with minute level to hour level temporal structure
- Signal quality aware inference where the model must stay calibrated under noise
- Cross-device generalization across wavelength, form factor, and sampling differences
- Streaming latency and memory footprint on realistic wearable hardware
This is where physiologic modeling should move next. Instead of optimizing only for clip-level accuracy on sanitized datasets, we should ask which architecture remains stable over continuous real-world streams.
Why inductive bias matters for biosignals
There is a broader reason state space models feel promising in health AI. Physiologic signals are generated by dynamical systems. Blood volume changes are not random token sequences. They are consequences of cardiovascular regulation, respiration, movement, autonomic input, device mechanics, and measurement physics.
Architectures that naturally represent state evolution may therefore have a better starting bias than architectures built for discrete token interactions. That does not guarantee better performance, but it does suggest a healthier alignment between model form and signal form.
This is also consistent with the broader digital medicine literature, which repeatedly shows that real-world wearable inference is shaped as much by temporal robustness and deployment constraints as by raw benchmark accuracy. If a model is elegant on a short offline dataset but fragile in continuous monitoring, it is not yet the right model.
Potential failure modes
There are real caveats.
Limited public long-form PPG datasets
A model designed for hour scale context is only as good as the data used to train it. Much of the public PPG literature still focuses on short segments or heavily preprocessed benchmarks. That can mask the very conditions where Mamba should help.
Long memory is useful, but it can also accumulate mistakes. If the model internal state becomes contaminated by prolonged artifact, predictions may degrade until some reset or correction mechanism intervenes. Training for robust recovery matters.
Interpretability is still hard
Attention maps are not perfect explanations, but they at least provide a familiar visualization tool. State space models may be efficient, yet harder to inspect. In clinical or regulated settings, that can slow adoption unless paired with strong evaluation and confidence estimation.
Not every task needs huge context
If the task is short window pulse rate estimation from clean fingertip PPG, a small convolutional network may still be the better answer. Novel architecture should follow task demands, not hype.
Mamba versus transformer for real PPG workloads
If I had to summarize the comparison in one sentence, it would be this: transformers are still excellent general sequence learners, but Mamba may be better aligned with the economics of continuous physiologic sensing.
That statement matters because PPG is moving from research datasets to persistent wearable monitoring. Once the workload becomes “always on waveform understanding,” model efficiency becomes part of model quality. A method that enables longer context at fixed budget can outperform a theoretically stronger model that must be aggressively truncated.
In practice, I expect the strongest future systems to be hybrid. You might use a state space backbone for continuous low-cost memory, then layer sparse attention or cross-attention where global comparison is most valuable. This would preserve the efficiency advantage while still giving the system some explicit long-range interaction capability.
That hybrid path feels especially likely for multimodal wearables, where PPG is combined with accelerometry, skin temperature, electrodermal activity, or event logs. State space models can carry the stream efficiently, while targeted attention handles cross-modal fusion.
What researchers should test next
To move beyond speculation, the community should run a more serious comparison program:
- Train matched CNN, transformer, S4-like, and Mamba-like models on the same raw PPG streams
- Evaluate at multiple context lengths, from seconds to hours
- Report not only AUROC or MAE, but also memory use, latency, and degradation under motion
- Test streaming deployment with persistent hidden state
- Measure calibration under signal quality shifts and device transfer
This kind of benchmark would tell us whether state space models are merely efficient, or genuinely better suited to long-form physiology.
My expectation is that Mamba style architectures will be strongest where three conditions are present at once: very long sequences, noisy real-world acquisition, and limited compute budget. That is almost a perfect description of consumer and ambulatory PPG.
The bigger picture for ChatPPG
For anyone building the next generation of PPG intelligence, state space models should not be treated as a side experiment. They should be treated as a serious architectural candidate for foundation models over continuous waveform streams.
Transformers helped push biosignal modeling forward by proving that sequence context matters. State space models may push it further by making that context cheap enough to use at the timescales physiology actually lives on.
If that happens, the question will stop being whether Mamba can beat transformers on a leaderboard. The real question will be whether it unlocks a new class of continuous, morphology-aware, deployment-ready PPG systems that were previously too expensive to build.
For long waveform monitoring, that is the more important win.
FAQ
What is a state space model in the context of PPG?
A state space model maintains a hidden internal state that updates over time as new PPG samples arrive. Instead of comparing every timestep to every other timestep as full attention does, it compresses useful history into a running representation. That is appealing for long physiologic sequences.
Why might Mamba outperform a transformer on long PPG recordings?
Mamba uses selective state space updates with roughly linear scaling in sequence length, while standard transformers scale quadratically with full self-attention. When PPG windows span minutes or hours, that efficiency can allow the model to process more context and remain practical for deployment.
Does Mamba replace transformers for all PPG tasks?
No. Short, clean, narrowly defined tasks may still be better served by compact CNNs or transformers. Mamba is most compelling when long context, streaming inference, and compute efficiency are all important.
What PPG applications benefit most from long-context models?
Continuous heart rate tracking, arrhythmia screening, recovery monitoring, sleep-related analysis, vascular trend detection, and robust signal quality aware inference are all strong candidates because they depend on patterns that unfold over many beats or long intervals.
Are state space models good for noisy wearable data?
Potentially yes. Their selective memory mechanism may help preserve useful physiology while ignoring transient corruption. But this still has to be validated carefully on real wearable datasets with motion, missingness, and device shift.
What is the main limitation of using Mamba for PPG today?
The biggest limitation is probably data and benchmarking, not theory. Public long-duration, weakly processed, clinically meaningful PPG datasets are still limited, so it is hard to prove the full benefit of long-context architectures without better evaluation setups.
References
- Gu, A., Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv. https://arxiv.org/abs/2312.00752
- Gu, A., Goel, K., Re, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv. https://arxiv.org/abs/2111.00396
- Digital medicine perspective on machine learning for real-world health sensing. https://doi.org/10.1038/s41746-019-0136-7