Real-Time PPG Signal Processing on Microcontrollers: Embedded Implementation Guide
Implementing real-time PPG signal processing on a microcontroller requires navigating tight constraints on compute cycles, memory, power, and latency that do not exist in desktop or cloud environments. A wrist-worn PPG device running on a 64 MHz ARM Cortex-M4 with 128 KB of RAM must perform analog front-end control, digital filtering, heart rate estimation, motion artifact removal, and SpO2 calculation -- all within the 10 ms inter-sample deadline at 100 Hz sampling, while consuming under 5 mW of processing power to preserve battery life.
This guide provides concrete implementation strategies for each stage of the real-time PPG processing pipeline on resource-constrained microcontrollers, with specific cycle counts, memory budgets, and optimization techniques validated on production hardware. For the algorithmic theory behind these methods, see our PPG signal processing algorithms guide and our PPG technology introduction.
Hardware Platform Selection
ARM Cortex-M Family
The ARM Cortex-M series dominates wearable PPG processors, with each core variant offering different capability-power tradeoffs:
Cortex-M0+ (e.g., nRF52810, STM32L0): 32-bit RISC core at 16-48 MHz with 16-64 KB RAM. No hardware FPU, no DSP extensions. Suitable for basic heart rate estimation using peak detection on pre-filtered PPG. Power consumption of 30-80 microWatts/MHz. The M0+ is used in ultra-low-power sensor nodes where processing is offloaded to a smartphone app via BLE.
Cortex-M4F (e.g., nRF52840, STM32L4, MAX32666): 32-bit RISC core at 48-120 MHz with 64-512 KB RAM. Hardware single-precision FPU, DSP extensions (single-cycle 32-bit MAC, SIMD for 16-bit data). This is the sweet spot for wearable PPG, providing sufficient compute for real-time filtering, heart rate, SpO2, and basic motion artifact removal while maintaining power consumption of 40-100 microWatts/MHz.
Cortex-M7 (e.g., STM32H7, i.MX RT1060): 32-bit RISC core at 200-600 MHz with 512 KB - 2 MB RAM. Hardware double-precision FPU, enhanced DSP with 2-cycle MAC. L1 cache (4-16 KB instruction, 4-16 KB data). Suitable for advanced PPG processing including neural network inference for arrhythmia detection, multi-channel fusion, and simultaneous extraction of multiple physiological parameters.
Cortex-M33 (e.g., nRF5340, STM32U5): ARMv8-M architecture with TrustZone security, DSP extensions, optional FPU. Similar performance to M4F but with security features relevant for medical device firmware and data protection.
RISC-V Alternatives
RISC-V processors are emerging in wearable applications, with the Espressif ESP32-C3 (single-core RISC-V at 160 MHz) and GreenWaves GAP8/GAP9 (multi-core RISC-V with neural network accelerator) offering competitive alternatives. The GAP9 is particularly interesting for PPG applications because its 10-core RISC-V cluster with 1.5 MB L2 memory can execute neural network inference at 50-100x the efficiency of a Cortex-M4F, enabling on-device deep learning for heart rate and arrhythmia detection. We discuss neural network deployment further in our edge AI for PPG guide.
Real-Time Processing Pipeline Architecture
Sample-by-Sample vs. Block Processing
PPG processing algorithms can operate in two modes:
Sample-by-sample processing handles each new ADC sample immediately. IIR filters, adaptive filters, and peak detection naturally operate sample-by-sample. Latency is minimal (one sample period) and memory requirements are small (only filter state variables). This mode suits streaming applications where each sample must be processed before the next arrives.
Block processing accumulates N samples (typically 64-512 at 100 Hz, representing 0.64-5.12 seconds) and processes the entire block at once. FFT-based spectral analysis, wavelet transforms, and batch feature extraction require block processing. Latency equals the block length plus processing time. Memory requirements include the block buffer (N * 2-4 bytes) plus intermediate arrays.
Most practical PPG firmware uses a hybrid architecture: sample-by-sample IIR filtering and adaptive filtering run in the ADC interrupt service routine (ISR), while block-based FFT analysis and feature extraction run as a lower-priority periodic task triggered when a full block has been accumulated.
Interrupt-Driven Data Acquisition
The PPG analog front-end (AFE4404, MAX86141, ADPD4101, or similar) generates sample-ready interrupts at the configured sampling rate. The ISR must complete within one sample period (10 ms at 100 Hz, 4 ms at 250 Hz). A minimal ISR reads the ADC result via SPI or I2C (typically 2-5 microseconds for SPI at 8 MHz), applies per-sample filtering (1-5 microseconds for a 4th-order IIR filter on M4F), stores the result in a circular buffer, and returns. Total ISR duration should be under 20% of the sample period to leave sufficient CPU time for background processing.
A double-buffering scheme for block processing uses two buffers of size N. The ISR fills one buffer while the main loop processes the other. When the ISR fills buffer A, it switches to buffer B and signals the main loop. The main loop processes buffer A, then waits for the next signal. This ensures no samples are dropped during processing as long as block processing completes before the second buffer fills.
Fixed-Point Arithmetic Implementation
Q-Format Representation
For processors without hardware FPU (Cortex-M0, M0+, M3), fixed-point arithmetic is essential. The Q-format represents fractional numbers using integer storage:
Q15: 16-bit signed integer representing values in [-1.0, 0.99997]. Resolution is 2^(-15) = 3.05 * 10^(-5). Suitable for normalized PPG signal values and filter coefficients. Multiplication of two Q15 values produces a Q30 result in a 32-bit accumulator, requiring a right-shift by 15 to return to Q15 format.
Q31: 32-bit signed integer representing values in [-1.0, 0.99999999953]. Resolution is 2^(-31) = 4.66 * 10^(-10). Suitable for filter accumulators and intermediate calculations where Q15 precision is insufficient. Used for PPG applications requiring high dynamic range such as SpO2 ratio-of-ratios calculation.
The CMSIS-DSP library (ARM's official DSP software library) provides Q15 and Q31 implementations of all common signal processing functions, with inline assembly optimizations for each Cortex-M variant. Using CMSIS-DSP functions rather than writing custom fixed-point code is strongly recommended because they handle saturation, rounding, and overflow correctly across edge cases.
Practical Precision Requirements
Heart rate estimation is tolerant of limited precision: even Q15 (approximately 15 bits of mantissa) provides sub-BPM accuracy for peak detection and spectral analysis. SpO2 calculation is more demanding because the ratio-of-ratios (R = (AC_red/DC_red) / (AC_ir/DC_ir)) involves division of small numbers, and a 1% error in R translates to approximately 1% error in SpO2. For SpO2, Q31 or single-precision floating-point is recommended.
HRV metrics are the most precision-sensitive because they depend on inter-beat interval (IBI) timing accuracy. At 100 Hz sampling, IBI resolution is 10 ms. Sub-sample interpolation using parabolic or Gaussian peak fitting can improve effective resolution to 1-2 ms but requires floating-point or Q31 arithmetic for the interpolation calculation. The RMSSD metric for short-term HRV is particularly sensitive to IBI timing noise: 5 ms RMS timing error introduces approximately 15% error in RMSSD for typical resting values (Peng et al., 2015, DOI: 10.1088/0967-3334/36/7/1441).
Digital Filtering on Microcontrollers
IIR Filter Implementation
IIR (Infinite Impulse Response) filters are preferred over FIR filters for embedded PPG because they achieve equivalent frequency response with 5-10x fewer coefficients, directly reducing both computation and memory. A 4th-order Butterworth bandpass filter (0.5-8 Hz at 100 Hz sampling) requires only 8 multiply-accumulates per sample using two cascaded biquad (second-order section) stages, compared to 60-120 MACs for an equivalent FIR filter.
The CMSIS-DSP arm_biquad_cascade_df1_f32() function implements cascaded biquad filtering with performance of 13 cycles per biquad stage per sample on Cortex-M4F (including function call overhead). A complete 4th-order bandpass (2 biquad stages) processes one sample in approximately 26 cycles, or 0.33 microseconds at 80 MHz. This leaves 99.997% of the CPU time available for other processing at 100 Hz sampling.
Filter coefficient design should use MATLAB, SciPy, or equivalent tools on a development workstation, with coefficients exported as constants for the embedded firmware. Second-order section (SOS) form is mandatory for fixed-point implementations because direct-form higher-order filters suffer from coefficient sensitivity and potential instability in limited-precision arithmetic.
FIR Filter Applications
FIR filters are used in specific PPG applications where linear phase response is critical: derivative filters for slope detection, matched filters for template correlation, and decimation filters for sample rate conversion. An 8-tap derivative filter for systolic onset detection requires 8 MACs per sample and provides linear phase, ensuring that timing measurements are not distorted by filter group delay variation.
The CMSIS-DSP FIR function arm_fir_f32() processes 4 samples simultaneously using SIMD instructions on Cortex-M4F, achieving 2 cycles per tap per sample. A 32-tap FIR filter thus requires 64 cycles per sample, or 0.8 microseconds at 80 MHz.
Heart Rate Estimation Algorithms for Embedded
Peak Detection
Peak detection is the simplest and most computationally efficient heart rate estimation method. The standard approach identifies systolic peaks in the bandpass-filtered PPG signal by finding local maxima exceeding an adaptive amplitude threshold. The inter-beat interval (IBI) between consecutive peaks gives the instantaneous heart rate: HR = 60 / IBI (in BPM, with IBI in seconds).
Pan-Tompkins-style algorithms adapted from ECG to PPG use derivative-based enhancement, squaring, and moving-window integration to create a detection function with sharp peaks at each cardiac cycle. Elgendi (2013) proposed a two-event-related moving average (TERMA) approach specifically designed for PPG systolic peak detection, achieving sensitivity of 99.39% and positive predictivity of 99.49% on the MIMIC-II dataset, with computational cost under 50 operations per sample (DOI: 10.1371/journal.pone.0076585).
On a Cortex-M4F, the complete TERMA peak detection pipeline (bandpass filter, two moving averages, threshold comparison, peak validation) requires approximately 150 cycles per sample, or 1.9 microseconds at 80 MHz. Memory requirements are 2 * W * 4 bytes for the moving average buffers (where W is the window length, typically 50-100 samples), plus 20-40 bytes for state variables. Total memory is approximately 500-1000 bytes.
Spectral Methods
FFT-based heart rate estimation computes the power spectrum of a PPG window and identifies the dominant frequency in the cardiac band (0.5-4 Hz). This approach is more robust to waveform distortion than peak detection because it relies on spectral power rather than morphological features.
A 256-point FFT at 25 Hz sampling covers 10.24 seconds with frequency resolution of 0.098 Hz (approximately 6 BPM). A 512-point FFT at 25 Hz gives 0.049 Hz resolution (approximately 3 BPM) over 20.48 seconds. The CMSIS-DSP arm_rfft_fast_f32() function computes a 256-point real FFT in approximately 4,500 cycles on Cortex-M4F, or 56 microseconds at 80 MHz. A 512-point FFT requires approximately 10,000 cycles (125 microseconds). These execution times are negligible relative to the block accumulation period.
Spectral peak tracking across consecutive FFT windows improves robustness by constraining heart rate changes to physiologically plausible rates (typically under 5 BPM change per second). The JOSS (Joint Sparsity Spectral Analysis) method tracks cardiac and motion spectral peaks simultaneously, providing robust heart rate estimation during exercise (Zhang et al., 2015, DOI: 10.1109/TBME.2014.2359372).
Memory for spectral analysis dominates the embedded budget: a 512-point float32 buffer requires 2048 bytes for the time-domain window, plus 2048 bytes for the FFT output (complex), plus working space. Total spectral analysis memory is approximately 5-8 KB.
SpO2 Calculation Pipeline
Ratio-of-Ratios on Embedded
SpO2 estimation requires simultaneous processing of red (660 nm) and infrared (940 nm) PPG channels. The standard calculation:
R = (AC_red / DC_red) / (AC_ir / DC_ir)
SpO2 = a - b * R
where a and b are empirically determined calibration coefficients (typically a = 110, b = 25 for the standard linear approximation, though quadratic or higher-order curves are used for clinical accuracy).
The AC component is extracted by bandpass filtering (0.5-5 Hz), and the DC component by low-pass filtering (below 0.5 Hz) or exponential moving average. Both channels must be sampled synchronously to avoid timing-related ratio errors. On a dual-channel AFE with TDM sequencing, the red and infrared samples are typically 100-500 microseconds apart within each cycle, which introduces negligible error at cardiac frequencies.
Fixed-point SpO2 calculation requires careful scaling to avoid overflow and maintain precision during the ratio computation. Using Q31 format with intermediate 64-bit accumulation for the division prevents loss of precision. The complete SpO2 pipeline (dual-channel bandpass filtering, AC/DC separation, ratio calculation, calibration lookup) requires approximately 500 cycles per sample pair on Cortex-M4F.
Averaging and Alarm Logic
Clinical pulse oximeters apply temporal averaging to SpO2 estimates, typically over 4-12 beats or 4-8 seconds, to reduce beat-to-beat variability. The averaging window length trades off responsiveness against stability: shorter windows (4 beats) respond faster to desaturation events but have higher noise. FDA guidance recommends testing with specific desaturation profiles to ensure the device detects clinically significant drops (to 80% SpO2) within 15 seconds.
Alarm logic must handle signal quality assessment: corrupted SpO2 estimates from motion artifacts should be excluded from the averaging buffer rather than corrupting the output. A quality flag based on perfusion index (PI > 0.3%) and waveform template correlation (r > 0.8) gates each estimate before inclusion.
Motion Artifact Removal on Embedded
NLMS Implementation
The NLMS adaptive filter is the practical choice for embedded motion artifact removal because it provides good performance with O(N) complexity per sample. A 32-tap NLMS filter using 3-axis accelerometer reference (3 channels, each with 32 taps) requires 96 MACs per sample for the filtering operation, plus 96 updates for coefficient adaptation. Total cost is approximately 200 MACs per sample.
On Cortex-M4F with SIMD processing (two 16-bit MACs per cycle), a 32-tap 3-channel NLMS filter executes in approximately 100 cycles, or 1.25 microseconds at 80 MHz. The CMSIS-DSP arm_lms_norm_f32() function provides an optimized NLMS implementation, though it operates on a single reference channel. For multi-axis accelerometer reference, three parallel NLMS instances with their outputs summed provide the motion artifact estimate that is subtracted from the PPG.
Memory for a 32-tap, 3-channel NLMS filter: 3 * 32 * 4 = 384 bytes for coefficients, 3 * 32 * 4 = 384 bytes for delay lines, plus state variables. Total is approximately 800 bytes, well within the budget of any Cortex-M4F device.
For the algorithmic background on adaptive filtering approaches, see our motion artifact removal guide.
Accelerometer Integration
Most wearable PPG systems include a 3-axis MEMS accelerometer (e.g., Bosch BMA400, ST LIS2DW12, InvenSense ICM-42688) for motion detection and artifact removal. The accelerometer sampling rate should match or exceed the PPG sampling rate (100-250 Hz) and must be time-synchronized with PPG samples to within 1-2 ms for effective adaptive filtering.
Hardware synchronization is achieved by triggering both the PPG AFE and the accelerometer from a shared timer interrupt, or by using the accelerometer's data-ready interrupt to trigger PPG sampling. Software synchronization through timestamp-based interpolation is possible but introduces jitter that degrades adaptive filter performance by 1-3 dB compared to hardware synchronization (Lee et al., 2010, DOI: 10.1109/TBME.2009.2039568).
Memory Management Strategies
Static Allocation
Dynamic memory allocation (malloc/free) is strongly discouraged in real-time embedded systems because of fragmentation, non-deterministic allocation time, and heap overflow risks. All buffers, filter states, and data structures should be statically allocated at compile time.
A typical PPG processing firmware memory map on a 128 KB RAM device:
- Circular sample buffers (PPG + accelerometer): 4-8 KB
- FFT working buffers: 4-8 KB
- Filter coefficient and state storage: 1-2 KB
- Feature extraction buffers: 2-4 KB
- BLE stack: 8-16 KB (vendor-dependent)
- Application stack: 4-8 KB
- Remaining for application data: 80-100 KB
This budget leaves substantial headroom for additional features or larger analysis windows. Memory becomes constrained only when implementing neural network models (which may require 50-200 KB for weights and activations) or when multiple processing pipelines run simultaneously.
Circular Buffers
Circular (ring) buffers are the fundamental data structure for real-time PPG processing. They provide O(1) write (append new sample) and O(1) read (access any sample by offset), with automatic overwriting of the oldest data when full. A circular buffer of size N with a write pointer and read pointer uses exactly N * element_size bytes with no overhead.
For block processing triggered at regular intervals, the circular buffer size should be at least 2x the block size to allow the block processor to read a complete block while new samples are simultaneously being written. Power-of-two buffer sizes enable efficient modular addressing using bitwise AND instead of modular division.
Power Optimization Techniques
Clock Gating and Sleep Modes
The most effective power reduction strategy is minimizing the time the processor spends in active mode. Between PPG samples (90-99% of the time at 100 Hz sampling), the processor should enter a low-power sleep mode (Sleep, Stop, or Deep Sleep depending on the platform and wake-up requirements).
On the nRF52840 (Cortex-M4F at 64 MHz), active mode current is 4.8 mA, System ON sleep (RAM retained, RTC running) is 1.5 microamps, and System OFF is 0.3 microamps. Processing one PPG sample requires approximately 500 cycles (including filtering and peak detection), taking 7.8 microseconds at 64 MHz. At 100 Hz sampling, active time is 780 microseconds per second, or 0.078% duty cycle. The average current for processing is therefore 0.078% * 4.8 mA + 99.922% * 1.5 microamps = approximately 5.2 microamps. This is negligible compared to LED drive current (typically 1-10 mA during pulses).
Algorithmic Power-Performance Tradeoff
More computationally intensive algorithms consume more power but may enable lower LED drive current (by extracting usable signals from noisier data) or longer LED-off periods (through predictive sampling). The total system power optimization considers the LED-processor power tradeoff: spending 1 mW more on processing to reduce LED power by 5 mW is a net win.
Compressed sensing approaches reduce LED activation events by 60-80%, saving 3-15 mW of LED power at the cost of 0.1-0.5 mW of additional processing for signal reconstruction. This tradeoff is almost universally favorable in wearable PPG systems, making CS a key power optimization tool for embedded implementations.
Testing and Validation
Real-Time Verification
Real-time PPG firmware must be verified for timing correctness: no sample drops, no buffer overflows, no deadline misses. GPIO toggling at ISR entry/exit allows oscilloscope measurement of ISR duration and jitter. Logging frameworks that record timestamps of key events (sample acquisition, block processing start/end, BLE transmission) help identify timing violations during extended operation.
Worst-case execution time (WCET) analysis should account for cache misses (on M7), interrupt nesting, and BLE stack activity that may preempt PPG processing. A real-time operating system (RTOS) such as FreeRTOS or Zephyr provides deterministic task scheduling, but adds 2-8 KB of code space and introduces context switch overhead of 5-20 microseconds per switch.
Algorithm Validation Against Benchmarks
Embedded PPG algorithms should be validated against established benchmark datasets before deployment. The IEEE Signal Processing Cup 2015 dataset, the PPG-DaLiA dataset (Reiss et al., 2019), and the MIMIC-III waveform database provide standardized PPG recordings with ground-truth annotations. Cross-validation between desktop (MATLAB/Python) and embedded (C/C++) implementations ensures that fixed-point quantization and memory constraints have not introduced accuracy regressions. Heart rate MAE should match within 0.5 BPM between desktop and embedded implementations on the same test data. For more on PPG algorithm validation methodologies, see our algorithms reference.