ChatPPG Editorial

Public PPG Datasets and Benchmarks for Research: A Comprehensive Guide

**Reproducible PPG research depends on standardized datasets and benchmarks, yet the landscape of publicly available PPG data remains fragmented acros...

ChatPPG Team

2026-03-24T19:21:26+00:00

10 min read

Reproducible PPG research depends on standardized datasets and benchmarks, yet the landscape of publicly available PPG data remains fragmented across dozens of repositories with inconsistent formats, annotation quality, and documentation. This guide provides a comprehensive catalog of the most important public PPG datasets, organized by application domain, with practical guidance on their strengths, limitations, and appropriate use cases.

Access to high-quality annotated data is the single largest bottleneck in PPG algorithm development. While clinical data repositories like MIMIC contain massive volumes of waveform data, the annotation quality and demographic diversity often limit their utility for training robust models. Understanding which datasets are appropriate for which research questions, and how to properly benchmark against them, is essential for producing work that advances the field. For background on PPG signal characteristics and acquisition, see our introduction to PPG technology.

Heart Rate Estimation Datasets

Heart rate (HR) estimation from wrist-worn PPG during physical activity is the most extensively benchmarked PPG task, driven by commercial demand for accurate fitness tracking and by several organized competitions that produced standardized datasets.

IEEE Signal Processing Cup 2015

The IEEE SP Cup 2015 dataset, introduced by Zhang et al. (2015), is the single most cited benchmark for PPG heart rate estimation during motion (DOI: 10.1109/JBHI.2014.2311044). It contains recordings from 12 subjects performing a sequence of activities: 30 seconds rest, 1 minute walking (1-2 km/h), 1 minute running (6-8 km/h), 1 minute arm movements, and 30 seconds rest. Each recording includes 2-channel PPG (green wavelength) and 3-axis accelerometer data sampled at 125 Hz, with simultaneous ECG-derived reference heart rate.

The dataset's primary strength is its widespread adoption, enabling direct comparison across methods. Zhang's original TROIKA algorithm achieved mean absolute error (MAE) of 2.34 BPM on this dataset. Subsequent methods have pushed this to below 1.5 BPM, with deep learning approaches (Reiss et al., 2019) achieving 1.17 BPM MAE using temporal convolutional networks. However, the dataset's limitations are significant: only 12 subjects, a fixed activity protocol, and recordings from a single device and body location. The small subject count means that leave-one-subject-out (LOSO) results have high variance, and reported average MAE values can be heavily influenced by one or two difficult subjects.

PPG-DaLiA (PPG Dataset for Daily Life Activities)

PPG-DaLiA, published by Reiss et al. (2019) from the German Research Center for Artificial Intelligence (DFKI), addresses many limitations of the IEEE SP Cup dataset (DOI: 10.3390/s19143079). It includes 15 subjects performing 8 daily life activities: sitting, ascending stairs, playing table soccer, cycling, driving, lunch break, walking, and working at a desk. Each subject's session lasts approximately 2.5 hours, providing substantially more data per subject.

The sensor setup includes 4-channel PPG (2 green, 1 red, 1 infrared) from an Empatica E4 wristband, 3-axis accelerometer, electrodermal activity, and body temperature. ECG from a RespiBAN chest strap provides the reference heart rate at 700 Hz. All signals are synchronized and resampled to 64 Hz for the PPG channels and 32 Hz for the accelerometer.

Using LOSO cross-validation, the baseline methods reported MAE values of 7.82 BPM (SpectroTemporalNet), 7.65 BPM (DeepPPG), and 9.84 BPM (classical spectral tracking). The higher error rates compared to the IEEE SP Cup dataset reflect the greater diversity and difficulty of the activities. This dataset is strongly recommended for evaluating motion artifact removal algorithms in realistic conditions.

WESAD (Wearable Stress and Affect Detection)

WESAD, also from DFKI (Schmidt et al., 2018), targets stress detection but is valuable for PPG heart rate research due to its multimodal sensor setup (DOI: 10.1145/3242969.3242985). It includes data from 15 subjects wearing both a wrist-worn Empatica E4 and a chest-worn RespiBAN Professional during a protocol designed to induce baseline, stress, and amusement states. The protocol uses the Trier Social Stress Test for stress induction and funny video clips for amusement.

Sensor data includes wrist PPG (64 Hz), wrist accelerometer (32 Hz), wrist EDA, wrist skin temperature, chest ECG (700 Hz), chest EMG, chest EDA, chest respiration, and chest temperature. Affect labels (baseline, stress, amusement, meditation) are provided. The dataset is publicly available and has been used in over 200 publications since its release.

Blood Pressure Estimation Datasets

PPG-based blood pressure estimation is one of the most active and challenging areas of PPG research, and the availability of paired PPG-BP data is critical for developing and validating cuffless blood pressure methods.

MIMIC-III Waveform Database

The MIMIC-III Waveform Database, part of the broader MIMIC-III Clinical Database maintained by the MIT Lab for Computational Physiology (Johnson et al., 2016; DOI: 10.1038/sdata.2016.35), is the largest publicly available source of continuous PPG waveforms with simultaneous arterial blood pressure (ABP). It contains matched waveform records from over 10,000 ICU patients at Beth Israel Deaconess Medical Center, with synchronized PPG, ABP, ECG, and other physiological signals sampled at 125 Hz.

The MIMIC-III database has been the foundation of hundreds of PPG-BP studies, including the influential work by Kachuee et al. (2017) who extracted pulse transit time (PTT) features from ECG-PPG pairs and achieved systolic BP estimation error of 11.17 mmHg standard deviation using an AdaBoost regression model (DOI: 10.1109/JBHI.2015.2514202). El Hajj and Bhatt (2020) used a deep learning approach on the same database and reported MAE of 4.41 mmHg for systolic BP and 2.91 mmHg for diastolic BP.

Critical limitations of MIMIC-III for BP research include: the ICU population is not representative of the general ambulatory population; many patients have arterial line damping artifacts that corrupt the ABP reference; the data is collected during critical illness, introducing pathological hemodynamics; and signal quality varies enormously across records. Researchers must implement careful quality screening (typically removing segments with ABP outside 40-200 mmHg systolic, PPG signal quality index below threshold, and records shorter than minimum duration) before using MIMIC data for model development.

Access requires completion of the CITI training program and a data use agreement through PhysioNet.

VitalDB

VitalDB (Lee and Jung, 2018) is a large-scale surgical vital signs database from Seoul National University Hospital containing over 6,388 surgical cases with intraoperative vital sign recordings. It includes continuous arterial blood pressure, PPG (pulse oximetry waveforms), ECG, and dozens of other parameters recorded at varying sample rates (typically 100-500 Hz for waveforms).

The database is accessible through a web interface and Python API (vitaldb.net). For PPG-BP research, VitalDB offers several advantages over MIMIC: the data quality is generally higher because anesthesia records are carefully curated, the surgical population has a wider range of hemodynamic states (from induced hypotension to vasopressor-driven hypertension), and the API simplifies data extraction. Lee et al. (2022) used VitalDB to develop a deep learning model for beat-to-beat BP estimation achieving MAE of 5.1 mmHg for systolic and 3.2 mmHg for diastolic pressures.

University of Queensland Vital Signs Dataset

The University of Queensland (UQ) Vital Signs Dataset (Liu et al., 2012) contains continuous vital sign recordings from 32 surgical cases, including PPG, invasive arterial BP, ECG, and airway pressure signals sampled at 100 Hz. While smaller than MIMIC or VitalDB, this dataset is notable for its careful curation and the accompanying annotations of signal quality events (motion artifacts, sensor disconnections, damped arterial lines).

The dataset has been used in several influential BP estimation studies, including the work of Esmaelpoor et al. (2020) on pulse arrival time (PAT) calibration methods. Its manageable size (32 cases) makes it suitable for initial algorithm prototyping before scaling to larger databases.

Sensors PPG-BP Dataset

Liang et al. (2018) published a purpose-built PPG-BP dataset containing recordings from 657 subjects using a custom finger PPG sensor and a clinical sphygmomanometer for reference BP (DOI: 10.3390/s18103183). Each subject has 3 sequential recordings of approximately 2.1 seconds each (approximately 2-3 cardiac cycles per recording) along with paired systolic and diastolic BP values.

While the dataset's strength is its relatively large subject count and ambulatory population, its limitations are significant: very short recording duration per subject (limiting feature extraction), only cuff BP reference (no beat-to-beat reference), and the custom finger sensor that may not transfer to wrist-worn devices. Nevertheless, it has been widely used for PPG morphology-based BP estimation studies.

Atrial Fibrillation Detection Datasets

PPG-based atrial fibrillation (AF) detection is a rapidly growing research area, driven by the prevalence of wrist-worn devices capable of passive AF screening. For a clinical overview of AF detection capabilities, see our guide on wearable AF detection.

PhysioNet Databases

PhysioNet (Goldberger et al., 2000; DOI: 10.1161/01.CIR.101.23.e215) hosts several databases relevant to PPG-based AF research. The MIT-BIH Atrial Fibrillation Database contains 25 10-hour ECG recordings with rhythm annotations (AF, atrial flutter, junctional rhythm, other). While this is an ECG database, many PPG AF detection algorithms are initially validated on ECG-derived inter-beat intervals to isolate the rhythm analysis component from the PPG-specific signal quality challenges.

The 2017 PhysioNet/Computing in Cardiology Challenge dataset ("AF Classification from a Short Single Lead ECG Recording") contains 8,528 single-lead ECG recordings of 9-60 seconds duration, classified as normal, AF, other rhythm, or noisy. This dataset established the benchmark for short-duration arrhythmia classification, with top-performing algorithms achieving F1 scores of 0.83 for the AF class (Clifford et al., 2017; DOI: 10.22489/CinC.2017.065-469).

MIMIC-III for AF

By cross-referencing the MIMIC-III Waveform Database with the clinical database's diagnostic codes (ICD-9 code 427.31 for atrial fibrillation), researchers can identify PPG recordings from patients with confirmed AF diagnoses. Shashikumar et al. (2017) used this approach to develop a deep learning AF detector from PPG signals, achieving sensitivity of 95.2% and specificity of 95.7% using 30-second PPG windows (DOI: 10.1109/EMBC.2017.8037734).

The limitation of this approach is that ICD diagnostic codes indicate presence of AF during the hospitalization but do not provide beat-level or segment-level rhythm annotations. The patient may have been in sinus rhythm for much of the recording. Reliable use of MIMIC for AF research requires either manual rhythm annotation of PPG segments or the use of the simultaneous ECG channel to derive rhythm labels.

Signal Quality Assessment Datasets

Signal quality assessment (SQA) is critical for all PPG applications, as downstream algorithms produce unreliable results when applied to corrupted signals. Several datasets specifically target SQA development.

Wrist PPG Quality Assessment Dataset

Li and Bhanu (2018) published a wrist PPG quality assessment dataset containing 3,000 10-second PPG segments from wrist-worn devices, manually annotated as "good," "acceptable," or "bad" quality by three independent annotators. Inter-annotator agreement (Cohen's kappa) was 0.72, reflecting the inherent subjectivity of PPG quality assessment. Random forest classifiers using time-domain and frequency-domain features achieved 92.4% classification accuracy on three-class quality assessment.

CapnoBase

The CapnoBase IEEE TBME Benchmark Dataset (Karlen et al., 2010) contains 42 8-minute recordings of PPG, ECG, and capnography from surgical patients, with manual annotations of signal quality and respiratory events. While designed for respiratory rate estimation from PPG, the quality annotations make it valuable for SQA algorithm development. The PPG signals are sampled at 300 Hz with 16-bit resolution, providing high-fidelity waveforms for feature extraction research.

Multi-Parameter and Specialized Datasets

BIDMC PPG and Respiration Dataset

The BIDMC dataset (Pimentel et al., 2017; DOI: 10.1088/1361-6579/aa670e) contains 53 8-minute recordings from critically ill patients, with PPG, ECG, impedance pneumography, and manual breath annotations. It is the standard benchmark for PPG-derived respiratory rate estimation. The best-performing algorithms achieve MAE of 1.0-2.5 breaths per minute on this dataset using PPG amplitude modulation, frequency modulation, and baseline wander features.

TROIKA Benchmark

Beyond the IEEE SP Cup 2015 dataset, the TROIKA algorithm framework (Zhang et al., 2015) established a complete benchmarking methodology for heart rate estimation: decompose the problem into signal decomposition, spectral peak tracking, and verification stages, and report per-subject MAE with LOSO cross-validation. This methodology has been adopted as the de facto standard for reporting heart rate estimation results, and new methods are expected to provide direct comparisons against TROIKA's published per-subject results.

Best Practices for Using PPG Benchmarks

Subject-Level Splitting

The most critical methodological requirement when using PPG datasets is subject-level data splitting. PPG morphology carries strong individual signatures (related to vascular anatomy, skin properties, and sensor coupling), and models trained on random sample-level splits can achieve artificially high accuracy by memorizing subject-specific patterns. Always use LOSO or grouped k-fold cross-validation with subject-level grouping.

Cheng et al. (2021) demonstrated this effect quantitatively: a CNN trained for BP estimation on MIMIC-III achieved MAE of 3.2 mmHg with random splits but 8.7 mmHg with subject-level splits (DOI: 10.1038/s41598-021-92997-0). The 2.7x performance degradation highlights the magnitude of data leakage in naive splitting approaches.

Reporting Standards

When benchmarking against public datasets, report results that enable direct comparison. At minimum, include: the exact dataset version and any preprocessing or filtering applied; the train/validation/test split strategy with subject IDs; per-subject metrics (not just population averages) to reveal inter-subject variability; the number of excluded segments and the exclusion criteria; and the reference signal processing used to derive ground truth labels.

Cross-Dataset Generalization

The ultimate test of a PPG algorithm is cross-dataset generalization: training on one dataset and evaluating on a completely different dataset collected with different hardware, subjects, and protocols. Most published PPG algorithms show significant performance degradation under cross-dataset evaluation, revealing overfitting to dataset-specific characteristics (sensor hardware, population demographics, activity patterns). Researchers should consider including cross-dataset evaluation as a standard validation step, training on the largest available dataset and evaluating on all others without fine-tuning.

For researchers implementing these datasets in complete analysis pipelines, our guides on PPG feature extraction and machine learning pipelines for PPG provide practical implementation guidance. Additional resources on PPG signal processing algorithms and the fundamentals of PPG technology are available in our learning center.

References

Li and Bhanu. (20). acceptable,.
Shashikumar et al. (2017). https://doi.org/10.1109/EMBC.2017.8037734

← Back to all articles