ML Engineering PPG February 2026

PPG Datasets for Machine Learning: A Practical Comparison

You need PPG data to build your model. Here is an honest look at your options, what each one costs you in time and risk, and when synthetic data is the right call.

Why this keeps coming up

If you are building anything that touches photoplethysmography signals (heart rate detection, arrhythmia classification, stress estimation, SpO₂ inference) you will hit the same problem early on. You need data before you have data.

Getting to a working pipeline takes time regardless of which route you pick. But some routes take weeks and some take minutes. The choice matters a lot, especially in the early stages when you are just trying to find out if your approach is even sensible.

There are essentially three routes available to you. Let's go through each one honestly.

Option 1: Public datasets

This is where most people start. The obvious candidates are MIMIC-III, PPGI, the Physionet collections, and a handful of academic datasets that get cited repeatedly in papers.

What you actually get

On paper, these are free and reasonably large. In practice, you spend the first few days just getting access sorted. MIMIC requires a training course and institutional approval. Others have their own hoops. Then the download itself is often fragmented across multiple files with inconsistent naming conventions.

Once you have the files, the real work starts. Signal quality varies enormously across subjects and recording sessions. Artefacts from motion, poor sensor contact, and equipment differences are common. Annotation quality is inconsistent. Metadata is sometimes missing entirely or contradicts itself across versions of the same dataset.

A realistic timeline from "I need PPG data" to "I have a clean, annotated slice I can actually train on" is somewhere between three days and two weeks, depending on your experience and what you need the data to do.

The governance question

Even though these datasets are technically public, many engineering teams still need to run them past a compliance or legal function before use. If your company handles patient data in any capacity, your legal team may want to review the licence terms, the data provenance, and how you are storing it locally. That conversation adds time and sometimes kills the plan entirely.

Option 2: Real patient data from your own sources

If your company is already collecting physiological data from users or clinical partners, you might be tempted to use it for model development. In some cases this is the right call. In most early-stage scenarios, it is not.

Why it is usually the wrong move early on

The governance overhead is substantial. You need to establish lawful basis, confirm consent covers secondary use, handle data minimisation requirements, and document the processing activity. In most UK and EU contexts that means GDPR compliance work before you can touch the data for ML purposes.

Beyond the legal side, real patient data is messy in ways that are genuinely hard to work around at the prototyping stage. Signal quality depends on how the data was collected, what device was used, and the conditions at the time. You often cannot easily generate the specific physiological scenarios you want to test because you are working with what you have, not what you need.

This route makes sense for later-stage validation. It is rarely the right starting point.

Option 3: Synthetic data

Synthetic PPG data means algorithmically generated waveforms that are morphologically realistic but contain no real patient information. Nothing was recorded from a person. There is nothing to consent to, nothing to govern, and nothing to clean.

What it is actually good for

Synthetic data is well suited to anything where you need signal-shaped data quickly, without the overhead. That includes building and testing your signal processing pipeline, training early model iterations, stress-testing your preprocessing steps, and creating convincing demos and internal presentations.

You can also generate specific physiological scenarios on demand. If you need examples of atrial fibrillation with variable amplitude, or hypotension with elevated noise, you can have those in seconds rather than hunting through a public dataset hoping the examples you need are in there and sufficiently labelled.

What it is not good for

Synthetic data is not a substitute for real-world validation. When you get to the stage of benchmarking your model against clinical ground truth, you need real signals from real people. Synthetic data does not capture the full complexity of human physiology, and it should not be used to make any clinical or diagnostic claims.

This is not a limitation unique to synthetic data. It is just an honest statement about what the tool is for. Use it for prototyping and iteration. Use real data for validation.

A useful mental model: synthetic data is like a flight simulator. It is excellent for building skill, testing procedures, and catching problems early. Nobody uses it to certify that a plane is airworthy. That requires real flight conditions. Both have their place.

Side-by-side comparison

Criteria Public Datasets Real Patient Data Synthetic Data
Time to first usable signal Days to weeks Weeks to months Minutes
Governance overhead Low to medium High None
Signal quality consistency Variable Variable Controlled
Scenario control Limited Limited Full
Suitable for clinical validation Yes Yes No
Cost Free but time-expensive High Low

How most teams actually use these in combination

The engineers who move fastest tend to follow a similar pattern. They start with synthetic data to get a working pipeline and a prototype model up and running. Once the architecture is stable and they have something worth validating, they move to public datasets or real-world data for benchmarking and refinement.

This is not about cutting corners. It is about sequencing the work sensibly. There is no point spending two weeks cleaning a public dataset if your model architecture turns out to be wrong. Synthetic data lets you fail fast and cheaply in the ways that do not matter, so you can invest the expensive work in the iterations that do.

A note on PPG specifically

PPG is a relatively forgiving signal to work with synthetically compared to something like ECG. The key morphological features (the systolic peak, the dicrotic notch, the diastolic component, the overall waveform envelope) can be modelled parametrically with reasonable fidelity. For the purposes of pipeline development and prototyping, a well-generated synthetic PPG is indistinguishable from a clean real-world recording in terms of what your model needs to learn from it.

The caveats apply at the edges: extreme physiological states, rare arrhythmia patterns, multi-sensor fusion scenarios. If your use case sits in those areas, you will need real data earlier. For the majority of standard PPG algorithm development work, synthetic signals are entirely adequate for the prototyping phase.

Bottom line

If you are in the early stages of building a PPG-based feature or product, the practical advice is simple. Do not spend a week cleaning a public dataset to answer a question that synthetic data could answer in ten minutes. Use synthetic data to prove out your pipeline and your approach. Then bring in real data when you are ready to validate properly.

The tools to do this are available now, they are inexpensive relative to engineering time, and they carry no governance risk. There is not a strong argument for starting anywhere else.

Generate synthetic PPG waveforms in minutes

Offline desktop tool. No patient data. No governance overhead. Export to CSV or NumPy and drop straight into your pipeline.

Download Free Trial

Synthetic only · Non-clinical · Not for regulatory or diagnostic use