Keywords: Synthetic, VAE, Tabular
Need: Over the course of three internship projects, we have developed NHSSynth, a modular synthetic data pipeline capable of running multiple generation models (e.g., a Variational Autoencoder with differential privacy) for producing medium-to-high fidelity, high-privacy single-table healthcare data. It also includes an evaluation metric suite, fairness tools, and an adversarial attack framework.
However, most NHS data is multi-table, longitudinal, and often multi-modal, reflecting linked patient records over time and across systems. Extending NHSSynth to support these formats would greatly increase its value—enabling realistic synthetic datasets for AI assurance, software testing, and research.
This project will investigate methods for multi-table, longitudinal, and multi-modal synthetic data generation, or alternatively, benchmark new generation methods across different data formats using the existing NHSSynth evaluation framework.
Current Knowledge/Examples & Possible Techniques/Approaches: In terms of:
Related Previous Internship Projects: The first two projects on this can be seen in SynthVAE with the most recent work in NHSSynth
Enables Future Work: Allows NHS England to be generating a wider range of synthetic data for internal and external use
Outcome/Learning Objectives:
Datasets:
Development will use open datasets for transparency and reproducibility (e.g., MIMIC-III or MIMIC-IV). Adaptations to NHS-specific synthetic datasets could be explored under governance.
Desired skill set: When applying please highlight any experience around work with synthetic data, variational autoencoders, other generative techniques, python coding experience and software development (including any coding in the open), and any other data science experience you feel relevant.
Return to list of all available projects.