Keywords: Synthetic, VAE, Tabular
Need: Creating high-fidelity realistic health data is not only complex but comes with multiple information governance considerations. A particularly promising technique for creating realistic synthetic data is the variational autoencoder (VAE). Previous intern projects have created a VAE with differential privacy built in that can generate non-gaussian numeric and categorical datasets. They also started to investigate the need for methods to address fairness in the data. This project would continue the development of this model to increase privacy, explainability or fairness.
Current Knowledge/Examples & Possible Techniques/Approaches: See two reports on https://github.com/nhsx/SynthVAE/tree/main/reports
Related Previous Internship Projects: SynthVAE
Enables Future Work: We aim to use this prototype to build an NHSVAE which would be used to create synthetic data sets for national and regional use-cases with high confidence in the privacy, fidelity and fairness of the data generated.
Outcome/Learning Objectives: Depending on the route taken the learning could include the implementation of causal models inside a VAE, a comparison of a PATE implementation versus differential privacy for a VAE or a demonstration of producing “fair” data from biassed sources.
Datasets: MIMIC-III (open dataset once initial training completed)
Desired skill set: When applying please highlight any experience around synthetic generation (especially variational autoencoders), differential privacy, PyTorch, python coding experience (including any coding in the open), any other data science experience you feel relevant.
Return to list of all available projects.