Keywords: Synthetic, BayesianNetworks, TabularData
Need: The field of synthetic data includes a wide range of applications and techniques. Our focus is on creating tools and data which can be widely used and shared. One direction for this work is to focus on probabilistic graph models such as Bayesian Networks. This project would seek to demonstrate the strengths and weaknesses of probabilistic models for synthetic data and for which use cases they are appropriate. One area we are keen to investigate is a methodology of querying the model to understand appropriateness of using the model for synthetic generation.
Current Knowledge/Examples & Possible Techniques/Approaches: Generating high-fidelity synthetic patient data for assessing machine learning healthcare software is a paper of particular interest relating to the CRPD work on generating synthetic data.
Related Previous Internship Projects: n/a as first year of the scheme
Enables Future Work: Further use and development of model
Outcome/Learning Objectives: Creation and interrogation of a probabilistic graph model on example data. Publication discussing how much information can be extracted by interrogating the graph alone in relation to privacy and quality of the generated data. Intention to make this project a public github repo
Datasets: tbd dependent on ambition of output
Desired skill set: When applying please highlight any experience around synthetic data generation, graph structures, bayesian networks or similar, coding experience (including any coding in the open), any other data science experience you feel relevant.
Return to list of all available projects.