NHS England Data Science PhD Internships

Synthetic Text Generation in Healthcare

Keywords: NLP, SyntheticGeneration, TextData

Need: Text data in the NHS is vastly underused due to issues around appropriate access and the subtleties when analysing free text. Providing generated synthetic medical text in various realistic formats would enable greater opportunity to identify potential, and develop innovation. This project would seek to build a methodology for creating publicly available synthetically generated medical free text such as notes and patient letters for developers of healthcare software and apps.

The project would also look to consider the ability of such sources to maintain appropriate privacy (where this is a key consideration) and understand the balance this has with utility and quality of the resulting texts. The ability to use the creativity of modern large-scale language models when working with and generating text, balanced with a need to supply curated information for robustness, is very important in solutions in healthcare settings.

This project looks to explore recent approaches in guiding text generation, such as prompting, techniques integrating factual knowledge sources, or learning from structured forms of information such as ontologies, with modern generative language models.

Current Knowledge/Examples & Possible Techniques/Approaches: Utilising and building on recent work in the area of generating synthetic medical text e.g.

Related Previous Internship Projects: Exploring large-scale language models with NHS incident data - Next Steps

Enables Future Work: Envisaged that as synthetic text generation has a range of possible approaches and nuances that this project will lead into a series of projects that build off one another. Additionally, the work will support the “NHS Language Corpus” and projects seeking to make useful NHS data available for public usage.

Outcome/Learning Objectives: A common need is for developer usage rather than research, and thus the complexity of the output and the quality score can be relatively low. In such cases the output can range from a “bag of relevant words” through to realistic structures that are internally consistent with other electronic health records. The output can either be a methodology that has a clear utility, privacy and quality score associated, or specific product extract with medical text examples within.

Datasets: Working with available medical text data such as MIMIC III (or IV) for training but would look to explore the possibility that the training dataset can be updated to a UK specific one in the future (requires IG training prior to project start).

Desired skill set: When applying please highlight any experience around work with text data and specifically medical text data, natural language processing, training and evaluating models, understanding of bias in text training data, privacy, coding experience (including any coding in the open), and any other data science experience you feel relevant.


Return to list of all available projects.