NHS England Data Science PhD Internships

Exploring large-scale language models with NHS incident data

Keywords: NLP, Embeddings, TextData

Need: NHS England collects a national repository of incident reporting data for England and Wales. These data are used to detect emerging patient safety incidents and drive learning that improves patient safety. Although there are some categorical data fields, the real ‘signals’ in these data are within free-text field. Clinically experienced teams review the most serious events, and emerging themes, but the scale of the dataset means that only a few percent can receive a full review.

Natural Language Processing (NLP) techniques have the potential to unlock learning from data that do not receive a full clinical review. The dataset is currently being used for topic modelling and other analyses, but a large scale language model is not yet in use. The creation of an appropriate language model will enhance the use of clustering and other methods to identify novel targets for clinical review.

This extension would seek to build on the first stages of the project which explored language model representations of text for various model architectures, approaches to evaluating these models, and integration of open-source frameworks for exploring representations.

Current Knowledge/Examples & Possible Techniques/Approaches: Text extraction, processing, vector representations and weighting methods. Use of modern open source NLP tools and exposure to neural network frameworks such as TensorFlow or Pytorch.

Related Previous Internship Projects: Outputs from the first stage of this project is available as ELM4PSIR, which focuses on exploring various language modelling architectures which are optimised to produce useful embeddings for different structures, and suitable approaches for testing these models

Enables Future Work: This may lead to ongoing projects based on this foundation e.g. fine-tuning the selected language model, training on more data, comparing against a simpler benchmark models, development of production models, examining clustering or anomaly detection algorithms.

Outcome/Learning Objectives: Open source published code for training or tuning the language model, and potentially the language model itself. The first steps are to demonstrate the process on a smaller amount of data and comparing this against a simpler benchmark model, before applying to the full dataset.

Datasets: The National Reporting and Learning (NRLS) system - the national repository of NHS Incident reporting data, holding over 15 years of NHS data.

Desired skill set: When applying please highlight any experience around work with text data and specifically medical text data, natural language processing, embeddings, coding experience (including any coding in the open), and any other data science experience you feel relevant.

Return to list of all available projects.