Keywords: NLP, Embeddings, TextData
Need: NHS England and NHS Improvement collect a national repository of incident reporting data for England and Wales. These data are used to detect emerging patient safety incidents and drive learning that improves patient safety. Although there are some categorical data fields, the real ‘signals’ in these data are within free-text field. Clinically experienced teams review the most serious events, and emerging themes, but the scale of the dataset means that only a few percent can receive a full review. Natural Language Processing (NLP) techniques have the potential to unlock learning from data that do not receive a full clinical review. The dataset is currently being used for topic modelling and other analyses, but a large scale language model is not yet in use. The creation of an appropriate language model will allow the use of clustering and other methods to identify novel targets for clinical review. This project would seek to build the first stages of a language model by constructing different representations of text to find preferred models.
Current Knowledge/Examples & Possible Techniques/Approaches: Text extraction, processing, vector representations and weighting methods. Use of modern open source NLP tools and exposure to neural network frameworks such as TensorFlow or Pytorch.
Related Previous Internship Projects: n/a as first year of the scheme. Possibility of architecture already put in place.
Enables Future Work: This may lead to ongoing projects based on this foundation e.g. fine-tuning the selected language model, training on more data, comparing against a simpler benchmark models, development of production models, examining clustering or anomaly detection algorithms.
Outcome/Learning Objectives: Open source published code for training or tuning the language model, and potentially the language model itself. The first steps are to demonstrate the process on a smaller amount of data and comparing this against a simpler benchmark model, before applying to the full dataset.
Datasets: The National Reporting and Learning (NRLS) system - the national repository of NHS Incident reporting data, holding over 15 years of NHS data.
Desired skill set: When applying please highlight any experience around work with text data and specifically medical text data, natural language processing, embeddings, coding experience (including any coding in the open), and any other data science experience you feel relevant.
Return to list of all available projects.