Keywords: NLP, Memorisation, Multi-modal
Need: Historically, work around NLP modelling techniques have also built an understanding of the amount of information that a model encodes within its parameters during training, and whether it can be extracted again at inference time, using a variety of techniques.
With the introduction of larger language models, which have undergone a rapid increase in the number of model parameters and are able to encode and differentiate between contexts within the training data, we have seen an increased interest in how much these models are memorising the data they are trained on, and further exploration of new techniques probing how this can be quantified and mitigated. This is particularly important in healthcare use cases where the sensitivity of training data, even after common de-identification approaches are applied, needs to be well understood.
This project looks to understand the current thinking on quantifying issues such as memorisation within modern language models, and exploring the various mitigation strategies emerging in the literature. It also hopes to understand how other privacy preserving techniques can be used to complement and enhance these more direct approaches.
Current Knowledge/Examples & Possible Techniques/Approaches:
Related Previous Internship Projects: Exploring Large-scale Language Models with NHS Incident Data
Enables Future Work: Work looking to utilise modern language modelling techniques which then hope to share models with a strong understanding of possible leakage, and further will feed into the development of an assessment framework or tooling for a given use case.
Outcome/Learning Objectives: A better understanding of mitigation of training data leakage from large-scale language models, with some practical examples of concerns.
Datasets: Synthetically generated and curated examples, MIMIC III or IV, any open medical text datasets
Desired skill set: When applying please highlight any experience around natural language processing, language modelling, deep learning, coding experience (including any coding in the open), and any other data science experience you feel relevant.
Return to list of all available projects.