Keywords: NLP, Curation, TextData
Need: Currently there is no large, open, representative UK based medical text database available. When training NLP models on available datasets (e.g. MedMentions, MIMIC III or IV) issues can include: not picking up domain specific terms (e.g. drug names), get sentiment wrong, fail to understand UK specific abbreviations, or struggle with context due to regional differences. These issues may not be obvious from surface testing. This project would seek to build the first steps towards having a data set of United Kingdom focussed medical text sources for the purposes of training NLP models for the NHS.
Current Knowledge/Examples & Possible Techniques/Approaches: Text Extraction, Processing and Curation, Task Definition, Evaluation and Benchmarking. Uses of more recent work in synthetic medical text generation and open source NLP tools for metadata enrichment.
Related Previous Internship Projects: n/a as first year of the scheme. Possibility of architecture already put in place.
Enables Future Work: Envisage this work will require ongoing projects of which this will be the foundation e.g. extra metadata suitable for tasks like negation or temporality identification, automated clinical coding in various settings, synthetic text generation, text translation to SNOMED codes. Enables any UK focussed NLP health analysis.
Outcome/Learning Objectives: Curated public facing dataset with relevant text and clinical tagging. First step is to demonstrate how a small amount of easily accessed data can be extracted and processed.
Datasets: No specific dataset but project requires a variety of open health text data to be sourced e.g. NHS.UK, clinical articles, data dictionary and metadata descriptions
Desired skill set: When applying please highlight any experience around work with text data and specifically medical text data, natural language processing, tagging of text, coding experience (including any coding in the open), and any other data science experience you feel relevant.
Return to list of all available projects.