Keywords: Linkage, Evaluation, Tabular
Need: Data linkage of administrative records is frequently conducted by trusted third-party organisations, leaving the end-users of the linked data largely unaware of the intricacies of the linkage process. Understanding the quality of data linkage is crucial for users to accurately interpret the datasets they receive. The existing literature offers extensive guidelines on how to communicate data linkage quality and uncertainty, however, implementing these recommendations in a scalable and automated manner presents significant challenges.
We have developed a preliminary plan to deliver transparent and explainable data linkage products to end-users. During this internship, we seek to refine and enhance this plan by incorporating the perspectives of external users. The goal is to establish a set of metrics that can:
These metrics must adhere to the following criteria:
This project aims to investigate and identify techniques and methodologies that can be applied and automated at scale to provide users of linked data with actionable and valuable information regarding the quality of data linkage.
Current Knowledge/Examples & Possible Techniques/Approaches:
Related Previous Internship Projects: First internship project in this field but builds of the data science teams work mentioned above.
Enables Future Work: Data linkage is a critical aspect of modelling that requires rigorous evaluation and clear communication. NHS England is committed to enhancing the transparency of its data linkage approach to ultimately deliver trusted and high-quality linked data. By undertaking this project, the intern will contribute to this goal and help shape the future operations of the data linkage service.
Outcome/Learning Objectives: Minimum: A consolidated and comprehensive plan to deliver transparent and explainable data linkage, including defined items, priorities, and dependencies. A proof of concept demonstrating how to automate basic transparency and explainability metrics for users of linked data (for instance, a subset of the techniques for linkage quality assessment listed in Table 1 of Quality assessment in data linkage ).
Ideal: All of the above, plus an engineered pipeline capable of automatically generating metrics and linkage metadata.
Datasets: The project will involve working with real data from PAVE (Participant Validation Engine) submissions. Due to the sensitivity of this data, the successful candidate will require security clearance, a process that can take up to three months. Consequently, the internship is scheduled to begin in June to allow sufficient time for this clearance to be obtained.
Desired skill set: When applying please highlight any experience around probabilistic or deterministic linkage, coding experience (including any coding in the open), any other data science experience you feel relevant.
Return to list of all available projects.