NHS England Data Science PhD Internships

Generative AI Evaluation

Keywords: NLP, Validation, Text

Need: Generative AI is a broad area covering many modalities (text, images, audio, etc…), which is easy to engage with, and yet hard to do “well”. Effective measures of quality are needed to help us choose the right techniques, complexity etc. Though classic Data Science and Analytical approaches to measuring quality can still be useful, new thinking is also needed. This is not a straightforward as how “good” is defined varies greatly based on the problem, the modality, the budget, etc. Additionally, existing common sense and best practice can be misleading as good performance on a simpler problem or benchmark may not be indicative of future performance on a more complex business task.

Traditional measures of quality from NLP can be used, such as BLEU and ROUGE scores, however there is some discussion about whether scoring well on these correlates with good outcomes for the user, hence there needs to be a discussion on their utility and when they can provide benefit. More recent metrics, such as BERTscore (or other approaches comparing embeddings) potentially provide more utility, whilst there are also small models (such as Cappy) which use Rouge-L to score outputs against tasks with seeming success.

Models are often compared in terms of how well they perform on benchmarks (such as SQuAD, Hellaswag, etc., see aggregator sites/tools such as Open LLM Leaderboard and AgentBench for comparisons), however for many users of LLMs it’s not clearer what these benchmarks are, what they test for, and to what extent we can use their results to infer performance on business tasks. There also exist tools that allow you to run models against these benchmarks (such as promptbench).

This project would seek to review available tooling and best practice in order to create a healthcare specific evaluation suite and to identify the best current benchmarks for healthcare with clarity around their coverage and limitations.

Current Knowledge/Examples & Possible Techniques/Approaches: There are many existing tools, methods, and benchmarks (e.g. LangChain, RAGAS, LlamaIndex, LLM-as-a-Judge, benchmarks like SQuAD).