Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation #MMPMID41343765
Kocaman V; Kaya MA; Feier AM; Talby D
JMIR AI 2025[Dec]; 4 (?): e72153 PMID41343765show ga
BACKGROUND: The proliferation of both general purpose and health care-specific large language models (LLMs) has intensified the challenge of effectively evaluating and comparing them. Data contamination plagues the validity of public benchmarks, self-preference distorts LLM-as-a-judge approaches, and there is a gap between the tasks used to test models and those used in clinical practice. OBJECTIVE: In response, we propose CLEVER (Clinical Large Language Model Evaluation-Expert Review), a methodology for blind, randomized, preference-based evaluation by practicing medical doctors on specific tasks. METHODS: We demonstrate the methodology by comparing GPT-4o (OpenAI) against 2 health care-specific LLMs, with 8 billion and 70 billion parameters, over 3 tasks: clinical text summarization, clinical information extraction, and question answering on biomedical research. RESULTS: Medical doctors prefer the medical model-small LLM trained by John Snow Labs over GPT-4o 45% to 92% more often on the dimensions of factuality, clinical relevance, and conciseness. CONCLUSIONS: The models show comparable performance on open-ended medical question answering, suggesting that health care-specific LLMs can outperform much larger general purpose LLMs in tasks that require understanding of clinical context. We test the validity of CLEVER evaluations by conducting interannotator agreement, interclass correlation, and washout period analysis.