Use my Search Websuite to scan PubMed, PMCentral, Journal Hosts and Journal Archives, FullText.
Kick-your-searchterm to multiple Engines kick-your-query now !>
A dictionary by aggregated review articles of nephrology, medicine and the life sciences
Your one-stop-run pathway from word to the immediate pdf of peer-reviewed on-topic knowledge.

suck abstract from ncbi


10.2196/72153

http://scihub22266oqcxt.onion/10.2196/72153
suck pdf from google scholar
41343765!?!41343765

suck abstract from ncbi

pmid41343765      JMIR+AI 2025 ; 4 (?): e72153
Nephropedia Template TP

gab.com Text

Twit Text FOAVip

Twit Text #

English Wikipedia


  • Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation #MMPMID41343765
  • Kocaman V; Kaya MA; Feier AM; Talby D
  • JMIR AI 2025[Dec]; 4 (?): e72153 PMID41343765show ga
  • BACKGROUND: The proliferation of both general purpose and health care-specific large language models (LLMs) has intensified the challenge of effectively evaluating and comparing them. Data contamination plagues the validity of public benchmarks, self-preference distorts LLM-as-a-judge approaches, and there is a gap between the tasks used to test models and those used in clinical practice. OBJECTIVE: In response, we propose CLEVER (Clinical Large Language Model Evaluation-Expert Review), a methodology for blind, randomized, preference-based evaluation by practicing medical doctors on specific tasks. METHODS: We demonstrate the methodology by comparing GPT-4o (OpenAI) against 2 health care-specific LLMs, with 8 billion and 70 billion parameters, over 3 tasks: clinical text summarization, clinical information extraction, and question answering on biomedical research. RESULTS: Medical doctors prefer the medical model-small LLM trained by John Snow Labs over GPT-4o 45% to 92% more often on the dimensions of factuality, clinical relevance, and conciseness. CONCLUSIONS: The models show comparable performance on open-ended medical question answering, suggesting that health care-specific LLMs can outperform much larger general purpose LLMs in tasks that require understanding of clinical context. We test the validity of CLEVER evaluations by conducting interannotator agreement, interclass correlation, and washout period analysis.
  • ?


  • DeepDyve
  • Pubget Overpricing
  • suck abstract from ncbi

    Linkout box