Critical Appraisal Tools for Evaluating Artificial Intelligence in Clinical Studies: Scoping Review #MMPMID41359958
Cabello JB; Ruiz Garcia V; Torralba M; Maldonado Fernandez M; Ubeda M; Ansuategui E; Ramos-Ruperto L; Emparanza JI; Urreta I; Iglesias MT; Pijoan JI; Burls A
J Med Internet Res 2025[Dec]; 27 (?): e77110 PMID41359958show ga
BACKGROUND: Health research that uses predictive and generative artificial intelligence (AI) is rapidly growing. As in traditional clinical studies, the way in which AI studies are conducted can introduce systematic errors. The translation of this AI evidence into clinical practice and research needs critical appraisal tools for clinical decision-makers and researchers. OBJECTIVE: This study aimed to identify existing tools for the critical appraisal of clinical studies that use AI and to examine the concepts and domains these tools explore. The research question was framed using the Population-Concept-Context (PCC) framework. Population (P): AI clinical studies; Concept (C): tools for critical appraisal and associated constructs such as quality, reporting, validity, risk of bias, and applicability; and context (C): clinical practice. In addition, studies on bias classification and chatbot assessment were included. METHODS: We searched medical and engineering databases (MEDLINE, Embase, CINAHL, PsycINFO, and IEEE) from inception to April 2024. We included clinical primary research with tools for critical appraisal. Classical reviews and systematic reviews were included in the first phase of screening and excluded in the secondary phase after identifying new tools by forward snowballing. We excluded nonhuman, computer, and mathematical research, and letters, opinion papers, and editorials. We used Rayyan (Qatar Computing Research Institute) for screening. Data extraction was done by two reviewers, and discrepancies were resolved through discussion. The protocol was previously registered in Open Science Framework. We adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) and the PRISMA-S (PRISMA-Search) extension for reporting literature in systematic reviews. RESULTS: We retrieved 4393 unique records for screening. After excluding 3803 records, 119 were selected for full-text screening. From these, 59 were excluded. After inclusion of 10 studies via other methods, a total of 70 records were finally included. We found 46 tools (26 guides for reporting AI studies, 16 tools for critical appraisal, 2 for study quality, and 2 for risk of bias). Nine papers focused on bias classification or mitigation. We found 15 chatbot assessment studies or systematic reviews of chatbot studies (6 and 9, respectively), which are a very heterogeneous group. CONCLUSIONS: The results picture a landscape of evidence tools where reporting tools predominate, followed by critical appraisal, and a few tools for risk of bias. The mismatch of bias in AI and epidemiology should be considered for critical appraisal, especially regarding fairness and bias mitigation in AI. Finally, chatbot assessment studies represent a vast and evolving field in which progress in design, reporting, and critical appraisal is necessary and urgent.