Project profile: Evaluation of AI Tools in Healthcare
Status: Active
This project develops and applies structured frameworks to evaluate generative artificial intelligence (AI), especially large language models (LLMs) in clinical settings. We combine qualitative assessments with automated metrics to assess linguistic quality, completeness, and trustworthiness of AI-generated outputs. Applications include Epic’s In-Basket messaging system and ambient digital scribing tools evaluated on real clinical data. The work supports responsible AI deployment by improving evaluation scalability, aligning automated metrics with clinical judgment, and informing governance standards.
Principal Investigator: Chuan Hong
Paper 1 – Published by JAMIA
Overview: This study introduces a unified evaluation framework for assessing large language models (LLMs) in healthcare, aiming to address the limitations of human-only evaluations, which are often time-consuming, inconsistent, and difficult to scale. The proposed framework integrates both qualitative assessments (e.g., linguistic quality, trustworthiness, usefulness) and quantitative metrics to enable more scalable and objective evaluation. The framework was applied to evaluate the Epic In-Basket feature, which uses LLMs to generate draft replies to patient messages. Results showed that while AI-generated responses were fluent and clear, they often lacked coherence and completeness. Importantly, clinicians’ actual use of these drafts was strongly aligned with specific quantitative metrics, supporting their utility as proxies for human judgment. The study concludes that although automated metrics cannot fully replace human evaluation, they can enhance the process and enable continuous monitoring and benchmarking of LLMs in clinical practice. This integrated approach supports more reliable, efficient, and ethical deployment of AI tools in healthcare.
Paper 2 – Accepted by npj Digital Medicine (in production stage)
Title: An Evaluation Framework for Ambient Digital Scribing Tools in Clinical Applications
Overview: Ambient digital scribing (ADS) tools alleviate clinician documentation burden, reducing burnout and enhancing efficiency. As AI-driven ADS tools integrate into clinical workflows, robust governance is essential for ethical and secure deployment. This study proposes a comprehensive ADS evaluation framework incorporating human evaluation, automated metrics, simulation testing, and large language models (LLMs) as evaluators. Our framework assesses transcription, diarization, and medical note generation across criteria such as fluency, completeness, and factuality. To demonstrate its effectiveness, we developed an ADS tool and applied our framework to evaluate the tool’s performance on 40 real clinical visit recordings. Our evaluation revealed strengths, such as fluency and clarity, but also highlighted weaknesses in factual accuracy and the ability to capture new medications. These findings underscore the value of structured ADS evaluation in improving healthcare delivery while emphasizing the need for strong governance to ensure safe, ethical integration.