Project profile: Understanding the Clinical Data Behind Large Language Models
Status: Active
In recent years, there has been significant excitement around the application of large language models (LLMs) to diverse clinical applications. Recent research has found that general domain models can perform just as well as these medically finetuned LLMs on standard benchmarks, despite being trained only on general online corpora. This raises major questions around where and how open-source LLMs are learning clinical information, given that they are not trained on EHR text.
This project, led by Duke researchers, examines how popular opensource LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational ability for understanding real-world clinical notes, and (2) their responses to unsupported medical claims. For both use cases, we investigate the frequency of relevant clinical information in their corresponding pretraining corpora, the relationship between pretraining data composition and model outputs, and the sources underlying this data. We find that jargon frequently appearing in clinical notes often rarely appears in pretraining corpora, revealing a mismatch between available data and real-world usage. Our analysis provides lessons for future training dataset filtering and composition.
Research supported by: Duke Whitehead Scholar Award
Principal Investigator: Monica Agrawal
Related Publications:
Jia, F., Sontag, D., & Agrawal, M. Diagnosing our datasets: How does my language model learn clinical information? Conference on Health, Inference, and Learning (CHIL) 2025.
