Project profile: Interpretable Long-Horizon Diagnosis Prediction via Learned Retrieval of Predictive Descriptions in Clinical Notes

Status: Active

Clinical notes contain descriptive findings not found in structured EHR fields that are relevant to early prediction of slowly evolving conditions, including neurodevelopmental conditions and chronic diseases. However, extracting descriptive findings relevant to a given condition of interest is challenging due to their relative sparsity compared to the high volume of clinical notes for a typical patient.

In this project, we have developed a novel, lightweight, scalable method to solve this challenge, which we call IRIS (Interpretable Retrieval-Augmented Classification for long Interspersed Document Sequences). IRIS divides clinical notes into short segments, stores embeddings of each segment in a vector database, and retrieves those segments most relevant to a given prediction task using learnable query vectors.

We have conducted several case studies illustrating how we can interpret these query vectors, including to identify and categorize specific descriptive findings in early childhood that are predictive of later autism diagnosis.

Research supported by:

National Institute of Mental Health (NIMH) (K01MH127309)

Dissemination: 

Li F, Hill ED, Shu J, Gao J, Engelhard MM. IRIS: Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2025 Jul (pp. 30263-30283).

Related Presentations, News, or Media:

Presented to the 2025 Annual Autism Centers of Excellence Investigators’ meeting