AI Health

Friday Roundup

The AI Health Friday Roundup highlights the week’s news and publications related to artificial intelligence, data science, public health, and clinical research.

May 16, 2025

In this week’s Duke AI Health Friday Roundup: using LLMs to map data to common data elements; new benchmark for healthcare LLMs debuts; social penalties for using AI in the workplace; SPIRIT protocol guidelines updated; worries that primary care is buckling under nonclinical workload; parsing FDA plans for deploying generative AI for product evaluation; avoiding problematic “elderspeak”; much more:

AI, STATISTICS & DATA SCIENCE

Close-up photo shows a hand placing a pin with a round red head into a fine-scale city map. Image credit: GeoJango Maps/Unsplash
Image credit: GeoJango Maps/Unsplash
  • “…to provide a practical tool and minimize human efforts during the mapping process, we developed CDEMapper, a publicly available, user-friendly, and LLM-powered mapping tool, for biomedical and clinical researchers to map their local data elements to the NIH CDEs. Our tool not only integrated state-of-the-art LLM technologies into a streamlined pipeline but also adopted a modular-based architecture to allow quick updates of software components and support future versions of NIH CDEs.” In a research article published in the Journal of the American Medical Informatics Association, Wang and colleagues present an LLM-based tool that maps data elements to the shared nomenclature of Common Data Elements (H/T @jessiet1023.bsky.social).
  • “OpenAI’s HealthBench contains 5,000 ‘realistic health conversations,’ each with a custom rubric to grade the model’s responses to health-related questions. The questions and rubrics were curated by a group of 262 physicians who have practiced in a combined 60 countries, the company said. In total, the rubrics encompass over 57,000 unique criteria, allowing the company to measure the performance of models in many more dimensions than traditional benchmarks, said Singhal.” STAT News’ Brittany Trang has big news from OpenAI: the release of a benchmark dataset for LLMs in healthcare applications.
  • “Although there were over 2.5 million activations of AI scribes over a 1-year period, further enhancements and customization will likely continue to improve the uptake of the tool, particularly among nonusers. AI scribes are likely to have their greatest positive impact on physician burden when applied strategically to the most highly impacted areas, but also when use is customized and targeted by the physicians themselves, based on the scenarios and patients for which AI scribes can be most useful.” In a commentary article for NEJM Catalyst, Tierney and colleagues share an institutional experience with a year of using ambient AI “scribes” to capture clinical documentation.
  • “Sharing a disclaimer in an AI-assisted reply to a patient message conveys transparency in content authenticity. Lack of transparency could raise doubts in patients who have established relationships with their physicians and are familiar with their communication styles. AI-generated drafts are often longer and more effusively empathetic, potentially differing from typical physician communication — especially if not substantially revised. This could lead to patients questioning the authenticity of the replies, potentially damaging the crucial doctor–patient trust.” A perspective article in NEJM AI by Millen and colleagues presents an institutional experience with automated disclosure of the use of AI applications in patient encounters.

BASIC SCIENCE, CLINICAL RESEARCH & PUBLIC HEALTH

An empty shopping cart sits in an empty parking deck at night, lit by fluorescent lights overhead. Image credit: James Watson/Unsplash
Image credit: James Watson/Unsplash
  • “After the end of EAs, there were significant increases in food insecurity and poor physical health days among SNAP participants. No changes in poor mental health days or poor or fair health status were observed….The increase in poor physical health after the end of SNAP EAs is equivalent to approximately 1 additional day of poor physical health each month, and is similar in magnitude to declines in physical health observed nationally during the COVID-19 pandemic.” In a research letter published in JAMA, Liu and colleagues explore changes in health status among recipients of “emergency allotments” of SNAP (Supplemental Nutrition Assistance Program) benefits during the COVID pandemic.
  • “The protocol of a randomized trial is the foundation for study planning, conduct, reporting, and external review. However, trial protocols vary in their completeness and often do not address key elements of design and conduct. The SPIRIT (Standard Protocol Items: Recommendations for Interventional Trials) statement was first published in 2013 as guidance to improve the completeness of trial protocols….Herein, we systematically update the SPIRIT recommendations for minimum items to address in the protocol of a randomized trial.” In a paper co-published in JAMA (and multiple other journals), Chan and colleagues provide an updated set of the SPIRIT guidelines for developing research protocols for randomized clinical trials.
  • “In this pragmatic RCT of antihypertensive medication timing, we observed no difference in all-cause deaths or major cardiovascular events or in potential hypotensive, visual, or cognition-related adverse events between adults with hypertension in primary care randomized to medication administration at bedtime vs in the morning. In particular, there was no difference in falls or fractures, new glaucoma diagnoses, or cognitive decline. Administration time affected neither the benefits nor the risks of BP-lowering medication.” A research article published in JAMA by Garrison and colleagues examines a long-running question that previous trials have failed to conclusively settle: whether there is any benefit to taking antihypertensive medications in the morning vs in the evening.
  • “Many organizations have successfully deployed additional team members to help PCPs manage their inboxes. Similarly, use of tablets to collect information from patients while they are in waiting rooms and ambient artificial intelligence tools that assist physicians with documentation has shown promise for helping physicians manage EHR-related tasks….But the primary care system in the United States is at risk of collapsing. Though correcting the mismatch between PCPs’ work and their remuneration remains crucial, steps are required to address additional concerns.” A perspective article in the New England Journal of Medicine by Landon and colleagues paints a stark picture of pressures affecting primary care in the United States.

COMMUNICATION, Health Equity & Policy

A computer lab with three rows of four desks, each occupied by students working at computers. Overlaying the computer lab are red lines connecting through nodes, symbolizing the flow of communication, data exchange, and interconnected networks. Image credit: Hanna Barakat & Cambridge Diversity Fund / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/
Image credit: Hanna Barakat & Cambridge Diversity Fund / Better Images of AI/CC-BY 4.0
  • “The Duke team conducted four experiments with over 4,400 participants to examine both anticipated and actual evaluations of AI tool users. Their findings, presented in a paper titled “Evidence of a social evaluation penalty for using AI,” reveal a consistent pattern of bias against those who receive help from AI….What made this penalty particularly concerning for the researchers was its consistency across demographics. They found that the social stigma against AI use wasn’t limited to specific groups.” Ars Technica’s Benj Edwards reports on an eye-catching PNAS paper by a group of Duke Fuqua Business School authors that finds a social downside for using AI in the workplace.
  • “It’s no surprise that the FDA is experimenting with generative AI, which has upended medical industries since high-powered commercial large language models burst into public consciousness in 2023….But in FDA meetings, experts have cautioned against adopting the technology too quickly for clinical purposes. The first meeting of FDA’s Digital Health Advisory Committee in November cited many of generative AI’s risks — including hallucinations, output variability, and privacy concerns — that should encourage careful testing before full adoption.” STAT News’ Katie Palmer and Casey Ross examine the implications of a recent announcement by the FDA commissioner regarding the impending rollout of AI-based applications to assist in product review at the agency.
  • “Sometimes, elderspeakers employ a louder volume, shorter sentences, or simple words intoned slowly. Or they may adopt an exaggerated, singsong vocal quality more suited to preschoolers, along with words like “potty” or “jammies.”…With what are known as tag questions — It’s time for you to eat lunch now, right? — ‘You’re asking them a question but you’re not letting them respond,’ Williams explained. ‘You’re telling them how to respond.’ In an article for Kaiser Health News, Paula Span highlights the potential harms of “elderspeak” – an often well-meant but potentially grating use of baby talk aimed at older persons in care and nursing environments.