AI Health
Friday Roundup
The AI Health Friday Roundup highlights the week’s news and publications related to artificial intelligence, data science, public health, and clinical research.
March 14, 2025
In this week’s Duke AI Health Friday Roundup: study examines use of NLP for chart message routing; new nonprofit provides home for bioRxiv and medRxiv; study weighs patient preferences for AI vs human messaging; signs of AI “peer review” crop up in comments on a journal submission; the case for keeping both efficacy and safety as key criteria for FDA drug approval; “red-teaming” ChatGPT outputs in clinical settings; much more:
AI, STATISTICS & DATA SCIENCE

- “Real-time message routing by the model was associated with shorter message response and resolution times and reductions in overall message burden for health care staff compared with a cohort of unrouted messages from the same period. In a sensitivity analysis, the effects of implementing the NLP remained consistent across most staff roles. The model also demonstrated high accuracy for correctly predicting each message class compared with expert reviewers, with accuracy exceeding 98% across message classes.” A research article published in NEJM AI by Anderson and colleagues describes an evaluation of a BERT-based natural language processing algorithm designed to cut down on clinician fatigue by more efficiently routing patient portal messages.
- “It’s true that in several prominent studies, researchers have staged ‘competitions’ in which AI technology appears to outperform humans in these very human areas. But a closer look reveals that these games are rigged against us humans. The competitions do not actually ask machines to perform human tasks; it’s more accurate to say that they ask humans to behave in machine-like ways as they perform lifeless simulacra of human tasks.” In an opinion piece for the Guardian, cognitive scientist J. Crockett suggests that some tests that pit human performance against AI may be utilizing a stacked deck.
- “…participants had a slight preference for clinician messages written by AI over those written by a human, and yet participants expressed higher satisfaction with messages they were told were written by their clinician over those they were told were written by AI. While statistically significant, the difference in overall patient satisfaction was small, with more than 75% of patients expressing satisfaction regardless of the author or disclosure.” In a survey study published in JAMA Network Open, Cavalier and colleages address how patients feel about getting AI-authored chart messages vs ones created by clinicians.
- “Wang and her colleagues created benchmarks to evaluate AI systems along two different dimensions that the team devised: difference awareness and contextual awareness. Difference awareness is measured by asking the AI descriptive questions about things like specific laws and demographics—questions that have an objectively correct answer.…Contextual awareness, a more subjective measure, tests the model’s ability to differentiate between groups within a larger context and involves value-based judgments.” MIT Technology Review’s Scott J. Mulligan reports on a pair of new assessment benchmarks that may hold promise for more accurate approaches to measuring bias in AI systems.
- “Of concern, inappropriate responses tended to be subtle and time-consuming to verify. Questions regarding “other people” who had had a similar diagnosis or requests to provide citations supporting a medical claim were likely to produce hallucination-containing answers that required manual verification. …With regards to citations, even when citation author list, article name, journal name, and publication year were all correct, the articles cited did not support the claims that the LLM reported they did, and indeed could be from completely unrelated disciplines.” A research article published in NPJ Digital Medicine by Chang and colleagues describes the results of an effort to “red-team” the ChatGPT LLM and gauge its capacity to deliver undesirable responses in a healthcare setting.
BASIC SCIENCE, CLINICAL RESEARCH & PUBLIC HEALTH

- “Labrador retrievers are particularly obesity-prone and tend to be highly food motivated… We studied a population of British Labrador retrievers and performed a GWAS which revealed multiple obesity-associated loci. We developed polygenic risk scores which explain previously observed obesity variation in the breed and quantify gene-environment interaction. Comparative genomics identified that canine obesity genes were also associated with human obesity.” A paper by Wallis and colleagues, published in Science, fingers the gene responsible for the phenomenon of Labrador Retriever chonkiness.
- “The Intuition brain health study was a large, virtual observational study in 23,004 US adults using direct-to-consumer App-based interactive and passive data collection from an iPhone and Apple Watch. We describe the study design, deployment and baseline study population, and a provide initial proof-of-concept modeling results to support the validity of remote MCI [mild cognitive impairment] classification.” A research article published in Nature Medicine by Butler and colleagues describes an observational study that evaluated the use of smartphone and wearable-based detection of cognitive impairment (H/T Matthew Harker).
- “Building an evidence base and implementing evidence require time. But rapid progress can be made with innovation that supports screening patients and populations at risk for the adverse health effects of climate change and providing updates to treatment protocols that consider climate change. New therapies to treat climate-sensitive disease and injury, scenario-based planning to support health system operations, and enhanced surveillance to track health loss and damage are encouraged.” A JAMA Insights essay by Hess and Elbi advocates for the development of an evidence-based framework for understanding and responding to the effects of climate change on human health.
COMMUNICATION, Health Equity & Policy

- “…the tone was far too uniform and generic. There was also an unexpected lack of nuance, depth or personality. And the reviewer had provided no page or line numbers and no specific examples of what needed to be improved to guide my revisions….they suggested I ‘remove redundant explanations’. However, they didn’t indicate which explanations were redundant, or even where they occurred in the manuscript. They also suggested I order my reference list in a bizarre manner which disregarded the journal requirements and followed no format that I have seen replicated in a scientific journal. They provided comments pertaining to subheadings that didn’t exist.” An essay by Timothy Hugh Barker, published at The Conversation, describes a dubious set of “peer review” comments on a submitted manuscript that raised suspicions of having been produced by an LLM AI.
- “NIH staff members have been instructed to identify and potentially cancel grants for projects studying transgender populations, gender identity, diversity, equity and inclusion (DEI) in the scientific workforce, environmental justice and any other research that might be perceived to discriminate on the basis of race or ethnicity, according to documents and an audio recording that Nature has obtained. Grants that allot funding to universities in China and those relating to climate change are also under scrutiny.” Nature’s Max Kozlov and Smirti Mallapaty report on the abrupt termination of active NIH research grants.
- “The approval standard in the Federal Food, Drug, and Cosmetic Act is flexible, allowing the FDA to make case-by-case determinations regarding what evidence is sufficient to support approval of a drug for a specific use. The adaptability of the approval standard, combined with existing pathways for companies to provide unproven products to patients with unmet treatment needs, suggests that proposals to lower or eliminate preapproval effectiveness requirements are, at best, premised on profound misunderstandings of the role of the FDA, and a failure to recognize the shortcomings of postmarket evidence generation.” A JAMA viewpoint article by Zettler and colleagues argues against changes to current FDA regulatory requirements that require demonstration of both efficacy and safety as a precondition for drug approval.
- “The free preprint servers bioRxiv and medRxiv, which during the past decade have sped up scientific communication by allowing biomedical researchers to share unreviewed manuscripts, today announced they will operate under a new nonprofit. Those involved hope to grow the share of all papers first appearing as preprints, increase submissions from authors in the Global South, and expand experiments to vet preprints.” Science’s Jeffrey Brainard reports on the formation of a nonprofit organization to support the activity of the MedRxiv and bioRxiv preprint servers.