AI Health

Friday Roundup

The AI Health Friday Roundup highlights the week’s news and publications related to artificial intelligence, data science, public health, and clinical research.

December 19, 2025

In this week’s Duke AI Health Friday Roundup: choosing measures for AI performance; problems with AI simulation in behavioral studies; AI ethics textbook full of bogus citations; testing different chatbots on subspecialty boards; looking back on the year in computer science; analysis flags questionable publication practices; machine learning model spots liver disease in EKG data; much more:

AI, STATISTICS & DATA SCIENCE

Close-up photograph shows a tangle of colorful measuring tapes, some marked in inches, others in centimeters. Image credit: Patricia Serna/Unsplash
Image credit: Patricia Serna/Unsplash
  • “We evaluated 32 classic and contemporary performance measures across five performance domains (discrimination, calibration, overall performance, classification, and clinical utility) for predictive AI models intended for medical practice. When validating the performance of a prediction model, we warn against the use of measures that are improper…or that do not have a clear focus on either statistical or decision–analytical performance… Measures that conflate statistical and decision–analytical performance without properly accounting for misclassification costs are ambiguous and should be replaced with dedicated measures for clinical utility.” An article by Van Calster and colleagues published in Lancet Digital Health critically evaluates different predictive AI performance measures on whether they are fit for task in clinical settings.
  • “Importantly, statistical analysis revealed no significant differences in overall scores among the three AI models across the 3 years or within any single year. These findings suggest that while newer models may show incremental improvements in accuracy, the margin may not yet be wide enough to translate into consistent superiority under testing conditions. Moreover, variability within the same model across different years, such as OpenAI GPT-3.5’s fluctuations, indicates that external factors like question composition, difficulty, or thematic emphasis could influence AI performance as much as the model capability itself.” A research article by Roberts and colleagues, published in the journal JPGN Reports, evaluates the performance of three chatbots tasked with taking specialty pediatric gastroenterology board exams.
  • “This scenario exemplifies what we call the ‘provenance problem’: a systematic breakdown in the chain of scholarly acknowledgement that current ethical frameworks fail to address. Unlike conventional plagiarism, this phenomenon involves neither ill-intent nor accidental carelessness with one’s sources. The problem is of a different, more intractable sort. Because LLM training distributes the influence of its source across billions of interdependent parameters, the contribution of any specific text is often untraceable.” An article by Earp and colleagues published in Nature Machine Intelligence makes the point that the use of large language models to author papers runs the risk of severing ideas from the context that gave rise to them and is necessary for fully understanding them.
  • “The Frontiers’ survey found that, among the respondents who use AI in peer review, 59% use it to help write their peer-review reports. Twenty-nine per cent said they use it to summarize the manuscript, identify gaps or check references. And 28% use AI to flag potential signs of misconduct, such as plagiarism and image duplication (see ‘AI assistance’).” Nature’s Miryam Naddaf reports on a new survey by the Frontiers scientific publishing group that suggests a large proportion of peer reviewers are using AI tools to help review manuscripts – even though doing so might not be allowed.
  • “He found that the 252 choice combinations produced a wide range of different outcomes. Some settings led models to more closely match the rankings of human participants, for instance, whereas others more closely matched the correlation between the measures. But no single combination of settings worked well across the board.” An article at Science Insider by Cathleen O’Grady on research, recently published as a preprint on arXiv, that suggests the use of AI simulation of human subjects in behavioral science research may be a fraught and risky undertaking.

BASIC SCIENCE, CLINICAL RESEARCH & PUBLIC HEALTH

A green glowing electrocardiogram trace on a hospital monitor video screen shows the peaks and depressions of a heart rhythm. Image credit: Joshua Chehov/Unsplash
Image credit: Joshua Chehov/Unsplash
  • “…we conducted a pragmatic trial to assess whether an electrocardiogram (ECG)-based machine learning (ECG-ML) model enables early detection of advanced CLD [chronic liver disease]…. The intervention significantly increased new diagnoses of advanced CLD in the overall cohort (1.0% versus 0.5% in the control arm; odds ratio (OR) 2.09, 95% confidence interval (CI) 1.22–3.55, P = 007).” Findings from a pragmatic clinical trial, published by Simonetto and colleagues in Nature Medicine this week, demonstrate the utility of a machine learning model that can detect liver disease from electrocardiogram (EKG) data.
  • “There were also plenty of positive developments in 2025 that offer hope for the coming years…Our recent Nature’s 10 package includes many good news stories — and there were many more. From gene-editing firsts to rapid disease containment and policy victories, Nature takes a look at some positive science stories of 2025.” I think we could all use a bit of positivity, and Katie Kavanagh’s recap of uplifting science stories from 2025 in Nature seems like just the thing.
  • “Explore the year’s most surprising computational revelations, including a new fundamental relationship between time and space, an undergraduate who overthrew a 40-year-old conjecture, and the unexpectedly effortless triggers that can turn AI evil.” It’s been an eventful year in computer science, but Quanta’s Michael Moyer is here to walk us through some of 2025’s biggest stories.

COMMUNICATIONS & Policy

Photograph shows a balloon printed with the gritted-teeth emoji expressive of secondhand embarrassment. Image credit: Bernard Hermant/Unsplash
Image credit: Bernard Hermant/Unsplash
  • “One of the world’s largest academic publishers is selling a book on the ethics of artificial intelligence research that appears to be riddled with fake citations, including references to journals that do not exist….The Times found that a book recently published by the German-British publishing giant Springer Nature includes dozens of citations that appear to have been invented — a sign, often, of AI-generated material.” The Times’ Tilly Harris and Rhys Blakely report on an unusually on-the-nose bit of (most likely) AI-generated shenanigans in a recently published Springer Nature textbook (on AI ethics, of course).
  • “…we’re building an AI-powered tool to conduct source audits on large corpora of news articles. The goal of the project is to equip newsrooms with data that helps them identify when coverage leans too heavily on a single type of source. Thanks to recent breakthroughs in generative AI, highly accurate quote-detection technology is now within reach.” An article by Dam and colleagues at Columbia Journalism Review reports on the development of an AI-powered tool for identifying and extracting quotes from large collections of news stories to facilitate “audits” of story sourcing.
  • “Of the nearly 1,000 pairs of authors and handling editors, Mindel and Ciriello found at least half had at least one identifiable potential conflict of interest, with about 40 percent of the author-editor pairs having accepted each other’s papers. Their findings suggest editorial conflicts of interest ‘are not isolated anomalies but pervasive features of elite journal governance,’ Mindel and Ciriello write in the preprint.” Retraction Watch reports on recent research scrutinizing publication practices in academic business journals.