AI Health

Friday Roundup

The AI Health Friday Roundup highlights the week’s news and publications related to artificial intelligence, data science, public health, and clinical research.

December 13, 2024

In this week’s Duke AI Health Friday Roundup: developing benchmarks for AI performance; tapping the potential of mental health apps; how multidisciplinary is that journal?; updating AI prediction models in healthcare; weighing the benefits of sepsis prediction models; risks of drug overdose death among Medicaid recipients; creating institutional guidelines for the use of health AI; dataset encompasses retracted journal articles; much more:

AI, STATISTICS & DATA SCIENCE

“…no studies to date have assessed the quality of AI benchmarks in general in a structured manner, including both FM and non-FM benchmarks. Further, no comparative analyses have assessed the quality differences across the benchmark life cycle between widely used AI benchmarks. This leaves a significant gap for practitioners who may be relying on these benchmarks to select models for downstream tasks and policymakers who are increasingly integrating benchmarking in their AI policy apparatuses.” A recently released policy brief from Stanford University’s Human-centered AI institute addresses the profusion – and accompanying lack of validation and oversight – of the myriad available “benchmarks” for AI performance.
“By administering sentence completion prompts to 77 different LLMs (for instance, ‘We are…’), we demonstrate that nearly all base models and some instruction-tuned and preference-tuned models display clear ingroup favoritism and outgroup derogation. These biases manifest both in controlled experimental settings and in naturalistic human–LLM conversations. However, we find that careful curation of training data and specialized fine-tuning can substantially reduce bias levels.” A research article published this week by Hu and colleagues in Nature Computational Science offers evidence of social identity biases permeating generative AI.
“AI-based prediction models are increasingly used in healthcare, helping clinicians with diagnosing diseases, guiding treatment decisions, and informing patients. However, these prediction models do not always work well when applied to hospitals, patient populations, or times different from those used to develop the models. Developing new models for every situation is not practical nor desired, as it wastes resources, time, and existing knowledge. A more efficient approach is to adjust existing models to new contexts (‘updating’), but there is limited guidance on how to do this for AI-based clinical prediction models.” A paper by Meijerink and colleagues, accepted for publication at the Journal of Clinical Epidemiology and available ahead of press, examines multiple AI-based clinical prediction models to provide a roadmap for updating predictive models.
“Recognizing the rapid advancement of AI technologies and their potential to transform healthcare delivery, we propose a set of guidelines emphasizing fairness, robustness, privacy, safety, transparency, explainability, accountability, and benefit. Through a multidisciplinary collaboration, we developed and operationalized these guidelines within a healthcare system, highlighting a case study on ambient documentation to demonstrate the practical application and challenges of implementing generative AI in clinical environments. Our proposed framework ensures continuous monitoring, evaluation, and adaptation of AI technologies, addressing ethical considerations and enhancing patient care.” A perspective article published in NPJ Digital Medicine by Saenz and colleagues offers a case study for creating institutional guidelines for the use of AI in health care (h/t @smcgrath.phd).

BASIC SCIENCE, CLINICAL RESEARCH & PUBLIC HEALTH

“Leveraging the EHR to provide smart clinical decision support is arguably one of the greatest, and least realized, opportunities to improve the quality of health care delivery. Sepsis is a prime use case, but enthusiasm has outstripped the quality of the evidence base. The SCREEN trial provides the first large-scale demonstration of benefit in a robust randomized evaluation. However, it also raises questions around just how such alerts might actually work, in whom, and in what settings.” A JAMA editorial by Derek C. Angus that accompanies a research article by Arabi and colleagues examines the relative utility of automated sepsis alerts that cue off of data in patient electronic health records.
“The most common types of distortion included failing to start the y axis at zero, as well as mistakes that related to logarithmic axes. The former often inflates the difference between two values to make small disparities look larger, and the latter can minimize differences because our brains are prone to perceive scales as linear. Papers with multiple co-authors were most likely to include these distortions, the team found.” A technology feature article by Nature’s Amanda Heidt addresses controversy surrounding a venerable fixture of scientific communications – the humble bar graph.
“…before consumer DMHIs [digital mental health interventions] can transform access to effective support, they must overcome an urgent problem: Most people don’t want to use them. Our best estimate is that 96% of people who download a mental health app will have entirely stopped using it just 15 days later. The field of digital mental health has been trying to tackle this profound engagement problem for years, with little progress. As a result, the wave of pandemic-era excitement and funding for digital mental health is drying up. To advance DMHIs toward their promise of global impact, we need a revolution in these tools’ design.” A STAT News opinion article by NIMH postdoc Benjamin Kaveladze makes the case that the potential for benefit from mental health apps is being attenuated by outdated approaches.
“This study suggests that drug overdose deaths remain a leading cause of death in the US despite years of investment to try to address the opioid epidemic. Medicaid beneficiaries are at high risk of drug overdoses. Federal and states agencies should invest in timely and accessible linked mortality and Medicaid data to better understand and target interventions toward the populations at highest risk.” An analysis published by Mark and Huber in JAMA Health Forum examines trends in drug overdose deaths among Medicaid beneficiaries.

COMMUNICATION, Health Equity & Policy

“…we examined all new drugs approved in the US, Germany, and Switzerland between January 2011 and December 2022 that had pricing information and assessed launch prices and price developments post launch until September 2023. In the US, drug prices of cancer drugs (but not noncancer drugs) increased substantially post launch. By contrast, drug prices decreased substantially post launch in Germany and Switzerland, but drug prices for cancer drugs continued to be substantially higher than noncancer drugs.” An analysis by Laube and colleagues, appearing in JAMA Health Forum, compares the evolution of drug prices following market debut in the US, Germany, and Switzerland.
“We found that the representation of the main branches of knowledge in multidisciplinary journals was uneven and, in general, not proportional to the global research effort dedicated to each branch. Similarly, the distribution of publications across specific research areas was uneven, with “Biochemistry & Molecular Biology” strongly overrepresented. However, we detected a decreasing trend in the percentage of publications that multidisciplinary journals dedicate to this and other top areas, especially over the last decade. The multidisciplinary degree of multidisciplinary journals, as measured by the Gini index, was generally low but showed a gradual increase over time.” An analysis published in PLOS ONE by Redondo-Gómez and colleagues addresses the question of just how “multidisciplinary” multidisciplinary scientific journals actually are.
“Retractions play a vital role in maintaining scientific integrity, yet systematic studies of retractions in computer science and other STEM fields remain scarce. We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv, containing over 14,000 papers and their associated retraction comments spanning the repository’s entire history through September 2024. Through careful analysis of author comments, we develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations.” A preprint article by Rao and colleagues, available from arXiv, describes the creation of WithdrarXiv, a data repository devoted to collecting and categorizing paper that have been retracted from publication for various reasons.