AI Health

Friday Roundup

The AI Health Friday Roundup highlights the week’s news and publications related to artificial intelligence, data science, public health, and clinical research.

April 11, 2025

In this week’s Duke AI Health Friday Roundup: general LLMs still trail locally trained models – but are improving; Stanford HAI releases 2025 AI index; rural hospitals and patients suffer from lagging infrastructure; Anthropic survey sheds light on how students use AI; report to Congress warns that US biotechnology lead is slipping; oxygen tolerance may have evolved earlier than thought; questioning the choice of benchmarks when modeling performance with health AI; much more:

AI, STATISTICS & DATA SCIENCE

“This investigation suggests that LLMs are not yet ready to serve as predictive, analytic models, although GPT-4 does surpass GPT-3.5 in most of our experiments, indicating that GPT is improving with new releases. In our experiments, LLMs have a significantly lower AUROC compared to traditional ML. This implies that LLMs simply do not match the discriminative capabilities of traditional ML for these tasks. …Moreover, inspecting the calibration curves and Brier Scores reveals that LLMs are poorly calibrated when compared to the traditional ML comparison. “ A research article published in JAMIA in March by Brown and colleagues reports findings from a study that compared the performance of general-purpose LLMs with locally trained machine learning models in a variety of clinical tasks (H/T moorejh.bsky.social).
“AI-related incidents are rising sharply, yet standardized RAI evaluations remain rare among major industrial model developers. However, new benchmarks like HELM Safety, AIR-Bench, and FACTS offer promising tools for assessing factuality and safety. Among companies, a gap persists between recognizing RAI risks and taking meaningful action. In contrast, governments are showing increased urgency…”Stanford’s Human-centered Artificial Intelligence institute has released its 2025 AI Index Report.
“While previous studies have focused on specific tasks or used smaller datasets, our analysis across multiple clinical scenarios and sociodemographic groups advances the evaluation of biases in medical LLMs. The magnitude and repetition of these differences, such as high rates of mental health referrals or escalated interventions, suggest that sociodemographic identifiers, rather than clinical need, are frequently driving model outputs.” A research article by Omar and colleagues, published in Nature Medicine, examines sociodemographic biases present in large language model AIs to characterize their magnitude, pervasiveness, and direction of effect.
“We understand the appeal of multiagent AI simulations, which can test on hundreds of thousands of synthetic patients, and, because the only limitation is computing power, new models can be added at will. The output of such systems is attractive leaderboards that compare models and are updated frequently. But such conveniences are outweighed by a fundamental question — what is being measured?” An editorial published in NEJM AI by Rodman and colleagues argues for careful consideration of relevant benchmarks when evaluating AI applications for use in medicine.

BASIC SCIENCE, CLINICAL RESEARCH & PUBLIC HEALTH

“Their analysis showed that the last common ancestor of bacteria likely existed 4.4 to 3.9 billion years ago, and aerobic organisms likely emerged before the Great Oxidation Event (2.43 to 2.33 billion years ago). Oxygen tolerance may have been a prerequisite for, rather than a consequence of, the evolution of oxygenic photosynthesis” A research article by Davín and colleagues, published in Science, suggests that life on Earth may have evolved oxygen tolerance earlier than previously thought.
“In this randomized crossover trial involving healthy young adults of varying weights, we show that drinks sweetened with sucralose, a non-caloric sweetener, increased hypothalamic blood flow…compared to caloric sugar (sucrose) and water. Sucrose, compared to sucralose, had a hunger-dampening effect while also raising peripheral glucose levels, which corresponded to reduced medial hypothalamic blood flow. These results support the notion…that non-caloric sweeteners may alter appetite by interfering with the conventional neural responses to sweet taste and nutrient signalling observed with caloric sugar.” A research article published by Chakravartti and colleagues in Nature Metabolism reports findings that suggest artificial sweeteners may affect appetite regulation in humans.
“…a less visible, more critical inequity is working against high-quality care for Walker and other patients: The hospital’s internet connection is a fraction of what experts say is sufficient. High-speed broadband is the new backbone of America’s health care system, which depends on electronic health records, high-tech wireless equipment, and telehealth access.” A Kaiser Family Foundation Health News article by Sarah Jane Tribble, Holly K. Hacker, and Caresse Jackman describes the struggle rural hospitals and the patients they serve are facing due to lagging technological infrastructure.

COMMUNICATION, Health Equity & Policy

“…nearly half (~47%) of student-AI conversations were Direct—that is, seeking answers or content with minimal engagement. Whereas many of these serve legitimate learning purposes (like asking conceptual questions or generating study guides), we did find concerning Direct conversation examples…These raise important questions about academic integrity, the development of critical thinking skills, and how to best assess student learning.” Anthropic has posted an analysis of survey results showing how its Claude LLM is actually being used in academic environments.
“The final report released Tuesday includes close to 50 recommendations to protect U.S. biotech intellectual property and bolster drug development, agriculture, and biological weapons defense. The commission doesn’t have the power to authorize any changes but can make recommendations and advise members of Congress….In all, the commission — which is made up of four members of Congress, former Google CEO Eric Schmidt, and MIT researcher Angela Belcher, among others — is recommending that Congress invest at least $15 billion in various projects over the next five years.” STAT News’ Allison DeAngelis covers a report from the National Security Commission on Emerging Biotechnology that warns Congress and the public that the United States’ global preeminence in biotechnology is at risk and urgently needs additional support.
“During an April 2023 inspection at Raptim facilites in Nava Mumbai, India, FDA inspectors found “objectionable conditions” that led them to conclude the company falsified data in testing for multiple subjects and samples across multiple studies, according to a letter sent last week to the pharmaceutical companies….At the same time, the FDA sent a letter to Raptim to say its data was unreliable.” Pharmalot’s Ed Silverman reports on an unusual FDA move that requires a pharmaceutical company to redo subpar (and possibly falsified) testing performed by a contract research organization.
“…the Humanities were and are experiencing the impact of two blunt forces: a great forgetting about the role of higher education in driving the American economy and the expansion of opportunity since World War II, and the perilous invisibility of Humanities expertise in a period when expertise of all kinds is vulnerable beyond reasonable scrutiny.” An essay by Karin Wulf for the Scholarly Kitchen examines the current stresses being felt by the humanities in academia – and what they might augur for academia as a whole.