AI Health

Friday Roundup

The AI Health Friday Roundup highlights the week’s news and publications related to artificial intelligence, data science, public health, and clinical research.

April 11, 2025

In this week’s Duke AI Health Friday Roundup: general LLMs still trail locally trained models – but are improving; Stanford HAI releases 2025 AI index; rural hospitals and patients suffer from lagging infrastructure; Anthropic survey sheds light on how students use AI; report to Congress warns that US biotechnology lead is slipping; oxygen tolerance may have evolved earlier than thought; questioning the choice of benchmarks when modeling performance with health AI; much more:

AI, STATISTICS & DATA SCIENCE

Three cartoon-style people of similar stature (one instructor and two students) are seated around a circular table. The instructor is wearing a dress shirt with a nametag and is a wheelchair user. The students are wearing a t-shirt and a jumper respectively and are not wheelchair users. A speech bubble appears above the instructor's head to depict the collection of biomedical data from an individual that is then inputted into an AI algorithm to predict long-term cardiovascular disease risk (denoted here by the icon of a broken heart). Image credit: Yaning Wu / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/
Image credit: Yaning Wu / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/
  • “This investigation suggests that LLMs are not yet ready to serve as predictive, analytic models, although GPT-4 does surpass GPT-3.5 in most of our experiments, indicating that GPT is improving with new releases. In our experiments, LLMs have a significantly lower AUROC compared to traditional ML. This implies that LLMs simply do not match the discriminative capabilities of traditional ML for these tasks. …Moreover, inspecting the calibration curves and Brier Scores reveals that LLMs are poorly calibrated when compared to the traditional ML comparison. “ A research article published in JAMIA in March by Brown and colleagues reports findings from a study that compared the performance of general-purpose LLMs with locally trained machine learning models in a variety of clinical tasks (H/T moorejh.bsky.social).
  • “AI-related incidents are rising sharply, yet standardized RAI evaluations remain rare among major industrial model developers. However, new benchmarks like HELM Safety, AIR-Bench, and FACTS offer promising tools for assessing factuality and safety. Among companies, a gap persists between recognizing RAI risks and taking meaningful action. In contrast, governments are showing increased urgency…”Stanford’s Human-centered Artificial Intelligence institute has released its 2025 AI Index Report.
  • “While previous studies have focused on specific tasks or used smaller datasets, our analysis across multiple clinical scenarios and sociodemographic groups advances the evaluation of biases in medical LLMs. The magnitude and repetition of these differences, such as high rates of mental health referrals or escalated interventions, suggest that sociodemographic identifiers, rather than clinical need, are frequently driving model outputs.” A research article by Omar and colleagues, published in Nature Medicine, examines sociodemographic biases present in large language model AIs to characterize their magnitude, pervasiveness, and direction of effect.
  • “We understand the appeal of multiagent AI simulations, which can test on hundreds of thousands of synthetic patients, and, because the only limitation is computing power, new models can be added at will. The output of such systems is attractive leaderboards that compare models and are updated frequently. But such conveniences are outweighed by a fundamental question — what is being measured?” An editorial published in NEJM AI by Rodman and colleagues argues for careful consideration of relevant benchmarks when evaluating AI applications for use in medicine.

BASIC SCIENCE, CLINICAL RESEARCH & PUBLIC HEALTH

Stromatolites (aggregations of primitive cyanobacteria) growing in Hamelin Pool Marine Nature Reserve, Shark Bay in Western Australia. Image credit: Paul Harrison via Wikipedia (CC BY-SA 3.0)
Image credit: Paul Harrison via Wikipedia (CC BY-SA 3.0)
  • “Their analysis showed that the last common ancestor of bacteria likely existed 4.4 to 3.9 billion years ago, and aerobic organisms likely emerged before the Great Oxidation Event (2.43 to 2.33 billion years ago). Oxygen tolerance may have been a prerequisite for, rather than a consequence of, the evolution of oxygenic photosynthesis” A research article by Davín and colleagues, published in Science, suggests that life on Earth may have evolved oxygen tolerance earlier than previously thought.
  • “In this randomized crossover trial involving healthy young adults of varying weights, we show that drinks sweetened with sucralose, a non-caloric sweetener, increased hypothalamic blood flow…compared to caloric sugar (sucrose) and water. Sucrose, compared to sucralose, had a hunger-dampening effect while also raising peripheral glucose levels, which corresponded to reduced medial hypothalamic blood flow. These results support the notion…that non-caloric sweeteners may alter appetite by interfering with the conventional neural responses to sweet taste and nutrient signalling observed with caloric sugar.” A research article published by Chakravartti and colleagues in Nature Metabolism reports findings that suggest artificial sweeteners may affect appetite regulation in humans.
  • “…a less visible, more critical inequity is working against high-quality care for Walker and other patients: The hospital’s internet connection is a fraction of what experts say is sufficient. High-speed broadband is the new backbone of America’s health care system, which depends on electronic health records, high-tech wireless equipment, and telehealth access.” A Kaiser Family Foundation Health News article by Sarah Jane Tribble, Holly K. Hacker, and Caresse Jackman describes the struggle rural hospitals and the patients they serve are facing due to lagging technological infrastructure.

COMMUNICATION, Health Equity & Policy

This image shows a large white, traditional, old building. The top half of the building represents the humanities (which is symbolised by the embedded text from classic literature which is faintly shown ontop the building). The bottom section of the building is embossed with mathematical formulas to represent the sciences. The middle layer of the image is heavily pixelated. On the steps at the front of the building there is a group of scholars, wearing formal suits and tie attire, who are standing around at the enternace talking and some of them are sitting on the steps. There are two stone, statute-like hands that are stretching the building apart from the left side. In the forefront of the image, there are 8 students - which can only be seen from the back. Their graduation gowns have bright blue hoods and they all look as though they are walking towards the old building which is in the background at a distance. There are a mix of students in the foreground. Image credit: Zoya Yasmine / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/
Image credit: Zoya Yasmine / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/
  • “…nearly half (~47%) of student-AI conversations were Direct—that is, seeking answers or content with minimal engagement. Whereas many of these serve legitimate learning purposes (like asking conceptual questions or generating study guides), we did find concerning Direct conversation examples…These raise important questions about academic integrity, the development of critical thinking skills, and how to best assess student learning.” Anthropic has posted an analysis of survey results showing how its Claude LLM is actually being used in academic environments.
  • “The final report released Tuesday includes close to 50 recommendations to protect U.S. biotech intellectual property and bolster drug development, agriculture, and biological weapons defense. The commission doesn’t have the power to authorize any changes but can make recommendations and advise members of Congress….In all, the commission — which is made up of four members of Congress, former Google CEO Eric Schmidt, and MIT researcher Angela Belcher, among others — is recommending that Congress invest at least $15 billion in various projects over the next five years.” STAT News’ Allison DeAngelis covers a report from the National Security Commission on Emerging Biotechnology that warns Congress and the public that the United States’ global preeminence in biotechnology is at risk and urgently needs additional support.
  • “During an April 2023 inspection at Raptim facilites in Nava Mumbai, India, FDA inspectors found “objectionable conditions” that led them to conclude the company falsified data in testing for multiple subjects and samples across multiple studies, according to a letter sent last week to the pharmaceutical companies….At the same time, the FDA sent a letter to Raptim to say its data was unreliable.” Pharmalot’s Ed Silverman reports on an unusual FDA move that requires a pharmaceutical company to redo subpar (and possibly falsified) testing performed by a contract research organization.
  • “…the Humanities were and are experiencing the impact of two blunt forces: a great forgetting about the role of higher education in driving the American economy and the expansion of opportunity since World War II, and the perilous invisibility of Humanities expertise in a period when expertise of all kinds is vulnerable beyond reasonable scrutiny.” An essay by Karin Wulf for the Scholarly Kitchen examines the current stresses being felt by the humanities in academia – and what they might augur for academia as a whole.