AI Health

Friday Roundup

The AI Health Friday Roundup highlights the week’s news and publications related to artificial intelligence, data science, public health, and clinical research.

December 22, 2023

In this week’s Duke AI Health Roundup: a couple of big weeks for AI regulation; GPT-4 reveals serious biases in clinical tasks; brain organoids bridged to computer inputs; proposing a network of “assurance labs” for health AI; automated ECG-based tools for risk assessment; assessment framework for eHealth tools; Stanford AI experts look forward to 2024; diagnostic accuracy of large language models; genome of vanished “wooly dogs” decoded; surveys examine state of deep learning; more:


A small toy robot made out of wooden blocks, seated on a flat surface, with a smiling expression and a red heart shape on its chest. Image credit: Ochir-Erdene Oyunmedeg/Unsplash
Image credit: Ochir-Erdene Oyunmedeg/ Unsplash
  • “An LLM was more accurate than human clinicians in determining pretest and posttest probability after a negative test result in all 5 cases. The LLM did not perform as well after positive test results. LLM estimates were worse than human estimates for the case that was framed as a urinary tract infection (UTI) in the question stem but was actually asymptomatic bacteriuria…. Other than the fifth test case, when the AI formally solved a basic statistical reasoning question, the range of the LLM’s probability outputs in response to clinical vignettes appeared to be emergent from its stochastic nature.” A research letter published this month in JAMA Network Open by Rodman and colleagues presents findings from a study that compared the diagnostic accuracy of the GPT-4 large language model vs human clinicians.
  • “Assurance labs could serve as a shared resource for the industry to validate AI models, thus accelerating the pace of development and innovation, responsible and safe AI deployment, and successful market adoption. A network of assurance labs could comprise both private and public entities, rather than one national organization….Such a network could fill a critical gap in an ecosystem dominated by well-meaning but often overexuberant and inexperienced developers who lack the depth of understanding of health care delivery.” A recent publication in JAMA by Shah and colleagues introduces the concept “assurance laboratories” for assessing the safety and performance of health AI applications.
  • “Have we reached peak AI? No, say several Stanford scholars. Expect bigger and multimodal models, exciting new capabilities, and more conversations around how we want to use and regulate this technology.” Stanford’s Human-Centered Artificial Intelligence institute rounds up some AI prognostications for the coming year.
  • “With ECG-based automated risk assessment, PreOpNet not only simplifies the preoperative evaluation process but also reduces the need for extensive tests and examinations. Consequently, patients will experience a quicker preoperative assessment, with shorter hospitalisation and lower expense, which might translate into considerable health economic value. Moreover, PreOpNet’s lightweight architecture is a pivotal feature that enhances its accessibility and applicability.” A commentary by Hong and Zhao published in Lancet Digital Health reflects on an article by Ouyang and colleagues that demonstrated the use of an automated tool for ECG-based risk assessment.
  • “GPT-4 exhibits racial and gender bias across clinically relevant tasks, including the generation of cases for medical education, support for differential diagnostic reasoning, medical plan recommendation, and subjective assessments of patients. For each of these tasks, GPT-4 was found to exaggerate known disease prevalence differences between groups, over-represent stereotypes including problematic representations of minority groups, and amplify harmful societal biases. These findings are seriously concerning and are in line with previous research about bias in large-scale generative artificial intelligence models more broadly. “ Also in Lancet Digital Health this month: a commentary by Janna Hastings introduces work by Zack and colleagues that reveals multiple forms of bias in the responses of the GPT-4 chatbot when it was given clinical tasks to complete.
  • “In an era where running state-of-the-art models requires a garrison of expensive GPUs, what research is left for academics, PhD students, and newcomers to NLP without such deep pockets? Should they focus on the analysis of black-box models and niche topics ignored by LLM practitioners?…In this newsletter, I first argue why the current state of research is not as bleak—rather the opposite! I will then highlight five research directions that are important for the field and do not require much compute.” A new entry at Sebastian Ruder’s Substack explores potential avenues of development for natural language processing that won’t break the bank on compute resources.
  • “The goal of this series is to chronicle opinions and issues in the field of machine learning as they stand today and as they change over time. The plan is to host this survey periodically until the AI singularity paperclip-frenzy-driven doomsday, keeping an updated list of topical questions and interviewing new community members for each edition. In this issue, we probed people’s opinions on interpretable AI, the value of benchmarking in modern NLP, the state of progress towards understanding deep learning, and the future of academia.” A preprint by Goldblum and colleagues available from arXiv, presents the latest in an occasional series of field surveys on deep learning.


Glowing, transparent plastic model of a human brain. Image credit: Lisa Yount/Unsplash
Image credit: Lisa Yount/Unsplash
  • “The researchers call the system Brainoware. It uses brain organoids — bundles of tissue-mimicking human cells that are used in research to model organs. Organoids are made from stem cells capable of specializing into different types of cell. In this case, they were morphed into neurons, akin to those found in our brains….The research aims to build ‘a bridge between AI and organoids’, says study co-author Feng Guo, a bioengineer at the University of Indiana Bloomington. Nature’s Lilly Tozer reports on a recent research into combining brain-tissue “organoids” with computer components.
  • “For millennia before Europeans colonized what is now called the Pacific Northwest, small, fluffy, white ‘woolly dogs,’ known as sqwemá:y in one language of the Coast Salish peoples,  roamed the coast. The animals were unlike any dog living today.…Only one known woolly dog pelt exists today. By analyzing its genes, scientists have now shown just how different these shaggy creatures were from the Yorkshire terriers and Newfoundland dogs that gallivant around modern neighborhoods.” An article in Scientific American by Meghan Bartels spotlights recently published research on the genome of a unique dog breed important to the Coast Salish people of the Pacific Northwest.
  • “The development and implementation of these drugs are forcing important discussions about the way obesity is considered. Old pejorative tropes about obesity being the result of low willpower were hurtful to begin with, but now there is compelling evidence that a biochemical difference, not mental weakness, is responsible for weight gain. Reduced appetite and reduced “food noise” are clear benefits of these drugs.” Science editor Holden Thorp suggests that the shower of questions – yet to be answered – that has accompanied new GLP-1 agonist therapies for weight loss are the concomitants of a true breakthrough.
  • “Over the years, researchers have honed variants of LLL to make the approach more practical — but only up to a point. Now, a pair of cryptographers have built a new LLL-style algorithm with a significant boost in efficiency. The new technique, which won the Best Paper award at the 2023 International Cryptology Conference, widens the range of scenarios in which computer scientists and mathematicians can feasibly use LLL-like approaches.” Quanta’s Madison Goldberg reports on recent progress in refining a key cryptographic algorithm for greater efficiency.

COMMUNICATION, Health Equity & Policy

Close-up photograph showing a doctor in a white lab coat and stethescope holding a smartphone in both hands. The doctor’s head is not visible. Image credit: National Cancer Institute
Image credit: National Cancer Institute
  • It has been an eventful couple of weeks in news about AI regulation and oversight: First, the Office of the National Coordinator for Health Information Technology held its annual meeting on Thursday and Friday of last week, where one of the highlights was discussion about the recently released HTI-1 Final Rule to Advance Health IT Interoperability and Algorithm Transparency. Shortly thereafter, a bipartisan group of Congresspersons sent a letter to the National Institute of Standards and Technology laying out expectations for the newly created Artificial Intelligence Safety Institute, established at by executive order at the end of October 2023.
  • “The main response to the carbon emissions and environmental impacts of compute-hungry and energy-intensive machine learning (ML) has been to improve the efficiency with which ML systems operate. In this article, we present three reasons why efficiency is not enough to address the environmental impacts of ML and propose systems thinking as a way to make further progress toward this goal.” A perspective by Dustin Wright, hosted by the Montreal AI Ethics Institute, summarizes issues related to sustainable AI in the face of growing depends for energy-intensive compute resources.
  • “Effective medical communication requires a large body of information to be distilled to a simple and clear message. Incorporation of evidence-based communication strategies may enhance delivery of medical information by the clinician and improve acceptance by patients.” An article published in JAMA by Cappola and Cohen examines approaches to cultivating better communication about medical and healthcare-related information and issues.
  • “Among the thousands of eHealth tools available, the vast majority do not get past pilot phases because they cannot prove value, and only a few have been systematically assessed. Although multiple eHealth assessment frameworks have been developed, these efforts face multiple challenges. This study aimed to address some of these challenges by validating and refining an initial list of 55 assessment criteria based on previous frameworks through a two-round modified Delphi process with in-between rounds of interviews.” A research article published this month in NPJ Digital Medicine by Jacob and colleagues presents results from a framework designed to provide better evaluation for potential patient-facing eHealth tools.