Home Family Practice OpenAI Releases HealthBench Dataset to Test AI in Health Care

OpenAI Releases HealthBench Dataset to Test AI in Health Care

By I. Edwards HealthDay Reporter

TUESDAY, May 13, 2025 (HealthDay News) — OpenAI has unveiled a large dataset to help test how well artificial intelligence (AI) models answer health care questions.

Experts call it a major step forward, but they also say more work is needed to ensure safety.

The dataset — called HealthBench — is OpenAI’s first major independent health care project. It includes 5,000 “realistic health conversations,” each with detailed grading tools to evaluate AI responses, STAT News reported.

“Our mission as OpenAI is to ensure AGI is beneficial to humanity,” Karan Singhal, head of the San Francisco-based company’s health AI team, said. AGI is shorthand for artificial general intelligence.

“One part of that is building and deploying technology,” Singhal said. “Another part of it is ensuring that positive applications like health care have a place to flourish and that we do the right work to ensure that the models are safe and reliable in these settings.”

The dataset was created with help from 262 doctors who have worked in 60 countries. They provided more than 57,000 unique criteria to judge how well AI models answer health questions.

HealthBench aims to fix a common problem: Comparing different AI models fairly.

“What OpenAI has done is they have provided this in a scalable way from a really big, reputable brand that’s going to enable people to use this very easily,” Raj Ratwani, a health AI researcher at MedStar Health, said.

The 5,000 examples in HealthBench were made using synthesized conversations designed by physicians.

“We wanted to balance the benefits of being able to release the data with, of course, the privacy constraints of using realistic data,” Singhal told STAT News.

The dataset also includes a special group of 1,000 hard examples where AI models struggled. OpenAI hopes this group “provides a worthy target for model improvements for months to come,” STAT News reported.

OpenAI also tested its own models as well as models from Google, Meta, Anthropic and xAI. OpenAI’s o3 model scored the best, especially in communication quality, STAT News reported.

But models performed poorly in areas like context awareness and completeness, experts said.

Some warned about OpenAI grading its own models.

“In sensitive contexts like health care, where we are discussing life and death, that level of opacity is unacceptable,” Hao explained.

Others noted that AI itself was used to grade some of the responses, which could result in errors being overlooked.

It “may hide errors shared by both model and grader,” Girish Nadkarni, head of artificial intelligence and human health at the Icahn School of Medicine at Mount Sinai in New York City, told STAT News.

He and others called for more reviews to ensure models work well in different countries and among different demographics.

“HealthBench improves large language model health care evaluation but still needs subgroup analysis and wider human review before it can support safety claims,” Nadkarni said.

More information

The National Institutes of Health has more on artificial intelligence in health care.

SOURCE: STAT News, May 12, 2025


Copyright © 2025 HealthDay. All rights reserved.