AI

In Harvard study, AI offered more accurate emergency room diagnoses than two human doctors

A new study examines how large language models perform in different medical contexts, including real emergency room cases – where at least one model appeared to be more accurate than human doctors.

The study was published this week in Science and comes from a research team led by physicians and computer scientists from Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers said they conducted several experiments to measure how OpenAI’s models compared to human doctors.

In one experiment, researchers focused on 76 patients who entered Beth Israel’s emergency room, comparing the diagnoses of two internal medicine physicians with the diagnoses generated by OpenAI’s o1 and 4o models. These diagnoses were reviewed by two other treating physicians, who did not know which came from humans and which from AI.

“At each diagnostic touchpoint, o1 performed nominally better than or comparable to the two treating physicians and 4o,” the study said, adding that the differences were “particularly pronounced at the first diagnostic touchpoint (initial ED triage), where there is the least information available about the patient and the most urgency to make the right decision.”

At Harvard Medical School press release About the study, the researchers emphasized that they “didn’t preprocess the data at all” – the AI ​​models were presented with the same information that was available in the electronic health records at the time of each diagnosis.

With that information, the o1 model was able to provide “the exact or very accurate diagnosis” in 67% of triage cases, compared to one doctor who had the exact or very accurate diagnosis 55% of the time, and another who had the correct diagnosis 50% of the time.

See also  Meta's COCONUT: The AI Method That Thinks Without Language

“We tested the AI ​​model against virtually every benchmark, and it eclipsed both previous models and our physicians’ baselines,” said Arjun Manrai, head of an AI lab at Harvard Medical School and one of the study’s lead authors, in the press release.

WAN event

San Francisco, CA
|
October 13-15, 2026

To be clear, the study did not claim that AI is ready to make real life-or-death decisions in the emergency room. Instead, the findings said there is an “urgent need for prospective studies to evaluate these technologies in real-world patient care practice.”

The researchers also noted that they only studied how models performed when provided with text-based information, and that “existing studies suggest that current basic models are more limited in reasoning about non-textual input.”

Adam Rodman, a Beth Israel physician who is also one of the study’s lead authors, warned the Guardian that there is “currently no formal framework for accountability” around AI diagnoses, and that patients still “want humans to guide them through life-or-death decisions [and] to guide them through challenging treatment decisions.”

In a message about the researchKristen Panthagani, an emergency physician, said this is an “interesting AI study that has led to some very overhyped headlines,” especially because it compared AI diagnoses to those of internal medicine doctors, not ER doctors.

“If we’re going to compare AI tools to physicians’ clinical skills, we need to start comparing them to physicians who actually practice that specialty,” Panthagani says. “I wouldn’t be surprised if an LLM could beat a dermatologist on a neurosurgery exam, [but] that’s not particularly useful to know.”

See also  Chicago Tribune sues Perplexity | TechCrunch

She also argued, “As an ER doctor seeing a patient for the first time, my main goal is not to guess your final diagnosis. My main goal is to determine if you have a condition that could kill you.”

This post and headline have been updated to reflect the fact that the diagnoses in the study came from internal medicine physicians, and to include comments from Kristen Panthagani.

When you make a purchase through links in our articles, we may earn a small commission. This does not affect our editorial independence.

Source link

Back to top button