Harvard Study 2026: OpenAI's o1-preview Outperforms ER Doctors on Diagnoses
May 3, 2026
A study unveiled in Science on April 30, 2026 by researchers at Harvard Medical School and Beth Israel Deaconess found that OpenAI's reasoning model o1-preview matches or beats experienced physicians on real ER diagnoses.
Reasoning AI in the Emergency Room: o1-preview in a Real-World Test
The debate about AI in medicine got a new data point on April 30, 2026. Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center published results in Science in which OpenAI's first reasoning model, o1-preview, was tested against experienced emergency physicians. The work was led by Raj Manrai at Harvard Medical School, with collaborators at Stanford. The headline finding: on real diagnostic and triage tasks, the model matched or outperformed two experienced physicians.
What Was Actually Tested
Three disciplines were evaluated: making diagnoses from electronic health records, recommending appropriate follow-up tests, and simple case-management decisions in the ER. Importantly, o1-preview worked from text only. Images, sounds, and the nonverbal signals doctors routinely use in real time were not available to the model. Even so, it outperformed a large baseline panel of real clinicians on the diagnostic path.
Methodological Limits the Researchers Emphasize
The authors are clear that the study is not evidence that AI replaces doctors. Patient interaction, medical responsibility, and legal liability cannot be offloaded into a reasoning loop. The model also operated in a controlled setting with structured EHR data — not the same reality as an overloaded ER at 3 a.m. The claim is narrower: a modern reasoning LLM can, under these conditions, raise the average quality of clinical decisions.
How This Compares to Earlier Studies
Earlier comparisons of LLMs to doctors often relied on isolated textbook cases or board-exam multiple-choice questions. The Harvard study matters because it uses real, retrospectively curated ER data and is therefore much closer to clinical practice than most prior benchmarks.
Why This Matters
For hospitals across the DACH region, a result like this changes how they look at AI as decision support. The question is no longer "can an LLM make diagnoses?" but "how do we integrate it so that liability, data protection, and care quality hold up?". Topics like audit logs for AI suggestions, specialty-specific clinical validation, and integration with existing EHR systems move to the top of the roadmap. Pressure on European regulators to fit applications like this cleanly under MDR and the EU AI Act will only grow.
Practical Example
A university hospital in Zurich is piloting a triage support system in its central emergency department. At check-in, the system reads structured history, vitals, and prior diagnoses from the hospital information system and proposes a triage level and a differential cluster. Recommendations never act on their own — physicians confirm or reject every suggestion, and each decision is logged. A Harvard-style validation — retrospective, six months — would provide the proof that the AI triage agrees with or improves on physician judgement, without touching the doctor's final responsibility.
💡 In plain English
Harvard researchers had a smart OpenAI computer compete against real emergency doctors. The computer had to figure out from patient records what was wrong with people. In the test it was as good as, or even better than, the doctors — but only because it just had to read text and did not have to actually talk to patients.
Key Takeaways
- →A Harvard Medical School and Beth Israel Deaconess study was published in Science on April 30, 2026.
- →The test pitted OpenAI's reasoning model o1-preview against experienced emergency physicians.
- →The model matched or surpassed human doctors on diagnoses, test recommendations, and case management.
- →Key limit: o1-preview worked from text only, without images, sounds, or nonverbal cues.
- →Lead author Raj Manrai stresses that the findings do not mean AI replaces doctors.
FAQ
Which AI model was tested in the study?
OpenAI's o1-preview, the company's first reasoning model with explicit step-by-step thinking.
Who ran the study?
Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center, with collaborators at Stanford. The lead author is Raj Manrai.
Does this mean AI replaces doctors?
No. The authors explicitly state the result is not a replacement for medical responsibility but evidence for strong decision support.
Where was the study published?
In the journal Science, in late April 2026.
Sources & Context
- An AI model beat doctors at diagnosing patients, in a new study – NPR
- AI Outperforms Doctors in Emergency Room Tasks, New Harvard Study Shows – Harvard Magazine
- In real-world test, an AI model did better than ER doctors at diagnosing patients – WCBE/NPR
- Health AI in 2026: CU Researchers are Implementing Trustworthy Tools to Support Clinicians – CU Anschutz