Which AI model was tested in the study?

OpenAI's o1-preview, the company's first reasoning model with explicit step-by-step thinking.

Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center, with collaborators at Stanford. The lead author is Raj Manrai.

Does this mean AI replaces doctors?

No. The authors explicitly state the result is not a replacement for medical responsibility but evidence for strong decision support.

Where was the study published?

In the journal Science, in late April 2026.

Harvard Study 2026: OpenAI o1-preview Beats ER Doctors

Reasoning AI in the Emergency Room: o1-preview in a Real-World Test

The debate about AI in medicine got a new data point on April 30, 2026. Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center published results in Science in which OpenAI's first reasoning model, o1-preview, was tested against experienced emergency physicians. The work was led by Raj Manrai at Harvard Medical School, with collaborators at Stanford. The headline finding: on real diagnostic and triage tasks, the model matched or outperformed two experienced physicians.

What Was Actually Tested

Three disciplines were evaluated: making diagnoses from electronic health records, recommending appropriate follow-up tests, and simple case-management decisions in the ER. Importantly, o1-preview worked from text only. Images, sounds, and the nonverbal signals doctors routinely use in real time were not available to the model. Even so, it outperformed a large baseline panel of real clinicians on the diagnostic path.

Methodological Limits the Researchers Emphasize

The authors are clear that the study is not evidence that AI replaces doctors. Patient interaction, medical responsibility, and legal liability cannot be offloaded into a reasoning loop. The model also operated in a controlled setting with structured EHR data — not the same reality as an overloaded ER at 3 a.m. The claim is narrower: a modern reasoning LLM can, under these conditions, raise the average quality of clinical decisions.

How This Compares to Earlier Studies

Earlier comparisons of LLMs to doctors often relied on isolated textbook cases or board-exam multiple-choice questions. The Harvard study matters because it uses real, retrospectively curated ER data and is therefore much closer to clinical practice than most prior benchmarks.

Why This Matters

For hospitals across the DACH region, a result like this changes how they look at AI as decision support. The question is no longer "can an LLM make diagnoses?" but "how do we integrate it so that liability, data protection, and care quality hold up?". Topics like audit logs for AI suggestions, specialty-specific clinical validation, and integration with existing EHR systems move to the top of the roadmap. Pressure on European regulators to fit applications like this cleanly under MDR and the EU AI Act will only grow.

Practical Example

A university hospital in Zurich is piloting a triage support system in its central emergency department. At check-in, the system reads structured history, vitals, and prior diagnoses from the hospital information system and proposes a triage level and a differential cluster. Recommendations never act on their own — physicians confirm or reject every suggestion, and each decision is logged. A Harvard-style validation — retrospective, six months — would provide the proof that the AI triage agrees with or improves on physician judgement, without touching the doctor's final responsibility.

Harvard Study 2026: OpenAI's o1-preview Outperforms ER Doctors on Diagnoses

Reasoning AI in the Emergency Room: o1-preview in a Real-World Test

What Was Actually Tested

Methodological Limits the Researchers Emphasize

How This Compares to Earlier Studies

Why This Matters

Practical Example

💡 In plain English

Key Takeaways

FAQ

Which AI model was tested in the study?

Who ran the study?

Does this mean AI replaces doctors?

Where was the study published?

Sources & Context