HealthBench: OpenAI’s first major Healthcare AI initiative

Lloyd Price
May 13, 2025
5 min read

HealthBench: OpenAI’s first major Healthcare AI initiative

HealthBench is an open-source benchmark developed by OpenAI, released on May 12th, 2025 to evaluate the performance and safety of large language models (LLMs) in healthcare contexts. It was created with input from over 250 physicians across 60 countries to ensure clinical relevance. Key features include:

Dataset: 5,000 multi-turn, realistic health conversations covering 26 medical specialties (e.g., cardiology, Paediatrics) and 49 languages, including underrepresented ones like Amharic and Nepali.
Evaluation: 48,562 evaluation points scored by GPT-4.1, based on physician-written rubrics assessing accuracy, context, completeness, and safety in tasks like triage, diagnosis, and patient communication.
Tasks: Scenarios range from handling emergencies (e.g., advising on an unresponsive 70-year-old patient) to routine care (e.g., managing diabetes follow-ups). Responses are scored on correctness and appropriateness (e.g., recommending emergency services or checking airways).
Performance: OpenAI’s o3 model led with a 60% score, followed by xAI’s Grok at 54% and Google’s Gemini 2.5 Pro at 52%. Scores reflect alignment with clinical standards.
Availability: Accessible via OpenAI’s GitHub repository for researchers and developers to test and improve LLMs.

HealthBench aims to standardise LLM evaluation in healthcare, addressing gaps in reliability and safety by providing a transparent, physician-validated framework.

Potential for Healthcare AI

Healthcare AI, particularly LLMs, has transformative potential, but it also faces challenges. HealthBench highlights both opportunities and areas for improvement. Below is an overview of its potential:

1. Clinical Decision Support

Potential: AI can assist clinicians by providing rapid, evidence-based recommendations for diagnosis, treatment, or triage. For example, HealthBench scenarios show AI suggesting correct emergency protocols (e.g., calling 911 for an unresponsive patient).
Impact: Could reduce diagnostic errors (e.g., misdiagnosis rates, which studies estimate at 5-20% in the U.S.) and support overburdened healthcare systems, especially in low-resource settings.
Challenges: HealthBench scores (highest at 60%) indicate AI still misses nuances in complex cases, risking incorrect advice. Human oversight remains critical.

2. Patient Communication and Education

Potential: AI can deliver tailored health advice in multiple languages, as seen in HealthBench’s 49-language coverage. It could empower patients with clear explanations of conditions or treatment plans.
Impact: Improves health literacy and adherence to treatments, particularly in underserved communities with language barriers.
Challenges: AI must avoid overly technical or culturally insensitive responses, which HealthBench rubrics flag as weaknesses in some models.

3. Triage and Telemedicine

Potential: AI can prioritise urgent cases in telemedicine or emergency settings, guiding patients to appropriate care levels (e.g., ER vs. primary care). HealthBench tests this through scenarios like stroke symptom assessment.
Impact: Enhances efficiency in strained systems (eg. reducing ER wait times, which average 2-3 hours in U.S. hospitals) and expands access in remote areas.
Challenges: Errors in triage (e.g., underestimating severity) could be life-threatening. HealthBench shows even top models score below 70% in some cases.

4. Administrative Efficiency

Potential: AI can streamline tasks like medical record summarization, billing, or scheduling, freeing clinicians for patient care. HealthBench indirectly supports this by ensuring AI understands clinical contexts.
Impact: Could save billions annually (eg. U.S. healthcare spends ~8% of revenue on administration) and reduce physician burnout.
Challenges: Requires integration with existing systems and compliance with regulations like HIPAA, not directly addressed by HealthBench.

5. Global Health Equity

Potential: Multilingual AI, as tested in HealthBench, can serve diverse populations, including in low-income countries with physician shortages (e.g., sub-Saharan Africa has ~0.2 doctors per 1,000 people vs. 2.6 in the US.)
Impact: Bridges gaps in care access, offering scalable solutions for basic health queries or preventive care.
Challenges: Limited internet access and cultural differences in healthcare expectations can hinder deployment.

6. Research and Development

Potential: HealthBench enables developers to refine LLMs for healthcare, fostering innovation in specialised models (e.g., for rare diseases or personalised medicine).
Impact: Accelerates AI-driven drug discovery or predictive analytics, potentially cutting development timelines (e.g., AI has reduced drug screening times by ~30% in some studies).
Challenges: Requires robust datasets beyond HealthBench to cover niche areas like genomics or mental health.

Limitations and Risks

Accuracy Gaps: HealthBench shows no model exceeds 60% alignment with physician standards, indicating risks of errors in high-stakes settings.
Bias and Fairness: AI may perpetuate biases in training data (e.g., underrepresenting certain demographics), which HealthBench’s diverse dataset aims to mitigate but doesn’t fully resolve.
Regulation: Healthcare AI must comply with strict standards (e.g., FDA oversight in the U.S.), and HealthBench is a research tool, not a regulatory framework.

Future Directions

Improved Benchmarks: Expanding HealthBench to include more specialties, real-world patient data, or longitudinal care scenarios could enhance its scope.
Hybrid Systems: Combining AI with human oversight (e.g., AI drafts, physicians review) could maximise safety and efficiency.
Ethical Deployment: Partnerships with global health organisations could ensure equitable AI access while addressing privacy and cultural concerns.

HealthBench Abstract

We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional.

Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, Health- Bench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioural dimensions(e.g., accuracy, instruction following, communication).

HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo’s 16% to GPT-4o’s 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behaviour validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.

Source: https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf

Nelson Advisors > Healthcare Technology M&A

Nelson Advisors specialise in mergers, acquisitions and partnerships for Digital Health, HealthTech, Health IT, Consumer HealthTech, Healthcare Cybersecurity, Healthcare AI companies based in the UK, Europe and North America. www.nelsonadvisors.co.uk

Nelson Advisors regularly publish Healthcare Technology thought leadership articles covering market insights, trends, analysis & predictions @ https://www.healthcare.digital

We share our views on the latest Healthcare Technology mergers, acquisitions and partnerships with insights, analysis and predictions in our LinkedIn Newsletter every week, subscribe today! https://lnkd.in/e5hTp_xb

Founders for Founders > We pride ourselves on our DNA as ‘HealthTech entrepreneurs advising HealthTech entrepreneurs.’ Nelson Advisors partner with entrepreneurs, boards and investors to maximise shareholder value and investment returns. www.nelsonadvisors.co.uk

#NelsonAdvisors #HealthTech #DigitalHealth #HealthIT #Cybersecurity #HealthcareAI #ConsumerHealthTech #Mergers #Acquisitions #Partnerships #Growth #Strategy #NHS #UK #Europe #USA #VentureCapital #PrivateEquity #Founders #BuySide #SellSide