top of page

HealthBench: OpenAI’s first major Healthcare AI initiative

  • Writer: Lloyd Price
    Lloyd Price
  • May 13
  • 5 min read

HealthBench: OpenAI’s first major Healthcare AI initiative
HealthBench: OpenAI’s first major Healthcare AI initiative

HealthBench: OpenAI’s first major Healthcare AI initiative


HealthBench is an open-source benchmark developed by OpenAI, released on May 12th, 2025 to evaluate the performance and safety of large language models (LLMs) in healthcare contexts. It was created with input from over 250 physicians across 60 countries to ensure clinical relevance. Key features include:


  • Dataset: 5,000 multi-turn, realistic health conversations covering 26 medical specialties (e.g., cardiology, Paediatrics) and 49 languages, including underrepresented ones like Amharic and Nepali.


  • Evaluation: 48,562 evaluation points scored by GPT-4.1, based on physician-written rubrics assessing accuracy, context, completeness, and safety in tasks like triage, diagnosis, and patient communication.


  • Tasks: Scenarios range from handling emergencies (e.g., advising on an unresponsive 70-year-old patient) to routine care (e.g., managing diabetes follow-ups). Responses are scored on correctness and appropriateness (e.g., recommending emergency services or checking airways).


  • Performance: OpenAI’s o3 model led with a 60% score, followed by xAI’s Grok at 54% and Google’s Gemini 2.5 Pro at 52%. Scores reflect alignment with clinical standards.


  • Availability: Accessible via OpenAI’s GitHub repository for researchers and developers to test and improve LLMs.


HealthBench aims to standardise LLM evaluation in healthcare, addressing gaps in reliability and safety by providing a transparent, physician-validated framework.


Potential for Healthcare AI


Healthcare AI, particularly LLMs, has transformative potential, but it also faces challenges. HealthBench highlights both opportunities and areas for improvement. Below is an overview of its potential:


1. Clinical Decision Support


  • Potential: AI can assist clinicians by providing rapid, evidence-based recommendations for diagnosis, treatment, or triage. For example, HealthBench scenarios show AI suggesting correct emergency protocols (e.g., calling 911 for an unresponsive patient).


  • Impact: Could reduce diagnostic errors (e.g., misdiagnosis rates, which studies estimate at 5-20% in the U.S.) and support overburdened healthcare systems, especially in low-resource settings.


  • Challenges: HealthBench scores (highest at 60%) indicate AI still misses nuances in complex cases, risking incorrect advice. Human oversight remains critical.


2. Patient Communication and Education


  • Potential: AI can deliver tailored health advice in multiple languages, as seen in HealthBench’s 49-language coverage. It could empower patients with clear explanations of conditions or treatment plans.


  • Impact: Improves health literacy and adherence to treatments, particularly in underserved communities with language barriers.


  • Challenges: AI must avoid overly technical or culturally insensitive responses, which HealthBench rubrics flag as weaknesses in some models.


3. Triage and Telemedicine


  • Potential: AI can prioritise urgent cases in telemedicine or emergency settings, guiding patients to appropriate care levels (e.g., ER vs. primary care). HealthBench tests this through scenarios like stroke symptom assessment.


  • Impact: Enhances efficiency in strained systems (eg. reducing ER wait times, which average 2-3 hours in U.S. hospitals) and expands access in remote areas.


  • Challenges: Errors in triage (e.g., underestimating severity) could be life-threatening. HealthBench shows even top models score below 70% in some cases.


4. Administrative Efficiency


  • Potential: AI can streamline tasks like medical record summarization, billing, or scheduling, freeing clinicians for patient care. HealthBench indirectly supports this by ensuring AI understands clinical contexts.


  • Impact: Could save billions annually (eg. U.S. healthcare spends ~8% of revenue on administration) and reduce physician burnout.


  • Challenges: Requires integration with existing systems and compliance with regulations like HIPAA, not directly addressed by HealthBench.


5. Global Health Equity


  • Potential: Multilingual AI, as tested in HealthBench, can serve diverse populations, including in low-income countries with physician shortages (e.g., sub-Saharan Africa has ~0.2 doctors per 1,000 people vs. 2.6 in the US.)


  • Impact: Bridges gaps in care access, offering scalable solutions for basic health queries or preventive care.


  • Challenges: Limited internet access and cultural differences in healthcare expectations can hinder deployment.


6. Research and Development


  • Potential: HealthBench enables developers to refine LLMs for healthcare, fostering innovation in specialised models (e.g., for rare diseases or personalised medicine).


  • Impact: Accelerates AI-driven drug discovery or predictive analytics, potentially cutting development timelines (e.g., AI has reduced drug screening times by ~30% in some studies).


  • Challenges: Requires robust datasets beyond HealthBench to cover niche areas like genomics or mental health.


Limitations and Risks


  • Accuracy Gaps: HealthBench shows no model exceeds 60% alignment with physician standards, indicating risks of errors in high-stakes settings.


  • Bias and Fairness: AI may perpetuate biases in training data (e.g., underrepresenting certain demographics), which HealthBench’s diverse dataset aims to mitigate but doesn’t fully resolve.


  • Regulation: Healthcare AI must comply with strict standards (e.g., FDA oversight in the U.S.), and HealthBench is a research tool, not a regulatory framework.


Future Directions


  • Improved Benchmarks: Expanding HealthBench to include more specialties, real-world patient data, or longitudinal care scenarios could enhance its scope.


  • Hybrid Systems: Combining AI with human oversight (e.g., AI drafts, physicians review) could maximise safety and efficiency.


  • Ethical Deployment: Partnerships with global health organisations could ensure equitable AI access while addressing privacy and cultural concerns.



HealthBench Abstract


We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional.


Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, Health- Bench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioural dimensions(e.g., accuracy, instruction following, communication).


HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo’s 16% to GPT-4o’s 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behaviour validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.



Nelson Advisors > Healthcare Technology M&A

.

Nelson Advisors specialise in mergers, acquisitions and partnerships for Digital Health, HealthTech, Health IT, Consumer HealthTech, Healthcare Cybersecurity, Healthcare AI companies based in the UK, Europe and North America. www.nelsonadvisors.co.uk

 

Nelson Advisors regularly publish Healthcare Technology thought leadership articles covering market insights, trends, analysis & predictions @ https://www.healthcare.digital 

 

We share our views on the latest Healthcare Technology mergers, acquisitions and partnerships with insights, analysis and predictions in our LinkedIn Newsletter every week, subscribe today! https://lnkd.in/e5hTp_xb 

 

Founders for Founders > We pride ourselves on our DNA as ‘HealthTech entrepreneurs advising HealthTech entrepreneurs.’ Nelson Advisors partner with entrepreneurs, boards and investors to maximise shareholder value and investment returns. www.nelsonadvisors.co.uk

 

 

Nelson Advisors LLP

 

Hale House, 76-78 Portland Place, Marylebone, London, W1B 1NT

 

Contact Us

 

 

Meet Us

 

Digital Health Rewired > 18-19th March 2025 

 

NHS ConfedExpo  > 11-12th June 2025

 

HLTH Europe > 16-19th June 2025

 

HIMSS AI in Healthcare > 10-11th July 2025






 
 
 

Comments


Nelson Advisors Main Logo 2400x1800.jpg
bottom of page