General-purpose large language models outperform specialized clinical AI tools on medical benchmarks (paper June 12 26)

https://www.nature.com/articles/s41591-026-04431-5

chatGPT(5.5paid):

Summary

This Nature Medicine brief communication compares two specialist clinical AI tools, OpenEvidence and UpToDate Expert AI, with three frontier general-purpose LLMs: GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6. The central question is whether medical-domain AI products actually outperform the best general-purpose models in clinical tasks.

The study uses three evaluation stages:

  1. MedQA: 500 USMLE-style multiple-choice medical questions.
  2. HealthBench: 500 open-response health prompts, graded by a panel of LLM judges.
  3. Real Clinical Queries benchmark: 100 de-identified queries from physicians using an NYU Langone HIPAA-compliant GPT system in routine clinical care. These were scored by 12 blinded US clinicians, generating 1,800 clinician annotations.

The main result is straightforward: general-purpose frontier LLMs outperformed the specialist clinical AI tools across all three tests.

On MedQA, Gemini was highest at 97.4%, followed by GPT-5.2 at 94.2%, Claude at 90.2%, OpenEvidence at 89.6%, and UpToDate Expert AI at 88.4%.

On HealthBench, GPT-5.2 was highest at 88.0, Gemini scored 79.3, Claude 77.0, while OpenEvidence and UpToDate Expert AI were substantially lower at 62.6 and 61.3.

On the real clinical query benchmark, the frontier models again formed the top tier: Gemini 3.62, GPT-5.2 3.54, Claude 3.52 on a 1–4 scale. OpenEvidence, UpToDate Expert AI and Google AI Overview formed a lower tier, scoring 3.24, 3.17 and 3.27 respectively. The notable point is that Google AI Overview performed comparably to the specialist clinical tools on these clinician-rated real-world queries.

The paper argues that specialist medical AI products may not currently justify claims of superior clinical performance simply because they use medical retrieval, curated sources or domain-specific interfaces. The authors suggest that scale, general reasoning ability, alignment and rapid iteration may currently matter more than medical branding or domain-specific retrieval.

Novelty

The main novelty is not that LLMs can answer medical questions. That is already well established. The novelty lies in the direct comparison between commercial clinical AI tools and frontier general-purpose models, especially using real clinician queries.

The strongest novel elements are:

First, the inclusion of proprietary clinical tools. OpenEvidence and UpToDate Expert AI are clinically marketed systems, but because their architectures and training methods are not transparent, independent evaluation is difficult. This paper directly benchmarks them against frontier models.

Second, the real clinical query benchmark is more useful than another exam-style benchmark alone. The 100 queries came from physicians using a live clinical LLM deployment, rather than from standard test sets. That makes it more relevant to how clinicians actually use these tools.

Third, the blinded clinician review is important. Twelve clinicians scored outputs without knowing which model produced them, across clinical correctness, completeness, safety/harm avoidance and clarity. This reduces some of the bias inherent in automated scoring.

Fourth, the finding that Google AI Overview matched the specialist clinical tools is provocative. It suggests that at least for these kinds of routine physician queries, some commercial clinical AI products may not be offering much beyond what general web-integrated AI already provides.

Critique

The paper is useful and timely, but its claims need careful interpretation.

The biggest strength is that the authors do not rely only on public benchmarks. MedQA and HealthBench are helpful, but both are vulnerable to contamination, benchmark overfitting and artificial task structure. The RCQ benchmark is the most persuasive part of the paper because it uses real physician queries and blinded human clinical review.

However, the study has several important limitations.

First, the sample size for real clinical queries is small. One hundred queries is enough to detect a performance signal, but not enough to conclude how these tools perform across all of medicine. The paper does not prove that frontier LLMs are superior in every specialty, rare-disease scenario, drug-interaction task, guideline-driven decision or institutional workflow.

Second, the clinical tools were queried through browser interfaces, while the frontier models were accessed via API. That may reflect real-world use, but it also introduces confounding. Browser-based tools may have hidden prompts, retrieval constraints, refusal policies, citation formatting and output-length limits. Some of the measured difference may be due to interface design rather than underlying model ability.

Third, HealthBench is an OpenAI-developed benchmark, and GPT-5.2 performed best on it. The authors acknowledge that benchmark–developer overlap could favour OpenAI models. Even with a multi-model judging panel, LLM-as-judge evaluation remains less trustworthy than blinded clinician assessment.

Fourth, the RCQ benchmark is not publicly available because it came from a clinical environment. That is understandable for privacy reasons, but it limits reproducibility. Other groups cannot directly test whether the same result holds across other hospitals, countries, specialties or query types.

Fifth, the study evaluates answer quality, but not all practical deployment issues. The authors explicitly did not assess citation quality, latency or workflow integration. For clinical use, those may matter greatly. A slightly less capable model that gives traceable, guideline-linked, auditable answers could sometimes be preferable to a stronger general model with weaker provenance.

Sixth, the safety results should not be overread. Harmful responses and hallucinations were rare and did not differ significantly between models, but the study is probably underpowered for low-frequency high-impact clinical risks. A 100-query test cannot establish that any of these tools are clinically safe.

Seventh, the paper’s conclusion may be time-sensitive. These systems change rapidly. A comparison using GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6, OpenEvidence and UpToDate Expert AI in early 2026 may not hold after model upgrades, changes in retrieval systems or new clinical fine-tuning.

Bottom line

This is a strong, useful warning against assuming that “clinical AI” branding means superior clinical performance. In this study, frontier general-purpose LLMs beat specialist clinical AI tools on medical knowledge, expert-aligned benchmarks and blinded clinician ratings of real physician queries. The most important implication is procurement and governance: health systems should demand independent, real-world evaluation rather than relying on vendor claims.

The main caveat is that the study is a snapshot of a fast-moving field, based on a modest private real-query benchmark. It supports the claim that current specialist clinical AI tools are not automatically better than frontier LLMs, but it does not prove that general-purpose LLMs are always the best clinical solution.