Large Language Models Benchmarks

Large language models (LLMs) performed poorly in navigating early clinical diagnostic uncertainty

Rao and colleagues evaluated the clinical reasoning performance of 21 off-the-shelf LLMs across 29 standardized clinical ...

Renal & Urology News

Large Language Models Perform Poorly for Differential Diagnosis

Differential diagnosis was less accurate than diagnostic testing, but final diagnosis and management were more accurate.

19m

Data Lineage for Large Language Model (LLM) Training Market Report 2026 - Total Revenue Set to More Than Double During 2026-2030 as AI Investments and Compli…

The "Data Lineage for Large Language Model (LLM) Training Market Report 2026" has been added to ResearchAndMarkets.com's ...

Frontier models are failing one in three production attempts — and getting harder to audit

Stanford's 2026 AI Index: frontier models fail one in three attempts, lab transparency is declining, and benchmarks are ...

Unite.AI

Verbosity Decreases Accuracy in Large Language Models

New research finds that forcing Large Language Models to give shorter answers notably improves the accuracy and quality of their answers. Anyone who has tried to stop a chatbot from ‘rambling’ will ...

British Journal of Ophthalmology

Publicly available multimodal large language models for ocular surface infections: benchmarking against corneal specialists in triage, diagnosis and treatment

Background/aims Ocular surface infections remain a major cause of visual loss worldwide, yet diagnosis often relies on slow ...

18don MSN

Hide inaccessible results

Large language models (LLMs) performed poorly in navigating early clinical diagnostic uncertainty

Large Language Models Perform Poorly for Differential Diagnosis

Data Lineage for Large Language Model (LLM) Training Market Report 2026 - Total Revenue Set to More Than Double During 2026-2030 as AI Investments and Compli…

Frontier models are failing one in three production attempts — and getting harder to audit

Verbosity Decreases Accuracy in Large Language Models

Publicly available multimodal large language models for ocular surface infections: benchmarking against corneal specialists in triage, diagnosis and treatment

ChatGPT vs. Claude: 7 real-life benchmarks that crown the 2026 AI madness champion

MathEval: a comprehensive benchmark for evaluating large language models on mathematical reasoning capabilities

Study finds top AI models still struggle with clinical reasoning

Frontier AI Models Are Doing Something Absolutely Bizarre When Asked to Diagnose Medical X-Rays