Large Language Models have shown strong performance in medical question answering, but their capabilities in complex clinical reasoning remain difficult to characterize systematically. We present MedEvalArena, a dynamic evaluation framework designed to compare medical reasoning robustness across models using a symmetric, round-robin protocol.
In MedEvalArena, each model generates medical questions intended to challenge the medical reasoning abilities of other models. Question validity is assesed using an LLM-as-judge paradigm along two axes, logical correctness and medical accuracy. Questions that the majority of the LLM-as-judge ensemble finds valid 'pass', and are used in the exam stage, where each LLM takes all the valid questions tha were generated. The top 6 model families on https://artificialanalysis.ai/ (model cutoff Nov 15, 2025) were included as question generators and on the LLM-as-judge ensemble. LLM generators generated questions until a quota of 50 total valid questions were generated per LLM for a grand total of 300 questions.
MedEvalArena provides a dynamic and scalable framework for benchmarking LLM medical reasoning.
Read more analyses here: https://www.medrxiv.org/content/10.64898/2026.01.27.26344905v1
Up to Top-10 models by accuracy shown. Each evaluation contains 300 questions (50 questions generated per LLM).
Default sort is by Mean Accuracy (descending). Top-3 entries are marked with π₯π₯π₯. Validity refers to pass-rate.
| # | Model | Mean Accuracy | SEM | Validity |
|---|
Tip: click a column header to sort.
For questions, please open a GitHub issue on the repository.
Prem, P., Shidara, K., Kuppa, V., Wheeler, E., Liu, F., Alaa, A., & Bernardo, D. (2026). MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning. medRxiv. https://doi.org/10.64898/2026.01.27.26344905
@article{prem2026medevaluarena,
title = {MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning},
author = {Prem, Preethi and Shidara, Kie and Kuppa, Vikasini and Wheeler, Esm{\'e} and Liu, Feng and Alaa, Ahmed and Bernardo, Danilo},
journal = {medRxiv},
year = {2026},
date = {2026-01-27},
doi = {10.64898/2026.01.27.26344905},
url = {https://doi.org/10.64898/2026.01.27.26344905},
note = {Preprint}
}