πŸ₯ MedEvalArena

Preethi Prem1, Kie Shidara2, Vikasini Kuppa3, Feng Liu4, Ahmed Alaa5, Danilo Bernardo2*
1Carle Illinois College of Medicine, University of Illinois Urbana-Champaign, Urbana, IL, 2Weill Institute of Neurology and Neurosciences, University of California, San Francisco, San Francisco, CA, 3University of California, Riverside, Riverside, CA, 4Department of Systems Engineering, Stevens Institute of Technology, Hoboken, NJ, 5Department of EECS, University of California Berkeley, Berkeley, CA
*Corresponding author
MedEvalArena introduction graphic

Introduction

Large Language Models have shown strong performance in medical question answering, but their capabilities in complex clinical reasoning remain difficult to characterize systematically. We present MedEvalArena, a dynamic evaluation framework designed to compare medical reasoning robustness across models using a symmetric, round-robin protocol.

In MedEvalArena, each model generates medical questions intended to challenge the medical reasoning abilities of other models. Question validity is assesed using an LLM-as-judge paradigm along two axes, logical correctness and medical accuracy. Questions that the majority of the LLM-as-judge ensemble finds valid 'pass', and are used in the exam stage, where each LLM takes all the valid questions tha were generated. The top 6 model families on https://artificialanalysis.ai/ (model cutoff Nov 15, 2025) were included as question generators and on the LLM-as-judge ensemble. LLM generators generated questions until a quota of 50 total valid questions were generated per LLM for a grand total of 300 questions.

MedEvalArena provides a dynamic and scalable framework for benchmarking LLM medical reasoning.

Read more analyses here: https://www.medrxiv.org/content/10.64898/2026.01.27.26344905v1

 Leaderboard

Question validity (pass rate), only models that served as question gnerators and on the LLM-as-judge ensemble are shown
Accuracy
Cost per evaluation
Accuracy vs Cost per evaluation

Up to Top-10 models by accuracy shown. Each evaluation contains 300 questions (50 questions generated per LLM).

🏟️ Results

Default sort is by Mean Accuracy (descending). Top-3 entries are marked with πŸ₯‡πŸ₯ˆπŸ₯‰. Validity refers to pass-rate.

Models 0
Generated 2026-01-30T00:48:00Z
# Model Mean Accuracy SEM Validity

Tip: click a column header to sort.

πŸ“¬ Contact

For questions, please open a GitHub issue on the repository.

Citation

Prem, P., Shidara, K., Kuppa, V., Wheeler, E., Liu, F., Alaa, A., & Bernardo, D. (2026). MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning. medRxiv. https://doi.org/10.64898/2026.01.27.26344905

BibTeX

@article{prem2026medevaluarena,
  title   = {MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning},
  author  = {Prem, Preethi and Shidara, Kie and Kuppa, Vikasini and Wheeler, Esm{\'e} and Liu, Feng and Alaa, Ahmed and Bernardo, Danilo},
  journal = {medRxiv},
  year    = {2026},
  date    = {2026-01-27},
  doi     = {10.64898/2026.01.27.26344905},
  url     = {https://doi.org/10.64898/2026.01.27.26344905},
  note    = {Preprint}
}