Home / Health / >MedHELM: A Holistic Evaluation of Large Language Models for Healthcare Applications

>MedHELM: A Holistic Evaluation of Large Language Models for Healthcare Applications

>MedHELM: A Holistic Evaluation of Large Language Models for Healthcare Applications

Table of Contents

Suhana Bedi
2026-01-20 00:00:00

  • Papers with Code. Question answering on MedQA (USMLE). https://paperswithcode.com/sota/question-answering-on-medqa-usmle (2024).

  • Khosravi, M., Zare, Z., Mojtabaeian, S. M. & Izadi, R. Artificial intelligence and decision-making in healthcare: a thematic analysis of a systematic review of reviews. Health Serv. Res. Manag. Epidemiol. 11, 23333928241234863 (2024).

    PubMed
    PubMed Central

    Google Scholar

  • Nath, D. Artificial intelligence (AI) will transform the clinical workflow with the next-generation technology. HealthTech Magazines https://www.healthtechmagazines.com/artificial-intelligence-ai-will-transform-the-clinical-workflow-with-the-next-generation-technology/ (2024).

  • Carl, N. et al. Evaluating interactions of patients with large language models for medical information. BJU Int. 135, 1010–1017 (2025).

    Article
    PubMed
    PubMed Central

    Google Scholar

  • Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at https://arxiv.org/abs/2303.13375 (2023).

  • Raji, I. D., Daneshjou, R. & Alsentzer, E. It’s time to bench the medical exam benchmark. NEJM AI 2, AIe2401235 (2025).

    Article

    Google Scholar

  • Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Proc. Conference on Health, Inference, and Learning 174, 248–260 (PMLR, 2022).

  • Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 333, 319–328 (2025).

    Article
    PubMed

    Google Scholar

  • Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).

    Article
    CAS
    PubMed
    PubMed Central

    Google Scholar

  • Arora, R. K. et al. HealthBench: evaluating large language models towards improved human health. https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf (2025).

  • Liang, P. et al. Holistic evaluation of language models. In Transactions on Machine Learning https://openreview.net/pdf?id=iO4LZibEqW (2023).

  • Leaderboard overview. LM Arena https://lmarena.ai/leaderboard (2025).

  • Wu, J. et al. BRIDGE: benchmarking large language models for understanding real-world clinical practice text. Preprint at https://arxiv.org/abs/2504.19467 (2025).

  • Fries, J. A. et al. BigBio: a framework for data-centric biomedical natural language processing. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks https://openreview.net/pdf?id=8lQDn9zTQlW (2022).

  • Croxford, E. et al. Automating evaluation of AI text generation in healthcare with a large language model (LLM)-as-a-judge. Preprint at medRxiv https://doi.org/10.1101/2025.04.22.25326219 (2025).

  • Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).

    Article
    CAS

    Google Scholar

  • Wornow, M. et al. Context clues: evaluating long context models for clinical prediction tasks on EHRs. In Proc. 13th International Conference on Learning Representations https://openreview.net/pdf?id=zg3ec1TdAP (ICLR, 2025).

    Also Read:  Cancer End-of-Life: Tumor Spread & Circulating Cells
  • Liu, F. et al. Large language models in the clinic: a comprehensive benchmark. Preprint at https://arxiv.org/abs/2405.00716 (2024).

  • Wu, C. et al. Towards evaluating and building versatile large language models for medicine. NPJ Digit. Med. 8, 58 (2025).

    Article
    CAS
    PubMed
    PubMed Central

    Google Scholar

  • Ouyang, Z. et al. CliMedBench: a large-scale Chinese benchmark for evaluating medical large language models in clinical scenarios. In Proc. 2024 Conference on Empircal Methods in Natural Language Processing 8428–8438 (EMNLP, 2024).

  • Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

    Article
    CAS
    PubMed
    PubMed Central

    Google Scholar

  • Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. 31, 2546–2549 (2025).

    Article
    CAS
    PubMed
    PubMed Central

    Google Scholar

  • Cai, Y. et al. MedBench: a large-scale Chinese benchmark for evaluating medical large language models. In Proc. 38th AAAI Conference on Artificial Intelligence 38, 17709–17717 (AAAI, 2024).

  • Pal, A., Umapathi, L. K. & Sankarasubbu, M. Med-HALT: medical domain hallucination test for large language models. In Proc. Conference on Computational Natural Language Learning (CoNLL) 314–334 (CoNLL, 2023).

  • Han, T., Kumar, A., Agarwal, C. & Lakkaraju, H. MedSafetyBench: evaluating and improving the medical safety of large language models. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://openreview.net/pdf?id=cFyagd2Yh4 (2024).

  • Liu, F. et al. Application of large language models in medicine. Nat. Rev. Bioeng. 3, 85–104 (2025).

    Article

    Google Scholar

  • Magar, I. & Schwartz, R. Data contamination: from memorization to exploitation. In Proc. 60th Annual Meeting of the Association of Computatonal Linguistics (Vol. 2: Short Papers) 157–165 (ACL, 2022).

  • Deng, J. et al. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).

  • Gu, J. et al. A survey on LLM-as-a-judge. Preprint at https://arxiv.org/abs/2411.15594 (2025).

  • Madaan, L. et al. Quantifying variance in evaluation benchmarks. Preprint at https://arxiv.org/abs/2406.10229 (2024).

  • Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In Proc. 2023 Conference on Empirical Methods Natural Language Processing 9004–9017 (EMNLP, 2023).

  • Guha, B. Secret ballots and costly information gathering: the jury size problem revisited. MPRA Paper no. 73048 (2016).

  • Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).

    Also Read:  Support Your Local Yoga Studio: Ways to Help All Year Round

    Article
    PubMed
    PubMed Central

    Google Scholar

  • Khandekar, N. et al. MedCalc-Bench: evaluating large language models for medical calculations. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://openreview.net/pdf?id=VXohja0vrQ (2024).

  • MT Samples: collection of transcribed medical transcription sample reports and examples. MT Samples https://www.mtsamples.com/ (2023).

  • Ben Abacha, A. et al. MEDEC: a benchmark for medical error detection and correction in clinical notes. in Findings of the Association for Computational Linguistics 22539–22550 (ACL, 2025).

  • Vilares, D. & Gómez-Rodríguez, C. HEAD-QA: a healthcare dataset for complex reasoning. In Proc. 57th Annual Meeting of the Association of Computational Linguistics 960–966 (ACL, 2019).

  • Chen, H., Fang, Z., Singla, Y. & Dredze, M. Benchmarking large language models on answering and explaining challenging medical questions. In Proc. 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) 3563–3599 (NAACL, 2025).

  • Yim, W. -W. et al. ACI-BENCH: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Sci. Data 10, 586 (2023).

    Article
    PubMed
    PubMed Central

    Google Scholar

  • Ben Abacha, A. et al. Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019: Health and Wellbeing e-Networks for All 25–29 (IOS Press, 2019).

  • Zeng, G. et al. MedDialog: large-scale medical dialogue datasets. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) https://doi.org/10.18653/v1/2020.emnlp-main.743 (Association for Computational Linguistics, 2020).

  • Abacha, A. B., Shivade, C. & Demner-Fushman, D. Overview of the MEDIQA 2019 shared task on textual inference, question entailment and question answering. In Proc. 18th BioNLP Workshop Shared Task 16–25 (ACL, 2019).

  • Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. In Proc. 2019 Conference on Empirical Methods Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2567–2577 (EMNLP, 2019).

  • Lee, G. et al. EHRSQL: a practical text-to-SQL benchmark for electronic health records. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks https://openreview.net/pdf?id=B2W8Vy0rarw (NeurIPS, 2022).

  • Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. NPJ Digit. Med. 6, 195 (2023).

    Article
    PubMed
    PubMed Central

    Google Scholar

  • Pandit, S. et al. MedHallu: a comprehensive benchmark for detecting medical hallucinations in large language models. Preprint at https://arxiv.org/abs/2502.14302 (2025).

    Also Read:  Value-Based Care & Tech: Expanding Responsibility with Lightbeam Health Solutions
  • Wornow, M., Thapa, R., Steinberg, E., Fries, J. A. & Shah, N. H. EHRSHOT: an EHR benchmark for few-shot evaluation of foundation models. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks https://openreview.net/pdf?id=CsXC6IcdwI (2023).

  • Fleming, S. L. et al. MedAlign: a clinician-generated dataset for instruction following with electronic medical records. In Proc. Thirty-Eighth AAAI Conf. Artif. Intell. 38, 21545–21555 (AAAI, 2024).

  • Xu, J. Discharge me: BioNLP ACL’24 shared task on streamlining discharge documentation (version 1.3). PhysioNet https://doi.org/10.13026/0zf5-fx50 (2024).

  • Chen, Z., Varma, M., Wan, X., Langlotz, C. & Delbrouck, J.-B. Toward expanding the scope of radiology report summarization to multiple anatomies and modalities. In Proc. 61st Annual Meeting of the Association of Computational Linguistics (Vol. 2: Short Papers) 469–484 (ACL, 2023).

  • Aali, A. et al. A dataset and benchmark for hospital course summarization with adapted large language models. J. Am. Med. Inform. Assoc. 32, 470–479 (2025).

    Article
    PubMed

    Google Scholar

  • Henry, S., Buchan, K., Filannino, M., Stubbs, A. & Uzuner, O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J. Am. Med. Inform. Assoc. 27, 3–12 (2020).

    Article
    PubMed

    Google Scholar

  • Edin, J. et al. Automated medical coding on MIMIC-III and MIMIC-IV: a critical review and replicability study. In Proc. 46th Int. ACM SIGIR Conf. Research and Development in Information Retrieval 2572–2582 (SIGIR, 2023).

  • Lopez, I. et al. Clinical entity augmented retrieval for clinical information extraction. NPJ Digit. Med. 8, 45 (2025).

    Article
    PubMed
    PubMed Central

    Google Scholar

  • Pillai, M., Posada, J., Gardner, R. M., Hernandez-Boussard, T. & Bannett, Y. Measuring quality-of-care in treatment of young children with attention deficit/hyperactivity disorder using pre-trained language models. J. Am. Med. Inform. Assoc. 31, 949–957 (2024).

    Article
    PubMed
    PubMed Central

    Google Scholar

  • Bannett, Y. et al. Applying large language models to assess quality of care: monitoring ADHD medication side effects. Pediatrics 155, e2024067223 (2025).

    Article
    PubMed

    Google Scholar

  • Rabbani, N. et al. Evaluation of a large language model to identify confidential content in adolescent encounter notes. JAMA Pediatr. 178, 308–310 (2024).

    Article
    PubMed
    PubMed Central

    Google Scholar

  • Tse, G. et al. Large language model responses to adolescent patient and proxy messages. JAMA Pediatr. 179, 93–94 (2025).

    Article
    PubMed

    Google Scholar

  • Leave a Reply