Suhana Bedi
2026-01-20 00:00:00
Papers with Code. Question answering on MedQA (USMLE). https://paperswithcode.com/sota/question-answering-on-medqa-usmle (2024).
Khosravi, M., Zare, Z., Mojtabaeian, S. M. & Izadi, R. Artificial intelligence and decision-making in healthcare: a thematic analysis of a systematic review of reviews. Health Serv. Res. Manag. Epidemiol. 11, 23333928241234863 (2024).
Nath, D. Artificial intelligence (AI) will transform the clinical workflow with the next-generation technology. HealthTech Magazines https://www.healthtechmagazines.com/artificial-intelligence-ai-will-transform-the-clinical-workflow-with-the-next-generation-technology/ (2024).
Carl, N. et al. Evaluating interactions of patients with large language models for medical information. BJU Int. 135, 1010–1017 (2025).
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at https://arxiv.org/abs/2303.13375 (2023).
Raji, I. D., Daneshjou, R. & Alsentzer, E. It’s time to bench the medical exam benchmark. NEJM AI 2, AIe2401235 (2025).
Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Proc. Conference on Health, Inference, and Learning 174, 248–260 (PMLR, 2022).
Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 333, 319–328 (2025).
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Arora, R. K. et al. HealthBench: evaluating large language models towards improved human health. https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf (2025).
Liang, P. et al. Holistic evaluation of language models. In Transactions on Machine Learning https://openreview.net/pdf?id=iO4LZibEqW (2023).
Leaderboard overview. LM Arena https://lmarena.ai/leaderboard (2025).
Wu, J. et al. BRIDGE: benchmarking large language models for understanding real-world clinical practice text. Preprint at https://arxiv.org/abs/2504.19467 (2025).
Fries, J. A. et al. BigBio: a framework for data-centric biomedical natural language processing. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks https://openreview.net/pdf?id=8lQDn9zTQlW (2022).
Croxford, E. et al. Automating evaluation of AI text generation in healthcare with a large language model (LLM)-as-a-judge. Preprint at medRxiv https://doi.org/10.1101/2025.04.22.25326219 (2025).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Wornow, M. et al. Context clues: evaluating long context models for clinical prediction tasks on EHRs. In Proc. 13th International Conference on Learning Representations https://openreview.net/pdf?id=zg3ec1TdAP (ICLR, 2025).
Liu, F. et al. Large language models in the clinic: a comprehensive benchmark. Preprint at https://arxiv.org/abs/2405.00716 (2024).
Wu, C. et al. Towards evaluating and building versatile large language models for medicine. NPJ Digit. Med. 8, 58 (2025).
Ouyang, Z. et al. CliMedBench: a large-scale Chinese benchmark for evaluating medical large language models in clinical scenarios. In Proc. 2024 Conference on Empircal Methods in Natural Language Processing 8428–8438 (EMNLP, 2024).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. 31, 2546–2549 (2025).
Cai, Y. et al. MedBench: a large-scale Chinese benchmark for evaluating medical large language models. In Proc. 38th AAAI Conference on Artificial Intelligence 38, 17709–17717 (AAAI, 2024).
Pal, A., Umapathi, L. K. & Sankarasubbu, M. Med-HALT: medical domain hallucination test for large language models. In Proc. Conference on Computational Natural Language Learning (CoNLL) 314–334 (CoNLL, 2023).
Han, T., Kumar, A., Agarwal, C. & Lakkaraju, H. MedSafetyBench: evaluating and improving the medical safety of large language models. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://openreview.net/pdf?id=cFyagd2Yh4 (2024).
Liu, F. et al. Application of large language models in medicine. Nat. Rev. Bioeng. 3, 85–104 (2025).
Magar, I. & Schwartz, R. Data contamination: from memorization to exploitation. In Proc. 60th Annual Meeting of the Association of Computatonal Linguistics (Vol. 2: Short Papers) 157–165 (ACL, 2022).
Deng, J. et al. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Gu, J. et al. A survey on LLM-as-a-judge. Preprint at https://arxiv.org/abs/2411.15594 (2025).
Madaan, L. et al. Quantifying variance in evaluation benchmarks. Preprint at https://arxiv.org/abs/2406.10229 (2024).
Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In Proc. 2023 Conference on Empirical Methods Natural Language Processing 9004–9017 (EMNLP, 2023).
Guha, B. Secret ballots and costly information gathering: the jury size problem revisited. MPRA Paper no. 73048 (2016).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Khandekar, N. et al. MedCalc-Bench: evaluating large language models for medical calculations. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://openreview.net/pdf?id=VXohja0vrQ (2024).
MT Samples: collection of transcribed medical transcription sample reports and examples. MT Samples https://www.mtsamples.com/ (2023).
Ben Abacha, A. et al. MEDEC: a benchmark for medical error detection and correction in clinical notes. in Findings of the Association for Computational Linguistics 22539–22550 (ACL, 2025).
Vilares, D. & Gómez-Rodríguez, C. HEAD-QA: a healthcare dataset for complex reasoning. In Proc. 57th Annual Meeting of the Association of Computational Linguistics 960–966 (ACL, 2019).
Chen, H., Fang, Z., Singla, Y. & Dredze, M. Benchmarking large language models on answering and explaining challenging medical questions. In Proc. 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) 3563–3599 (NAACL, 2025).
Yim, W. -W. et al. ACI-BENCH: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Sci. Data 10, 586 (2023).
Ben Abacha, A. et al. Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019: Health and Wellbeing e-Networks for All 25–29 (IOS Press, 2019).
Zeng, G. et al. MedDialog: large-scale medical dialogue datasets. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) https://doi.org/10.18653/v1/2020.emnlp-main.743 (Association for Computational Linguistics, 2020).
Abacha, A. B., Shivade, C. & Demner-Fushman, D. Overview of the MEDIQA 2019 shared task on textual inference, question entailment and question answering. In Proc. 18th BioNLP Workshop Shared Task 16–25 (ACL, 2019).
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. In Proc. 2019 Conference on Empirical Methods Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2567–2577 (EMNLP, 2019).
Lee, G. et al. EHRSQL: a practical text-to-SQL benchmark for electronic health records. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks https://openreview.net/pdf?id=B2W8Vy0rarw (NeurIPS, 2022).
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. NPJ Digit. Med. 6, 195 (2023).
Pandit, S. et al. MedHallu: a comprehensive benchmark for detecting medical hallucinations in large language models. Preprint at https://arxiv.org/abs/2502.14302 (2025).
Wornow, M., Thapa, R., Steinberg, E., Fries, J. A. & Shah, N. H. EHRSHOT: an EHR benchmark for few-shot evaluation of foundation models. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks https://openreview.net/pdf?id=CsXC6IcdwI (2023).
Fleming, S. L. et al. MedAlign: a clinician-generated dataset for instruction following with electronic medical records. In Proc. Thirty-Eighth AAAI Conf. Artif. Intell. 38, 21545–21555 (AAAI, 2024).
Xu, J. Discharge me: BioNLP ACL’24 shared task on streamlining discharge documentation (version 1.3). PhysioNet https://doi.org/10.13026/0zf5-fx50 (2024).
Chen, Z., Varma, M., Wan, X., Langlotz, C. & Delbrouck, J.-B. Toward expanding the scope of radiology report summarization to multiple anatomies and modalities. In Proc. 61st Annual Meeting of the Association of Computational Linguistics (Vol. 2: Short Papers) 469–484 (ACL, 2023).
Aali, A. et al. A dataset and benchmark for hospital course summarization with adapted large language models. J. Am. Med. Inform. Assoc. 32, 470–479 (2025).
Henry, S., Buchan, K., Filannino, M., Stubbs, A. & Uzuner, O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J. Am. Med. Inform. Assoc. 27, 3–12 (2020).
Edin, J. et al. Automated medical coding on MIMIC-III and MIMIC-IV: a critical review and replicability study. In Proc. 46th Int. ACM SIGIR Conf. Research and Development in Information Retrieval 2572–2582 (SIGIR, 2023).
Lopez, I. et al. Clinical entity augmented retrieval for clinical information extraction. NPJ Digit. Med. 8, 45 (2025).
Pillai, M., Posada, J., Gardner, R. M., Hernandez-Boussard, T. & Bannett, Y. Measuring quality-of-care in treatment of young children with attention deficit/hyperactivity disorder using pre-trained language models. J. Am. Med. Inform. Assoc. 31, 949–957 (2024).
Bannett, Y. et al. Applying large language models to assess quality of care: monitoring ADHD medication side effects. Pediatrics 155, e2024067223 (2025).
Rabbani, N. et al. Evaluation of a large language model to identify confidential content in adolescent encounter notes. JAMA Pediatr. 178, 308–310 (2024).
Tse, G. et al. Large language model responses to adolescent patient and proxy messages. JAMA Pediatr. 179, 93–94 (2025).









