LGJun 15, 2025

Domain Specific Benchmarks for Evaluating Multimodal Large Language Models

Khizar Anjum, Muhammad Arbab Arshad, Kadhim Hayawi, Efstathios Polyzos, Asadullah Tariq, Mohamed Adel Serhani, Laiba Batool, Brady Lund, Nishith Reddy Mannuru, Ravi Varma Kumar Bevara, Taslim Mahbub, Muhammad Zeeshan Akram

arXiv:2506.12958v28 citationsh-index: 28

Originality Synthesis-oriented

AI Analysis

This work addresses the need for domain-specific evaluation benchmarks for LLMs, which is incremental as it organizes existing resources rather than proposing new methods.

The paper tackles the lack of domain-specific analysis for evaluating multimodal large language models (LLMs) by introducing a taxonomy of seven key disciplines and compiling benchmarks by domain to create an accessible resource for researchers.

Large language models (LLMs) are increasingly being deployed across disciplines due to their advanced reasoning and problem solving capabilities. To measure their effectiveness, various benchmarks have been developed that measure aspects of LLM reasoning, comprehension, and problem-solving. While several surveys address LLM evaluation and benchmarks, a domain-specific analysis remains underexplored in the literature. This paper introduces a taxonomy of seven key disciplines, encompassing various domains and application areas where LLMs are extensively utilized. Additionally, we provide a comprehensive review of LLM benchmarks and survey papers within each domain, highlighting the unique capabilities of LLMs and the challenges faced in their application. Finally, we compile and categorize these benchmarks by domain to create an accessible resource for researchers, aiming to pave the way for advancements toward artificial general intelligence (AGI)

View on arXiv PDF

Similar