A new AI benchmark may just be a key to measuring whether or not an AI model is really AGI. Artificial General Intelligence (AGI) AGI is conceptualized as AI that equals or surpasses human abilities in a wide array of tasks. A major challenge in this field has been the absence of a universally accepted, operational definition of AGI. The leveled taxonomy for AGI proposed by Morris et al. addresses this, focusing on both the generality and performance of AI capabilities. A key focus is Level 3 AGI, or Expert AGI, which is characterized by AI systems performing within the top 10% of skilled adults in a variety of tasks, highlighting potential implications for job displacement and economic changes.
Creating benchmarks for Expert AGI is challenging. Since Expert AGI is defined in relation to the performance of skilled adults, college-level exams in various disciplines are a logical starting point. Existing benchmarks like MMLU and AGIEval have contributed to this area but are limited to text-based questions. Human expertise, however, often requires understanding that spans both text and images.
To bridge this gap, MMMU (Multimodal Multi-discipline Understanding) has been developed. It’s a comprehensive benchmark designed to assess college-level multimodal understanding and reasoning across six disciplines. MMMU includes 11.5K multimodal questions, encompassing 30 subjects and 183 subfields, ensuring a balance of breadth and depth. The benchmark introduces unique challenges, such as a variety of image formats and combined text-image inputs, demanding a sophisticated level of understanding and complex reasoning from AI models.
In an evaluation involving 14 open-source large multimodal models (LMMs) and GPT-4V, a proprietary LMM, several key findings emerged:
The development of MMMU involved a thorough process. Subjects were chosen based on the integral role of visual inputs, and questions were sourced from various materials, with strict adherence to copyright and licensing norms. The data underwent extensive cleaning and categorization into different difficulty levels, ensuring the benchmark’s quality and relevance.
Despite its significance, MMMU has limitations. Its focus on college-level subjects might not fully capture the breadth and depth required for Expert AGI. Plans for incorporating human evaluations in the future are aimed at providing a more rounded assessment and bridging the gap towards achieving Expert AGI.
MMMU marks a significant step forward in evaluating Expert AGI. By challenging LMMs with intricate, multimodal tasks that demand deep knowledge and reasoning, MMMU not only tests the current limits of AI capabilities but also directs future research in this domain. The ongoing refinement and expansion of benchmarks like MMMU are crucial in the pursuit of realizing the full potential of AGI.