MMMU: An AI Benchmark in the Quest for Expert AGI

December 7, 2023 3 mins to read

A new AI benchmark may just be a key to measuring whether or not an AI model is really AGI. Artificial General Intelligence (AGI) AGI is conceptualized as AI that equals or surpasses human abilities in a wide array of tasks. A major challenge in this field has been the absence of a universally accepted, operational definition of AGI. The leveled taxonomy for AGI proposed by Morris et al. addresses this, focusing on both the generality and performance of AI capabilities. A key focus is Level 3 AGI, or Expert AGI, which is characterized by AI systems performing within the top 10% of skilled adults in a variety of tasks, highlighting potential implications for job displacement and economic changes.

Benchmarking Expert AGI

Creating benchmarks for Expert AGI is challenging. Since Expert AGI is defined in relation to the performance of skilled adults, college-level exams in various disciplines are a logical starting point. Existing benchmarks like MMLU and AGIEval have contributed to this area but are limited to text-based questions. Human expertise, however, often requires understanding that spans both text and images.

Introducing MMMU

To bridge this gap, MMMU (Multimodal Multi-discipline Understanding) has been developed. It’s a comprehensive benchmark designed to assess college-level multimodal understanding and reasoning across six disciplines. MMMU includes 11.5K multimodal questions, encompassing 30 subjects and 183 subfields, ensuring a balance of breadth and depth. The benchmark introduces unique challenges, such as a variety of image formats and combined text-image inputs, demanding a sophisticated level of understanding and complex reasoning from AI models.

Evaluating AI Models with MMMU

In an evaluation involving 14 open-source large multimodal models (LMMs) and GPT-4V, a proprietary LMM, several key findings emerged:

  • GPT-4V scored only 55.7% accuracy on MMMU, indicating significant potential for improvement.
  • A noticeable performance disparity exists between open-source LMMs and GPT-4V.
  • Additional features like OCR and captioning do not substantially enhance performance, highlighting the need for more advanced joint interpretation of images and text.
  • Model performance varies across disciplines, with better results in visually simpler fields like Art & Design and Humanities & Social Science, compared to more complex fields like Science and Engineering.

Data Collection and Quality Control for MMMU

The development of MMMU involved a thorough process. Subjects were chosen based on the integral role of visual inputs, and questions were sourced from various materials, with strict adherence to copyright and licensing norms. The data underwent extensive cleaning and categorization into different difficulty levels, ensuring the benchmark’s quality and relevance.

Limitations and Future Directions

Despite its significance, MMMU has limitations. Its focus on college-level subjects might not fully capture the breadth and depth required for Expert AGI. Plans for incorporating human evaluations in the future are aimed at providing a more rounded assessment and bridging the gap towards achieving Expert AGI.

MMMU marks a significant step forward in evaluating Expert AGI. By challenging LMMs with intricate, multimodal tasks that demand deep knowledge and reasoning, MMMU not only tests the current limits of AI capabilities but also directs future research in this domain. The ongoing refinement and expansion of benchmarks like MMMU are crucial in the pursuit of realizing the full potential of AGI.

Leave a comment

Your email address will not be published. Required fields are marked *