Judge Anything: MLLM as a Judge Across Any Modality

1Huazhong University of Science and Technology, 2University of Illinois Chicago
* Co-first authors § Corresponding author
† Equal contribution
Judge Anything Pipeline

Construction Pipeline The construction of TaskAnything and JudgeAnything follows a systematic four-step approach. First, we compile open-ended any-to-any instructions from existing benchmarks and datasets, followed by rigorous human annotation to ensure sample diversity and quality in TaskAnything. Subsequently, we collect model responses and develop evaluation principles through an Human-MLLM collaborative approach, creating detailed assessment checklists for each sample. Finally, we curate instruction-responses pairs to evaluate the effectiveness of MLLM-as-a-Judge in any-to-any generation tasks, benchmarking these automated assessments against expert human judgments.

Abstract

Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-1.5-Pro) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: \url{https://mllm-judge.github.io}.

TaskAnything

We collect samples from previous well-constructed and data-balanced benchmarks, as shown in Table 5, followed by manual selection to filter out similar and non-open-source samples. For MMU tasks (e.g., Video-to-Text), we further incorporate human refinements to remove predefined constraints and ensure natural, free-form outputs. For MMG tasks (e.g., text-to-video), we filter out NSFW content and low-quality queries to ensure answerability.

For emerging task categories like visual-to-audio, we employ a human-in-the-loop approach to curate diverse queries sourced from carefully filtered video datasets. These datasets are collected from publicly available video platforms, ensuring both relevance and diversity. Through this rigorous process, we have successfully constructed a comprehensive open-ended any-to-any benchmark comprising 1,500 high-quality queries, with 100 representative samples per task category.

Benchmark Overview

Task overview infographic

JudgeAnything

Settings

Pair Comparison

Pair Comparison setting require judging models to choose a better choice from two given models responses.

Score Evaluation

Score Evaluation setting require judgine models to score an integar with detailed scoring rules.

Baselines

Overall

  • Single-step direct evaluation
  • Combined reasoning and judgment
  • Baseline for efficiency comparison

Rubric

  • Multi-criteria evaluation framework
  • Predefined assessment dimensions
  • Structured scoring system

Checklist

  • Human-curated verification items
  • Multi-stage validation process
  • Comprehensive coverage assurance

Model Performance

Models Multimodal Understanding Multimodal Generation
Pair Comparison Score Evaluation Pair Comparison Score Evaluation
w. Tie w.o. Tie Agreement Pearson Spearman MAE w. Tie w.o. Tie Agreement Pearson Spearman MAE
Overall
GPT-4o 61.20 77.59 38.40 0.461 0.433 0.919 52.55 69.83 31.78 0.444 0.444 1.176
Gemini-1.5-Pro 60.50 77.14 37.45 0.456 0.420 1.022 54.70 70.74 32.00 0.327 0.338 1.268
LearnLM-1.5-Pro 58.10 74.14 33.45 0.415 0.380 1.103 52.05 67.32 33.32 0.328 0.332 1.285
Gemini-2.0-Flash 58.10 75.42 36.15 0.423 0.348 1.053 52.15 68.36 31.20 0.415 0.417 1.536
Gemini-2.0-Flash-Lite 57.50 74.52 35.75 0.429 0.385 1.052 46.95 60.93 30.03 0.421 0.407 1.482
Evaluator-Fusion 62.20 79.25 37.05 0.512 0.471 0.936 54.80 72.05 25.22 0.492 0.502 1.261
Rubrics
GPT-4o 63.38 76.68 39.98 0.576 0.568 0.935 32.27 53.03 28.95 0.383 0.392 1.365
Gemini-1.5-Pro 69.40 82.74 39.58 0.565 0.551 0.949 53.01 68.67 35.60 0.406 0.408 1.203
LearnLM-1.5-Pro 64.77 77.30 39.62 0.552 0.540 0.973 52.66 67.18 35.83 0.387 0.389 1.222
Gemini-2.0-Flash 47.75 68.71 37.53 0.491 0.473 1.124 41.89 61.00 26.87 0.350 0.353 1.706
Gemini-2.0-Flash-Lite 54.73 70.77 36.64 0.492 0.495 1.152 40.45 59.25 28.87 0.405 0.414 1.571
Evaluator-Fusion 66.73 81.08 37.17 0.618 0.627 0.989 49.26 65.89 24.42 0.502 0.522 1.349
Checklist
GPT-4o 60.77 74.75 42.03 0.623 0.608 0.844 30.27 51.90 30.63 0.343 0.340 1.295
Gemini-1.5-Pro 70.60 84.07 54.57 0.745 0.729 0.629 53.79 69.97 41.77 0.494 0.495 1.036
LearnLM-1.5-Pro 64.52 76.85 43.45 0.646 0.631 0.843 52.48 70.71 38.43 0.445 0.447 1.112
Gemini-2.0-Flash 53.93 71.87 40.31 0.554 0.543 0.979 50.23 67.74 35.16 0.476 0.482 1.282
Gemini-2.0-Flash-Lite 56.22 70.74 39.50 0.551 0.552 0.979 48.53 65.66 35.99 0.450 0.460 1.165
Evaluator-Fusion 66.55 80.68 42.79 0.687 0.687 0.816 53.37 70.71 30.05 0.562 0.572 1.069

Task-wise Performance Breakdown

Models Setting Overall Multimodal Understanding Multimodal Generation
T→T I→T V→T A→T V+A→T T→I T→V T→A I→I I→V I→A V→V V→A A→V A→A
Pair
GPT-4o Overall 55.43 52.50 73.00 58.00 69.50 53.00 52.50 56.00 26.00 49.00 42.00 68.00 78.00 38.00 58.50 57.50
Rubric 42.64 52.33 68.00 53.08 77.58 65.92 17.33 25.25 46.83 69.17 43.42 15.67 3.17 38.92 48.92 14.00
Checklist 40.43 46.92 67.33 52.08 63.25 70.92 10.50 25.42 45.75 69.17 43.25 10.08 3.08 38.67 44.50 12.25
Gemini-1.5-Pro Overall 56.63 49.00 79.00 57.50 68.50 48.50 52.00 56.00 38.00 57.00 39.00 69.00 88.50 33.00 47.00 67.50
Rubric 58.47 62.33 75.67 56.08 82.42 72.17 18.42 68.50 36.50 73.25 38.08 66.25 89.33 34.00 43.25 62.50
Checklist 59.39 65.75 78.67 58.42 70.92 72.58 26.50 69.92 37.17 65.42 39.75 66.25 89.83 33.42 45.33 64.33
Gemini-2.0-Flash Overall 54.13 43.50 67.50 59.00 69.00 51.50 46.50 58.00 50.50 51.00 43.50 57.50 77.00 41.00 58.00 38.50
Rubric 43.84 36.75 50.58 42.33 56.58 67.58 33.42 55.00 37.50 48.92 44.08 31.50 57.00 41.58 45.50 24.42
Checklist 51.47 39.33 59.75 47.00 66.08 67.58 46.42 62.83 43.33 33.50 48.25 60.25 78.17 38.33 52.67 38.58
Score
GPT-4o Overall 33.98 37.25 46.50 35.75 33.25 39.25 28.25 41.50 30.25 37.50 29.25 28.25 13.75 28.25 47.00 33.00
Rubric 32.63 46.58 46.50 37.13 31.58 38.13 30.25 30.25 34.29 32.00 28.67 30.96 12.29 24.79 33.21 25.79
Checklist 34.43 48.63 48.25 35.17 34.04 44.04 34.00 30.33 34.00 41.04 30.67 25.33 21.96 27.46 35.04 25.50
Gemini-1.5-Pro Overall 33.82 40.75 47.25 38.00 34.25 27.00 25.25 17.00 29.50 39.75 26.50 32.00 34.50 35.75 50.75 29.00
Rubric 36.93 44.00 47.21 35.17 33.29 38.25 36.17 36.17 33.79 43.08 41.63 33.83 32.96 32.42 45.21 22.21
Checklist 46.04 58.88 59.54 42.83 48.13 63.46 44.50 44.50 33.71 43.71 62.38 44.58 40.25 28.46 52.58 30.83
Gemini-2.0-Flash Overall 32.85 35.00 40.75 45.00 29.75 30.25 21.50 26.75 24.00 36.50 41.00 47.00 11.25 42.75 46.50 14.75
Rubric 30.42 46.92 45.21 38.58 27.17 29.75 35.63 31.50 32.50 23.13 32.71 32.17 11.79 33.00 24.13 12.42
Checklist 36.88 47.21 47.04 40.29 36.92 35.00 36.92 35.00 34.13 37.13 37.75 41.63 19.83 38.50 54.00 16.75

OMNIARENA

Universal Benchmark Platform for Omni-Model Evaluation

Core Architecture

Dynamic Evaluation

Pairwise comparison mechanism with real-time ELO ranking

Open Participation

Seamless integration for new models and custom queries

Dual Arena System

Specialized evaluation tracks for MMU and MMG tasks

Platform Raw Result

BibTeX

@article{pu2025judge,
      title={Judge Anything: MLLM as a Judge Across Any Modality},
      author={Pu, Shu and Wang, Yaochen and Chen, Dongping and Chen, Yuhang and Wang, Guohao and Qin, Qi and Zhang, Zhongyi and Zhang, Zhiyuan and Zhou, Zetong and Gong, Shuang and others},
      journal={arXiv preprint arXiv:2503.17489},
      year={2025}
    }