We collect samples from previous well-constructed and data-balanced benchmarks, as shown in Table 5, followed by manual selection to filter out similar and non-open-source samples. For MMU tasks (e.g., Video-to-Text), we further incorporate human refinements to remove predefined constraints and ensure natural, free-form outputs. For MMG tasks (e.g., text-to-video), we filter out NSFW content and low-quality queries to ensure answerability.
For emerging task categories like visual-to-audio, we employ a human-in-the-loop approach to curate diverse queries sourced from carefully filtered video datasets. These datasets are collected from publicly available video platforms, ensuring both relevance and diversity. Through this rigorous process, we have successfully constructed a comprehensive open-ended any-to-any benchmark comprising 1,500 high-quality queries, with 100 representative samples per task category.