RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

1University of Copenhagen,  2ETH Zürich,  3University of Amsterdam,  
4University of Cambridge,   5Singapore University of Technology and Design,  
6Massachusetts Institute of Technology,  7Singapore Management University,  

*Project lead
   Equal Contribution   Corresponding Authors

tl;dr: We present RAVENEA, a large-scale benchmark with 10K+ human-ranked Wikipedia docs for culture-aware VL tasks. We find retrieval boosts lightweight VLMs, showing the power of cultural augmentation.

We introduce RAVENEA: A Multimodal Retrieval-Augmented Visual culturE uNdErstAnding dataset. It features cultural visual question answering and image captioning tasks, with a diverse geographic and categorical distribution of cultural references. Our evaluation across 14 vision-language models demonstrates how culture-aware retrieval significantly improves performance on these challenging tasks.

Abstract

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.

BibTeX

@article{li2025ravenea,
  title={RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding},
  author={Li, Jiaang and Yuan, Yifei and Li, Wenyan and Aliannejadi, Mohammad and Hershcovich, Daniel and S{\o}gaard, Anders and Vuli{\'c}, Ivan and Zhang, Wenxuan and Liang, Paul Pu and Deng, Yang and others},
  journal={arXiv preprint arXiv:2505.14462},
  year={2025}
}