RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

ICLR 2026

* Project lead | Equal contribution
1University of Copenhagen
2ETH Zürich
3University of Amsterdam
4University of Cambridge
5Massachusetts Institute of Technology
6Singapore University of Technology and Design
7Singapore Management University

TL;DR

We present RAVENEA, a new benchmark for evaluating VLMs in using external knowledge for visual culture understanding. With 11,396 human-ranked Wikipedia documents across 8 countries and 11 categories, VLMs augmented with culture-aware retrieval outperform their non-augmented counterparts by +6% on cVQA and +11% on cIC on average.

RAVENEA Overview
1,868
Images
11,396
Documents
18,680
Img-Doc Pairs
8
Countries
11
Categories
17
VLMs Tested
Cultural Diversity

8 countries across 4 continents:

🇨🇳 China 🇮🇳 India 🇮🇩 Indonesia 🇰🇷 Korea 🇲🇽 Mexico 🇳🇬 Nigeria 🇷🇺 Russia 🇪🇸 Spain

Human-Ranked Documents

18,680 image-document pairs with human-labeled cultural relevance across 3 dimensions.

Significant Gains

Culture-aware retrieval: +6.2% on cVQA accuracy (67.1%→73.3%), +11.3% on cIC RegionScore (57.5%→68.8%).

Abstract

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-centric visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating 11,396 unique Wikipedia documents curated and ranked by human annotators. Through extensive evaluation on seven multimodal retrievers and seventeen VLMs, RAVENEA reveals key findings: (i) Cultural grounding annotations enhance multimodal retrieval and corresponding downstream tasks. (ii) VLMs, when augmented with culture-aware retrieval, generally outperform their non-augmented counterparts (by averaging +6% on cVQA and +11% on cIC). (iii) Performance of culture-aware retrieval augmentation varies widely across countries. These findings highlight the critical limitations of current multimodal retrievers and VLMs, and underscore the need to enhance RAG visual culture understanding.

RAVENEA: Dataset & Approach

Cultural Dataset

It comprises 11,396 Wikipedia documents, each rigorously annotated by human judges for country association, topic alignment, and explicit visual representation.

Two Core Tasks

Culture-centric Visual Question Answering (cVQA) with 2,331 questions and Culture-informed Image Captioning (cIC) with 655 captions are introduced.

Comprehensive Evaluation

17 VLMs and 7 multimodal retrievers are evaluated, with two retrievers fine-tuned specifically through culture-aware contrastive learning.

Key Findings

Finding 1: Model Scaling Impact

Larger models may already internalize Wikipedia knowledge, resulting in redundancy.

Key Results:
  • Small Model Qwen3-VL-2B
    cVQA: +31.6% | cIC: +8.1%
  • Large Model Qwen3-VL-32B
    cVQA: +2.9% | cIC: No gain
Geographic Variability Insights:
  • cVQA: Most VLMs exhibit diminished performance on Nigeria and Indonesia, with Spanish culture showing up to 50% accuracy differential between models.
  • cIC: VLMs consistently underperform on Mexico while achieving highest RegionScores on Indian culture-related inputs.
Finding 2: Cultural Disparities

Cultural disparities in VLMs are evident across different regions and models.

Annotation Ablation:
  • 3 Annotation Questions: Using all three (country association, topic alignment, explicit visual representation) yields strongest performance.
Finding 3: Cultural annotation benefits the performance of cultural VLMs

Using all three (country association, topic alignment, explicit visual representation) yields strongest performance.

Main Results

Performance with Different Retriever Models

Fine-tuned contrastive models consistently outperform their frozen counterparts (shown in gray). Ravenea-SigLIP and Ravenea-CLIP are SigLIP and CLIP models fine-tuned on RAVENEA, respectively.

Method MRR ↑ P@1 ↑ P@3 ↑ P@5 ↑ nDCG@1 ↑ nDCG@3 ↑ nDCG@5 ↑
SigLIP 68.62 54.66 37.47 32.92 61.22 63.82 71.44
CLIP 75.44 60.87 41.41 34.41 67.75 72.31 78.09
ViB 61.33 46.71 33.14 29.42 55.76 59.42 65.92
VLT 58.33 39.86 32.58 29.35 47.74 57.29 62.12
LLaVA 58.85 37.48 30.68 28.21 48.59 51.80 60.34
Ravenea-SigLIP (ours) 70.95 57.14 40.99 33.29 63.86 68.31 73.92
Ravenea-CLIP (ours) 82.17 72.05 45.76 36.77 77.08 78.96 84.09

cVQA and cIC Performance w/ and w/o RAG

Models in gray are frozen retrievers. VLMs augmented with fine-tuned retriever generally perform better.

Retriever Avg Open-Sourced Prop.
DS-T DS Qw-3 Qw-7 Qw-72 IV-2 IV-8 IV-78 Gm-4 Gm-27 Phi Pix LLaVA Q3-2 Q3-8 Q3-32 GPT
cVQAAccuracy ↑
W/O RAG 67.1 49.8 68.9 62.7 52.6 83.3 64.1 71.3 84.2 66.5 79.4 42.6 73.2 62.7 37.8 73.7 81.3 86.6
SigLIP 71.0 48.8 76.1 63.2 67.5 82.3 71.3 73.7 86.1 67.5 77.5 64.6 75.1 44.5 67.9 75.6 83.3 82.8
CLIP 71.9 50.2 77.5 63.6 66.0 82.8 73.7 78.5 84.2 69.4 76.6 67.0 76.6 44.5 68.9 76.6 83.3 83.3
ViB 67.6 48.8 71.8 61.7 64.1 78.5 66.5 67.5 82.3 63.6 75.1 61.2 68.4 37.3 60.8 73.2 83.7 85.2
VLT 66.1 46.6 72.7 58.4 65.1 78.5 65.1 67.9 80.4 58.4 75.6 58.9 62.7 35.4 60.3 75.6 78.9 83.7
LLaVA 67.9 50.1 66.0 60.3 67.0 81.8 66.5 66.0 84.2 65.1 75.1 60.8 68.9 37.3 59.8 77.0 83.7 84.7
Ravenea-SigLIP 72.6 48.8 76.1 68.4 67.9 81.3 73.2 75.6 85.6 70.3 79.4 69.4 75.6 47.8 68.9 77.0 84.2 85.6
Ravenea-CLIP 73.3 50.7 75.1 64.1 69.9 82.3 76.1 81.3 85.2 69.9 79.9 69.4 75.1 50.2 69.4 77.5 84.2 86.1
cICRegionScore ↑
W/O RAG 57.5 26.5 63.3 53.1 69.4 71.4 38.8 49.0 65.3 73.5 75.5 28.6 53.1 49.0 59.2 61.2 73.5 67.3
SigLIP 54.8 30.6 59.2 51.0 63.3 65.3 44.9 53.1 57.1 65.3 69.4 46.9 53.1 42.9 49.0 59.2 59.2 61.2
CLIP 65.1 38.8 69.4 59.2 71.4 69.4 65.3 65.3 71.4 73.5 73.5 61.2 69.4 51.0 63.3 67.3 71.4 65.3
ViB 57.4 36.7 57.1 53.1 61.2 69.4 44.9 59.2 61.2 63.3 69.4 42.9 57.1 44.9 55.1 65.3 69.4 65.3
VLT 57.0 34.7 61.2 55.1 65.3 73.5 46.9 53.1 61.2 61.2 67.3 44.9 59.2 38.8 53.1 63.3 67.3 63.3
LLaVA 56.8 38.8 63.3 51.0 63.3 69.4 44.9 57.1 59.2 65.3 71.4 40.8 55.1 46.9 49.0 61.2 67.3 61.2
Ravenea-SigLIP 60.6 30.6 73.5 57.1 67.3 71.4 44.9 59.2 69.4 67.3 69.4 57.1 61.2 40.8 53.1 65.3 71.4 71.4
Ravenea-CLIP 68.8 46.9 73.5 61.2 77.6 75.5 69.4 67.3 73.5 75.5 75.5 63.3 71.4 55.1 67.3 73.5 73.5 69.4

Legend: DS-T=DeepSeek-Tiny, DS=DeepSeek, Qw=Qwen2.5-VL, IV=InternVL3, Gm=Gemma3, Pix=Pixtral, Q3=Qwen3-VL, GPT=GPT-4.1

Resources

Paper

Read our paper for detailed methodology, results, and analysis.

Read Paper

Code

Access open-source code for training and evaluation.

GitHub Repository

Dataset

Download the RAVENEA benchmark dataset.

Download Data

Citation


@inproceedings{
li2026ravenea,
title={{RAVENEA}: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding},
author={Jiaang Li and Yifei Yuan and Wenyan Li and Mohammad Aliannejadi and Daniel Hershcovich and Anders S{\o}gaard and Ivan Vuli{\'c} and Wenxuan Zhang and Paul Pu Liang and Yang Deng and Serge Belongie},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=4zAbkxQ23i}
}
                        

FAQ

RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding) is a benchmark for multimodal retrieval-augmented visual culture understanding. It contains 1,868 images from 8 countries across 11 cultural categories, paired with 11,396 unique Wikipedia documents (18,680 image-document pairs) curated by human annotators.

cVQA (culture-focused Visual Question Answering): 2,331 open-ended questions about cultural artifacts, with accuracy measured by exact match. cIC (culture-Informed Image Captioning): 655 captions evaluated using RegionScore and CIDEr metrics, requiring proper identification of cultural elements and their geographic/topical context.

We evaluate 17 state-of-the-art VLMs including Qwen3-VL (2B/8B/32B), InternVL3 (8B/38B), Gemma3 (4B/12B/27B), Llama-3.2-Vision (11B/90B), and GPT-4o. We also benchmark 7 multimodal retrievers including CLIP, SigLIP, Llama-CLIP, and our fine-tuned Ravenea-CLIP.

Yes! The RAVENEA dataset is publicly available on Hugging Face at huggingface.co/datasets/jaagli/ravenea. The code is also available on GitHub.