Abstract
As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-centric visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating 11,396 unique Wikipedia documents curated and ranked by human annotators. Through extensive evaluation on seven multimodal retrievers and seventeen VLMs, RAVENEA reveals key findings: (i) Cultural grounding annotations enhance multimodal retrieval and corresponding downstream tasks. (ii) VLMs, when augmented with culture-aware retrieval, generally outperform their non-augmented counterparts (by averaging +6% on cVQA and +11% on cIC). (iii) Performance of culture-aware retrieval augmentation varies widely across countries. These findings highlight the critical limitations of current multimodal retrievers and VLMs, and underscore the need to enhance RAG visual culture understanding.
RAVENEA: Dataset & Approach
Cultural Dataset
It comprises 11,396 Wikipedia documents, each rigorously annotated by human judges for country association, topic alignment, and explicit visual representation.
Two Core Tasks
Culture-centric Visual Question Answering (cVQA) with 2,331 questions and Culture-informed Image Captioning (cIC) with 655 captions are introduced.
Comprehensive Evaluation
17 VLMs and 7 multimodal retrievers are evaluated, with two retrievers fine-tuned specifically through culture-aware contrastive learning.
Key Findings
The impact of Ravenea-CLIP does not scale monotonically with model size: smaller models exhibit the most substantial gains.
Larger models may already internalize Wikipedia knowledge, resulting in redundancy.
Key Results:
-
Small Model
Qwen3-VL-2B
cVQA: +31.6% | cIC: +8.1%
-
Large Model
Qwen3-VL-32B
cVQA: +2.9% | cIC: No gain
VLMs show cultural disparities with and without RAG, with large inter-model variance highlighting model-specific biases.
Geographic Variability Insights:
- cVQA: Most VLMs exhibit diminished performance on Nigeria and Indonesia, with Spanish culture showing up to 50% accuracy differential between models.
- cIC: VLMs consistently underperform on Mexico while achieving highest RegionScores on Indian culture-related inputs.
Cultural disparities in VLMs are evident across different regions and models.
Models benefit most from full culture-relevant annotations.
Annotation Ablation:
- 3 Annotation Questions: Using all three (country association, topic alignment, explicit visual representation) yields strongest performance.
Using all three (country association, topic alignment, explicit visual representation) yields strongest performance.
Main Results
Performance with Different Retriever Models
Fine-tuned contrastive models consistently outperform their frozen counterparts (shown in gray). Ravenea-SigLIP and Ravenea-CLIP are SigLIP and CLIP models fine-tuned on RAVENEA, respectively.
| Method | MRR ↑ | P@1 ↑ | P@3 ↑ | P@5 ↑ | nDCG@1 ↑ | nDCG@3 ↑ | nDCG@5 ↑ |
|---|---|---|---|---|---|---|---|
| SigLIP | 68.62 | 54.66 | 37.47 | 32.92 | 61.22 | 63.82 | 71.44 |
| CLIP | 75.44 | 60.87 | 41.41 | 34.41 | 67.75 | 72.31 | 78.09 |
| ViB | 61.33 | 46.71 | 33.14 | 29.42 | 55.76 | 59.42 | 65.92 |
| VLT | 58.33 | 39.86 | 32.58 | 29.35 | 47.74 | 57.29 | 62.12 |
| LLaVA | 58.85 | 37.48 | 30.68 | 28.21 | 48.59 | 51.80 | 60.34 |
| Ravenea-SigLIP (ours) | 70.95 | 57.14 | 40.99 | 33.29 | 63.86 | 68.31 | 73.92 |
| Ravenea-CLIP (ours) | 82.17 | 72.05 | 45.76 | 36.77 | 77.08 | 78.96 | 84.09 |
cVQA and cIC Performance w/ and w/o RAG
Models in gray are frozen retrievers. VLMs augmented with fine-tuned retriever generally perform better.
| Retriever | Avg | Open-Sourced | Prop. | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DS-T | DS | Qw-3 | Qw-7 | Qw-72 | IV-2 | IV-8 | IV-78 | Gm-4 | Gm-27 | Phi | Pix | LLaVA | Q3-2 | Q3-8 | Q3-32 | GPT | ||
| cVQAAccuracy ↑ | ||||||||||||||||||
| W/O RAG | 67.1 | 49.8 | 68.9 | 62.7 | 52.6 | 83.3 | 64.1 | 71.3 | 84.2 | 66.5 | 79.4 | 42.6 | 73.2 | 62.7 | 37.8 | 73.7 | 81.3 | 86.6 |
| SigLIP | 71.0 | 48.8 | 76.1 | 63.2 | 67.5 | 82.3 | 71.3 | 73.7 | 86.1 | 67.5 | 77.5 | 64.6 | 75.1 | 44.5 | 67.9 | 75.6 | 83.3 | 82.8 |
| CLIP | 71.9 | 50.2 | 77.5 | 63.6 | 66.0 | 82.8 | 73.7 | 78.5 | 84.2 | 69.4 | 76.6 | 67.0 | 76.6 | 44.5 | 68.9 | 76.6 | 83.3 | 83.3 |
| ViB | 67.6 | 48.8 | 71.8 | 61.7 | 64.1 | 78.5 | 66.5 | 67.5 | 82.3 | 63.6 | 75.1 | 61.2 | 68.4 | 37.3 | 60.8 | 73.2 | 83.7 | 85.2 |
| VLT | 66.1 | 46.6 | 72.7 | 58.4 | 65.1 | 78.5 | 65.1 | 67.9 | 80.4 | 58.4 | 75.6 | 58.9 | 62.7 | 35.4 | 60.3 | 75.6 | 78.9 | 83.7 |
| LLaVA | 67.9 | 50.1 | 66.0 | 60.3 | 67.0 | 81.8 | 66.5 | 66.0 | 84.2 | 65.1 | 75.1 | 60.8 | 68.9 | 37.3 | 59.8 | 77.0 | 83.7 | 84.7 |
| Ravenea-SigLIP | 72.6 | 48.8 | 76.1 | 68.4 | 67.9 | 81.3 | 73.2 | 75.6 | 85.6 | 70.3 | 79.4 | 69.4 | 75.6 | 47.8 | 68.9 | 77.0 | 84.2 | 85.6 |
| Ravenea-CLIP | 73.3 | 50.7 | 75.1 | 64.1 | 69.9 | 82.3 | 76.1 | 81.3 | 85.2 | 69.9 | 79.9 | 69.4 | 75.1 | 50.2 | 69.4 | 77.5 | 84.2 | 86.1 |
| cICRegionScore ↑ | ||||||||||||||||||
| W/O RAG | 57.5 | 26.5 | 63.3 | 53.1 | 69.4 | 71.4 | 38.8 | 49.0 | 65.3 | 73.5 | 75.5 | 28.6 | 53.1 | 49.0 | 59.2 | 61.2 | 73.5 | 67.3 |
| SigLIP | 54.8 | 30.6 | 59.2 | 51.0 | 63.3 | 65.3 | 44.9 | 53.1 | 57.1 | 65.3 | 69.4 | 46.9 | 53.1 | 42.9 | 49.0 | 59.2 | 59.2 | 61.2 |
| CLIP | 65.1 | 38.8 | 69.4 | 59.2 | 71.4 | 69.4 | 65.3 | 65.3 | 71.4 | 73.5 | 73.5 | 61.2 | 69.4 | 51.0 | 63.3 | 67.3 | 71.4 | 65.3 |
| ViB | 57.4 | 36.7 | 57.1 | 53.1 | 61.2 | 69.4 | 44.9 | 59.2 | 61.2 | 63.3 | 69.4 | 42.9 | 57.1 | 44.9 | 55.1 | 65.3 | 69.4 | 65.3 |
| VLT | 57.0 | 34.7 | 61.2 | 55.1 | 65.3 | 73.5 | 46.9 | 53.1 | 61.2 | 61.2 | 67.3 | 44.9 | 59.2 | 38.8 | 53.1 | 63.3 | 67.3 | 63.3 |
| LLaVA | 56.8 | 38.8 | 63.3 | 51.0 | 63.3 | 69.4 | 44.9 | 57.1 | 59.2 | 65.3 | 71.4 | 40.8 | 55.1 | 46.9 | 49.0 | 61.2 | 67.3 | 61.2 |
| Ravenea-SigLIP | 60.6 | 30.6 | 73.5 | 57.1 | 67.3 | 71.4 | 44.9 | 59.2 | 69.4 | 67.3 | 69.4 | 57.1 | 61.2 | 40.8 | 53.1 | 65.3 | 71.4 | 71.4 |
| Ravenea-CLIP | 68.8 | 46.9 | 73.5 | 61.2 | 77.6 | 75.5 | 69.4 | 67.3 | 73.5 | 75.5 | 75.5 | 63.3 | 71.4 | 55.1 | 67.3 | 73.5 | 73.5 | 69.4 |
Legend: DS-T=DeepSeek-Tiny, DS=DeepSeek, Qw=Qwen2.5-VL, IV=InternVL3, Gm=Gemma3, Pix=Pixtral, Q3=Qwen3-VL, GPT=GPT-4.1
Resources
Citation
@inproceedings{
li2026ravenea,
title={{RAVENEA}: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding},
author={Jiaang Li and Yifei Yuan and Wenyan Li and Mohammad Aliannejadi and Daniel Hershcovich and Anders S{\o}gaard and Ivan Vuli{\'c} and Wenxuan Zhang and Paul Pu Liang and Yang Deng and Serge Belongie},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=4zAbkxQ23i}
}
FAQ
RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding) is a benchmark for multimodal retrieval-augmented visual culture understanding. It contains 1,868 images from 8 countries across 11 cultural categories, paired with 11,396 unique Wikipedia documents (18,680 image-document pairs) curated by human annotators.
cVQA (culture-focused Visual Question Answering): 2,331 open-ended questions about cultural artifacts, with accuracy measured by exact match. cIC (culture-Informed Image Captioning): 655 captions evaluated using RegionScore and CIDEr metrics, requiring proper identification of cultural elements and their geographic/topical context.
We evaluate 17 state-of-the-art VLMs including Qwen3-VL (2B/8B/32B), InternVL3 (8B/38B), Gemma3 (4B/12B/27B), Llama-3.2-Vision (11B/90B), and GPT-4o. We also benchmark 7 multimodal retrievers including CLIP, SigLIP, Llama-CLIP, and our fine-tuned Ravenea-CLIP.
Yes! The RAVENEA dataset is publicly available on Hugging Face at huggingface.co/datasets/jaagli/ravenea. The code is also available on GitHub.