Jiaang Li

I'm an ELLIS PhD student at the Pioneer Centre for Artificial Intelligence. I'm honored to be advised by Prof. Serge Belongie at the University of Copenhagen and Ivan Vulić at the University of Cambridge . Previously I spent two wonderful years at CoAStaL, advised by Prof. Anders Søgaard, and received my Master's degree in Computer Science at the University of Copenhagen.

Github / BlueSky / Google Scholar / X / Email

Research

My interests revolve around the convergence of natural language processing and computer vision, with a focus on gaining insights from human cognition. I am enthusiastic about exploring language grounding within multimodal contexts and investigating the linguistic and cognitive characteristics of models.

News

Aug. 2025 - One paper is accepted to EMNLP 2025

Publications & Preprints
* denotes equal contribution

	What if Othello-Playing Language Models Could See? Xinyi Chen, Yifei Yuan, Jiaang Li, Serge Belongie, Maarten de Rijke, Anders Søgaard, EMNLP 2025 TL;DR The paper introduces VISOTHELLO, a multi-modal model that plays Othello using both move sequences and board images. Compared to text-only models, it predicts moves more accurately and learns more robust, structured representations, suggesting visual grounding helps language models build stronger world models.
	RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie arXiv code / data TL;DR We present RAVENEA, a large-scale benchmark with 10K+ human-ranked Wikipedia docs for culture-aware VL tasks. We find retrieval boosts lightweight VLMs, showing the power of cultural augmentation.
	ChatMotion: A Multimodal Multi-Agent for Human Motion Analysis Lei Li, Sen Jia, Jiahao Wang, Zhaochong An, Jiaang Li, Jenq-Neng Hwang, Serge Belongie arXiv TL;DR ChatMotion is introduced, a multimodal multi-agent framework for human motion analysis that dynamically interprets user intent, decomposes complex tasks into meta-tasks, and activates specialized function modules for motion comprehension.
	Do Vision and Language Models Share Concepts? A Vector Space Alignment Study Jiaang Li, Yova Kementchedjhieva, Constanza Fierro, Anders Søgaard TACL (EMNLP 2024 oral) code & data TL;DR Our experiments show that LMs partially converge towards representations isomorphic to those of vision models, subject to dispersion, polysemy, and frequency, which has important implications for both multi-modal processing and the LM understanding debate.
	FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture Wenyan Li, Xinyu Zhang, Jiaang Li, Qiwei Peng, Raphael Tang, Li Zhou, Weijia Zhang, Guimin Hu, Yifei Yuan, Anders Søgaard, Daniel Hershcovich, Desmond Elliottd EMNLP 2024 code / data TL;DR In this work, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China, and evaluates vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions.
	Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning Wenyan Li, Jiaang Li, Rita Ramos, Raphael Tang, Desmond Elliottd ACL 2024 code TL;DR We analyze the robustness of a retrieval-augmented captioning model SmallCap and propose to train the model by sampling retrieved captions from more diverse sets, which decreases the chance that the model learns to copy majority tokens, and improves both in-domain and cross-domain performance.
	Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing Yong Cao, Wenyan Li, Jiaang Li, Yifei Yuan, Daniel Hershcovich arXiv TL;DR We empirically show that GPT-4V excels at identifying cultural concepts but still exhibits weaker performance in low-resource languages, such as Tamil and Swahili, suggesting a promising solution for future visual cultural benchmark construction.
	Structural Similarities Between Language Models and Neural Response Measurements Jiaang Li, Antonia Karamolegkou, Yova Kementchedjhieva, Mostafa Abdou, Sune Lehmann, Anders Søgaard NeurReps @ NeurIPS 2023 code TL;DR This work shows that the larger neural language models get, the more their representations are structurally similar to neural response measurements from brain imaging.
	Copyright Violations and Large Language Models Antonia Karamolegkou, Jiaang Li, Li Zhou, Anders Søgaard EMNLP 2023 code TL;DR We explore the issue of copyright violations and large language models through the lens of verbatim memorization, focusing on possible redistribution of copyrighted text.
	PokemonChat: Auditing ChatGPT for Pokemon Universe Knowledge Laura Cabello, Jiaang Li, Ilias Chalkidis arXiv TL;DR We probe ChatGPT for its conversational understanding and introduces a conversational framework (protocol) that can be adopted in future studies to assess ChatGPT's ability to generalize, combine features, and to acquire and reason over newly introduced knowledge from human feedback.

Services

Reviewer: ACL Rolling Reviewer, NLLP workshop' 2023

This page design is based on a template from Jon Barron. Big thanks!