Publications & Preprints
* denotes equal
contribution
|
Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning
Chengzu Li*,
Zanyi Wang*,
Jiaang Li*,
Yi Xu,
Han Zhou,
Huanyu Zhang,
Ruichuan An,
Dengyang Jiang,
Zhaochong An,
Ivan Vulić,
Serge Belongie
Anna Korhonen
arXiv 2026
code  / 
data
TL;DR
The paper demonstrates that video generation isn't just for creating media—it's a powerful engine for visual planning and reasoning.
By "thinking in frames," models can better solve spatial and temporal puzzles that traditional VLMs struggle with.
|
|
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual
Culture Understanding
Jiaang Li*,
Yifei Yuan*,
Wenyan Li,
Mohammad Aliannejadi,
Daniel Hershcovich,
Anders Søgaard,
Ivan Vulić,
Wenxuan Zhang,
Paul Pu Liang,
Yang Deng,
Serge Belongie
ICLR 2026
code  / 
data
TL;DR
We present RAVENEA, a large-scale benchmark with 10K+ human-ranked Wikipedia
docs for culture-aware VL tasks.
We find retrieval boosts lightweight VLMs, showing the power of cultural
augmentation.
|
|
What if Othello-Playing Language Models Could See?
Xinyi Chen*,
Yifei Yuan*,
Jiaang Li,
Serge Belongie,
Maarten de Rijke,
Anders Søgaard,
EMNLP 2025
TL;DR
The paper introduces VISOTHELLO, a multi-modal model that plays Othello using
both move sequences and board images. Compared to text-only models,
it predicts moves more accurately and learns more robust, structured
representations, suggesting visual grounding helps language models build
stronger world models.
|
|
ChatMotion: A Multimodal Multi-Agent for Human Motion Analysis
Lei
Li,
Sen
Jia,
Jiahao Wang,
Zhaochong An,
Jiaang Li,
Jenq-Neng Hwang,
Serge Belongie
arXiv
TL;DR
ChatMotion is introduced, a multimodal multi-agent framework for human motion
analysis that dynamically interprets user intent,
decomposes complex tasks into meta-tasks, and activates specialized function
modules for motion comprehension.
|
|
Do Vision and Language Models Share Concepts? A Vector Space
Alignment Study
Jiaang Li,
Yova
Kementchedjhieva,
Constanza Fierro,
Anders Søgaard
TACL (EMNLP 2024 oral)
code & data
TL;DR
Our experiments show that LMs partially converge towards representations
isomorphic to those of vision models,
subject to dispersion, polysemy, and frequency, which has important implications
for both multi-modal processing and the LM understanding debate.
|
|
FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of
Chinese Food Culture
Wenyan Li,
Xinyu Zhang,
Jiaang Li,
Qiwei Peng,
Raphael Tang,
Li Zhou,
Weijia Zhang,
Guimin Hu,
Yifei Yuan,
Anders Søgaard,
Daniel Hershcovich,
Desmond Elliottd
EMNLP 2024
code  / 
data
TL;DR
In this work, we introduce FoodieQA, a manually curated, fine-grained image-text
dataset capturing the intricate features of
food cultures across various regions in China, and evaluates vision-language
Models (VLMs) and large language models (LLMs)
on newly collected, unseen food images and corresponding questions.
|
|
Understanding Retrieval Robustness for Retrieval-Augmented Image
Captioning
Wenyan Li,
Jiaang Li,
Rita
Ramos,
Raphael Tang,
Desmond Elliottd
ACL 2024
code
TL;DR
We analyze the robustness of a retrieval-augmented captioning model SmallCap and
propose to train the model by sampling retrieved
captions from more diverse sets, which decreases the chance that the model
learns to copy majority tokens, and improves both
in-domain and cross-domain performance.
|
|
Structural Similarities Between Language Models and Neural
Response Measurements
Jiaang Li*,
Antonia Karamolegkou*,
Yova
Kementchedjhieva,
Mostafa
Abdou,
Sune
Lehmann,
Anders Søgaard
NeurReps @ NeurIPS 2023
code
TL;DR
This work shows that the larger neural language models get, the more their
representations are structurally similar to
neural response measurements from brain imaging.
|
|
Copyright Violations and Large Language Models
Antonia Karamolegkou*,
Jiaang Li*,
Li Zhou,
Anders Søgaard
EMNLP 2023
code
TL;DR
We explore the issue of copyright violations and large language models through
the lens of verbatim memorization,
focusing on possible redistribution of copyrighted text.
|
Services
- Reviewer: ICML 2026, ICLR 2026, ACL Rolling Reviewer, NLLP workshop' 2023
|
This page design is based on a template from Jon
Barron. Big thanks!
© Jiaang Li 2026
|
|