AILGNov 5, 2025

To See or To Read: User Behavior Reasoning in Multimodal LLMs

arXiv:2511.03845v1h-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses modality trade-offs for MLLMs in user-behavior reasoning, which is incremental as it benchmarks existing methods on new data representations.

The paper tackled the problem of whether textual or image representations of user behavior data are more effective for multimodal large language models (MLLMs) in reasoning tasks, finding that image representations improved next-purchase prediction accuracy by 87.5% compared to textual representations without extra computational cost.

Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present \texttt{BehaviorLens}, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes