Na Zheng

AI
h-index35
4papers
15citations
Novelty51%
AI Score40

4 Papers

AIJun 13, 2025
Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization

Wenqi Liu, Xuemeng Song, Jiaxi Li et al.

Direct Preference Optimization (DPO) has emerged as an effective approach for mitigating hallucination in Multimodal Large Language Models (MLLMs). Although existing methods have achieved significant progress by utilizing vision-oriented contrastive objectives for enhancing MLLMs' attention to visual inputs and hence reducing hallucination, they suffer from non-rigorous optimization objective function and indirect preference supervision. To address these limitations, we propose a Symmetric Multimodal Preference Optimization (SymMPO), which conducts symmetric preference learning with direct preference supervision (i.e., response pairs) for visual understanding enhancement, while maintaining rigorous theoretical alignment with standard DPO. In addition to conventional ordinal preference learning, SymMPO introduces a preference margin consistency loss to quantitatively regulate the preference gap between symmetric preference pairs. Comprehensive evaluation across five benchmarks demonstrate SymMPO's superior performance, validating its effectiveness in hallucination mitigation of MLLMs.

CLDec 15, 2023
VK-G2T: Vision and Context Knowledge enhanced Gloss2Text

Liqiang Jing, Xuemeng Song, Xinxing Zu et al.

Existing sign language translation methods follow a two-stage pipeline: first converting the sign language video to a gloss sequence (i.e. Sign2Gloss) and then translating the generated gloss sequence into a spoken language sentence (i.e. Gloss2Text). While previous studies have focused on boosting the performance of the Sign2Gloss stage, we emphasize the optimization of the Gloss2Text stage. However, this task is non-trivial due to two distinct features of Gloss2Text: (1) isolated gloss input and (2) low-capacity gloss vocabulary. To address these issues, we propose a vision and context knowledge enhanced Gloss2Text model, named VK-G2T, which leverages the visual content of the sign language video to learn the properties of the target sentence and exploit the context knowledge to facilitate the adaptive translation of gloss words. Extensive experiments conducted on a Chinese benchmark validate the superiority of our model.

CVOct 9, 2025
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Leigang Qu, Ziyang Wang, Na Zheng et al.

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

IRJan 24, 2022
Dual Preference Distribution Learning for Item Recommendation

Xue Dong, Xuemeng Song, Na Zheng et al.

Recommender systems can automatically recommend users with items that they probably like. The goal of them is to model the user-item interaction by effectively representing the users and items. Existing methods have primarily learned the user's preferences and item's features with vectorized embeddings, and modeled the user's general preferences to items by the interaction of them. In fact, users have their specific preferences to item attributes and different preferences are usually related. Therefore, exploring the fine-grained preferences as well as modeling the relationships among user's different preferences could improve the recommendation performance. Toward this end, we propose a dual preference distribution learning framework (DUPLE), which aims to jointly learn a general preference distribution and a specific preference distribution for a given user, where the former corresponds to the user's general preference to items and the latter refers to the user's specific preference to item attributes. Notably, the mean vector of each Gaussian distribution can capture the user's preferences, and the covariance matrix can learn their relationship. Moreover, we can summarize a preferred attribute profile for each user, depicting his/her preferred item attributes. We then can provide the explanation for each recommended item by checking the overlap between its attributes and the user's preferred attribute profile. Extensive quantitative and qualitative experiments on six public datasets demonstrate the effectiveness and explainability of the DUPLE method.