Zexin Zheng

CV
h-index11
4papers
5citations
Novelty51%
AI Score48

4 Papers

86.5IRMay 18
TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval

Xinyu Sun, Huangyu Dai, Lingtao Mao et al.

E-commerce image search often takes a cropped image as the query, while each candidate is represented by full item images and structured text. This image-to-multimodal retrieval setting presents two asymmetries: a modality disparity -- a visual query must match image--text items, and a granularity disparity -- a cropped query must be compared with full images containing background context and possible distractors. Detection-based pipelines handle the granularity disparity through explicit localization but incur extra cost and error propagation, whereas CLIP-style encoders avoid detection, but are vulnerable to backgrounds or irrelevant items. To address these limitations, we propose TIGER-FG, a text-guided implicit fine-grained grounding framework for image-to-multimodal e-commerce retrieval. TIGER-FG uses item text as semantic guidance to produce target-focused item representations without object detection for retrieval. We further introduce dual distillation objectives that preserve target-region spatial consistency and query--item similarity structure, yielding more stable and discriminative multimodal representations. In addition, we construct ECom-RF-IMMR, a realistic benchmark suite with a 10M-pair training set and two evaluation benchmarks covering standard and cluttered item layouts. TIGER-FG improves Recall@1 over the strongest baseline by 6.1 and 34.4 percentage points on the two evaluation benchmarks, respectively, with only 85.7M query-side parameters and 256-dim embeddings. Results on public e-commerce benchmarks further demonstrate its generalization to noisy and one-to-many retrieval scenarios. Code and data will be released.

CVNov 20, 2025Code
UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

Xinyu Nan, Lingtao Mao, Huangyu Dai et al.

Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.

CVMar 29, 2025Code
Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation

Guohong Huang, Ling-An Zeng, Zexin Zheng et al.

We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5\% of the inference time. Code is available \href{https://github.com/Huanggh531/EJIM}{here}.

CVOct 7, 2025
OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search

Zexin Zheng, Huangyu Dai, Lingtao Mao et al.

Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.