50.3CVMar 29Code
Streamlined Open-Vocabulary Human-Object Interaction DetectionChang Sun, Dongliang Liao, Changxing Ding
Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training. Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories. However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce SL-HOI, a StreamLined open-vocabulary HOI detection framework based solely on the powerful DINOv3 model. Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification. Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output, we propose first feeding both the interaction queries and the backbone image tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task. Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture. Code is available at https://github.com/MPI-Lab/SL-HOI.
53.6IRApr 19
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention HeadsJuyuan Wang, Chenxing Wang, Yuchen Fang et al.
Decoding-free reranking methods that read relevance signals directly from LLM attention weights offer significant latency advantages over autoregressive approaches, yet suffer from attention score homogenization: middle-context documents receive near-identical scores, destroying the fine-grained distinctions required for ranking. We propose HeadRank, a framework that lifts preference optimization from discrete token space into the continuous attention domain through entropy-regularized head selection, hard adjacent-level preference pairs, and a distribution regularizer that jointly sharpen discriminability in the homogenized middle zone. Depth truncation at the deepest selected layer further reduces inference to $\mathcal{O}(1)$ forward passes. Across 14 benchmarks on three Qwen3 scales (0.6B--4B) using only 211 training queries, HeadRank consistently outperforms generative and decoding-free baselines with 100\% formatting success. At 4B, 57.4\% of relevant middle-zone documents reach the top quartile versus 14.2\% for irrelevant ones -- a 43-percentage-point selectivity gap that demonstrates the effectiveness of attention-space preference alignment for listwise reranking.
IRFeb 7, 2022Code
Hybrid Contrastive Quantization for Efficient Cross-View Video RetrievalJinpeng Wang, Bin Chen, Dongliang Liao et al.
With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence queries has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because they adopt brute-force search based on high-dimensional embeddings. To improve efficiency, Web search engines widely apply vector compression libraries (e.g., FAISS) to post-process the learned embeddings. Unfortunately, separate compression from feature encoding degrades the robustness of representations and incurs performance decay. To pursue a better balance between performance and efficiency, we propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ). Specifically, HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos and preserve comprehensive semantic information. By performing Asymmetric-Quantized Contrastive Learning (AQ-CL) across views, HCQ aligns texts and videos at coarse-grained and multiple fine-grained levels. This hybrid-grained learning strategy serves as strong supervision on the cross-view video quantization model, where contrastive learning at different levels can be mutually promoted. Extensive experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation. Code and configurations are available at https://github.com/gimpong/WWW22-HCQ.
CVJun 23, 2024
Multi-Scale Temporal Difference Transformer for Video-Text RetrievalNi Wang, Dongliang Liao, Xing Xu
Currently, in the field of video-text retrieval, there are many transformer-based methods. Most of them usually stack frame features and regrade frames as tokens, then use transformers for video temporal modeling. However, they commonly neglect the inferior ability of the transformer modeling local temporal information. To tackle this problem, we propose a transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT). MSTDT mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information. Besides, in order to better model the detailed dynamic information, we make use of the difference feature between frames, which practically reflects the dynamic movement of a video. We extract the inter-frame difference feature and integrate the difference and frame feature by the multi-scale temporal transformer. In general, our proposed MSTDT consists of a short-term multi-scale temporal difference transformer and a long-term temporal transformer. The former focuses on modeling local temporal information, the latter aims at modeling global temporal information. At last, we propose a new loss to narrow the distance of similar samples. Extensive experiments show that backbone, such as CLIP, with MSTDT has attained a new state-of-the-art result.
CLAug 13, 2020
Cognitive Representation Learning of Self-Media Online Article QualityYiru Wang, Shen Huang, Gongfu Li et al.
The automatic quality assessment of self-media online articles is an urgent and new issue, which is of great value to the online recommendation and search. Different from traditional and well-formed articles, self-media online articles are mainly created by users, which have the appearance characteristics of different text levels and multi-modal hybrid editing, along with the potential characteristics of diverse content, different styles, large semantic spans and good interactive experience requirements. To solve these challenges, we establish a joint model CoQAN in combination with the layout organization, writing characteristics and text semantics, designing different representation learning subnetworks, especially for the feature learning process and interactive reading habits on mobile terminals. It is more consistent with the cognitive style of expressing an expert's evaluation of articles. We have also constructed a large scale real-world assessment dataset. Extensive experimental results show that the proposed framework significantly outperforms state-of-the-art methods, and effectively learns and integrates different factors of the online article quality assessment.