CVJul 30, 2023
Triple Correlations-Guided Label Supplementation for Unbiased Video Scene Graph GenerationWenqing Wang, Kaifeng Gao, Yawei Luo et al.
Video-based scene graph generation (VidSGG) is an approach that aims to represent video content in a dynamic graph by identifying visual entities and their relationships. Due to the inherently biased distribution and missing annotations in the training data, current VidSGG methods have been found to perform poorly on less-represented predicates. In this paper, we propose an explicit solution to address this under-explored issue by supplementing missing predicates that should be appear in the ground-truth annotations. Dubbed Trico, our method seeks to supplement the missing predicates by exploring three complementary spatio-temporal correlations. Guided by these correlations, the missing labels can be effectively supplemented thus achieving an unbiased predicate predictions. We validate the effectiveness of Trico on the most widely used VidSGG datasets, i.e., VidVRD and VidOR. Extensive experiments demonstrate the state-of-the-art performance achieved by Trico, particularly on those tail predicates.
CVAug 7, 2022
Label Semantic Knowledge Distillation for Unbiased Scene Graph GenerationLin Li, Long Chen, Hanrong Shi et al.
The Scene Graph Generation (SGG) task aims to detect all the objects and their pairwise visual relationships in a given image. Although SGG has achieved remarkable progress over the last few years, almost all existing SGG models follow the same training paradigm: they treat both object and predicate classification in SGG as a single-label classification problem, and the ground-truths are one-hot target labels. However, this prevalent training paradigm has overlooked two characteristics of current SGG datasets: 1) For positive samples, some specific subject-object instances may have multiple reasonable predicates. 2) For negative samples, there are numerous missing annotations. Regardless of the two characteristics, SGG models are easy to be confused and make wrong predictions. To this end, we propose a novel model-agnostic Label Semantic Knowledge Distillation (LS-KD) for unbiased SGG. Specifically, LS-KD dynamically generates a soft label for each subject-object instance by fusing a predicted Label Semantic Distribution (LSD) with its original one-hot target label. LSD reflects the correlations between this instance and multiple predicate categories. Meanwhile, we propose two different strategies to predict LSD: iterative self-KD and synchronous self-KD. Extensive ablations and results on three SGG tasks have attested to the superiority and generality of our proposed LS-KD, which can consistently achieve decent trade-off performance between different predicate categories.
CVJul 22, 2022
Rethinking the Reference-based Distinctive Image CaptioningYangjun Mao, Long Chen, Zhihong Jiang et al.
Distinctive Image Captioning (DIC) -- generating distinctive captions that describe the unique details of a target image -- has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at object-/attribute- level (vs. scene-level). Secondly, to generate distinctive captions, we develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.
CVAug 3, 2022
Rethinking the Evaluation of Unbiased Scene Graph GenerationXingchen Li, Long Chen, Jian Shao et al.
Current Scene Graph Generation (SGG) methods tend to predict frequent predicate categories and fail to recognize rare ones due to the severe imbalanced distribution of predicates. To improve the robustness of SGG models on different predicate categories, recent research has focused on unbiased SGG and adopted mean Recall@K (mR@K) as the main evaluation metric. However, we discovered two overlooked issues about this de facto standard metric, which makes current unbiased SGG evaluation vulnerable and unfair: 1) mR@K neglects the correlations among predicates and unintentionally breaks category independence when ranking all the triplet predictions together regardless of the predicate categories. 2) mR@K neglects the compositional diversity of different predicates and assigns excessively high weights to some oversimple category samples with limited composable relation triplet types. In addition, we investigate the under-explored correlation between objects and predicates, which can serve as a simple but strong baseline for unbiased SGG. In this paper, we refine mR@K and propose two complementary evaluation metrics for unbiased SGG: Independent Mean Recall (MR) and weighted IMR (wIMR). These two metrics are designed by considering the category independence and diversity of composable relation triplets, respectively. We compare the proposed metrics with the de facto standard metrics through extensive experiments and discuss the solutions to evaluate unbiased SGG in a more trustworthy way.
CVJul 20, 2022
Explicit Image Caption EditingZhen Wang, Long Chen, Wenbo Ma et al.
Given an image and a reference caption, the image caption editing task aims to correct the misalignment errors and generate a refined caption. However, all existing caption editing works are implicit models, ie, they directly produce the refined captions without explicit connections to the reference captions. In this paper, we introduce a new task: Explicit Caption Editing (ECE). ECE models explicitly generate a sequence of edit operations, and this edit operation sequence can translate the reference caption into a refined one. Compared to the implicit editing, ECE has multiple advantages: 1) Explainable: it can trace the whole editing path. 2) Editing Efficient: it only needs to modify a few words. 3) Human-like: it resembles the way that humans perform caption editing, and tries to keep original sentence structures. To solve this new task, we propose the first ECE model: TIger. TIger is a non-autoregressive transformer-based model, consisting of three modules: Tagger_del, Tagger_add, and Inserter. Specifically, Tagger_del decides whether each word should be preserved or not, Tagger_add decides where to add new words, and Inserter predicts the specific word for adding. To further facilitate ECE research, we propose two new ECE benchmarks by re-organizing two existing datasets, dubbed COCO-EE and Flickr30K-EE, respectively. Extensive ablations on both two benchmarks have demonstrated the effectiveness of TIger.
CVJun 25, 2023
Improving Reference-based Distinctive Image Captioning with Contrastive RewardsYangjun Mao, Jun Xiao, Dong Zhang et al.
Distinctive Image Captioning (DIC) -- generating distinctive captions that describe the unique details of a target image -- has received considerable attention over the last few years. A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to force the generated captions to distinguish between the target image and the reference image. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we propose two new Ref-DIC benchmarks and develop a Transformer-based Ref-DIC baseline TransDIC. The model only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Taking one step further, we propose a stronger TransDIC++, which consists of an extra contrastive learning module to make full use of the reference images. This new module is model-agnostic, which can be easily incorporated into various Ref-DIC architectures. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC++ can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.
CVMar 11, 2023
Learning Combinatorial Prompts for Universal Controllable Image CaptioningZhen Wang, Jun Xiao, Yueting Zhuang et al.
Controllable Image Captioning (CIC) -- generating natural language descriptions about images under the guidance of given control signals -- is one of the most promising directions towards next-generation captioning systems. Till now, various kinds of control signals for CIC have been proposed, ranging from content-related control to structure-related control. However, due to the format and target gaps of different control signals, all existing CIC works (or architectures) only focus on one certain control signal, and overlook the human-like combinatorial ability. By ``combinatorial", we mean that our humans can easily meet multiple needs (or constraints) simultaneously when generating descriptions. To this end, we propose a novel prompt-based framework for CIC by learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly utilize a pretrained language model GPT-2 as our language model, which can help to bridge the gap between different signal-specific CIC architectures. Then, we reformulate the CIC as a prompt-guide sentence generation problem, and propose a new lightweight prompt generation network to generate the combinatorial prompts for different kinds of control signals. For different control signals, we further design a new mask attention mechanism to realize the prompt-based CIC. Due to its simplicity, our ComPro can be further extended to more kinds of combined control signals by concatenating these prompts. Extensive experiments on two prevalent CIC benchmarks have verified the effectiveness and efficiency of our ComPro on both single and combined control signals.
AIOct 2, 2022
Citation Trajectory Prediction via Publication Influence Representation Using Temporal Knowledge GraphChang Zong, Yueting Zhuang, Weiming Lu et al.
Predicting the impact of publications in science and technology has become an important research area, which is useful in various real world scenarios such as technology investment, research direction selection, and technology policymaking. Citation trajectory prediction is one of the most popular tasks in this area. Existing approaches mainly rely on mining temporal and graph data from academic articles. Some recent methods are capable of handling cold-start prediction by aggregating metadata features of new publications. However, the implicit factors causing citations and the richer information from handling temporal and attribute features still need to be explored. In this paper, we propose CTPIR, a new citation trajectory prediction framework that is able to represent the influence (the momentum of citation) of either new or existing publications using the history information of all their attributes. Our framework is composed of three modules: difference-preserved graph embedding, fine-grained influence representation, and learning-based trajectory calculation. To test the effectiveness of our framework in more situations, we collect and construct a new temporal knowledge graph dataset from the real world, named AIPatent, which stems from global patents in the field of artificial intelligence. Experiments are conducted on both the APS academic dataset and our contributed AIPatent dataset. The results demonstrate the strengths of our approach in the citation trajectory prediction task.
CLSep 3, 2024
S^3cMath: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical ReasonersYuchen Yan, Jin Jiang, Yang Liu et al.
Self-correction is a novel method that can stimulate the potential reasoning abilities of large language models (LLMs). It involves detecting and correcting errors during the inference process when LLMs solve reasoning problems. However, recent works do not regard self-correction as a spontaneous and intrinsic capability of LLMs. Instead, such correction is achieved through post-hoc generation, external knowledge introduction, multi-model collaboration, and similar techniques. In this paper, we propose a series of mathematical LLMs called S$^3$c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning. This capability helps LLMs to recognize whether their ongoing inference tends to contain errors and simultaneously correct these errors to produce a more reliable response. We proposed a method, which employs a step-level sampling approach to construct step-wise self-correction data for achieving such ability. Additionally, we implement a training strategy that uses above constructed data to equip LLMs with spontaneous step-level self-correction capacities. Our data and methods have been demonstrated to be effective across various foundation LLMs, consistently showing significant progress in evaluations on GSM8K, MATH, and other mathematical benchmarks. To the best of our knowledge, we are the first to introduce the spontaneous step-level self-correction ability of LLMs in mathematical reasoning.
CLFeb 6
InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement LearningYuchen Yan, Liang Jiang, Jin Jiang et al.
Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.
CVDec 8, 2021Code
Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite GraphsKaifeng Gao, Long Chen, Yulei Niu et al.
Today's VidSGG models are all proposal-based methods, i.e., they first generate numerous paired subject-object snippets as proposals, and then conduct predicate classification for each proposal. In this paper, we argue that this prevalent proposal-based framework has three inherent drawbacks: 1) The ground-truth predicate labels for proposals are partially correct. 2) They break the high-order relations among different predicate instances of a same subject-object pair. 3) VidSGG performance is upper-bounded by the quality of the proposals. To this end, we propose a new classification-then-grounding framework for VidSGG, which can avoid all the three overlooked drawbacks. Meanwhile, under this framework, we reformulate the video scene graphs as temporal bipartite graphs, where the entities and predicates are two types of nodes with time slots, and the edges denote different semantic roles between these nodes. This formulation takes full advantage of our new framework. Accordingly, we further propose a novel BIpartite Graph based SGG model: BIG. It consists of a classification stage and a grounding stage, where the former aims to classify the categories of all the nodes and the edges, and the latter tries to localize the temporal location of each relation instance. Extensive ablations on two VidSGG datasets have attested to the effectiveness of our framework and BIG. Code is available at https://github.com/Dawn-LX/VidSGG-BIG.
CLFeb 22, 2024
Triad: A Framework Leveraging a Multi-Role LLM-based Agent to Solve Knowledge Base Question AnsweringChang Zong, Yuchen Yan, Weiming Lu et al.
Recent progress with LLM-based agents has shown promising results across various tasks. However, their use in answering questions from knowledge bases remains largely unexplored. Implementing a KBQA system using traditional methods is challenging due to the shortage of task-specific training data and the complexity of creating task-focused model structures. In this paper, we present Triad, a unified framework that utilizes an LLM-based agent with three roles for KBQA tasks. The agent is assigned three roles to tackle different KBQA subtasks: agent as a generalist for mastering various subtasks, as a decision maker for the selection of candidates, and as an advisor for answering questions with knowledge. Our KBQA framework is executed in four phases, involving the collaboration of the agent's multiple roles. We evaluated the performance of our framework using three benchmark datasets, and the results show that our framework outperforms state-of-the-art systems on the LC-QuAD and YAGO-QA benchmarks, yielding F1 scores of 11.8% and 20.7%, respectively.
CLMar 9, 2025
InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language ModelsYuchen Yan, Yongliang Shen, Yang Liu et al.
Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.
HCMar 27, 2025
A Survey on (M)LLM-Based GUI AgentsFei Tang, Haolei Xu, Hang Zhang et al.
Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction, evolving from rule-based automation scripts to sophisticated AI-driven systems capable of understanding and executing complex interface operations. This survey provides a comprehensive examination of the rapidly advancing field of LLM-based GUI Agents, systematically analyzing their architectural foundations, technical components, and evaluation methodologies. We identify and analyze four fundamental components that constitute modern GUI Agents: (1) perception systems that integrate text-based parsing with multimodal understanding for comprehensive interface comprehension; (2) exploration mechanisms that construct and maintain knowledge bases through internal modeling, historical experience, and external information retrieval; (3) planning frameworks that leverage advanced reasoning methodologies for task decomposition and execution; and (4) interaction systems that manage action generation with robust safety controls. Through rigorous analysis of these components, we reveal how recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms. We critically examine current evaluation frameworks, highlighting methodological limitations in existing benchmarks while proposing directions for standardization. This survey also identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control, while outlining promising research directions for enhancing GUI Agents' capabilities. Our systematic review provides researchers and practitioners with a thorough understanding of the field's current state and offers insights into future developments in intelligent interface automation.
CLMay 20, 2025
Let LRMs Break Free from Overthinking via Self-Braking TuningHaoran Zhao, Yuchen Yan, Yongliang Shen et al.
Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.
CLMay 21, 2025
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language ModelsYuchen Yan, Jin Jiang, Zhenbang Ren et al.
Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.
AIJul 21, 2025
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy OptimizationXingyu Wu, Yuchen Yan, Shangke Lyu et al.
Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.
CLFeb 17, 2025
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle TaskYuchen Yan, Yongliang Shen, Yang Liu et al.
Mathematical reasoning represents a critical frontier in advancing large language models (LLMs). While step-by-step approaches have emerged as the dominant paradigm for mathematical problem-solving in LLMs, the quality of reasoning steps in training data fundamentally constrains the performance of the models. Recent studies has demonstrated that more detailed intermediate steps can enhance model performance, yet existing methods for step expansion either require more powerful external models or incur substantial computational costs. In this paper, we introduce MathFimer, a novel framework for mathematical reasoning step expansion inspired by the "Fill-in-the-middle" task from code completion. By decomposing solution chains into prefix-suffix pairs and training models to reconstruct missing intermediate steps, we develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset. We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains, creating MathFimer-expanded versions. Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on original data across various benchmarks such as GSM8K and MATH. Our approach offers a practical, scalable solution for enhancing mathematical reasoning capabilities in LLMs without relying on powerful external models or expensive inference procedures.
LGMay 13, 2025
Structural-Temporal Coupling Anomaly Detection with Dynamic Graph TransformerChang Zong, Yueting Zhuang, Jian Shao et al.
Detecting anomalous edges in dynamic graphs is an important task in many applications over evolving triple-based data, such as social networks, transaction management, and epidemiology. A major challenge with this task is the absence of structural-temporal coupling information, which decreases the ability of the representation to distinguish anomalies from normal instances. Existing methods focus on handling independent structural and temporal features with embedding models, which ignore the deep interaction between these two types of information. In this paper, we propose a structural-temporal coupling anomaly detection architecture with a dynamic graph transformer model. Specifically, we introduce structural and temporal features from two integration levels to provide anomaly-aware graph evolutionary patterns. Then, a dynamic graph transformer enhanced by two-dimensional positional encoding is implemented to capture both discrimination and contextual consistency signals. Extensive experiments on six datasets demonstrate that our method outperforms current state-of-the-art models. Finally, a case study illustrates the strength of our method when applied to a real-world task.
CLMar 14, 2024
ProSwitch: Knowledge-Guided Instruction Tuning to Switch Between Professional and Non-Professional ResponsesChang Zong, Yuyan Chen, Weiming Lu et al.
Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications, including question answering and controlled text generation. However, studies into their ability to switch between opposite styles of responses in professional domains remain underexplored. This study introduces a novel approach, named ProSwitch, which enables a language model to switch between professional and non-professional answers, by tuning and evaluating through the guidance of domain and style knowledge. ProSwitch unfolds in three phases: LLM-augmented preparation to collect domain knowledge and QA pairs, instruction tuning to optimize LLMs with multiple levels of knowledge, and comprehensive evaluation to assess both style discrimination and reference-based quality of the generated text. Comparative analysis of ProSwitch against general and specialized LLMs reveals that our approach outperforms baselines in switching between professional and non-professional responses.
CVMay 21, 2023
Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language ModelsLin Li, Jun Xiao, Guikun Chen et al.
Pretrained vision-language models, such as CLIP, have demonstrated strong generalization capabilities, making them promising tools in the realm of zero-shot visual recognition. Visual relation detection (VRD) is a typical task that identifies relationship (or interaction) types between object pairs within an image. However, naively utilizing CLIP with prevalent class-based prompts for zero-shot VRD has several weaknesses, e.g., it struggles to distinguish between different fine-grained relation types and it neglects essential spatial information of two objects. To this end, we propose a novel method for zero-shot VRD: RECODE, which solves RElation detection via COmposite DEscription prompts. Specifically, RECODE first decomposes each predicate category into subject, object, and spatial components. Then, it leverages large language models (LLMs) to generate description-based prompts (or visual cues) for each component. Different visual cues enhance the discriminability of similar relation categories from different perspectives, which significantly boosts performance in VRD. To dynamically fuse different cues, we further introduce a chain-of-thought method that prompts LLMs to generate reasonable weights for different visual cues. Extensive experiments on four VRD benchmarks have demonstrated the effectiveness and interpretability of RECODE.
CVMay 19, 2023
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual GroundingChenchi Zhang, Jun Xiao, Lei Chen et al.
Prompt tuning has achieved great success in transferring the knowledge from large pretrained vision-language models into downstream tasks, and has dominated the performance on visual grounding (VG). However, almost all existing prompt tuning paradigms suffer from poor interpretability. In this paper, we argue that their poor interpretability is attributed to the holistic prompt generation and inference process. By "holistic", we mean that they usually directly learn a set of vectors as the prompt (i.e., prompt generation), and use the learned global prompt to augment the textual input for the VG model (i.e., prompt inference). To this end, we propose a new prompt construction paradigm with explicit explainable ability, named TreePrompt. Specifically, we first deconstruct a complex sentence into a tree, that is consistent with human reasoning. Then, following the syntax tree, we compose a structured prompt in a bottom-up manner. Thanks to this step-by-step prompt construction process, each intermediate prompt (i.e., tree node) permits us to understand the reasoning process. Extensive ablations on various backbones and benchmarks consistently demonstrate the effectiveness and interpretability of our TreePrompt.
CVSep 22, 2021
Natural Language Video Localization with Learnable Moment ProposalsShaoning Xiao, Long Chen, Jian Shao et al.
Given an untrimmed video and a natural language query, Natural Language Video Localization (NLVL) aims to identify the video moment described by the query. To address this task, existing methods can be roughly grouped into two groups: 1) propose-and-rank models first define a set of hand-designed moment candidates and then find out the best-matching one. 2) proposal-free models directly predict two temporal boundaries of the referential moment from frames. Currently, almost all the propose-and-rank methods have inferior performance than proposal-free counterparts. In this paper, we argue that propose-and-rank approach is underestimated due to the predefined manners: 1) Hand-designed rules are hard to guarantee the complete coverage of targeted segments. 2) Densely sampled candidate moments cause redundant computation and degrade the performance of ranking process. To this end, we propose a novel model termed LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable moment proposals. The position and length of these proposals are dynamically adjusted during training process. Moreover, a boundary-aware loss has been proposed to leverage frame-level information and further improve the performance. Extensive ablations on two challenging NLVL benchmarks have demonstrated the effectiveness of LPNet over existing state-of-the-art methods.
LGSep 3, 2021
Instance-wise or Class-wise? A Tale of Neighbor Shapley for Concept-based ExplanationJiahui Li, Kun Kuang, Lin Li et al.
Deep neural networks have demonstrated remarkable performance in many data-driven and prediction-oriented applications, and sometimes even perform better than humans. However, their most significant drawback is the lack of interpretability, which makes them less attractive in many real-world applications. When relating to the moral problem or the environmental factors that are uncertain such as crime judgment, financial analysis, and medical diagnosis, it is essential to mine the evidence for the model's prediction (interpret model knowledge) to convince humans. Thus, investigating how to interpret model knowledge is of paramount importance for both academic research and real applications.
CLJun 21, 2021
Empower Distantly Supervised Relation Extraction with Collaborative Adversarial TrainingTao Chen, Haochen Shi, Liyuan Liu et al.
With recent advances in distantly supervised (DS) relation extraction (RE), considerable attention is attracted to leverage multi-instance learning (MIL) to distill high-quality supervision from the noisy DS. Here, we go beyond label noise and identify the key bottleneck of DS-MIL to be its low data utilization: as high-quality supervision being refined by MIL, MIL abandons a large amount of training instances, which leads to a low data utilization and hinders model training from having abundant supervision. In this paper, we propose collaborative adversarial training to improve the data utilization, which coordinates virtual adversarial training (VAT) and adversarial training (AT) at different levels. Specifically, since VAT is label-free, we employ the instance-level VAT to recycle instances abandoned by MIL. Besides, we deploy AT at the bag-level to unleash the full potential of the high-quality supervision got by MIL. Our proposed method brings consistent improvements (~ 5 absolute AUC score) to the previous state of the art, which verifies the importance of the data utilization issue and the effectiveness of our method.
CVMay 26, 2021
Deep Learning for Weakly-Supervised Object Detection and Object Localization: A SurveyFeifei Shao, Long Chen, Jian Shao et al.
Weakly-Supervised Object Detection (WSOD) and Localization (WSOL), i.e., detecting multiple and single instances with bounding boxes in an image using image-level labels, are long-standing and challenging tasks in the CV community. With the success of deep neural networks in object detection, both WSOD and WSOL have received unprecedented attention. Hundreds of WSOD and WSOL methods and numerous techniques have been proposed in the deep learning era. To this end, in this paper, we consider WSOL is a sub-task of WSOD and provide a comprehensive survey of the recent achievements of WSOD. Specifically, we firstly describe the formulation and setting of the WSOD, including the background, challenges, basic framework. Meanwhile, we summarize and analyze all advanced techniques and training tricks for improving detection performance. Then, we introduce the widely-used datasets and evaluation metrics of WSOD. Lastly, we discuss the future directions of WSOD. We believe that these summaries can help pave a way for future research on WSOD and WSOL.
CVMay 12, 2021
VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language MatchingChenchi Zhang, Wenbo Ma, Jun Xiao et al.
The prevailing framework for matching multimodal inputs is based on a two-stage process: 1) detecting proposals with an object detector and 2) matching text queries with proposals. Existing two-stage solutions mostly focus on the matching step. In this paper, we argue that these methods overlook an obvious \emph{mismatch} between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., query-agnostic), hoping that the proposals contain all instances mentioned in the text query (i.e., query-aware). Due to this mismatch, chances are that proposals relevant to the text query are suppressed during the filtering process, which in turn bounds the matching performance. To this end, we propose VL-NMS, which is the first method to yield query-aware proposals at the first stage. VL-NMS regards all mentioned instances as critical objects, and introduces a lightweight module to predict a score for aligning each proposal with a critical object. These scores can guide the NMS operation to filter out proposals irrelevant to the text query, increasing the recall of critical objects, resulting in a significantly improved matching performance. Since VL-NMS is agnostic to the matching step, it can be easily integrated into any state-of-the-art two-stage matching methods. We validate the effectiveness of VL-NMS on two multimodal matching tasks, namely referring expression grounding and image-text matching. Extensive ablation studies on several baselines and benchmarks consistently demonstrate the superiority of VL-NMS.
CVMar 15, 2021
Boundary Proposal Network for Two-Stage Natural Language Video LocalizationShaoning Xiao, Long Chen, Songyang Zhang et al.
We aim to address the problem of Natural Language Video Localization (NLVL)-localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e.g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchor-free approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal two-stage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visual-language fusion layer is proposed to jointly model the multi-modal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BPNet on three challenging NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.
CVNov 17, 2016
SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image CaptioningLong Chen, Hanwang Zhang, Jun Xiao et al.
Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input image. However, we argue that such spatial attention does not necessarily conform to the attention mechanism --- a dynamic feature extractor that combines contextual fixations over time, as CNN features are naturally spatial, channel-wise and multi-layer. In this paper, we introduce a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN. In the task of image captioning, SCA-CNN dynamically modulates the sentence generation context in multi-layer feature maps, encoding where (i.e., attentive spatial locations at multiple layers) and what (i.e., attentive channels) the visual attention is. We evaluate the proposed SCA-CNN architecture on three benchmark image captioning datasets: Flickr8K, Flickr30K, and MSCOCO. It is consistently observed that SCA-CNN significantly outperforms state-of-the-art visual attention-based image captioning methods.