CLJul 18, 2023Code
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person RetrievalDelong Liu, Haiwen Li, Zhicheng Zhao et al.
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual-textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual entities to specific image regions, improving the alignment between local textual and visual embeddings. Additionally, a cross-modal triplet loss is presented to handle hard samples, and further enhance the model's discriminability for minor differences. Moreover, a pruning-based text data augmentation approach is proposed to enhance focus on essential elements in descriptions, thereby avoiding excessive model attention to less significant information. The experimental results show our proposed method outperforms state-of-the-art methods on three popular benchmark datasets, and the code will be made publicly available at https://github.com/Delong-liu-bupt/SEN.
CVJul 19, 2023Code
Boundary-Refined Prototype Generation: A General End-to-End Paradigm for Semi-Supervised Semantic SegmentationJunhao Dong, Zhu Meng, Delong Liu et al.
Semi-supervised semantic segmentation has attracted increasing attention in computer vision, aiming to leverage unlabeled data through latent supervision. To achieve this goal, prototype-based classification has been introduced and achieved lots of success. However, the current approaches isolate prototype generation from the main training framework, presenting a non-end-to-end workflow. Furthermore, most methods directly perform the K-Means clustering on features to generate prototypes, resulting in their proximity to category semantic centers, while overlooking the clear delineation of class boundaries. To address the above problems, we propose a novel end-to-end boundary-refined prototype generation (BRPG) method. Specifically, we perform online clustering on sampled features to incorporate the prototype generation into the whole training framework. In addition, to enhance the classification boundaries, we sample and cluster high- and low-confidence features separately based on confidence estimation, facilitating the generation of prototypes closer to the class boundaries. Moreover, an adaptive prototype optimization strategy is proposed to increase the number of prototypes for categories with scattered feature distributions, which further refines the class boundaries. Extensive experiments demonstrate the remarkable robustness and scalability of our method across diverse datasets, segmentation networks, and semi-supervised frameworks, outperforming the state-of-the-art approaches on three benchmark datasets: PASCAL VOC 2012, Cityscapes and MS COCO. The code is available at https://github.com/djh-dzxw/BRPG.
CVNov 24, 2022
1st Workshop on Maritime Computer Vision (MaCVi) 2023: Challenge ResultsBenjamin Kiefer, Matej Kristan, Janez Perš et al.
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
CVNov 25, 2023Code
Automatic Synthetic Data and Fine-grained Adaptive Feature Alignment for Composed Person RetrievalDelong Liu, Haiwen Li, Zhaohui Hou et al.
Person retrieval has attracted rising attention. Existing methods are mainly divided into two retrieval modes, namely image-only and text-only. However, they are unable to make full use of the available information and are difficult to meet diverse application requirements. To address the above limitations, we propose a new Composed Person Retrieval (CPR) task, which combines visual and textual queries to identify individuals of interest from large-scale person image databases. Nevertheless, the foremost difficulty of the CPR task is the lack of available annotated datasets. Therefore, we first introduce a scalable automatic data synthesis pipeline, which decomposes complex multimodal data generation into the creation of textual quadruples followed by identity-consistent image synthesis using fine-tuned generative models. Meanwhile, a multimodal filtering method is designed to ensure the resulting SynCPR dataset retains 1.15 million high-quality and fully synthetic triplets. Additionally, to improve the representation of composed person queries, we propose a novel Fine-grained Adaptive Feature Alignment (FAFA) framework through fine-grained dynamic alignment and masked feature reasoning. Moreover, for objective evaluation, we manually annotate the Image-Text Composed Person Retrieval (ITCPR) test set. The extensive experiments demonstrate the effectiveness of the SynCPR dataset and the superiority of the proposed FAFA framework when compared with the state-of-the-art methods. All code and data will be provided at https://github.com/Delong-liu-bupt/Composed_Person_Retrieval.
AIApr 9, 2023Code
OpenDriver: An Open-Road Driver State Detection DatasetDelong Liu, Shichao Li, Tianyi Shi et al.
Among numerous studies for driver state detection, wearable physiological measurements offer a practical method for real-time monitoring. However, there are few driver physiological datasets in open-road scenarios, and the existing datasets suffer from issues such as poor signal quality, small sample sizes, and short data collection periods. Therefore, in this paper, a large-scale multimodal driving dataset, OpenDriver, for driver state detection is developed. The OpenDriver encompasses a total of 3,278 driving trips, with a signal collection duration spanning approximately 4,600 hours. Two modalities of driving signals are enrolled in OpenDriver: electrocardiogram (ECG) signals and six-axis motion data of the steering wheel from a motion measurement unit (IMU), which were recorded from 81 drivers and their vehicles. Furthermore, three challenging tasks are involved in our work, namely ECG signal quality assessment, individual biometric identification based on ECG signals, and physiological signal analysis in complex driving environments. To facilitate research in these tasks, corresponding benchmarks have also been introduced. First, a noisy augmentation strategy is applied to generate a larger-scale ECG signal dataset with realistic noise simulation for quality assessment. Second, an end-to-end contrastive learning framework is employed for individual biometric identification. Finally, a comprehensive analysis of drivers' HRV features under different driving conditions is conducted. Each benchmark provides evaluation metrics and reference results. The OpenDriver dataset will be publicly available at https://github.com/bdne/OpenDriver.
CVMay 24, 2024Code
MindShot: A Few-Shot Brain Decoding Framework via Transferring Cross-Subject Prior and Distilling Frequency Domain KnowledgeShuai Jiang, Zhu Meng, Haiwen Li et al.
Aiming to reconstruct visual stimuli from brain signals, brain decoding has recently made significant progress using functional magnetic resonance imaging (fMRI). However, it still has challenging issues such as substantial individual differences and high data collection costs. To simplify these problems, most methods adopt the per-subject-per-model paradigm, but this greatly limits their applications. In this paper, we design a few-shot brain decoding setting specifically for potential clinical scenarios and propose a novel two-stage decoding framework named MindShot, comprising a Multi-Subject Pretraining (MSP) stage and Fourier-based cross-subject Knowledge Distillation (FKD) stage. Firstly, a MSP framework based on multi-modal contrastive learning is constructed to mine the cross-subject prior. Secondly, the FKD is presented to decrease inter-individual differences while improving the decoding adaptability to new individuals. Our approach achieves high semantic fidelity in visual reconstruction on the largest dataset and has the potential to reduce scanning time by up to 99%. Remarkably, MindShot achieves a CLIP accuracy of 83.6% using only 1.8% of the fMRI-image pairs, surpassing the 77.4% accuracy of the method trained on the entire NSD dataset. This makes it feasible to train large-scale brain decoding frameworks that require less data, facilitating practical applications. The code is available at https://github.com/JSinBUPT/MindShot.
CVDec 12, 2024Code
UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame OrganizerDelong Liu, Zhaohui Hou, Mingjie Zhan et al.
Recently, diffusion-based video generation models have achieved significant success. However, existing models often suffer from issues like weak consistency and declining image quality over time. To overcome these challenges, inspired by aesthetic principles, we propose a non-invasive plug-in called Uniform Frame Organizer (UFO), which is compatible with any diffusion-based video generation model. The UFO comprises a series of adaptive adapters with adjustable intensities, which can significantly enhance the consistency between the foreground and background of videos and improve image quality without altering the original model parameters when integrated. The training for UFO is simple, efficient, requires minimal resources, and supports stylized training. Its modular design allows for the combination of multiple UFOs, enabling the customization of personalized video generation models. Furthermore, the UFO also supports direct transferability across different models of the same specification without the need for specific retraining. The experimental results indicate that UFO effectively enhances video generation quality and demonstrates its superiority in public video generation benchmarks. The code will be publicly available at https://github.com/Delong-liu-bupt/UFO.
CVApr 18, 2024
MLS-Track: Multilevel Semantic Interaction in RMOTZeliang Ma, Song Yang, Zhe Cui et al.
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
CLSep 15, 2025
MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing DialoguesWeishu Chen, Jinyi Tang, Zhouhui Hou et al.
Memory extraction is crucial for maintaining coherent ultra-long dialogues in human-robot role-playing scenarios. However, existing methods often exhibit uncontrolled memory growth. To address this, we propose MOOM, the first dual-branch memory plugin that leverages literary theory by modeling plot development and character portrayal as core storytelling elements. Specifically, one branch summarizes plot conflicts across multiple time scales, while the other extracts the user's character profile. MOOM further integrates a forgetting mechanism, inspired by the ``competition-inhibition'' memory theory, to constrain memory capacity and mitigate uncontrolled growth. Furthermore, we present ZH-4O, a Chinese ultra-long dialogue dataset specifically designed for role-playing, featuring dialogues that average 600 turns and include manually annotated memory information. Experimental results demonstrate that MOOM outperforms all state-of-the-art memory extraction methods, requiring fewer large language model invocations while maintaining a controllable memory capacity.
CVJul 8, 2025
Automatic Synthesis of High-Quality Triplet Data for Composed Image RetrievalHaiwen Li, Delong Liu, Zhaohui Hou et al.
As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.