Towards Data-Efficient Detection TransformersWen Wang, Jing Zhang, Yang Cao et al.
Detection Transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency. Experiments show that our method can be readily applied to different detection transformers and improve their performance on both small-size and sample-rich datasets. Code will be made publicly available at \url{https://github.com/encounter1997/DE-DETRs}.
31.1CVJul 23, 2024
MovieDreamer: Hierarchical Generation for Coherent Long Visual SequenceCanyu Zhao, Mingyu Liu, Wen Wang et al.
Recent advancements in video generation have primarily leveraged diffusion models for short-duration content. However, these approaches often fall short in modeling complex narratives and maintaining character consistency over extended periods, which is essential for long-form video production like movies. We propose MovieDreamer, a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering to pioneer long-duration video generation with intricate plot progressions and high visual fidelity. Our approach utilizes autoregressive models for global narrative coherence, predicting sequences of visual tokens that are subsequently transformed into high-quality video frames through diffusion rendering. This method is akin to traditional movie production processes, where complex stories are factorized down into manageable scene capturing. Further, we employ a multimodal script that enriches scene descriptions with detailed character information and visual style, enhancing continuity and character identity across scenes. We present extensive experiments across various movie genres, demonstrating that our approach not only achieves superior visual and narrative quality but also effectively extends the duration of generated content significantly beyond current capabilities. Homepage: https://aim-uofa.github.io/MovieDreamer/.
7.9CVMar 7, 2020
TTPP: Temporal Transformer with Progressive Prediction for Efficient Action AnticipationWen Wang, Xiaojiang Peng, Yanzhou Su et al.
Video action anticipation aims to predict future action categories from observed frames. Current state-of-the-art approaches mainly resort to recurrent neural networks to encode history information into hidden states, and predict future actions from the hidden representations. It is well known that the recurrent pipeline is inefficient in capturing long-term information which may limit its performance in predication task. To address this problem, this paper proposes a simple yet efficient Temporal Transformer with Progressive Prediction (TTPP) framework, which repurposes a Transformer-style architecture to aggregate observed features, and then leverages a light-weight network to progressively predict future features and actions. Specifically, predicted features along with predicted probabilities are accumulated into the inputs of subsequent prediction. We evaluate our approach on three action datasets, namely TVSeries, THUMOS-14, and TV-Human-Interaction. Additionally we also conduct a comprehensive study for several popular aggregation and prediction strategies. Extensive results show that TTPP not only outperforms the state-of-the-art methods but also more efficient.
9.6HCJan 21, 2020
A Comprehensive Study on Temporal Modeling for Online Action DetectionWen Wang, Xiaojiang Peng, Yu Qiao et al.
Online action detection (OAD) is a practical yet challenging task, which has attracted increasing attention in recent years. A typical OAD system mainly consists of three modules: a frame-level feature extractor which is usually based on pre-trained deep Convolutional Neural Networks (CNNs), a temporal modeling module, and an action classifier. Among them, the temporal modeling module is crucial which aggregates discriminative information from historical and current features. Though many temporal modeling methods have been developed for OAD and other topics, their effects are lack of investigation on OAD fairly. This paper aims to provide a comprehensive study on temporal modeling for OAD including four meta types of temporal modeling methods, \ie temporal pooling, temporal convolution, recurrent neural networks, and temporal attention, and uncover some good practices to produce a state-of-the-art OAD system. Many of them are explored in OAD for the first time, and extensively evaluated with various hyper parameters. Furthermore, based on our comprehensive study, we present several hybrid temporal modeling methods, which outperform the recent state-of-the-art methods with sizable margins on THUMOS-14 and TVSeries.
0.9CVOct 19, 2018
Temporal Action Detection by Joint Identification-VerificationWen Wang, Yongjian Wu, Haijun Liu et al.
Temporal action detection aims at not only recognizing action category but also detecting start time and end time for each action instance in an untrimmed video. The key challenge of this task is to accurately classify the action and determine the temporal boundaries of each action instance. In temporal action detection benchmark: THUMOS 2014, large variations exist in the same action category while many similarities exist in different action categories, which always limit the performance of temporal action detection. To address this problem, we propose to use joint Identification-Verification network to reduce the intra-action variations and enlarge inter-action differences. The joint Identification-Verification network is a siamese network based on 3D ConvNets, which can simultaneously predict the action categories and the similarity scores for the input pairs of video proposal segments. Extensive experimental results on the challenging THUMOS 2014 dataset demonstrate the effectiveness of our proposed method compared to the existing state-of-art methods for temporal action detection in untrimmed videos.