Jing Hao

6.8CVApr 7, 2023

Language-aware Multiple Datasets Detection Pretraining for DETRs

Jing Hao, Song Chen, Xiaodi Wang et al.

Pretraining on large-scale datasets can boost the performance of object detectors while the annotated datasets for object detection are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly pretrain models across aggregation of datasets to enhance data volume and diversity. In this paper, we propose a strong framework for utilizing Multiple datasets to pretrain DETR-like detectors, termed METR, without the need for manual label spaces integration. It converts the typical multi-classification in object detection into binary classification by introducing a pre-trained language model. Specifically, we design a category extraction module for extracting potential categories involved in an image and assign these categories into different queries by language embeddings. Each query is only responsible for predicting a class-specific object. Besides, to adapt our novel detection paradigm, we propose a group bipartite matching strategy that limits the ground truths to match queries assigned to the same category. Extensive experiments demonstrate that METR achieves extraordinary results on either multi-task joint training or the pretrain & finetune paradigm. Notably, our pre-trained models have high flexible transferability and increase the performance upon various DETR-like detectors on COCO val2017 benchmark. Codes will be available after this paper is published.

8.8LGMay 24, 2023Code

A Joint Time-frequency Domain Transformer for Multivariate Time Series Forecasting

Yushu Chen, Shengzhuo Liu, Jinzhe Yang et al.

In order to enhance the performance of Transformer models for long-term multivariate forecasting while minimizing computational demands, this paper introduces the Joint Time-Frequency Domain Transformer (JTFT). JTFT combines time and frequency domain representations to make predictions. The frequency domain representation efficiently extracts multi-scale dependencies while maintaining sparsity by utilizing a small number of learnable frequencies. Simultaneously, the time domain (TD) representation is derived from a fixed number of the most recent data points, strengthening the modeling of local relationships and mitigating the effects of non-stationarity. Importantly, the length of the representation remains independent of the input sequence length, enabling JTFT to achieve linear computational complexity. Furthermore, a low-rank attention layer is proposed to efficiently capture cross-dimensional dependencies, thus preventing performance degradation resulting from the entanglement of temporal and channel-wise modeling. Experimental results on six real-world datasets demonstrate that JTFT outperforms state-of-the-art baselines in predictive performance.

Jing Hao

2 Papers