CVJun 9, 2023Code
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision TransformersBowen Zhang, Liyang Liu, Minh Hieu Phan et al.
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework and introduces \textbf{SegViTv2}. In this study, we introduce a novel Attention-to-Mask (\atm) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5\%$ of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a \emph{Shrunk++} structure that incorporates edge-aware query-based down-sampling (EQD) and query-based upsampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to $50\%$ while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: \url{https://github.com/zbwxp/SegVit}.
CVFeb 5, 2023Code
FastPillars: A Deployment-friendly Pillar-based 3D DetectorSifan Zhou, Zhi Tian, Xiangxiang Chu et al.
The deployment of 3D detectors strikes one of the major challenges in real-world self-driving scenarios. Existing BEV-based (i.e., Bird Eye View) detectors favor sparse convolutions (known as SPConv) to speed up training and inference, which puts a hard barrier for deployment, especially for on-device applications. In this paper, to tackle the challenge of efficient 3D object detection from an industry perspective, we devise a deployment-friendly pillar-based 3D detector, termed FastPillars. First, we introduce a novel lightweight Max-and-Attention Pillar Encoding (MAPE) module specially for enhancing small 3D objects. Second, we propose a simple yet effective principle for designing a backbone in pillar-based 3D detection. We construct FastPillars based on these designs, achieving high performance and low latency without SPConv. Extensive experiments on two large-scale datasets demonstrate the effectiveness and efficiency of FastPillars for on-device 3D detection regarding both performance and speed. Specifically, FastPillars delivers state-of-the-art accuracy on Waymo Open Dataset with 1.8X speed up and 3.8 mAPH/L2 improvement over CenterPoint (SPConv-based). Our code is publicly available at: https://github.com/StiphyJay/FastPillars.
CVOct 12, 2022
SegViT: Semantic Segmentation with Plain Vision TransformersBowen Zhang, Zhi Tian, Quan Tang et al.
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.
CVMay 27, 2022
Fully Convolutional One-Stage 3D Object Detection on LiDAR Range ImagesZhi Tian, Xiangxiang Chu, Xiaoming Wang et al.
We present a simple yet effective fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes, termed FCOS-LiDAR. Unlike the dominant methods that use the bird-eye view (BEV), our proposed detector detects objects from the range view (RV, a.k.a. range image) of the LiDAR points. Due to the range view's compactness and compatibility with the LiDAR sensors' sampling process on self-driving cars, the range view-based object detector can be realized by solely exploiting the vanilla 2D convolutions, departing from the BEV-based methods which often involve complicated voxelization operations and sparse convolutions. For the first time, we show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors while being significantly faster and simpler. More importantly, almost all previous range view-based detectors only focus on single-frame point clouds, since it is challenging to fuse multi-frame point clouds into a single range view. In this work, we tackle this challenging issue with a novel range view projection mechanism, and for the first time demonstrate the benefits of fusing multi-frame point clouds for a range-view based detector. Extensive experiments on nuScenes show the superiority of our proposed method and we believe that our work can be strong evidence that an RV-based 3D detector can compare favourably with the current mainstream BEV-based detectors.
SYApr 28, 2011
Tracking Target Signal Strengths on a Grid using SparsityShahrokh Farahmand, Georgios B. Giannakis, Geert Leus et al.
Multi-target tracking is mainly challenged by the nonlinearity present in the measurement equation, and the difficulty in fast and accurate data association. To overcome these challenges, the present paper introduces a grid-based model in which the state captures target signal strengths on a known spatial grid (TSSG). This model leads to \emph{linear} state and measurement equations, which bypass data association and can afford state estimation via sparsity-aware Kalman filtering (KF). Leveraging the grid-induced sparsity of the novel model, two types of sparsity-cognizant TSSG-KF trackers are developed: one effects sparsity through $\ell_1$-norm regularization, and the other invokes sparsity as an extra measurement. Iterative extended KF and Gauss-Newton algorithms are developed for reduced-complexity tracking, along with accurate error covariance updates for assessing performance of the resultant sparsity-aware state estimators. Based on TSSG state estimates, more informative target position and track estimates can be obtained in a follow-up step, ensuring that track association and position estimation errors do not propagate back into TSSG state estimates. The novel TSSG trackers do not require knowing the number of targets or their signal strengths, and exhibit considerably lower complexity than the benchmark hidden Markov model filter, especially for a large number of targets. Numerical simulations demonstrate that sparsity-cognizant trackers enjoy improved root mean-square error performance at reduced complexity when compared to their sparsity-agnostic counterparts.
CVAug 8, 2022
Two-Stream Networks for Object Segmentation in VideosHannan Lu, Zhi Tian, Lirong Yang et al.
Existing matching-based approaches perform video object segmentation (VOS) via retrieving support features from a pixel-level memory, while some pixels may suffer from lack of correspondence in the memory (i.e., unseen), which inevitably limits their segmentation performance. In this paper, we present a Two-Stream Network (TSN). Our TSN includes (i) a pixel stream with a conventional pixel-level memory, to segment the seen pixels based on their pixellevel memory retrieval. (ii) an instance stream for the unseen pixels, where a holistic understanding of the instance is obtained with dynamic segmentation heads conditioned on the features of the target instance. (iii) a pixel division module generating a routing map, with which output embeddings of the two streams are fused together. The compact instance stream effectively improves the segmentation accuracy of the unseen pixels, while fusing two streams with the adaptive routing map leads to an overall performance boost. Through extensive experiments, we demonstrate the effectiveness of our proposed TSN, and we also report state-of-the-art performance of 86.1% on YouTube-VOS 2018 and 87.5% on the DAVIS-2017 validation split.
LGApr 7, 2022
Adaptive-Gravity: A Defense Against Adversarial SamplesAli Mirzaeian, Zhi Tian, Sai Manoj P D et al.
This paper presents a novel model training solution, denoted as Adaptive-Gravity, for enhancing the robustness of deep neural network classifiers against adversarial examples. We conceptualize the model parameters/features associated with each class as a mass characterized by its centroid location and the spread (standard deviation of the distance) of features around the centroid. We use the centroid associated with each cluster to derive an anti-gravity force that pushes the centroids of different classes away from one another during network training. Then we customized an objective function that aims to concentrate each class's features toward their corresponding new centroid, which has been obtained by anti-gravity force. This methodology results in a larger separation between different masses and reduces the spread of features around each centroid. As a result, the samples are pushed away from the space that adversarial examples could be mapped to, effectively increasing the degree of perturbation needed for making an adversarial example. We have implemented this training solution as an iterative method consisting of four steps at each iteration: 1) centroid extraction, 2) anti-gravity force calculation, 3) centroid relocation, and 4) gravity training. Gravity's efficiency is evaluated by measuring the corresponding fooling rates against various attack models, including FGSM, MIM, BIM, and PGD using LeNet and ResNet110 networks, benchmarked against MNIST and CIFAR10 classification problems. Test results show that Gravity not only functions as a powerful instrument to robustify a model against state-of-the-art adversarial attacks but also effectively improves the model training accuracy.
LGOct 29, 2022
Robust Distributed Learning Against Both Distributional Shifts and Byzantine AttacksGuanqiang Zhou, Ping Xu, Yue Wang et al.
In distributed learning systems, robustness issues may arise from two sources. On one hand, due to distributional shifts between training data and test data, the trained model could exhibit poor out-of-sample performance. On the other hand, a portion of working nodes might be subject to byzantine attacks which could invalidate the learning result. Existing works mostly deal with these two issues separately. In this paper, we propose a new algorithm that equips distributed learning with robustness measures against both distributional shifts and byzantine attacks. Our algorithm is built on recent advances in distributionally robust optimization as well as norm-based screening (NBS), a robust aggregation scheme against byzantine attacks. We provide convergence proofs in three cases of the learning model being nonconvex, convex, and strongly convex for the proposed algorithm, shedding light on its convergence behaviors and endurability against byzantine attacks. In particular, we deduce that any algorithm employing NBS (including ours) cannot converge when the percentage of byzantine nodes is 1/3 or higher, instead of 1/2, which is the common belief in current literature. The experimental results demonstrate the effectiveness of our algorithm against both robustness issues. To the best of our knowledge, this is the first work to address distributional shifts and byzantine attacks simultaneously.
LGAug 4, 2022
QC-ODKLA: Quantized and Communication-Censored Online Decentralized Kernel Learning via Linearized ADMMPing Xu, Yue Wang, Xiang Chen et al.
This paper focuses on online kernel learning over a decentralized network. Each agent in the network receives continuous streaming data locally and works collaboratively to learn a nonlinear prediction function that is globally optimal in the reproducing kernel Hilbert space with respect to the total instantaneous costs of all agents. In order to circumvent the curse of dimensionality issue in traditional online kernel learning, we utilize random feature (RF) mapping to convert the non-parametric kernel learning problem into a fixed-length parametric one in the RF space. We then propose a novel learning framework named Online Decentralized Kernel learning via Linearized ADMM (ODKLA) to efficiently solve the online decentralized kernel learning problem. To further improve the communication efficiency, we add the quantization and censoring strategies in the communication stage and develop the Quantized and Communication-censored ODKLA (QC-ODKLA) algorithm. We theoretically prove that both ODKLA and QC-ODKLA can achieve the optimal sublinear regret $\mathcal{O}(\sqrt{T})$ over $T$ time slots. Through numerical experiments, we evaluate the learning effectiveness, communication, and computation efficiencies of the proposed methods.
LGJul 16, 2023
Byzantine-Robust Distributed Online Learning: Taming Adversarial Participants in An Adversarial EnvironmentXingrong Dong, Zhaoxian Wu, Qing Ling et al.
This paper studies distributed online learning under Byzantine attacks. The performance of an online learning algorithm is often characterized by (adversarial) regret, which evaluates the quality of one-step-ahead decision-making when an environment provides adversarial losses, and a sublinear bound is preferred. But we prove that, even with a class of state-of-the-art robust aggregation rules, in an adversarial environment and in the presence of Byzantine participants, distributed online gradient descent can only achieve a linear adversarial regret bound, which is tight. This is the inevitable consequence of Byzantine attacks, even though we can control the constant of the linear adversarial regret to a reasonable level. Interestingly, when the environment is not fully adversarial so that the losses of the honest participants are i.i.d. (independent and identically distributed), we show that sublinear stochastic regret, in contrast to the aforementioned adversarial regret, is possible. We develop a Byzantine-robust distributed online momentum algorithm to attain such a sublinear stochastic regret bound. Extensive numerical experiments corroborate our theoretical analysis.
CVApr 15, 2025Code
SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RLJunke Wang, Zhi Tian, Xun Wang et al.
This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.
CVOct 24, 2021Code
NAS-FCOS: Efficient Search for Object Detection ArchitecturesNing Wang, Yang Gao, Hao Chen et al.
Neural Architecture Search (NAS) has shown great potential in effectively reducing manual effort in network design by automatically discovering optimal architectures. What is noteworthy is that as of now, object detection is less touched by NAS algorithms despite its significant importance in computer vision. To the best of our knowledge, most of the recent NAS studies on object detection tasks fail to satisfactorily strike a balance between performance and efficiency of the resulting models, let alone the excessive amount of computational resources cost by those algorithms. Here we propose an efficient method to obtain better object detectors by searching for the feature pyramid network (FPN) as well as the prediction head of a simple anchor-free object detector, namely, FCOS [36], using a tailored reinforcement learning paradigm. With carefully designed search space, search algorithms, and strategies for evaluating network quality, we are able to find top-performing detection architectures within 4 days using 8 V100 GPUs. The discovered architectures surpass state-of-the-art object detection models (such as Faster R-CNN, Retina-Net and, FCOS) by 1.0% to 5.4% points in AP on the COCO dataset, with comparable computation complexity and memory footprint, demonstrating the efficacy of the proposed NAS method for object detection. Code is available at https://github.com/Lausannen/NAS-FCOS.
CVMay 29, 2021Code
FCPose: Fully Convolutional Multi-Person Pose Estimation with Dynamic Instance-Aware ConvolutionsWeian Mao, Zhi Tian, Xinlong Wang et al.
We propose a fully convolutional multi-person pose estimation framework using dynamic instance-aware convolutions, termed FCPose. Different from existing methods, which often require ROI (Region of Interest) operations and/or grouping post-processing, FCPose eliminates the ROIs and grouping post-processing with dynamic instance-aware keypoint estimation heads. The dynamic keypoint heads are conditioned on each instance (person), and can encode the instance concept in the dynamically-generated weights of their filters. Moreover, with the strong representation capacity of dynamic convolutions, the keypoint heads in FCPose are designed to be very compact, resulting in fast inference and making FCPose have almost constant inference time regardless of the number of persons in the image. For example, on the COCO dataset, a real-time version of FCPose using the DLA-34 backbone infers about 4.5x faster than Mask R-CNN (ResNet-101) (41.67 FPS vs. 9.26FPS) while achieving improved performance. FCPose also offers better speed/accuracy trade-off than other state-of-the-art methods. Our experiment results show that FCPose is a simple yet effective multi-person pose estimation framework. Code is available at: https://git.io/AdelaiDet
CVApr 28, 2021Code
Twins: Revisiting the Design of Spatial Attention in Vision TransformersXiangxiang Chu, Zhi Tian, Yuqing Wang et al.
Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks, including image level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is released at https://github.com/Meituan-AutoML/Twins .
CVFeb 22, 2021Code
Conditional Positional Encodings for Vision TransformersXiangxiang Chu, Zhi Tian, Bo Zhang et al.
We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings, which are pre-defined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during training. Besides, CPE can keep the desired translation-invariance in the image classification task, resulting in improved performance. We implement CPE with a simple Position Encoding Generator (PEG) to get seamlessly incorporated into the current Transformer framework. Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results. Our code is available at https://github.com/Meituan-AutoML/CPVT .
CVFeb 5, 2021Code
Instance and Panoptic Segmentation Using Conditional ConvolutionsZhi Tian, Bowen Zhang, Hao Chen et al.
We propose a simple yet effective framework for instance and panoptic segmentation, termed CondInst (conditional convolutions for instance and panoptic segmentation). In the literature, top-performing instance segmentation methods typically follow the paradigm of Mask R-CNN and rely on ROI operations (typically ROIAlign) to attend to each instance. In contrast, we propose to attend to the instances with dynamic conditional convolutions. Instead of using instance-wise ROIs as inputs to the instance mask head of fixed weights, we design dynamic instance-aware mask heads, conditioned on the instances to be predicted. CondInst enjoys three advantages: 1.) Instance and panoptic segmentation are unified into a fully convolutional network, eliminating the need for ROI cropping and feature alignment. 2.) The elimination of the ROI cropping also significantly improves the output instance mask resolution. 3.) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference time per instance and making the overall inference time almost constant, irrelevant to the number of instances. We demonstrate a simpler method that can achieve improved accuracy and inference speed on both instance and panoptic segmentation tasks. On the COCO dataset, we outperform a few state-of-the-art methods. We hope that CondInst can be a strong baseline for instance and panoptic segmentation. Code is available at: https://git.io/AdelaiDet
CVDec 3, 2020Code
BoxInst: High-Performance Instance Segmentation with Box AnnotationsZhi Tian, Chunhua Shen, Xinlong Wang et al.
We present a high-performance method that can achieve mask-level instance segmentation with only bounding-box annotations for training. While this setting has been studied in the literature, here we show significantly stronger performance with a simple design (e.g., dramatically improving previous best reported mask AP of 21.1% in Hsu et al. (2019) to 31.6% on the COCO dataset). Our core idea is to redesign the loss of learning masks in instance segmentation, with no modification to the segmentation network itself. The new loss functions can supervise the mask training without relying on mask annotations. This is made possible with two loss terms, namely, 1) a surrogate term that minimizes the discrepancy between the projections of the ground-truth box and the predicted mask; 2) a pairwise loss that can exploit the prior that proximal pixels with similar colors are very likely to have the same category label. Experiments demonstrate that the redesigned mask loss can yield surprisingly high-quality instance masks with only box annotations. For example, without using any mask annotations, with a ResNet-101 backbone and 3x training schedule, we achieve 33.2% mask AP on COCO test-dev split (vs. 39.1% of the fully supervised counterpart). Our excellent experiment results on COCO and Pascal VOC indicate that our method dramatically narrows the performance gap between weakly and fully supervised instance segmentation. Code is available at: https://git.io/AdelaiDet
CVNov 19, 2020Code
Unifying Instance and Panoptic Segmentation with Dynamic Rank-1 ConvolutionsHao Chen, Chunhua Shen, Zhi Tian
Recently, fully-convolutional one-stage networks have shown superior performance comparing to two-stage frameworks for instance segmentation as typically they can generate higher-quality mask predictions with less computation. In addition, their simple design opens up new opportunities for joint multi-task learning. In this paper, we demonstrate that adding a single classification layer for semantic segmentation, fully-convolutional instance segmentation networks can achieve state-of-the-art panoptic segmentation quality. This is made possible by our novel dynamic rank-1 convolution (DR1Conv), a novel dynamic module that can efficiently merge high-level context information with low-level detailed features which is beneficial for both semantic and instance segmentation. Importantly, the proposed new method, termed DR1Mask, can perform panoptic segmentation by adding a single layer. To our knowledge, DR1Mask is the first panoptic segmentation framework that exploits a shared feature map for both instance and semantic segmentation by considering both efficacy and efficiency. Not only our framework is much more efficient -- twice as fast as previous best two-branch approaches, but also the unified framework opens up opportunities for using the same context module to improve the performance for both tasks. As a byproduct, when performing instance segmentation alone, DR1Mask is 10% faster and 1 point in mAP more accurate than previous state-of-the-art instance segmentation network BlendMask. Code is available at: https://git.io/AdelaiDet
CVMar 26, 2020Code
Mask Encoding for Single Shot Instance SegmentationRufeng Zhang, Zhi Tian, Chunhua Shen et al.
To date, instance segmentation is dominated by twostage methods, as pioneered by Mask R-CNN. In contrast, one-stage alternatives cannot compete with Mask R-CNN in mask AP, mainly due to the difficulty of compactly representing masks, making the design of one-stage methods very challenging. In this work, we propose a simple singleshot instance segmentation framework, termed mask encoding based instance segmentation (MEInst). Instead of predicting the two-dimensional mask directly, MEInst distills it into a compact and fixed-dimensional representation vector, which allows the instance segmentation task to be incorporated into one-stage bounding-box detectors and results in a simple yet efficient instance segmentation framework. The proposed one-stage MEInst achieves 36.4% in mask AP with single-model (ResNeXt-101-FPN backbone) and single-scale testing on the MS-COCO benchmark. We show that the much simpler and flexible one-stage instance segmentation method, can also achieve competitive performance. This framework can be easily adapted for other instance-level recognition tasks. Code is available at: https://git.io/AdelaiDet
CVMar 12, 2020Code
Conditional Convolutions for Instance SegmentationZhi Tian, Chunhua Shen, Hao Chen
We propose a simple yet effective instance segmentation framework, termed CondInst (conditional convolutions for instance segmentation). Top-performing instance segmentation methods such as Mask R-CNN rely on ROI operations (typically ROIPool or ROIAlign) to obtain the final instance masks. In contrast, we propose to solve instance segmentation from a new perspective. Instead of using instance-wise ROIs as inputs to a network of fixed weights, we employ dynamic instance-aware networks, conditioned on instances. CondInst enjoys two advantages: 1) Instance segmentation is solved by a fully convolutional network, eliminating the need for ROI cropping and feature alignment. 2) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference. We demonstrate a simpler instance segmentation method that can achieve improved performance in both accuracy and inference speed. On the COCO dataset, we outperform a few recent methods including well-tuned Mask RCNN baselines, without longer training schedules needed. Code is available: https://github.com/aim-uofa/adet
CVJan 2, 2020Code
BlendMask: Top-Down Meets Bottom-Up for Instance SegmentationHao Chen, Kunyang Sun, Zhi Tian et al.
Instance segmentation is one of the fundamental vision tasks. Recently, fully convolutional instance segmentation methods have drawn much attention as they are often simpler and more efficient than two-stage approaches like Mask R-CNN. To date, almost all such approaches fall behind the two-stage Mask R-CNN method in mask precision when models have similar computation complexity, leaving great room for improvement. In this work, we achieve improved mask prediction by effectively combining instance-level information with semantic information with lower-level fine-granularity. Our main contribution is a blender module which draws inspiration from both top-down and bottom-up instance segmentation approaches. The proposed BlendMask can effectively predict dense per-pixel position-sensitive instance features with very few channels, and learn attention maps for each instance with merely one convolution layer, thus being fast in inference. BlendMask can be easily incorporated with the state-of-the-art one-stage detection frameworks and outperforms Mask R-CNN under the same training schedule while being 20% faster. A light-weight version of BlendMask achieves $ 34.2% $ mAP at 25 FPS evaluated on a single 1080Ti GPU card. Because of its simplicity and efficacy, we hope that our BlendMask could serve as a simple yet strong baseline for a wide range of instance-wise prediction tasks. Code is available at https://git.io/AdelaiDet
CVOct 17, 2019Code
Convolutional Character NetworksLinjie Xing, Zhi Tian, Weilin Huang et al.
Recent progress has been made on developing a unified framework for joint text detection and recognition in natural images, but existing joint models were mostly built on two-stage framework by involving ROI pooling, which can degrade the performance on recognition task. In this work, we propose convolutional character networks, referred as CharNet, which is an one-stage model that can process two tasks simultaneously in one pass. CharNet directly outputs bounding boxes of words and characters, with corresponding character labels. We utilize character as basic element, allowing us to overcome the main difficulty of existing approaches that attempted to optimize text detection jointly with a RNN-based recognition branch. In addition, we develop an iterative character detection approach able to transform the ability of character detection learned from synthetic data to real-world images. These technical improvements result in a simple, compact, yet powerful one-stage model that works reliably on multi-orientation and curved text. We evaluate CharNet on three standard benchmarks, where it consistently outperforms the state-of-the-art approaches [25, 24] by a large margin, e.g., with improvements of 65.33%->71.08% (with generic lexicon) on ICDAR 2015, and 54.0%->69.23% on Total-Text, on end-to-end text recognition. Code is available at: https://github.com/MalongTech/research-charnet.
CVApr 2, 2019Code
FCOS: Fully Convolutional One-Stage Object DetectionZhi Tian, Chunhua Shen, Hao Chen et al.
We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free. By eliminating the predefined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), FCOS with ResNeXt-64x4d-101 achieves 44.7% in AP with single-model and single-scale testing, surpassing previous one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks. Code is available at:Code is available at: https://tinyurl.com/FCOSv1
CVApr 15, 2025
Seedream 3.0 Technical ReportYu Gao, Lixue Gong, Qiushan Guo et al.
We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.
CVJun 10, 2025
Seedance 1.0: Exploring the Boundaries of Video Generation ModelsYu Gao, Haoyuan Guo, Tuyen Hoang et al.
Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.
CVMay 8, 2025
Mogao: An Omni Foundation Model for Interleaved Multi-Modal GenerationChao Liao, Liyang Liu, Xun Wang et al.
Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.
CVMar 10, 2025
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation ModelLixue Gong, Xiaoxia Hou, Fanshi Li et al.
Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.
LGMar 11, 2024
Unraveling the Mystery of Scaling Laws: Part IHui Su, Zhi Tian, Xiaoyu Shen et al.
Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. These principles play a vital role in optimizing various aspects of model pre-training, ultimately contributing to the success of large language models such as GPT-4, Llama and Gemini. However, the original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas, and their conclusions are only based on models containing up to 1.5 billion parameters. Though some subsequent works attempt to unveil these details and scale to larger models, they often neglect the training dependency of important factors such as the learning rate, context length and batch size, leading to their failure to establish a reliable formula for predicting the test loss trajectory. In this technical report, we confirm that the scaling law formulations proposed in the original OpenAI paper remain valid when scaling the model size up to 33 billion, but the constant coefficients in these formulas vary significantly with the experiment setup. We meticulously identify influential factors and provide transparent, step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M~60M parameters. Using these estimated formulas, we showcase the capability to accurately predict various attributes for models with up to 33B parameters before their training, including (1) the minimum possible test loss; (2) the minimum required training steps and processed tokens to achieve a specific loss; (3) the critical batch size with an optimal time/computation trade-off at any loss value; and (4) the complete test loss trajectory with arbitrary batch size.
CVSep 24, 2025
Seedream 4.0: Toward Next-generation Multimodal Image GenerationTeam Seedream, Yunpeng Chen, Yu Gao et al.
We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.
CVApr 8, 2025
DDT: Decoupled Diffusion TransformerShuai Wang, Zhi Tian, Weilin Huang et al.
Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \textbf{\color{ddt}D}ecoupled \textbf{\color{ddt}D}iffusion \textbf{\color{ddt}T}ransformer~(\textbf{\color{ddt}DDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet $256\times256$, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly $4\times$ faster training convergence compared to previous diffusion transformers). For ImageNet $512\times512$, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.
NIMar 29, 2024
Distributed Swarm Learning for Edge Internet of ThingsYue Wang, Zhi Tian, FXin Fan et al.
The rapid growth of Internet of Things (IoT) has led to the widespread deployment of smart IoT devices at wireless edge for collaborative machine learning tasks, ushering in a new era of edge learning. With a huge number of hardware-constrained IoT devices operating in resource-limited wireless networks, edge learning encounters substantial challenges, including communication and computation bottlenecks, device and data heterogeneity, security risks, privacy leakages, non-convex optimization, and complex wireless environments. To address these issues, this article explores a novel framework known as distributed swarm learning (DSL), which combines artificial intelligence and biological swarm intelligence in a holistic manner. By harnessing advanced signal processing and communications, DSL provides efficient solutions and robust tools for large-scale IoT at the edge of wireless networks.
LGMay 21, 2025
Scaling Diffusion Transformers Efficiently via $μ$PChenyu Zheng, Xinyu Zhang, Rongzhen Wang et al.
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($μ$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $μ$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $μ$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $μ$P of mainstream diffusion Transformers, including U-ViT, DiT, PixArt-$α$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $μ$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$μ$P enjoys robust HP transferability. Notably, DiT-XL-2-$μ$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $μ$P on text-to-image generation by scaling PixArt-$α$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $μ$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$α$ and 3% of consumption by human experts for MMDiT-18B. These results establish $μ$P as a principled and efficient framework for scaling diffusion Transformers.
LGSep 22, 2025
Multi-Worker Selection based Distributed Swarm Learning for Edge IoT with Non-i.i.d. DataZhuoyu Yao, Yue Wang, Songyang Zhang et al.
Recent advances in distributed swarm learning (DSL) offer a promising paradigm for edge Internet of Things. Such advancements enhance data privacy, communication efficiency, energy saving, and model scalability. However, the presence of non-independent and identically distributed (non-i.i.d.) data pose a significant challenge for multi-access edge computing, degrading learning performance and diverging training behavior of vanilla DSL. Further, there still lacks theoretical guidance on how data heterogeneity affects model training accuracy, which requires thorough investigation. To fill the gap, this paper first study the data heterogeneity by measuring the impact of non-i.i.d. datasets under the DSL framework. This then motivates a new multi-worker selection design for DSL, termed M-DSL algorithm, which works effectively with distributed heterogeneous data. A new non-i.i.d. degree metric is introduced and defined in this work to formulate the statistical difference among local datasets, which builds a connection between the measure of data heterogeneity and the evaluation of DSL performance. In this way, our M-DSL guides effective selection of multiple works who make prominent contributions for global model updates. We also provide theoretical analysis on the convergence behavior of our M-DSL, followed by extensive experiments on different heterogeneous datasets and non-i.i.d. data settings. Numerical results verify performance improvement and network intelligence enhancement provided by our M-DSL beyond the benchmarks.
LGMar 5, 2025
Towards Trustworthy Federated LearningAlina Basharat, Yijun Bian, Ping Xu et al.
This paper develops a comprehensive framework to address three critical trustworthy challenges in federated learning (FL): robustness against Byzantine attacks, fairness, and privacy preservation. To improve the system's defense against Byzantine attacks that send malicious information to bias the system's performance, we develop a Two-sided Norm Based Screening (TNBS) mechanism, which allows the central server to crop the gradients that have the l lowest norms and h highest norms. TNBS functions as a screening tool to filter out potential malicious participants whose gradients are far from the honest ones. To promote egalitarian fairness, we adopt the q-fair federated learning (q-FFL). Furthermore, we adopt a differential privacy-based scheme to prevent raw data at local clients from being inferred by curious parties. Convergence guarantees are provided for the proposed framework under different scenarios. Experimental results on real datasets demonstrate that the proposed framework effectively improves robustness and fairness while managing the trade-off between privacy and accuracy. This work appears to be the first study that experimentally and theoretically addresses fairness, privacy, and robustness in trustworthy FL.
CVJan 19, 2022
Poseur: Direct Human Pose Regression with TransformersWeian Mao, Yongtao Ge, Chunhua Shen et al.
We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the state-of-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.
LGNov 28, 2021
Fed2: Feature-Aligned Federated LearningFuxun Yu, Weishan Zhang, Zhuwei Qin et al.
Federated learning learns from scattered data by fusing collaborative models from local nodes. However, the conventional coordinate-based model averaging by FedAvg ignored the random information encoded per parameter and may suffer from structural feature misalignment. In this work, we propose Fed2, a feature-aligned federated learning framework to resolve this issue by establishing a firm structure-feature alignment across the collaborative models. Fed2 is composed of two major designs: First, we design a feature-oriented model structure adaptation method to ensure explicit feature allocation in different neural network structures. Applying the structure adaptation to collaborative models, matchable structures with similar feature information can be initialized at the very early training stage. During the federated learning process, we then propose a feature paired averaging scheme to guarantee aligned feature distribution and maintain no feature fusion conflicts under either IID or non-IID scenarios. Eventually, Fed2 could effectively enhance the federated learning convergence performance under extensive homo- and heterogeneous settings, providing excellent convergence speed, accuracy, and computation/communication efficiency.
LGOct 18, 2021
BEV-SGD: Best Effort Voting SGD for Analog Aggregation Based Federated Learning against Byzantine AttackersXin Fan, Yue Wang, Yan Huo et al.
As a promising distributed learning technology, analog aggregation based federated learning over the air (FLOA) provides high communication efficiency and privacy provisioning under the edge computing paradigm. When all edge devices (workers) simultaneously upload their local updates to the parameter server (PS) through commonly shared time-frequency resources, the PS obtains the averaged update only rather than the individual local ones. While such a concurrent transmission and aggregation scheme reduces the latency and communication costs, it unfortunately renders FLOA vulnerable to Byzantine attacks. Aiming at Byzantine-resilient FLOA, this paper starts from analyzing the channel inversion (CI) mechanism that is widely used for power control in FLOA. Our theoretical analysis indicates that although CI can achieve good learning performance in the benign scenarios, it fails to work well with limited defensive capability against Byzantine attacks. Then, we propose a novel scheme called the best effort voting (BEV) power control policy that is integrated with stochastic gradient descent (SGD). Our BEV-SGD enhances the robustness of FLOA to Byzantine attacks, by allowing all the workers to send their local updates at their maximum transmit power. Under worst-case attacks, we derive the expected convergence rates of FLOA with CI and BEV power control policies, respectively. The rate comparison reveals that our BEV-SGD outperforms its counterpart with CI in terms of better convergence behavior, which is verified by experimental simulations.
CVJul 30, 2021
Dynamic Neural Representational Decoders for High-Resolution Semantic SegmentationBowen Zhang, Yifan Liu, Zhi Tian et al.
Semantic segmentation requires per-pixel prediction for a given image. Typically, the output resolution of a segmentation network is severely reduced due to the downsampling operations in the CNN backbone. Most previous methods employ upsampling decoders to recover the spatial resolution. Various decoders were designed in the literature. Here, we propose a novel decoder, termed dynamic neural representational decoder (NRD), which is simple yet significantly more efficient. As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks. This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient. Furthermore, these neural representations are dynamically generated and conditioned on the outputs of the encoder networks. The desired semantic labels can be efficiently decoded from the neural representations, resulting in high-resolution semantic segmentation predictions. We empirically show that our proposed decoder can outperform the decoder in DeeplabV3+ with only 30% computational complexity, and achieve competitive performance with the methods using dilated encoders with only 15% computation. Experiments on the Cityscapes, ADE20K, and PASCAL Context datasets demonstrate the effectiveness and efficiency of our proposed method.
LGApr 8, 2021
Joint Optimization of Communications and Federated Learning Over the AirXin Fan, Yue Wang, Yan Huo et al.
Federated learning (FL) is an attractive paradigm for making use of rich distributed data while protecting data privacy. Nonetheless, nonideal communication links and limited transmission resources may hinder the implementation of fast and accurate FL. In this paper, we study joint optimization of communications and FL based on analog aggregation transmission in realistic wireless networks. We first derive closed-form expressions for the expected convergence rate of FL over the air, which theoretically quantify the impact of analog aggregation on FL. Based on the analytical results, we develop a joint optimization model for accurate FL implementation, which allows a parameter server to select a subset of workers and determine an appropriate power scaling factor. Since the practical setting of FL over the air encounters unobservable parameters, we reformulate the joint optimization of worker selection and power allocation using controlled approximation. Finally, we efficiently solve the resulting mixed-integer programming problem via a simple yet optimal finite-set search method by reducing the search space. Simulation results show that the proposed solutions developed for realistic wireless analog channels outperform a benchmark method, and achieve comparable performance of the ideal case where FL is implemented over noise-free wireless channels.
LGMar 30, 2021
1-Bit Compressive Sensing for Efficient Federated Learning Over the AirXin Fan, Yue Wang, Yan Huo et al.
For distributed learning among collaborative users, this paper develops and analyzes a communication-efficient scheme for federated learning (FL) over the air, which incorporates 1-bit compressive sensing (CS) into analog aggregation transmissions. To facilitate design parameter optimization, we theoretically analyze the efficacy of the proposed scheme by deriving a closed-form expression for the expected convergence rate of the FL over the air. Our theoretical results reveal the tradeoff between convergence performance and communication efficiency as a result of the aggregation errors caused by sparsification, dimension reduction, quantization, signal reconstruction and noise. Then, we formulate 1-bit CS based FL over the air as a joint optimization problem to mitigate the impact of these aggregation errors through joint optimal design of worker scheduling and power scaling policy. An enumeration-based method is proposed to solve this non-convex problem, which is optimal but becomes computationally infeasible as the number of devices increases. For scalable computing, we resort to the alternating direction method of multipliers (ADMM) technique to develop an efficient implementation that is suitable for large-scale networks. Simulation results show that our proposed 1-bit CS based FL over the air achieves comparable performance to the ideal case where conventional FL without compression and quantification is applied over error-free aggregation, at much reduced communication overhead and transmission latency.
CVMar 29, 2021
TFPose: Direct Human Pose Estimation with TransformersWeian Mao, Yongtao Ge, Chunhua Shen et al.
We propose a human pose estimation framework that solves the task in the regression-based fashion. Unlike previous regression-based methods, which often fall behind those state-of-the-art methods, we formulate the pose estimation task into a sequence prediction problem that can effectively be solved by transformers. Our framework is simple and direct, bypassing the drawbacks of the heatmap-based pose estimation. Moreover, with the attention mechanism in transformers, our proposed framework is able to adaptively attend to the features most relevant to the target keypoints, which largely overcomes the feature misalignment issue of previous regression-based methods and considerably improves the performance. Importantly, our framework can inherently take advantages of the structured relationship between keypoints. Experiments on the MS-COCO and MPII datasets demonstrate that our method can significantly improve the state-of-the-art of regression-based pose estimation and perform comparably with the best heatmap-based pose estimation methods.
ITSep 18, 2020
The Capacity of Multi-user Private Information Retrieval for Computationally Limited DatabasesWilliam Barnhart, Zhi Tian
We present a private information retrieval (PIR) scheme that allows a user to retrieve a single message from an arbitrary number of databases by colluding with other users while hiding the desired message index. This scheme is of particular significance when there is only one accessible database -- a special case that turns out to be more challenging for PIR in the multi-database case. The upper bound for privacy-preserving capacity for these scenarios is $C=(1+\frac{1}{S}+\cdots+\frac{1}{S^{K-1}})^{-1}$, where $K$ is the number of messages and $S$ represents the quantity of information sources such as $S=N+U-1$ for $U$ users and $N$ databases. We show that the proposed information retrieval scheme attains the capacity bound even when only one database is present, which differs from most existing works that hinge on the access to multiple databases in order to hide user privacy. Unlike the multi-database case, this scheme capitalizes on the inability for a database to cross-reference queries made by multiple users due to computational complexity.
LGAug 15, 2020
Heterogeneous Federated LearningFuxun Yu, Weishan Zhang, Zhuwei Qin et al.
Federated learning learns from scattered data by fusing collaborative models from local nodes. However, due to chaotic information distribution, the model fusion may suffer from structural misalignment with regard to unmatched parameters. In this work, we propose a novel federated learning framework to resolve this issue by establishing a firm structure-information alignment across collaborative models. Specifically, we design a feature-oriented regulation method ({$Ψ$-Net}) to ensure explicit feature information allocation in different neural network structures. Applying this regulating method to collaborative models, matchable structures with similar feature information can be initialized at the very early training stage. During the federated learning process under either IID or non-IID scenarios, dedicated collaboration schemes further guarantee ordered information distribution with definite structure matching, so as the comprehensive model alignment. Eventually, this framework effectively enhances the federated learning applicability to extensive heterogeneous settings, while providing excellent convergence speed, accuracy, and computation/communication efficiency.
CVJun 14, 2020
FCOS: A simple and strong anchor-free object detectorZhi Tian, Chunhua Shen, Hao Chen et al.
In computer vision, object detection is one of most important tasks, which underpins a few instance-level recognition tasks and many downstream applications. Recently one-stage methods have gained much attention over two-stage approaches due to their simpler design and competitive performance. Here we propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to other dense prediction problems such as semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free. By eliminating the pre-defined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating the intersection over union (IoU) scores during training. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks. Code and pre-trained models are available at: https://git.io/AdelaiDet
CVFeb 3, 2020
DiverseDepth: Affine-invariant Depth Prediction Using Diverse DataWei Yin, Xinlong Wang, Chunhua Shen et al.
We present a method for depth estimation with monocular images, which can predict high-quality depth on diverse scenes up to an affine transformation, thus preserving accurate shapes of a scene. Previous methods that predict metric depth often work well only for a specific scene. In contrast, learning relative depth (information of being closer or further) can enjoy better generalization, with the price of failing to recover the accurate geometric shape of the scene. In this work, we propose a dataset and methods to tackle this dilemma, aiming to predict accurate depth up to an affine transformation with good generalization to diverse scenes. First we construct a large-scale and diverse dataset, termed Diverse Scene Depth dataset (DiverseDepth), which has a broad range of scenes and foreground contents. Compared with previous learning objectives, i.e., learning metric depth or relative depth, we propose to learn the affine-invariant depth using our diverse dataset to ensure both generalization and high-quality geometric shapes of scenes. Furthermore, in order to train the model on the complex dataset effectively, we propose a multi-curriculum learning method. Experiments show that our method outperforms previous methods on 8 datasets by a large margin with the zero-shot test setting, demonstrating the excellent generalization capacity of the learned model to diverse scenes. The reconstructed point clouds with the predicted depth show that our method can recover high-quality 3D shapes. Code and dataset are available at: https://tinyurl.com/DiverseDepth
LGJan 28, 2020
COKE: Communication-Censored Decentralized Kernel LearningPing Xu, Yue Wang, Xiang Chen et al.
This paper studies the decentralized optimization and learning problem where multiple interconnected agents aim to learn an optimal decision function defined over a reproducing kernel Hilbert space by jointly minimizing a global objective function, with access to their own locally observed dataset. As a non-parametric approach, kernel learning faces a major challenge in distributed implementation: the decision variables of local objective functions are data-dependent and thus cannot be optimized under the decentralized consensus framework without any raw data exchange among agents. To circumvent this major challenge, we leverage the random feature (RF) approximation approach to enable consensus on the function modeled in the RF space by data-independent parameters across different agents. We then design an iterative algorithm, termed DKLA, for fast-convergent implementation via ADMM. Based on DKLA, we further develop a communication-censored kernel learning (COKE) algorithm that reduces the communication load of DKLA by preventing an agent from transmitting at every iteration unless its local updates are deemed informative. Theoretical results in terms of linear convergence guarantee and generalization performance analysis of DKLA and COKE are provided. Comprehensive tests on both synthetic and real datasets are conducted to verify the communication efficiency and learning effectiveness of COKE.
CVJan 6, 2020
Learning and Memorizing Representative Prototypes for 3D Point Cloud Semantic and Instance SegmentationTong He, Dong Gong, Zhi Tian et al.
3D point cloud semantic and instance segmentation is crucial and fundamental for 3D scene understanding. Due to the complex structure, point sets are distributed off balance and diversely, which appears as both category imbalance and pattern imbalance. As a result, deep networks can easily forget the non-dominant cases during the learning process, resulting in unsatisfactory performance. Although re-weighting can reduce the influence of the well-classified examples, they cannot handle the non-dominant patterns during the dynamic training. In this paper, we propose a memory-augmented network to learn and memorize the representative prototypes that cover diverse samples universally. Specifically, a memory module is introduced to alleviate the forgetting issue by recording the patterns seen in mini-batch training. The learned memory items consistently reflect the interpretable and meaningful information for both dominant and non-dominant categories and cases. The distorted observations and rare cases can thus be augmented by retrieving the stored prototypes, leading to better performances and generalization. Exhaustive experiments on the benchmarks, i.e. S3DIS and ScanNetV2, reflect the superiority of our method on both effectiveness and efficiency. Not only the overall accuracy but also nondominant classes have improved substantially.
CVNov 18, 2019
DirectPose: Direct End-to-End Multi-Person Pose EstimationZhi Tian, Hao Chen, Chunhua Shen
We propose the first direct end-to-end multi-person pose estimation framework, termed DirectPose. Inspired by recent anchor-free object detectors, which directly regress the two corners of target bounding-boxes, the proposed framework directly predicts instance-aware keypoints for all the instances from a raw input image, eliminating the need for heuristic grouping in bottom-up methods or bounding-box detection and RoI operations in top-down ones. We also propose a novel Keypoint Alignment (KPAlign) mechanism, which overcomes the main difficulty: lack of the alignment between the convolutional features and predictions in this end-to-end framework. KPAlign improves the framework's performance by a large margin while still keeping the framework end-to-end trainable. With the only postprocessing non-maximum suppression (NMS), our proposed framework can detect multi-person keypoints with or without bounding-boxes in a single shot. Experiments demonstrate that the end-to-end paradigm can achieve competitive or better performance than previous strong baselines, in both bottom-up and top-down methods. We hope that our end-to-end approach can provide a new perspective for the human pose estimation task.
OCSep 15, 2019
Communication-Censored Linearized ADMM for Decentralized Consensus OptimizationWeiyu Li, Yaohua Liu, Zhi Tian et al.
In this paper, we propose a communication- and computation-efficient algorithm to solve a convex consensus optimization problem defined over a decentralized network. A remarkable existing algorithm to solve this problem is the alternating direction method of multipliers (ADMM), in which at every iteration every node updates its local variable through combining neighboring variables and solving an optimization subproblem. The proposed algorithm, called as COmmunication-censored Linearized ADMM (COLA), leverages a linearization technique to reduce the iteration-wise computation cost of ADMM and uses a communication-censoring strategy to alleviate the communication cost. To be specific, COLA introduces successive linearization approximations to the local cost functions such that the resultant computation is first-order and light-weight. Since the linearization technique slows down the convergence speed, COLA further adopts the communication-censoring strategy to avoid transmissions of less informative messages. A node is allowed to transmit only if the distance between the current local variable and its previously transmitted one is larger than a censoring threshold. COLA is proven to be convergent when the local cost functions have Lipschitz continuous gradients and the censoring threshold is summable. When the local cost functions are further strongly convex, we establish the linear (sublinear) convergence rate of COLA, given that the censoring threshold linearly (sublinearly) decays to 0. Numerical experiments corroborate with the theoretical findings and demonstrate the satisfactory communication-computation tradeoff of COLA.
CVJun 11, 2019
NAS-FCOS: Fast Neural Architecture Search for Object DetectionNing Wang, Yang Gao, Hao Chen et al.
The success of deep neural networks relies on significant architecture engineering. Recently neural architecture search (NAS) has emerged as a promise to greatly reduce manual effort in network design by automatically searching for optimal architectures, although typically such algorithms need an excessive amount of computational resources, e.g., a few thousand GPU-days. To date, on challenging vision tasks such as object detection, NAS, especially fast versions of NAS, is less studied. Here we propose to search for the decoder structure of object detectors with search efficiency being taken into consideration. To be more specific, we aim to efficiently search for the feature pyramid network (FPN) as well as the prediction head of a simple anchor-free object detector, namely FCOS, using a tailored reinforcement learning paradigm. With carefully designed search space, search algorithms and strategies for evaluating network quality, we are able to efficiently search a top-performing detection architecture within 4 days using 8 V100 GPUs. The discovered architecture surpasses state-of-the-art object detection models (such as Faster R-CNN, RetinaNet and FCOS) by 1.5 to 3.5 points in AP on the COCO dataset, with comparable computation complexity and memory footprint, demonstrating the efficacy of the proposed NAS for object detection.