Robust Structured Declarative Classifiers for 3D Point Clouds: Defending Adversarial Attacks with Implicit GradientsKaidong Li, Ziming Zhang, Cuncong Zhong et al.
Deep neural networks for 3D point cloud classification, such as PointNet, have been demonstrated to be vulnerable to adversarial attacks. Current adversarial defenders often learn to denoise the (attacked) point clouds by reconstruction, and then feed them to the classifiers as input. In contrast to the literature, we propose a family of robust structured declarative classifiers for point cloud classification, where the internal constrained optimization mechanism can effectively defend adversarial attacks through implicit gradients. Such classifiers can be formulated using a bilevel optimization framework. We further propose an effective and efficient instantiation of our approach, namely, Lattice Point Classifier (LPC), based on structured sparse coding in the permutohedral lattice and 2D convolutional neural networks (CNNs) that is end-to-end trainable. We demonstrate state-of-the-art robust point cloud classification performance on ModelNet40 and ScanNet under seven different attackers. For instance, we achieve 89.51% and 83.16% test accuracy on each dataset under the recent JGBA attacker that outperforms DUP-Net and IF-Defense with PointNet by ~70%. Demo code is available at https://zhang-vislab.github.io.
Accumulated Trivial Attention Matters in Vision Transformers on Small DatasetsXiangyu Chen, Qinghao Hu, Kaidong Li et al.
Vision Transformers has demonstrated competitive performance on computer vision tasks benefiting from their ability to capture long-range dependencies with multi-head self-attention modules and multi-layer perceptron. However, calculating global attention brings another disadvantage compared with convolutional neural networks, i.e. requiring much more data and computations to converge, which makes it difficult to generalize well on small datasets, which is common in practical applications. Previous works are either focusing on transferring knowledge from large datasets or adjusting the structure for small datasets. After carefully examining the self-attention modules, we discover that the number of trivial attention weights is far greater than the important ones and the accumulated trivial weights are dominating the attention in Vision Transformers due to their large quantity, which is not handled by the attention itself. This will cover useful non-trivial attention and harm the performance when trivial attention includes more noise, e.g. in shallow layers for some backbones. To solve this issue, we proposed to divide attention weights into trivial and non-trivial ones by thresholds, then Suppressing Accumulated Trivial Attention (SATA) weights by proposed Trivial WeIghts Suppression Transformation (TWIST) to reduce attention noise. Extensive experiments on CIFAR-100 and Tiny-ImageNet datasets show that our suppressing method boosts the accuracy of Vision Transformers by up to 2.3%. Code is available at https://github.com/xiangyu8/SATA.
Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small DatasetsTianxiao Zhang, Wenju Xu, Bo Luo et al.
The Vision Transformer (ViT) leverages the Transformer's encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, the self-attention mechanism of ViT captures the global context from the outset, overlooking the inherent relationships between neighboring pixels in images or videos. Transformers mainly focus on global information while ignoring the fine-grained local details. Consequently, ViT lacks inductive bias during image or video dataset training. In contrast, convolutional neural networks (CNNs), with their reliance on local filters, possess an inherent inductive bias, making them more efficient and quicker to converge than ViT with less data. In this paper, we present a lightweight Depth-Wise Convolution module as a shortcut in ViT models, bypassing entire Transformer blocks to ensure the models capture both local and global information with minimal overhead. Additionally, we introduce two architecture variants, allowing the Depth-Wise Convolution modules to be applied to multiple Transformer blocks for parameter savings, and incorporating independent parallel Depth-Wise Convolution modules with different kernels to enhance the acquisition of local information. The proposed approach significantly boosts the performance of ViT models on image classification, object detection, and instance segmentation by a large margin, especially on small datasets, as evaluated on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet for image classification, and COCO for object detection and instance segmentation. The source code can be accessed at https://github.com/ZTX-100/Efficient_ViT_with_DW.
Explicitly Increasing Input Information Density for Vision Transformers on Small DatasetsXiangyu Chen, Ying Qin, Wenju Xu et al.
Vision Transformers have attracted a lot of attention recently since the successful implementation of Vision Transformer (ViT) on vision tasks. With vision Transformers, specifically the multi-head self-attention modules, networks can capture long-term dependencies inherently. However, these attention modules normally need to be trained on large datasets, and vision Transformers show inferior performance on small datasets when training from scratch compared with widely dominant backbones like ResNets. Note that the Transformer model was first proposed for natural language processing, which carries denser information than natural images. To boost the performance of vision Transformers on small datasets, this paper proposes to explicitly increase the input information density in the frequency domain. Specifically, we introduce selecting channels by calculating the channel-wise heatmaps in the frequency domain using Discrete Cosine Transform (DCT), reducing the size of input while keeping most information and hence increasing the information density. As a result, 25% fewer channels are kept while better performance is achieved compared with previous work. Extensive experiments demonstrate the effectiveness of the proposed approach on five small-scale datasets, including CIFAR-10/100, SVHN, Flowers-102, and Tiny ImageNet. The accuracy has been boosted up to 17.05% with Swin and Focal Transformers. Codes are available at https://github.com/xiangyu8/DenseVT.
LFSRDiff: Light Field Image Super-Resolution via Diffusion ModelsWentao Chao, Fuqing Duan, Xuechun Wang et al.
Light field (LF) image super-resolution (SR) is a challenging problem due to its inherent ill-posed nature, where a single low-resolution (LR) input LF image can correspond to multiple potential super-resolved outcomes. Despite this complexity, mainstream LF image SR methods typically adopt a deterministic approach, generating only a single output supervised by pixel-wise loss functions. This tendency often results in blurry and unrealistic results. Although diffusion models can capture the distribution of potential SR results by iteratively predicting Gaussian noise during the denoising process, they are primarily designed for general images and struggle to effectively handle the unique characteristics and information present in LF images. To address these limitations, we introduce LFSRDiff, the first diffusion-based LF image SR model, by incorporating the LF disentanglement mechanism. Our novel contribution includes the introduction of a disentangled U-Net for diffusion models, enabling more effective extraction and fusion of both spatial and angular information within LF images. Through comprehensive experimental evaluations and comparisons with the state-of-the-art LF image SR methods, the proposed approach consistently produces diverse and realistic SR results. It achieves the highest perceptual metric in terms of LPIPS. It also demonstrates the ability to effectively control the trade-off between perception and distortion. The code is available at \url{https://github.com/chaowentao/LFSRDiff}.
5.9CVAug 10, 2023
Aphid Cluster Recognition and Detection in the Wild Using Deep Learning ModelsTianxiao Zhang, Kaidong Li, Xiangyu Chen et al.
Aphid infestation poses a significant threat to crop production, rural communities, and global food security. While chemical pest control is crucial for maximizing yields, applying chemicals across entire fields is both environmentally unsustainable and costly. Hence, precise localization and management of aphids are essential for targeted pesticide application. The paper primarily focuses on using deep learning models for detecting aphid clusters. We propose a novel approach for estimating infection levels by detecting aphid clusters. To facilitate this research, we have captured a large-scale dataset from sorghum fields, manually selected 5,447 images containing aphids, and annotated each individual aphid cluster within these images. To facilitate the use of machine learning models, we further process the images by cropping them into patches, resulting in a labeled dataset comprising 151,380 image patches. Then, we implemented and compared the performance of four state-of-the-art object detection models (VFNet, GFLV2, PAA, and ATSS) on the aphid dataset. Extensive experimental results show that all models yield stable similar performance in terms of average precision and recall. We then propose to merge close neighboring clusters and remove tiny clusters caused by cropping, and the performance is further boosted by around 17%. The study demonstrates the feasibility of automatically detecting and managing insects using machine learning models. The labeled dataset will be made openly available to the research community.
16.5IVJul 4, 2023
Edge-aware Multi-task Network for Integrating Quantification Segmentation and Uncertainty Prediction of Liver Tumor on Multi-modality Non-contrast MRIXiaojiao Xiao, Qinmin Hu, Guanghui Wang
Simultaneous multi-index quantification, segmentation, and uncertainty estimation of liver tumors on multi-modality non-contrast magnetic resonance imaging (NCMRI) are crucial for accurate diagnosis. However, existing methods lack an effective mechanism for multi-modality NCMRI fusion and accurate boundary information capture, making these tasks challenging. To address these issues, this paper proposes a unified framework, namely edge-aware multi-task network (EaMtNet), to associate multi-index quantification, segmentation, and uncertainty of liver tumors on the multi-modality NCMRI. The EaMtNet employs two parallel CNN encoders and the Sobel filters to extract local features and edge maps, respectively. The newly designed edge-aware feature aggregation module (EaFA) is used for feature fusion and selection, making the network edge-aware by capturing long-range dependency between feature and edge maps. Multi-tasking leverages prediction discrepancy to estimate uncertainty and improve segmentation and quantification performance. Extensive experiments are performed on multi-modality NCMRI with 250 clinical subjects. The proposed model outperforms the state-of-the-art by a large margin, achieving a dice similarity coefficient of 90.01$\pm$1.23 and a mean absolute error of 2.72$\pm$0.58 mm for MD. The results demonstrate the potential of EaMtNet as a reliable clinical-aided tool for medical image analysis.
9.8LGFeb 17, 2023
Minimizing Dynamic Regret on Geodesic Metric SpacesZihao Hu, Guanghui Wang, Jacob Abernethy
In this paper, we consider the sequential decision problem where the goal is to minimize the general dynamic regret on a complete Riemannian manifold. The task of offline optimization on such a domain, also known as a geodesic metric space, has recently received significant attention. The online setting has received significantly less attention, and it has remained an open question whether the body of results that hold in the Euclidean setting can be transplanted into the land of Riemannian manifolds where new challenges (e.g., curvature) come into play. In this paper, we show how to get optimistic regret bound on manifolds with non-positive curvature whenever improper learning is allowed and propose an array of adaptive no-regret algorithms. To the best of our knowledge, this is the first work that considers general dynamic regret and develops "optimistic" online learning algorithms which can be employed on geodesic metric spaces.
9.6LGOct 17, 2022
On Accelerated Perceptrons and BeyondGuanghui Wang, Rafael Hanashiro, Etash Guha et al.
The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify $n$ linearly separable data points, assuming the classes are separated by some margin $γ> 0$. A foundational result is that Perceptron converges after $Ω(1/γ^{2})$ iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to $Ω(\sqrt{\log n}/γ)$, with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also lead to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) For the margin maximization problem, we improve the state-of-the-art result from $O(\log t/t^2)$ to $O(1/t^2)$, where $t$ is the number of iterations; b) We provide the first result on identifying the implicit bias property of the classical Nesterov's accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an $O(1/t^2)$ rate; c) For the classical $p$-norm Perceptron problem, we provide an algorithm with $Ω(\sqrt{(p-1)\log n}/γ)$ convergence rate, while existing algorithms suffer the $Ω({(p-1)}/γ^2)$ convergence rate.
7.6CVJul 17, 2023
On the Real-Time Semantic Segmentation of Aphid Clusters in the WildRaiyan Rahman, Christopher Indris, Tianxiao Zhang et al.
Aphid infestations can cause extensive damage to wheat and sorghum fields and spread plant viruses, resulting in significant yield losses in agriculture. To address this issue, farmers often rely on chemical pesticides, which are inefficiently applied over large areas of fields. As a result, a considerable amount of pesticide is wasted on areas without pests, while inadequate amounts are applied to areas with severe infestations. The paper focuses on the urgent need for an intelligent autonomous system that can locate and spray infestations within complex crop canopies, reducing pesticide use and environmental impact. We have collected and labeled a large aphid image dataset in the field, and propose the use of real-time semantic segmentation models to segment clusters of aphids. A multiscale dataset is generated to allow for learning the clusters at different scales. We compare the segmentation speeds and accuracy of four state-of-the-art real-time semantic segmentation models on the aphid cluster dataset, benchmarking them against nonreal-time models. The study results show the effectiveness of a real-time solution, which can reduce inefficient pesticide use and increase crop yields, paving the way towards an autonomous pest detection system.
10.4LGOct 17, 2022
Adaptive Oracle-Efficient Online LearningGuanghui Wang, Zihao Hu, Vidya Muthukumar et al.
The classical algorithms for online learning and decision-making have the benefit of achieving the optimal performance guarantees, but suffer from computational complexity limitations when implemented at scale. More recent sophisticated techniques, which we refer to as oracle-efficient methods, address this problem by dispatching to an offline optimization oracle that can search through an exponentially-large (or even infinite) space of decisions and select that which performed the best on any dataset. But despite the benefits of computational feasibility, oracle-efficient algorithms exhibit one major limitation: while performing well in worst-case settings, they do not adapt well to friendly environments. In this paper we consider two such friendly scenarios, (a) "small-loss" problems and (b) IID data. We provide a new framework for designing follow-the-perturbed-leader algorithms that are oracle-efficient and adapt well to the small-loss environment, under a particular condition which we call approximability (which is spiritually related to sufficient conditions provided by Dudík et al., [2020]). We identify a series of real-world settings, including online auctions and transductive online classification, for which approximability holds. We also extend the algorithm to an IID data setting and establish a "best-of-both-worlds" bound in the oracle-efficient setting.
5.0CVJul 12, 2023
A New Dataset and Comparative Study for Aphid Cluster DetectionTianxiao Zhang, Kaidong Li, Xiangyu Chen et al.
Aphids are one of the main threats to crops, rural families, and global food security. Chemical pest control is a necessary component of crop production for maximizing yields, however, it is unnecessary to apply the chemical approaches to the entire fields in consideration of the environmental pollution and the cost. Thus, accurately localizing the aphid and estimating the infestation level is crucial to the precise local application of pesticides. Aphid detection is very challenging as each individual aphid is really small and all aphids are crowded together as clusters. In this paper, we propose to estimate the infection level by detecting aphid clusters. We have taken millions of images in the sorghum fields, manually selected 5,447 images that contain aphids, and annotated each aphid cluster in the image. To use these images for machine learning models, we crop the images into patches and created a labeled dataset with over 151,000 image patches. Then, we implement and compare the performance of four state-of-the-art object detection models.
MOFA: A Model Simplification Roadmap for Image Restoration on Mobile DevicesXiangyu Chen, Ruiwen Zhen, Shuai Li et al.
Image restoration aims to restore high-quality images from degraded counterparts and has seen significant advancements through deep learning techniques. The technique has been widely applied to mobile devices for tasks such as mobile photography. Given the resource limitations on mobile devices, such as memory constraints and runtime requirements, the efficiency of models during deployment becomes paramount. Nevertheless, most previous works have primarily concentrated on analyzing the efficiency of single modules and improving them individually. This paper examines the efficiency across different layers. We propose a roadmap that can be applied to further accelerate image restoration models prior to deployment while simultaneously increasing PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). The roadmap first increases the model capacity by adding more parameters to partial convolutions on FLOPs non-sensitive layers. Then, it applies partial depthwise convolution coupled with decoupling upsampling/downsampling layers to accelerate the model speed. Extensive experiments demonstrate that our approach decreases runtime by up to 13% and reduces the number of parameters by up to 23%, while increasing PSNR and SSIM on several image restoration datasets. Source Code of our method is available at \href{https://github.com/xiangyu8/MOFA}{https://github.com/xiangyu8/MOFA}.
1.4CVApr 17, 2022
In Defense of Subspace Tracker: Orthogonal Embedding for Visual TrackingYao Sui, Guanghui Wang, Li Zhang
The paper focuses on a classical tracking model, subspace learning, grounded on the fact that the targets in successive frames are considered to reside in a low-dimensional subspace or manifold due to the similarity in their appearances. In recent years, a number of subspace trackers have been proposed and obtained impressive results. Inspired by the most recent results that the tracking performance is boosted by the subspace with discrimination capability learned over the recently localized targets and their immediately surrounding background, this work aims at solving such a problem: how to learn a robust low-dimensional subspace to accurately and discriminatively represent these target and background samples. To this end, a discriminative approach, which reliably separates the target from its surrounding background, is injected into the subspace learning by means of joint learning, achieving a dimension-adaptive subspace with superior discrimination capability. The proposed approach is extensively evaluated and compared with the state-of-the-art trackers on four popular tracking benchmarks. The experimental results demonstrate that the proposed tracker performs competitively against its counterparts. In particular, it achieves more than 9% performance increase compared with the state-of-the-art subspace trackers.
ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-DivergenceGuanghui Wang, Zhiyong Yang, Zitai Wang et al.
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the \textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $α$-$β$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving an effective trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy. The code is available at https://github.com/ghwang-s/abkd.
Multi-Layer Dense Attention Decoder for Polyp SegmentationKrushi Patel, Fengjun Li, Guanghui Wang
Detecting and segmenting polyps is crucial for expediting the diagnosis of colon cancer. This is a challenging task due to the large variations of polyps in color, texture, and lighting conditions, along with subtle differences between the polyp and its surrounding area. Recently, vision Transformers have shown robust abilities in modeling global context for polyp segmentation. However, they face two major limitations: the inability to learn local relations among multi-level layers and inadequate feature aggregation in the decoder. To address these issues, we propose a novel decoder architecture aimed at hierarchically aggregating locally enhanced multi-level dense features. Specifically, we introduce a novel module named Dense Attention Gate (DAG), which adaptively fuses all previous layers' features to establish local feature relations among all layers. Furthermore, we propose a novel nested decoder architecture that hierarchically aggregates decoder features, thereby enhancing semantic features. We incorporate our novel dense decoder with the PVT backbone network and conduct evaluations on five polyp segmentation datasets: Kvasir, CVC-300, CVC-ColonDB, CVC-ClinicDB, and ETIS. Our experiments and comparisons with nine competing segmentation models demonstrate that the proposed architecture achieves state-of-the-art performance and outperforms the previous models on four datasets. The source code is available at: https://github.com/krushi1992/Dense-Decoder.
OccCasNet: Occlusion-aware Cascade Cost Volume for Light Field Depth EstimationWentao Chao, Fuqing Duan, Xuechun Wang et al.
Light field (LF) depth estimation is a crucial task with numerous practical applications. However, mainstream methods based on the multi-view stereo (MVS) are resource-intensive and time-consuming as they need to construct a finer cost volume. To address this issue and achieve a better trade-off between accuracy and efficiency, we propose an occlusion-aware cascade cost volume for LF depth (disparity) estimation. Our cascaded strategy reduces the sampling number while keeping the sampling interval constant during the construction of a finer cost volume. We also introduce occlusion maps to enhance accuracy in constructing the occlusion-aware cost volume. Specifically, we first obtain the coarse disparity map through the coarse disparity estimation network. Then, the sub-aperture images (SAIs) of side views are warped to the center view based on the initial disparity map. Next, we propose photo-consistency constraints between the warped SAIs and the center SAI to generate occlusion maps for each SAI. Finally, we introduce the coarse disparity map and occlusion maps to construct an occlusion-aware refined cost volume, enabling the refined disparity estimation network to yield a more precise disparity map. Extensive experiments demonstrate the effectiveness of our method. Compared with state-of-the-art methods, our method achieves a superior balance between accuracy and efficiency and ranks first in terms of MSE and Q25 metrics among published methods on the HCI 4D benchmark. The code and model of the proposed method are available at https://github.com/chaowentao/OccCasNet.
DRB-GAN: A Dynamic ResBlock Generative Adversarial Network for Artistic Style TransferWenju Xu, Chengjiang Long, Ruisheng Wang et al.
The paper proposes a Dynamic ResBlock Generative Adversarial Network (DRB-GAN) for artistic style transfer. The style code is modeled as the shared parameters for Dynamic ResBlocks connecting both the style encoding network and the style transfer network. In the style encoding network, a style class-aware attention mechanism is used to attend the style feature representation for generating the style codes. In the style transfer network, multiple Dynamic ResBlocks are designed to integrate the style code and the extracted CNN semantic feature and then feed into the spatial window Layer-Instance Normalization (SW-LIN) decoder, which enables high-quality synthetic images with artistic style transfer. Moreover, the style collection conditional discriminator is designed to equip our DRB-GAN model with abilities for both arbitrary style transfer and collection style transfer during the training stage. No matter for arbitrary style transfer or collection style transfer, extensive experiments strongly demonstrate that our proposed DRB-GAN outperforms state-of-the-art methods and exhibits its superior performance in terms of visual quality and efficiency. Our source code is available at \color{magenta}{\url{https://github.com/xuwenju123/DRB-GAN}}.
Multiple Classifiers Based Maximum Classifier Discrepancy for Unsupervised Domain AdaptationYiju Yang, Taejoon Kim, Guanghui Wang
Adversarial training based on the maximum classifier discrepancy between two classifier structures has achieved great success in unsupervised domain adaptation tasks for image classification. The approach adopts the structure of two classifiers, though simple and intuitive, the learned classification boundary may not well represent the data property in the new domain. In this paper, we propose to extend the structure to multiple classifiers to further boost its performance. To this end, we develop a very straightforward approach to adding more classifiers. We employ the principle that the classifiers are different from each other to construct a discrepancy loss function for multiple classifiers. The proposed construction method of loss function makes it possible to add any number of classifiers to the original framework. The proposed approach is validated through extensive experimental evaluations. We demonstrate that, on average, adopting the structure of three classifiers normally yields the best performance as a trade-off between accuracy and efficiency. With minimum extra computational costs, the proposed approach can significantly improve the performance of the original algorithm. The source code of the proposed approach can be downloaded from \url{https://github.com/rucv/MMCD\_DA}.
Training Deep Neural Networks via Branch-and-BoundYuanwei Wu, Ziming Zhang, Guanghui Wang
In this paper, we propose BPGrad, a novel approximate algorithm for deep nueral network training, based on adaptive estimates of feasible region via branch-and-bound. The method is based on the assumption of Lipschitz continuity in objective function, and as a result, it can adaptively determine the step size for the current gradient given the history of previous updates. We prove that, by repeating such a branch-and-pruning procedure, it can achieve the optimal solution within finite iterations. A computationally efficient solver based on BPGrad has been proposed to train the deep neural networks. Empirical results demonstrate that BPGrad solver works well in practice and compares favorably to other stochastic optimization methods in the tasks of object recognition, detection, and segmentation. The code is available at \url{https://github.com/RyanCV/BPGrad}.
16.4CVFeb 11, 2025
A Survey on Mamba Architecture for Vision ApplicationsFady Ibrahim, Guangjun Liu, Guanghui Wang
Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.
6.5CVMay 7, 2024
A New Dataset and Comparative Study for Aphid Cluster Detection and Segmentation in Sorghum FieldsRaiyan Rahman, Christopher Indris, Goetz Bramesfeld et al.
Aphid infestations are one of the primary causes of extensive damage to wheat and sorghum fields and are one of the most common vectors for plant viruses, resulting in significant agricultural yield losses. To address this problem, farmers often employ the inefficient use of harmful chemical pesticides that have negative health and environmental impacts. As a result, a large amount of pesticide is wasted on areas without significant pest infestation. This brings to attention the urgent need for an intelligent autonomous system that can locate and spray sufficiently large infestations selectively within the complex crop canopies. We have developed a large multi-scale dataset for aphid cluster detection and segmentation, collected from actual sorghum fields and meticulously annotated to include clusters of aphids. Our dataset comprises a total of 54,742 image patches, showcasing a variety of viewpoints, diverse lighting conditions, and multiple scales, highlighting its effectiveness for real-world applications. In this study, we trained and evaluated four real-time semantic segmentation models and three object detection models specifically for aphid cluster segmentation and detection. Considering the balance between accuracy and efficiency, Fast-SCNN delivered the most effective segmentation results, achieving 80.46% mean precision, 81.21% mean recall, and 91.66 frames per second (FPS). For object detection, RT-DETR exhibited the best overall performance with a 61.63% mean average precision (mAP), 92.6% mean recall, and 72.55 on an NVIDIA V100 GPU. Our experiments further indicate that aphid cluster segmentation is more suitable for assessing aphid infestations than using detection models.
FgC2F-UDiff: Frequency-guided and Coarse-to-fine Unified Diffusion Model for Multi-modality Missing MRI SynthesisXiaojiao Xiao, Qinmin Vivian Hu, Guanghui Wang
Multi-modality magnetic resonance imaging (MRI) is essential for the diagnosis and treatment of brain tumors. However, missing modalities are commonly observed due to limitations in scan time, scan corruption, artifacts, motion, and contrast agent intolerance. Synthesis of missing MRI has been a means to address the limitations of modality insufficiency in clinical practice and research. However, there are still some challenges, such as poor generalization, inaccurate non-linear mapping, and slow processing speeds. To address the aforementioned issues, we propose a novel unified synthesis model, the Frequency-guided and Coarse-to-fine Unified Diffusion Model (FgC2F-UDiff), designed for multiple inputs and outputs. Specifically, the Coarse-to-fine Unified Network (CUN) fully exploits the iterative denoising properties of diffusion models, from global to detail, by dividing the denoising process into two stages, coarse and fine, to enhance the fidelity of synthesized images. Secondly, the Frequency-guided Collaborative Strategy (FCS) harnesses appropriate frequency information as prior knowledge to guide the learning of a unified, highly non-linear mapping. Thirdly, the Specific-acceleration Hybrid Mechanism (SHM) integrates specific mechanisms to accelerate the diffusion model and enhance the feasibility of many-to-many synthesis. Extensive experimental evaluations have demonstrated that our proposed FgC2F-UDiff model achieves superior performance on two datasets, validated through a comprehensive assessment that includes both qualitative observations and quantitative metrics, such as PSNR SSIM, LPIPS, and FID.
3.7CVOct 18, 2024
Improving Vision Transformers by Overlapping Heads in Multi-Head Self-AttentionTianxiao Zhang, Bo Luo, Guanghui Wang
Vision Transformers have made remarkable progress in recent years, achieving state-of-the-art performance in most vision tasks. A key component of this success is due to the introduction of the Multi-Head Self-Attention (MHSA) module, which enables each head to learn different representations by applying the attention mechanism independently. In this paper, we empirically demonstrate that Vision Transformers can be further enhanced by overlapping the heads in MHSA. We introduce Multi-Overlapped-Head Self-Attention (MOHSA), where heads are overlapped with their two adjacent heads for queries, keys, and values, while zero-padding is employed for the first and last heads, which have only one neighboring head. Various paradigms for overlapping ratios are proposed to fully investigate the optimal performance of our approach. The proposed approach is evaluated using five Transformer models on four benchmark datasets and yields a significant performance boost. The source code will be made publicly available upon publication.
9.4LGFeb 24, 2025
TGT: A Temporal Gating Transformer for Smartphone App Usage PredictionLonglong Li, Cunquan Qu, Guanghui Wang
Accurately predicting smartphone app usage is challenging due to the sparsity and irregularity of user behavior, especially under cold-start and low-activity conditions. Existing approaches mostly rely on static or attention-only architectures, which struggle to model fine-grained temporal dynamics. We propose TGT, a Transformer framework equipped with a temporal gating module that conditions hidden representations on the hour-of-day. Unlike conventional time embeddings, temporal gating adaptively rescales feature dimensions in a time-aware manner, working orthogonally to self-attention and strengthening temporal sensitivity. TGT further incorporates a context-aware encoder that integrates session sequences and user profiles into a unified representation. Experiments on two real-world datasets, Tsinghua App Usage and LSApp, demonstrate that TGT significantly outperforms 15 competitive baselines, achieving notable gains in HR@1 and maintaining robustness under cold-start scenarios. Beyond accuracy, analysis of gating vectors uncovers interpretable daily usage rhythms, showing that TGT learns human-consistent patterns of app behavior. These results establish TGT as both a powerful and interpretable framework for time-aware app usage prediction.
3.3GTFeb 12, 2025
Last-iterate Convergence for Symmetric, General-sum, $2 \times 2$ Games Under The Exponential Weights DynamicGuanghui Wang, Krishna Acharya, Lokranjan Lakshmikanthan et al.
We conduct a comprehensive analysis of the discrete-time exponential-weights dynamic with a constant step size on all \emph{general-sum and symmetric} $2 \times 2$ normal-form games, i.e. games with $2$ pure strategies per player, and where the ensuing payoff tuple is of the form $(A,A^\top)$ (where $A$ is the $2 \times 2$ payoff matrix corresponding to the first player). Such symmetric games commonly arise in real-world interactions between "symmetric" agents who have identically defined utility functions -- such as Bertrand competition, multi-agent performative prediction, and certain congestion games -- and display a rich multiplicity of equilibria despite the seemingly simple setting. Somewhat surprisingly, we show through a first-principles analysis that the exponential weights dynamic, which is popular in online learning, converges in the last iterate for such games regardless of initialization with an appropriately chosen step size. For certain games and/or initializations, we further show that the convergence rate is in fact exponential and holds for any step size. We illustrate our theory with extensive simulations and applications to the aforementioned game-theoretic interactions. In the case of multi-agent performative prediction, we formulate a new "mortgage competition" game between lenders (i.e. banks) who interact with a population of customers, and show that it fits into our framework.
Mix-Domain Contrastive Learning for Unpaired H&E-to-IHC Stain TranslationSong Wang, Zhong Zhang, Huan Yan et al.
H&E-to-IHC stain translation techniques offer a promising solution for precise cancer diagnosis, especially in low-resource regions where there is a shortage of health professionals and limited access to expensive equipment. Considering the pixel-level misalignment of H&E-IHC image pairs, current research explores the pathological consistency between patches from the same positions of the image pair. However, most of them overemphasize the correspondence between domains or patches, overlooking the side information provided by the non-corresponding objects. In this paper, we propose a Mix-Domain Contrastive Learning (MDCL) method to leverage the supervision information in unpaired H&E-to-IHC stain translation. Specifically, the proposed MDCL method aggregates the inter-domain and intra-domain pathology information by estimating the correlation between the anchor patch and all the patches from the matching images, encouraging the network to learn additional contrastive knowledge from mixed domains. With the mix-domain pathology information aggregation, MDCL enhances the pathological consistency between the corresponding patches and the component discrepancy of the patches from the different positions of the generated IHC image. Extensive experiments on two H&E-to-IHC stain translation datasets, namely MIST and BCI, demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics.
3.6IVFeb 1, 2024
Assessing Patient Eligibility for Inspire Therapy through Machine Learning and Deep Learning ModelsMohsena Chowdhury, Tejas Vyas, Rahul Alapati et al.
Inspire therapy is an FDA-approved internal neurostimulation treatment for obstructive sleep apnea. However, not all patients respond to this therapy, posing a challenge even for experienced otolaryngologists to determine candidacy. This paper makes the first attempt to leverage both machine learning and deep learning techniques in discerning patient responsiveness to Inspire therapy using medical data and videos captured through Drug-Induced Sleep Endoscopy (DISE), an essential procedure for Inspire therapy. To achieve this, we gathered and annotated three datasets from 127 patients. Two of these datasets comprise endoscopic videos focused on the Base of the Tongue and Velopharynx. The third dataset composes the patient's clinical information. By utilizing these datasets, we benchmarked and compared the performance of six deep learning models and five classical machine learning algorithms. The results demonstrate the potential of employing machine learning and deep learning techniques to determine a patient's eligibility for Inspire therapy, paving the way for future advancements in this field.
5.2CVJan 26, 2024
Kitchen Food Waste Image Segmentation and Classification for Compost Nutrients EstimationRaiyan Rahman, Mohsena Chowdhury, Yueyang Tang et al.
The escalating global concern over extensive food wastage necessitates innovative solutions to foster a net-zero lifestyle and reduce emissions. The LILA home composter presents a convenient means of recycling kitchen scraps and daily food waste into nutrient-rich, high-quality compost. To capture the nutritional information of the produced compost, we have created and annotated a large high-resolution image dataset of kitchen food waste with segmentation masks of 19 nutrition-rich categories. Leveraging this dataset, we benchmarked four state-of-the-art semantic segmentation models on food waste segmentation, contributing to the assessment of compost quality of Nitrogen, Phosphorus, or Potassium. The experiments demonstrate promising results of using segmentation models to discern food waste produced in our daily lives. Based on the experiments, SegFormer, utilizing MIT-B5 backbone, yields the best performance with a mean Intersection over Union (mIoU) of 67.09. Class-based results are also provided to facilitate further analysis of different food waste classes.
8.5IVJan 24, 2024
Predicting Mitral Valve mTEER Surgery Outcomes Using Machine Learning and Deep Learning TechniquesTejas Vyas, Mohsena Chowdhury, Xiaojiao Xiao et al.
Mitral Transcatheter Edge-to-Edge Repair (mTEER) is a medical procedure utilized for the treatment of mitral valve disorders. However, predicting the outcome of the procedure poses a significant challenge. This paper makes the first attempt to harness classical machine learning (ML) and deep learning (DL) techniques for predicting mitral valve mTEER surgery outcomes. To achieve this, we compiled a dataset from 467 patients, encompassing labeled echocardiogram videos and patient reports containing Transesophageal Echocardiography (TEE) measurements detailing Mitral Valve Repair (MVR) treatment outcomes. Leveraging this dataset, we conducted a benchmark evaluation of six ML algorithms and two DL models. The results underscore the potential of ML and DL in predicting mTEER surgery outcomes, providing insight for future investigation and advancements in this domain.
5.9CVDec 10, 2023
Disentangled Representation Learning for Controllable Person Image GenerationWenju Xu, Chengjiang Long, Yongwei Nie et al.
In this paper, we propose a novel framework named DRL-CPG to learn disentangled latent representation for controllable person image generation, which can produce realistic person images with desired poses and human attributes (e.g., pose, head, upper clothes, and pants) provided by various source persons. Unlike the existing works leveraging the semantic masks to obtain the representation of each component, we propose to generate disentangled latent code via a novel attribute encoder with transformers trained in a manner of curriculum learning from a relatively easy step to a gradually hard one. A random component mask-agnostic strategy is introduced to randomly remove component masks from the person segmentation masks, which aims at increasing the difficulty of training and promoting the transformer encoder to recognize the underlying boundaries between each component. This enables the model to transfer both the shape and texture of the components. Furthermore, we propose a novel attribute decoder network to integrate multi-level attributes (e.g., the structure feature and the attribute representation) with well-designed Dual Adaptive Denormalization (DAD) residual blocks. Extensive experiments strongly demonstrate that the proposed approach is able to transfer both the texture and shape of different human parts and yield realistic results. To our knowledge, we are the first to learn disentangled latent representations with transformers for person image generation.
5.3LGMay 30, 2023
Riemannian Projection-free Online LearningZihao Hu, Guanghui Wang, Jacob Abernethy
The projection operation is a critical component in a wide range of optimization algorithms, such as online gradient descent (OGD), for enforcing constraints and achieving optimal regret bounds. However, it suffers from computational complexity limitations in high-dimensional settings or when dealing with ill-conditioned constraint sets. Projection-free algorithms address this issue by replacing the projection oracle with more efficient optimization subroutines. But to date, these methods have been developed primarily in the Euclidean setting, and while there has been growing interest in optimization on Riemannian manifolds, there has been essentially no work in trying to utilize projection-free tools here. An apparent issue is that non-trivial affine functions are generally non-convex in such domains. In this paper, we present methods for obtaining sub-linear regret guarantees in online geodesically convex optimization on curved spaces for two scenarios: when we have access to (a) a separation oracle or (b) a linear optimization oracle. For geodesically convex losses, and when a separation oracle is available, our algorithms achieve $O(T^{1/2}\:)$ and $O(T^{3/4}\;)$ adaptive regret guarantees in the full information setting and the bandit setting, respectively. When a linear optimization oracle is available, we obtain regret rates of $O(T^{3/4}\;)$ for geodesically convex losses and $O(T^{2/3}\; log T )$ for strongly geodesically convex losses.
3.9CVMay 26, 2023
Gender, Smoking History and Age Prediction from Laryngeal ImagesTianxiao Zhang, Andrés M. Bur, Shannon Kraft et al.
Flexible laryngoscopy is commonly performed by otolaryngologists to detect laryngeal diseases and to recognize potentially malignant lesions. Recently, researchers have introduced machine learning techniques to facilitate automated diagnosis using laryngeal images and achieved promising results. Diagnostic performance can be improved when patients' demographic information is incorporated into models. However, manual entry of patient data is time consuming for clinicians. In this study, we made the first endeavor to employ deep learning models to predict patient demographic information to improve detector model performance. The overall accuracy for gender, smoking history, and age was 85.5%, 65.2%, and 75.9%, respectively. We also created a new laryngoscopic image set for machine learning study and benchmarked the performance of 8 classical deep learning models based on CNNs and Transformers. The results can be integrated into current learning models to improve their performance by incorporating the patient's demographic information.
Dynamic Label Assignment for Object Detection by Combining Predicted IoUs and Anchor IoUsTianxiao Zhang, Bo Luo, Ajay Sharda et al.
Label assignment plays a significant role in modern object detection models. Detection models may yield totally different performances with different label assignment strategies. For anchor-based detection models, the IoU (Intersection over Union) threshold between the anchors and their corresponding ground truth bounding boxes is the key element since the positive samples and negative samples are divided by the IoU threshold. Early object detectors simply utilize the fixed threshold for all training samples, while recent detection algorithms focus on adaptive thresholds based on the distribution of the IoUs to the ground truth boxes. In this paper, we introduce a simple while effective approach to perform label assignment dynamically based on the training status with predictions. By introducing the predictions in label assignment, more high-quality samples with higher IoUs to the ground truth objects are selected as the positive samples, which could reduce the discrepancy between the classification scores and the IoU scores, and generate more high-quality boundary boxes. Our approach shows improvements in the performance of the detection models with the adaptive label assignment algorithm and lower bounding box losses for those positive samples, indicating more samples with higher-quality predicted boxes are selected as positives.
Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention ConvergenceWenchi Ma, Tianxiao Zhang, Guanghui Wang
Object Detection with Transformers (DETR) and related works reach or even surpass the highly-optimized Faster-RCNN baseline with self-attention network architectures. Inspired by the evidence that pure self-attention possesses a strong inductive bias that leads to the transformer losing the expressive power with respect to network depth, we propose a transformer architecture with a mitigatory self-attention mechanism by applying possible direct mapping connections in the transformer architecture to mitigate the rank collapse so as to counteract feature expression loss and enhance the model performance. We apply this proposal in object detection tasks and develop a model named Miti-DETR. Miti-DETR reserves the inputs of each single attention layer to the outputs of that layer so that the "non-attention" information has participated in any attention propagation. The formed residual self-attention network addresses two critical issues: (1) stop the self-attention networks from degenerating to rank-1 to the maximized degree; and (2) further diversify the path distribution of parameter update so that easier attention learning is expected. Miti-DETR significantly enhances the average detection precision and convergence speed towards existing DETR-based models on the challenging COCO object detection dataset. Moreover, the proposed transformer with the residual self-attention network can be easily generalized or plugged in other related task models without specific customization.
9.4CVDec 25, 2021
Semantic Clustering based Deduction Learning for Image Recognition and ClassificationWenchi Ma, Xuemin Tu, Bo Luo et al.
The paper proposes a semantic clustering based deduction learning by mimicking the learning and thinking process of human brains. Human beings can make judgments based on experience and cognition, and as a result, no one would recognize an unknown animal as a car. Inspired by this observation, we propose to train deep learning models using the clustering prior that can guide the models to learn with the ability of semantic deducing and summarizing from classification attributes, such as a cat belonging to animals while a car pertaining to vehicles. %Specifically, if an image is labeled as a cat, then the model is trained to learn that "this image is totally not any random class that is the outlier of animal". The proposed approach realizes the high-level clustering in the semantic space, enabling the model to deduce the relations among various classes during the learning process. In addition, the paper introduces a semantic prior based random search for the opposite labels to ensure the smooth distribution of the clustering and the robustness of the classifiers. The proposed approach is supported theoretically and empirically through extensive experiments. We compare the performance across state-of-the-art classifiers on popular benchmarks, and the generalization ability is verified by adding noisy labeling to the datasets. Experimental results demonstrate the superiority of the proposed approach.
4.7CVDec 17, 2021
Towards More Effective PRM-based Crowd Counting via A Multi-resolution Fusion and Attention NetworkUsman Sajid, Guanghui Wang
The paper focuses on improving the recent plug-and-play patch rescaling module (PRM) based approaches for crowd counting. In order to make full use of the PRM potential and obtain more reliable and accurate results for challenging images with crowd-variation, large perspective, extreme occlusions, and cluttered background regions, we propose a new PRM based multi-resolution and multi-task crowd counting network by exploiting the PRM module with more effectiveness and potency. The proposed model consists of three deep-layered branches with each branch generating feature maps of different resolutions. These branches perform a feature-level fusion across each other to build the vital collective knowledge to be used for the final crowd estimate. Additionally, early-stage feature maps undergo visual attention to strengthen the later-stage channels understanding of the foreground regions. The integration of these deep branches with the PRM module and the early-attended blocks proves to be more effective than the original PRM based schemes through extensive numerical and visual evaluations on four benchmark datasets. The proposed approach yields a significant improvement by a margin of 12.6% in terms of the RMSE evaluation criterion. It also outperforms state-of-the-art methods in cross-dataset evaluations.
A Discriminative Channel Diversification Network for Image ClassificationKrushi Patel, Guanghui Wang
Channel attention mechanisms in convolutional neural networks have been proven to be effective in various computer vision tasks. However, the performance improvement comes with additional model complexity and computation cost. In this paper, we propose a light-weight and effective attention module, called channel diversification block, to enhance the global context by establishing the channel relationship at the global level. Unlike other channel attention mechanisms, the proposed module focuses on the most discriminative features by giving more attention to the spatially distinguishable channels while taking account of the channel activation. Different from other attention models that plugin the module in between several intermediate layers, the proposed module is embedded at the end of the backbone networks, making it easy to implement. Extensive experiments on CIFAR-10, SVHN, and Tiny-ImageNet datasets demonstrate that the proposed module improves the performance of the baseline networks by a margin of 3% on average.
8.7CVSep 4, 2021
Audio-Visual Transformer Based Crowd CountingUsman Sajid, Xiangyu Chen, Hasan Sajid et al.
Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality. These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate. To acquire rich visual features, we propose a multi-branch structure with transformer-style fusion in-between. Extensive experimental evaluations show that the proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8% improvement. We also analyze and compare the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.
8.7CVMay 17, 2021
A Fine-Grained Visual Attention Approach for Fingerspelling Recognition in the WildKamala Gajurel, Cuncong Zhong, Guanghui Wang
Fingerspelling in sign language has been the means of communicating technical terms and proper nouns when they do not have dedicated sign language gestures. Automatic recognition of fingerspelling can help resolve communication barriers when interacting with deaf people. The main challenges prevalent in fingerspelling recognition are the ambiguity in the gestures and strong articulation of the hands. The automatic recognition model should address high inter-class visual similarity and high intra-class variation in the gestures. Most of the existing research in fingerspelling recognition has focused on the dataset collected in a controlled environment. The recent collection of a large-scale annotated fingerspelling dataset in the wild, from social media and online platforms, captures the challenges in a real-world scenario. In this work, we propose a fine-grained visual attention mechanism using the Transformer model for the sequence-to-sequence prediction task in the wild dataset. The fine-grained attention is achieved by utilizing the change in motion of the video frames (optical flow) in sequential context-based attention along with a Transformer encoder model. The unsegmented continuous video dataset is jointly trained by balancing the Connectionist Temporal Classification (CTC) loss and the maximum-entropy loss. The proposed approach can capture better fine-grained attention in a single iteration. Experiment evaluations show that it outperforms the state-of-the-art approaches.
Few-Shot Learning by Integrating Spatial and Frequency RepresentationXiangyu Chen, Guanghui Wang
Human beings can recognize new objects with only a few labeled examples, however, few-shot learning remains a challenging problem for machine learning systems. Most previous algorithms in few-shot learning only utilize spatial information of the images. In this paper, we propose to integrate the frequency information into the learning model to boost the discrimination ability of the system. We employ Discrete Cosine Transformation (DCT) to generate the frequency representation, then, integrate the features from both the spatial domain and frequency domain for classification. The proposed strategy and its effectiveness are validated with different backbones, datasets, and algorithms. Extensive experiments demonstrate that the frequency information is complementary to the spatial representations in few-shot classification. The classification accuracy is boosted significantly by integrating features from both the spatial and frequency domains in different few-shot learning tasks.
Enhanced U-Net: A Feature Enhancement Network for Polyp SegmentationKrushi Patel, Andres M. Bur, Guanghui Wang
Colonoscopy is a procedure to detect colorectal polyps which are the primary cause for developing colorectal cancer. However, polyp segmentation is a challenging task due to the diverse shape, size, color, and texture of polyps, shuttle difference between polyp and its background, as well as low contrast of the colonoscopic images. To address these challenges, we propose a feature enhancement network for accurate polyp segmentation in colonoscopy images. Specifically, the proposed network enhances the semantic information using the novel Semantic Feature Enhance Module (SFEM). Furthermore, instead of directly adding encoder features to the respective decoder layer, we introduce an Adaptive Global Context Module (AGCM), which focuses only on the encoder's significant and hard fine-grained features. The integration of these two modules improves the quality of features layer by layer, which in turn enhances the final feature representation. The proposed approach is evaluated on five colonoscopy datasets and demonstrates superior performance compared to other state-of-the-art models.
SGNet: A Super-class Guided Network for Image Classification and Object DetectionKaidong Li, Nina Y. Wang, Yiju Yang et al.
Most classification models treat different object classes in parallel and the misclassifications between any two classes are treated equally. In contrast, human beings can exploit high-level information in making a prediction of an unknown object. Inspired by this observation, the paper proposes a super-class guided network (SGNet) to integrate the high-level semantic information into the network so as to increase its performance in inference. SGNet takes two-level class annotations that contain both super-class and finer class labels. The super-classes are higher-level semantic categories that consist of a certain amount of finer classes. A super-class branch (SCB), trained on super-class labels, is introduced to guide finer class prediction. At the inference time, we adopt two different strategies: Two-step inference (TSI) and direct inference (DI). TSI first predicts the super-class and then makes predictions of the corresponding finer class. On the other hand, DI directly generates predictions from the finer class branch (FCB). Extensive experiments have been performed on CIFAR-100 and MS COCO datasets. The experimental results validate the proposed approach and demonstrate its superior performance on image classification and object detection.
17.5CVApr 22, 2021
Colonoscopy Polyp Detection and Classification: Dataset Creation and Comparative EvaluationsKaidong Li, Mohammad I. Fathan, Krushi Patel et al.
Colorectal cancer (CRC) is one of the most common types of cancer with a high mortality rate. Colonoscopy is the preferred procedure for CRC screening and has proven to be effective in reducing CRC mortality. Thus, a reliable computer-aided polyp detection and classification system can significantly increase the effectiveness of colonoscopy. In this paper, we create an endoscopic dataset collected from various sources and annotate the ground truth of polyp location and classification results with the help of experienced gastroenterologists. The dataset can serve as a benchmark platform to train and evaluate the machine learning models for polyp classification. We have also compared the performance of eight state-of-the-art deep learning-based object detection models. The results demonstrate that deep CNN models are promising in CRC screening. This work can serve as a baseline for future research in polyp detection and classification.
12.5LGMar 20, 2021
Projection-free Distributed Online Learning with Sublinear Communication ComplexityYuanyu Wan, Guanghui Wang, Wei-Wei Tu et al.
To deal with complicated constraints via locally light computations in distributed online learning, a recent study has presented a projection-free algorithm called distributed online conditional gradient (D-OCG), and achieved an $O(T^{3/4})$ regret bound for convex losses, where $T$ is the number of total rounds. However, it requires $T$ communication rounds, and cannot utilize the strong convexity of losses. In this paper, we propose an improved variant of D-OCG, namely D-BOCG, which can attain the same $O(T^{3/4})$ regret bound with only $O(\sqrt{T})$ communication rounds for convex losses, and a better regret bound of $O(T^{2/3}(\log T)^{1/3})$ with fewer $O(T^{1/3}(\log T)^{2/3})$ communication rounds for strongly convex losses. The key idea is to adopt a delayed update mechanism that reduces the communication complexity, and redefine the surrogate loss function in D-OCG for exploiting the strong convexity. Furthermore, we provide lower bounds to demonstrate that the $O(\sqrt{T})$ communication rounds required by D-BOCG are optimal (in terms of $T$) for achieving the $O(T^{3/4})$ regret with convex losses, and the $O(T^{1/3}(\log T)^{2/3})$ communication rounds required by D-BOCG are near-optimal (in terms of $T$) for achieving the $O(T^{2/3}(\log T)^{1/3})$ regret with strongly convex losses up to polylogarithmic factors. Finally, to handle the more challenging bandit setting, in which only the loss value is available, we incorporate the classical one-point gradient estimator into D-BOCG, and obtain similar theoretical guarantees.
3.7CVFeb 10, 2021
Classification of Long Noncoding RNA Elements Using Deep Convolutional Neural Networks and Siamese NetworksBrian McClannahan, Cucong Zhong, Guanghui Wang
In the last decade, the discovery of noncoding RNA(ncRNA) has exploded. Classifying these ncRNA is critical todetermining their function. This thesis proposes a new methodemploying deep convolutional neural networks (CNNs) to classifyncRNA sequences. To this end, this paper first proposes anefficient approach to convert the RNA sequences into imagescharacterizing their base-pairing probability. As a result, clas-sifying RNA sequences is converted to an image classificationproblem that can be efficiently solved by available CNN-basedclassification models. This research also considers the foldingpotential of the ncRNAs in addition to their primary sequence.Based on the proposed approach, a benchmark image classifi-cation dataset is generated from the RFAM database of ncRNAsequences. In addition, three classical CNN models and threeSiamese network models have been implemented and comparedto demonstrate the superior performance and efficiency of theproposed approach. Extensive experimental results show thegreat potential of using deep learning approaches for RNAclassification.
15.9CVJan 19, 2021
SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from Monocular imagesLei He, Jiwen Lu, Guanghui Wang et al.
Depth estimation and semantic segmentation play essential roles in scene understanding. The state-of-the-art methods employ multi-task learning to simultaneously learn models for these two tasks at the pixel-wise level. They usually focus on sharing the common features or stitching feature maps from the corresponding branches. However, these methods lack in-depth consideration on the correlation of the geometric cues and the scene parsing. In this paper, we first introduce the concept of semantic objectness to exploit the geometric relationship of these two tasks through an analysis of the imaging process, then propose a Semantic Object Segmentation and Depth Estimation Network (SOSD-Net) based on the objectness assumption. To the best of our knowledge, SOSD-Net is the first network that exploits the geometry constraint for simultaneous monocular depth estimation and semantic segmentation. In addition, considering the mutual implicit relationship between these two tasks, we exploit the iterative idea from the expectation-maximization algorithm to train the proposed network more effectively. Extensive experimental results on the Cityscapes and NYU v2 dataset are presented to demonstrate the superior performance of the proposed approach.
8.7CVJan 3, 2021
Six-channel Image Representation for Cross-domain Object DetectionTianxiao Zhang, Wenchi Ma, Guanghui Wang
Most deep learning models are data-driven and the excellent performance is highly dependent on the abundant and diverse datasets. However, it is very hard to obtain and label the datasets of some specific scenes or applications. If we train the detector using the data from one domain, it cannot perform well on the data from another domain due to domain shift, which is one of the big challenges of most object detection models. To address this issue, some image-to-image translation techniques have been employed to generate some fake data of some specific scenes to train the models. With the advent of Generative Adversarial Networks (GANs), we could realize unsupervised image-to-image translation in both directions from a source to a target domain and from the target to the source domain. In this study, we report a new approach to making use of the generated images. We propose to concatenate the original 3-channel images and their corresponding GAN-generated fake images to form 6-channel representations of the dataset, hoping to address the domain shift problem while exploiting the success of available detection models. The idea of augmented data representation may inspire further study on object detection and other applications.
Efficient Golf Ball Detection and Tracking Based on Convolutional Neural Networks and Kalman FilterTianxiao Zhang, Xiaohan Zhang, Yiju Yang et al.
This paper focuses on the problem of online golf ball detection and tracking from image sequences. An efficient real-time approach is proposed by exploiting convolutional neural networks (CNN) based object detection and a Kalman filter based prediction. Five classical deep learning-based object detection networks are implemented and evaluated for ball detection, including YOLO v3 and its tiny version, YOLO v4, Faster R-CNN, SSD, and RefineDet. The detection is performed on small image patches instead of the entire image to increase the performance of small ball detection. At the tracking stage, a discrete Kalman filter is employed to predict the location of the ball and a small image patch is cropped based on the prediction. Then, the object detector is utilized to refine the location of the ball and update the parameters of Kalman filter. In order to train the detection models and test the tracking algorithm, a collection of golf ball dataset is created and annotated. Extensive comparative experiments are performed to demonstrate the effectiveness and superior tracking performance of the proposed scheme.
14.3CVNov 2, 2020
Deep Feature Augmentation for Occluded Image ClassificationFeng Cen, Xiaoyu Zhao, Wuzhuang Li et al.
Due to the difficulty in acquiring massive task-specific occluded images, the classification of occluded images with deep convolutional neural networks (CNNs) remains highly challenging. To alleviate the dependency on large-scale occluded image datasets, we propose a novel approach to improve the classification accuracy of occluded images by fine-tuning the pre-trained models with a set of augmented deep feature vectors (DFVs). The set of augmented DFVs is composed of original DFVs and pseudo-DFVs. The pseudo-DFVs are generated by randomly adding difference vectors (DVs), extracted from a small set of clean and occluded image pairs, to the real DFVs. In the fine-tuning, the back-propagation is conducted on the DFV data flow to update the network parameters. The experiments on various datasets and network structures show that the deep feature augmentation significantly improves the classification accuracy of occluded images without a noticeable influence on the performance of clean images. Specifically, on the ILSVRC2012 dataset with synthetic occluded images, the proposed approach achieves 11.21% and 9.14% average increases in classification accuracy for the ResNet50 networks fine-tuned on the occlusion-exclusive and occlusion-inclusive training sets, respectively.