Yicong Zhou

h-index68

22papers

357citations

Novelty52%

AI Score35

Ranked #107,485 of 194,257 authors (top 55%)#35,942 in CV (top 61%)

22 Papers

1.8LGOct 31, 2022

SEVGGNet-LSTM: a fused deep learning model for ECG classification

Tongyue He, Yiming Chen, Junxin Chen et al.

This paper presents a fused deep learning algorithm for ECG classification. It takes advantages of the combined convolutional and recurrent neural network for ECG classification, and the weight allocation capability of attention mechanism. The input ECG signals are firstly segmented and normalized, and then fed into the combined VGG and LSTM network for feature extraction and classification. An attention mechanism (SE block) is embedded into the core network for increasing the weight of important features. Two databases from different sources and devices are employed for performance validation, and the results well demonstrate the effectiveness and robustness of the proposed algorithm.

4.8IVSep 5, 2022

Uformer-ICS: A U-Shaped Transformer for Image Compressive Sensing Service

Kuiyuan Zhang, Zhongyun Hua, Yuanman Li et al.

Many service computing applications require real-time dataset collection from multiple devices, necessitating efficient sampling techniques to reduce bandwidth and storage pressure. Compressive sensing (CS) has found wide-ranging applications in image acquisition and reconstruction. Recently, numerous deep-learning methods have been introduced for CS tasks. However, the accurate reconstruction of images from measurements remains a significant challenge, especially at low sampling rates. In this paper, we propose Uformer-ICS as a novel U-shaped transformer for image CS tasks by introducing inner characteristics of CS into transformer architecture. To utilize the uneven sparsity distribution of image blocks, we design an adaptive sampling architecture that allocates measurement resources based on the estimated block sparsity, allowing the compressed results to retain maximum information from the original image. Additionally, we introduce a multi-channel projection (MCP) module inspired by traditional CS optimization methods. By integrating the MCP module into the transformer blocks, we construct projection-based transformer blocks, and then form a symmetrical reconstruction model using these blocks and residual convolutional blocks. Therefore, our reconstruction model can simultaneously utilize the local features and long-range dependencies of image, and the prior projection knowledge of CS theory. Experimental results demonstrate its significantly better reconstruction performance than state-of-the-art deep learning-based CS methods.

9.6CVSep 7, 2024Code

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation

Jiaxin Cheng, Zixu Zhao, Tong He et al.

Recent advancements in generative models have significantly enhanced their capacity for image generation, enabling a wide range of applications such as image editing, completion and video editing. A specialized area within generative modeling is layout-to-image (L2I) generation, where predefined layouts of objects guide the generative process. In this study, we introduce a novel regional cross-attention module tailored to enrich layout-to-image generation. This module notably improves the representation of layout regions, particularly in scenarios where existing methods struggle with highly complex and detailed textual descriptions. Moreover, while current open-vocabulary L2I methods are trained in an open-set setting, their evaluations often occur in closed-set environments. To bridge this gap, we propose two metrics to assess L2I performance in open-vocabulary scenarios. Additionally, we conduct a comprehensive user study to validate the consistency of these metrics with human preferences.

1.5CVOct 11, 2023

Multiview Transformer: Rethinking Spatial Information in Hyperspectral Image Classification

Jie Zhang, Yongshan Zhang, Yicong Zhou

Identifying the land cover category for each pixel in a hyperspectral image (HSI) relies on spectral and spatial information. An HSI cuboid with a specific patch size is utilized to extract spatial-spectral feature representation for the central pixel. In this article, we investigate that scene-specific but not essential correlations may be recorded in an HSI cuboid. This additional information improves the model performance on existing HSI datasets and makes it hard to properly evaluate the ability of a model. We refer to this problem as the spatial overfitting issue and utilize strict experimental settings to avoid it. We further propose a multiview transformer for HSI classification, which consists of multiview principal component analysis (MPCA), spectral encoder-decoder (SED), and spatial-pooling tokenization transformer (SPTT). MPCA performs dimension reduction on an HSI via constructing spectral multiview observations and applying PCA on each view data to extract low-dimensional view representation. The combination of view representations, named multiview representation, is the dimension reduction output of the MPCA. To aggregate the multiview information, a fully-convolutional SED with a U-shape in spectral dimension is introduced to extract a multiview feature map. SPTT transforms the multiview features into tokens using the spatial-pooling tokenization strategy and learns robust and discriminative spatial-spectral features for land cover identification. Classification is conducted with a linear classifier. Experiments on three HSI datasets with rigid settings demonstrate the superiority of the proposed multiview transformer over the state-of-the-art methods.

1.4CVMay 24, 2021Code

LineCounter: Learning Handwritten Text Line Segmentation by Counting

Deng Li, Yue Wu, Yicong Zhou

Handwritten Text Line Segmentation (HTLS) is a low-level but important task for many higher-level document processing tasks like handwritten text recognition. It is often formulated in terms of semantic segmentation or object detection in deep learning. However, both formulations have serious shortcomings. The former requires heavy post-processing of splitting/merging adjacent segments, while the latter may fail on dense or curved texts. In this paper, we propose a novel Line Counting formulation for HTLS -- that involves counting the number of text lines from the top at every pixel location. This formulation helps learn an end-to-end HTLS solution that directly predicts per-pixel line number for a given document image. Furthermore, we propose a deep neural network (DNN) model LineCounter to perform HTLS through the Line Counting formulation. Our extensive experiments on the three public datasets (ICDAR2013-HSC, HIT-MW, and VML-AHTE) demonstrate that LineCounter outperforms state-of-the-art HTLS approaches. Source code is available at https://github.com/Leedeng/Line-Counter.

4.7CVMay 12, 2021Code

SauvolaNet: Learning Adaptive Sauvola Network for Degraded Document Binarization

Deng Li, Yue Wu, Yicong Zhou

Inspired by the classic Sauvola local image thresholding approach, we systematically study it from the deep neural network (DNN) perspective and propose a new solution called SauvolaNet for degraded document binarization (DDB). It is composed of three explainable modules, namely, Multi-Window Sauvola (MWS), Pixelwise Window Attention (PWA), and Adaptive Sauolva Threshold (AST). The MWS module honestly reflects the classic Sauvola but with trainable parameters and multi-window settings. The PWA module estimates the preferred window sizes for each pixel location. The AST module further consolidates the outputs from MWS and PWA and predicts the final adaptive threshold for each pixel location. As a result, SauvolaNet becomes end-to-end trainable and significantly reduces the number of required network parameters to 40K -- it is only 1\% of MobileNetV2. In the meantime, it achieves the State-of-The-Art (SoTA) performance for the DDB task -- SauvolaNet is at least comparable to, if not better than, SoTA binarization solutions in our extensive studies on the 13 public document binarization datasets. Our source code is available at https://github.com/Leedeng/SauvolaNet.

16.6CVJul 19, 2020Code

Geometry Constrained Weakly Supervised Object Localization

Weizeng Lu, Xi Jia, Weicheng Xie et al.

We propose a geometry constrained network, termed GC-Net, for weakly supervised object localization (WSOL). GC-Net consists of three modules: a detector, a generator and a classifier. The detector predicts the object location defined by a set of coefficients describing a geometric shape (i.e. ellipse or rectangle), which is geometrically constrained by the mask produced by the generator. The classifier takes the resulting masked images as input and performs two complementary classification tasks for the object and background. To make the mask more compact and more complete, we propose a novel multi-task loss function that takes into account area of the geometric shape, the categorical cross-entropy and the negative entropy. In contrast to previous approaches, GC-Net is trained end-to-end and predict object location without any post-processing (e.g. thresholding) that may require additional tuning. Extensive experiments on the CUB-200-2011 and ILSVRC2012 datasets show that GC-Net outperforms state-of-the-art methods by a large margin. Our source code is available at https://github.com/lwzeng/GC-Net.

8.8CRApr 11, 2012Code

A Novel Latin Square Image Cipher

Yue Wu, Yicong Zhou, Joseph P. Noonan et al.

In this paper, we introduce a symmetric-key Latin square image cipher (LSIC) for grayscale and color images. Our contributions to the image encryption community include 1) we develop new Latin square image encryption primitives including Latin Square Whitening, Latin Square S-box and Latin Square P-box ; 2) we provide a new way of integrating probabilistic encryption in image encryption by embedding random noise in the least significant image bit-plane; and 3) we construct LSIC with these Latin square image encryption primitives all on one keyed Latin square in a new loom-like substitution-permutation network. Consequently, the proposed LSIC achieve many desired properties of a secure cipher including a large key space, high key sensitivities, uniformly distributed ciphertext, excellent confusion and diffusion properties, semantically secure, and robustness against channel noise. Theoretical analysis show that the LSIC has good resistance to many attack models including brute-force attacks, ciphertext-only attacks, known-plaintext attacks and chosen-plaintext attacks. Experimental analysis under extensive simulation results using the complete USC-SIPI Miscellaneous image dataset demonstrate that LSIC outperforms or reach state of the art suggested by many peer algorithms. All these analysis and results demonstrate that the LSIC is very suitable for digital image encryption. Finally, we open source the LSIC MATLAB code under webpage https://sites.google.com/site/tuftsyuewu/source-code.

7.1LGJan 21, 2025

Highly Efficient Rotation-Invariant Spectral Embedding for Scalable Incomplete Multi-View Clustering

Xinxin Wang, Yongshan Zhang, Yicong Zhou

Incomplete multi-view clustering presents significant challenges due to missing views. Although many existing graph-based methods aim to recover missing instances or complete similarity matrices with promising results, they still face several limitations: (1) Recovered data may be unsuitable for spectral clustering, as these methods often ignore guidance from spectral analysis; (2) Complex optimization processes require high computational burden, hindering scalability to large-scale problems; (3) Most methods do not address the rotational mismatch problem in spectral embeddings. To address these issues, we propose a highly efficient rotation-invariant spectral embedding (RISE) method for scalable incomplete multi-view clustering. RISE learns view-specific embeddings from incomplete bipartite graphs to capture the complementary information. Meanwhile, a complete consensus representation with second-order rotation-invariant property is recovered from these incomplete embeddings in a unified model. Moreover, we design a fast alternating optimization algorithm with linear complexity and promising convergence to solve the proposed formulation. Extensive experiments on multiple datasets demonstrate the effectiveness, scalability, and efficiency of RISE compared to the state-of-the-art methods.

3.6CVFeb 4, 2025

DCT-Mamba3D: Spectral Decorrelation and Spatial-Spectral Feature Extraction for Hyperspectral Image Classification

Weijia Cao, Xiaofei Yang, Yicong Zhou et al.

Hyperspectral image classification presents challenges due to spectral redundancy and complex spatial-spectral dependencies. This paper proposes a novel framework, DCT-Mamba3D, for hyperspectral image classification. DCT-Mamba3D incorporates: (1) a 3D spectral-spatial decorrelation module that applies 3D discrete cosine transform basis functions to reduce both spectral and spatial redundancy, enhancing feature clarity across dimensions; (2) a 3D-Mamba module that leverages a bidirectional state-space model to capture intricate spatial-spectral dependencies; and (3) a global residual enhancement module that stabilizes feature representation, improving robustness and convergence. Extensive experiments on benchmark datasets show that our DCT-Mamba3D outperforms the state-of-the-art methods in challenging scenarios such as the same object in different spectra and different objects in the same spectra.

1.4CVSep 7, 2021

CIM: Class-Irrelevant Mapping for Few-Shot Classification

Shuai Shao, Lei Xing, Yixin Chen et al.

Few-shot classification (FSC) is one of the most concerned hot issues in recent years. The general setting consists of two phases: (1) Pre-train a feature extraction model (FEM) with base data (has large amounts of labeled samples). (2) Use the FEM to extract the features of novel data (with few labeled samples and totally different categories from base data), then classify them with the to-be-designed classifier. The adaptability of pre-trained FEM to novel data determines the accuracy of novel features, thereby affecting the final classification performances. To this end, how to appraise the pre-trained FEM is the most crucial focus in the FSC community. It sounds like traditional Class Activate Mapping (CAM) based methods can achieve this by overlaying weighted feature maps. However, due to the particularity of FSC (e.g., there is no backpropagation when using the pre-trained FEM to extract novel features), we cannot activate the feature map with the novel classes. To address this challenge, we propose a simple, flexible method, dubbed as Class-Irrelevant Mapping (CIM). Specifically, first, we introduce dictionary learning theory and view the channels of the feature map as the bases in a dictionary. Then we utilize the feature map to fit the feature vector of an image to achieve the corresponding channel weights. Finally, we overlap the weighted feature map for visualization to appraise the ability of pre-trained FEM on novel data. For fair use of CIM in evaluating different models, we propose a new measurement index, called Feature Localization Accuracy (FLA). In experiments, we first compare our CIM with CAM in regular tasks and achieve outstanding performances. Next, we use our CIM to appraise several classical FSC frameworks without considering the classification results and discuss them.

3.8CRJun 27, 2021

Secure Reversible Data Hiding in Encrypted Images Using Cipher-Feedback Secret Sharing

Zhongyun Hua, Yanxiang Wang, Shuang Yi et al.

Reversible data hiding in encrypted images (RDH-EI) has attracted increasing attention, since it can protect the privacy of original images while the embedded data can be exactly extracted. Recently, some RDH-EI schemes with multiple data hiders have been proposed using secret sharing technique. However, these schemes protect the contents of the original images with lightweight security level. In this paper, we propose a high-security RDH-EI scheme with multiple data hiders. First, we introduce a cipher-feedback secret sharing (CFSS) technique. It follows the cryptography standards by introducing the cipher-feedback strategy of AES. Then, using the CFSS technique, we devise a new (r,n)-threshold (r<=n) RDH-EI scheme with multiple data hiders called CFSS-RDHEI. It can encrypt an original image into n encrypted images with reduced size using an encryption key and sends each encrypted image to one data hider. Each data hider can independently embed secret data into the encrypted image to obtain the corresponding marked encrypted image. The original image can be completely recovered from r marked encrypted images and the encryption key. Performance evaluations show that our CFSS-RDHEI scheme has high embedding rate and its generated encrypted images are much smaller, compared to existing secret sharing-based RDH-EI schemes. Security analysis demonstrates that it can achieve high security to defense some commonly used security attacks.

5.6CVJun 24, 2021

Detection of Deepfake Videos Using Long Distance Attention

Wei Lu, Lingyi Liu, Junwei Luo et al.

With the rapid progress of deepfake techniques in recent years, facial video forgery can generate highly deceptive video contents and bring severe security threats. And detection of such forgery videos is much more urgent and challenging. Most existing detection methods treat the problem as a vanilla binary classification problem. In this paper, the problem is treated as a special fine-grained classification problem since the differences between fake and real faces are very subtle. It is observed that most existing face forgery methods left some common artifacts in the spatial domain and time domain, including generative defects in the spatial domain and inter-frame inconsistencies in the time domain. And a spatial-temporal model is proposed which has two components for capturing spatial and temporal forgery traces in global perspective respectively. The two components are designed using a novel long distance attention mechanism. The one component of the spatial domain is used to capture artifacts in a single frame, and the other component of the time domain is used to capture artifacts in consecutive frames. They generate attention maps in the form of patches. The attention method has a broader vision which contributes to better assembling global information and extracting local statistic information. Finally, the attention maps are used to guide the network to focus on pivotal parts of the face, just like other fine-grained classification methods. The experimental results on different public datasets demonstrate that the proposed method achieves the state-of-the-art performance, and the proposed long distance attention method can effectively capture pivotal parts for face forgery.

1.4CVMar 8, 2021

Bridging the Distribution Gap of Visible-Infrared Person Re-identification with Modality Batch Normalization

Wenkang Li, Qi Ke, Wenbin Chen et al.

Visible-infrared cross-modality person re-identification (VI-ReID), whose aim is to match person images between visible and infrared modality, is a challenging cross-modality image retrieval task. Most existing works integrate batch normalization layers into their neural network, but we found out that batch normalization layers would lead to two types of distribution gap: 1) inter-mini-batch distribution gap -- the distribution gap of the same modality between each mini-batch; 2) intra-mini-batch modality distribution gap -- the distribution gap of different modality within the same mini-batch. To address these problems, we propose a new batch normalization layer called Modality Batch Normalization (MBN), which normalizes each modality sub-mini-batch respectively instead of the whole mini-batch, and can reduce these distribution gap significantly. Extensive experiments show that our MBN is able to boost the performance of VI-ReID models, even with different datasets, backbones and losses.

1.4CVMar 8, 2021

Unified Batch All Triplet Loss for Visible-Infrared Person Re-identification

Wenkang Li, Ke Qi, Wenbin Chen et al.

Visible-Infrared cross-modality person re-identification (VI-ReID), whose aim is to match person images between visible and infrared modality, is a challenging cross-modality image retrieval task. Batch Hard Triplet loss is widely used in person re-identification tasks, but it does not perform well in the Visible-Infrared person re-identification task. Because it only optimizes the hardest triplet for each anchor image within the mini-batch, samples in the hardest triplet may all belong to the same modality, which will lead to the imbalance problem of modality optimization. To address this problem, we adopt the batch all triplet selection strategy, which selects all the possible triplets among samples to optimize instead of the hardest triplet. Furthermore, we introduce Unified Batch All Triplet loss and Cosine Softmax loss to collaboratively optimize the cosine distance between image vectors. Similarly, we rewrite the Hetero Center Triplet loss, which is proposed for VI-ReID task, into a batch all form to improve model performance. Extensive experiments indicate the effectiveness of the proposed methods, which outperform state-of-the-art methods by a wide margin.

1.2CVAug 21, 2020

Learning Domain-invariant Graph for Adaptive Semi-supervised Domain Adaptation with Few Labeled Source Samples

Jinfeng Li, Weifeng Liu, Yicong Zhou et al.

Domain adaptation aims to generalize a model from a source domain to tackle tasks in a related but different target domain. Traditional domain adaptation algorithms assume that enough labeled data, which are treated as the prior knowledge are available in the source domain. However, these algorithms will be infeasible when only a few labeled data exist in the source domain, and thus the performance decreases significantly. To address this challenge, we propose a Domain-invariant Graph Learning (DGL) approach for domain adaptation with only a few labeled source samples. Firstly, DGL introduces the Nystrom method to construct a plastic graph that shares similar geometric property as the target domain. And then, DGL flexibly employs the Nystrom approximation error to measure the divergence between plastic graph and source graph to formalize the distribution mismatch from the geometric perspective. Through minimizing the approximation error, DGL learns a domain-invariant geometric graph to bridge source and target domains. Finally, we integrate the learned domain-invariant graph with the semi-supervised learning and further propose an adaptive semi-supervised model to handle the cross-domain problems. The results of extensive experiments on popular datasets verify the superiority of DGL, especially when only a few labeled source samples are available.

2.3MMMar 28, 2019Code

Universal chosen-ciphertext attack for a family of image encryption schemes

Junxin Chen, Lei Chen, Yicong Zhou

During the past decades, there is a great popularity employing nonlinear dynamics and permutation-substitution architecture for image encryption. There are three primary procedures in such encryption schemes, the key schedule module for producing encryption factors, permutation for image scrambling and substitution for pixel modification. Under the assumption of chosen-ciphertext attack, we evaluate the security of a class of image ciphers which adopts pixel-level permutation and modular addition for substitution. It is mathematically revealed that the mapping from differentials of ciphertexts to those of plaintexts are linear and has nothing to do with the key schedules, permutation techniques and encryption rounds. Moreover, a universal chosen-ciphertext attack is proposed and validated. Experimental results demonstrate that the plaintexts can be directly reconstructed without any security key or encryption elements. Related cryptographic discussions are also given.

0.9CVJun 21, 2018

Ensemble p-Laplacian Regularization for Remote Sensing Image Recognition

Xueqi Ma, Weifeng Liu, Dapeng Tao et al.

Recently, manifold regularized semi-supervised learning (MRSSL) received considerable attention because it successfully exploits the geometry of the intrinsic data probability distribution including both labeled and unlabeled samples to leverage the performance of a learning model. As a natural nonlinear generalization of graph Laplacian, p-Laplacian has been proved having the rich theoretical foundations to better preserve the local structure. However, it is difficult to determine the fitting graph p-Lapalcian i.e. the parameter which is a critical factor for the performance of graph p-Laplacian. Therefore, we develop an ensemble p-Laplacian regularization (EpLapR) to fully approximate the intrinsic manifold of the data distribution. EpLapR incorporates multiple graphs into a regularization term in order to sufficiently explore the complementation of graph p-Laplacian. Specifically, we construct a fused graph by introducing an optimization approach to assign suitable weights on different p-value graphs. And then, we conduct semi-supervised learning framework on the fused graph. Extensive experiments on UC-Merced data set demonstrate the effectiveness and efficiency of the proposed method.

0.9CVNov 19, 2017

Vision Recognition using Discriminant Sparse Optimization Learning

Qingxiang Feng, Yicong Zhou

To better select the correct training sample and obtain the robust representation of the query sample, this paper proposes a discriminant-based sparse optimization learning model. This learning model integrates discriminant and sparsity together. Based on this model, we then propose a classifier called locality-based discriminant sparse representation (LDSR). Because discriminant can help to increase the difference of samples in different classes and to decrease the difference of samples within the same class, LDSR can obtain better sparse coefficients and constitute a better sparse representation for classification. In order to take advantages of kernel techniques, discriminant and sparsity, we further propose a nonlinear classifier called kernel locality-based discriminant sparse representation (KLDSR). Experiments on several well-known databases prove that the performance of LDSR and KLDSR is better than that of several state-of-the-art methods including deep learning based methods.

0.9CVNov 19, 2017

Discriminant Projection Representation-based Classification for Vision Recognition

Qingxiang Feng, Yicong Zhou

Representation-based classification methods such as sparse representation-based classification (SRC) and linear regression classification (LRC) have attracted a lot of attentions. In order to obtain the better representation, a novel method called projection representation-based classification (PRC) is proposed for image recognition in this paper. PRC is based on a new mathematical model. This model denotes that the 'ideal projection' of a sample point $x$ on the hyper-space $H$ may be gained by iteratively computing the projection of $x$ on a line of hyper-space $H$ with the proper strategy. Therefore, PRC is able to iteratively approximate the 'ideal representation' of each subject for classification. Moreover, the discriminant PRC (DPRC) is further proposed, which obtains the discriminant information by maximizing the ratio of the between-class reconstruction error over the within-class reconstruction error. Experimental results on five typical databases show that the proposed PRC and DPRC are effective and outperform other state-of-the-art methods on several vision recognition tasks.

0.9CVNov 19, 2017

Color Face Recognition using High-Dimension Quaternion-based Adaptive Representation

Qingxiang Feng, Yicong Zhou

Recently, quaternion collaborative representation-based classification (QCRC) and quaternion sparse representation-based classification (QSRC) have been proposed for color face recognition. They can obtain correlation information among different color channels. However, their performance is unstable in different conditions. For example, QSRC performs better than than QCRC on some situations but worse on other situations. To benefit from quaternion-based $e_2$-norm minimization in QCRC and quaternion-based $e_1$-norm minimization in QSRC, we propose the quaternion-based adaptive representation (QAR) that uses a quaternion-based $e_p$-norm minimization ($1 \le p \le 2$) for color face recognition. To obtain the high dimension correlation information among different color channels, we further propose the high-dimension quaternion-based adaptive representation (HD-QAR). The experimental results demonstrate that the proposed QAR and HD-QAR achieve better recognition rates than QCRC, QSRC and several state-of-the-art methods.

1.2CDDec 14, 2016

Nonlinear Chaotic Processing Model

Zhongyun Hua, Yicong Zhou

Designing chaotic maps with complex dynamics is a challenging topic. This paper introduces the nonlinear chaotic processing (NCP) model, which contains six basic nonlinear operations. Each operation is a general framework that can use existing chaotic maps as seed maps to generate a huge number of new chaotic maps. The proposed NCP model can be easily extended by introducing new nonlinear operations or by arbitrarily combining existing ones. The properties and chaotic behaviors of the NCP model are investigated. To show its effectiveness and usability, as examples, we provide four new chaotic maps generated by the NCP model and evaluate their chaotic performance using Lyapunov exponent, Shannon entropy, correlation dimension and initial state sensitivity. The experimental results show that these chaotic maps have more complex chaotic behaviors than existing ones.