Zhanyi Hu

CV
h-index4
21papers
547citations
Novelty46%
AI Score31

21 Papers

LGApr 4, 2023Code
SLPerf: a Unified Framework for Benchmarking Split Learning

Tianchen Zhou, Zhanyi Hu, Bingzhe Wu et al.

Data privacy concerns has made centralized training of data, which is scattered across silos, infeasible, leading to the need for collaborative learning frameworks. To address that, two prominent frameworks emerged, i.e., federated learning (FL) and split learning (SL). While FL has established various benchmark frameworks and research libraries,SL currently lacks a unified library despite its diversity in terms of label sharing, model aggregation, and cut layer choice. This lack of standardization makes comparing SL paradigms difficult. To address this, we propose SLPerf, a unified research framework and open research library for SL, and conduct extensive experiments on four widely-used datasets under both IID and Non-IID data settings. Our contributions include a comprehensive survey of recently proposed SL paradigms, a detailed benchmark comparison of different SL paradigms in different situations, and rich engineering take-away messages and research insights for improving SL paradigms. SLPerf can facilitate SL algorithm development and fair performance comparisons. The code is available at https://github.com/Rainysponge/Split-learning-Attacks .

CVMar 17, 2022
Semantic-diversity transfer network for generalized zero-shot learning via inner disagreement based OOD detector

Bo Liu, Qiulei Dong, Zhanyi Hu

Zero-shot learning (ZSL) aims to recognize objects from unseen classes, where the kernel problem is to transfer knowledge from seen classes to unseen classes by establishing appropriate mappings between visual and semantic features. The knowledge transfer in many existing works is limited mainly due to the facts that 1) the widely used visual features are global ones but not totally consistent with semantic attributes; 2) only one mapping is learned in existing works, which is not able to effectively model diverse visual-semantic relations; 3) the bias problem in the generalized ZSL (GZSL) could not be effectively handled. In this paper, we propose two techniques to alleviate these limitations. Firstly, we propose a Semantic-diversity transfer Network (SetNet) addressing the first two limitations, where 1) a multiple-attention architecture and a diversity regularizer are proposed to learn multiple local visual features that are more consistent with semantic attributes and 2) a projector ensemble that geometrically takes diverse local features as inputs is proposed to model visual-semantic relations from diverse local perspectives. Secondly, we propose an inner disagreement based domain detection module (ID3M) for GZSL to alleviate the third limitation, which picks out unseen-class data before class-level classification. Due to the absence of unseen-class data in training stage, ID3M employs a novel self-contained training scheme and detects out unseen-class data based on a designed inner disagreement criterion. Experimental results on three public datasets demonstrate that the proposed SetNet with the explored ID3M achieves a significant improvement against $30$ state-of-the-art methods.

LGJan 22, 2025
Bad-PFL: Exploring Backdoor Attacks against Personalized Federated Learning

Mingyuan Fan, Zhanyi Hu, Fuyi Wang et al.

Data heterogeneity and backdoor attacks rank among the most significant challenges facing federated learning (FL). For data heterogeneity, personalized federated learning (PFL) enables each client to maintain a private personalized model to cater to client-specific knowledge. Meanwhile, vanilla FL has proven vulnerable to backdoor attacks. However, recent advancements in PFL community have demonstrated a potential immunity against such attacks. This paper explores this intersection further, revealing that existing federated backdoor attacks fail in PFL because backdoors about manually designed triggers struggle to survive in personalized models. To tackle this, we design Bad-PFL, which employs features from natural data as our trigger. As long as the model is trained on natural data, it inevitably embeds the backdoor associated with our trigger, ensuring its longevity in personalized models. Moreover, our trigger undergoes mutual reinforcement training with the model, further solidifying the backdoor's durability and enhancing attack effectiveness. The large-scale experiments across three benchmark datasets demonstrate the superior performance of our attack against various PFL methods, even when equipped with state-of-the-art defense mechanisms.

CVMar 30, 2022
An Iterative Co-Training Transductive Framework for Zero Shot Learning

Bo Liu, Lihua Hu, Qiulei Dong et al.

In zero-shot learning (ZSL) community, it is generally recognized that transductive learning performs better than inductive one as the unseen-class samples are also used in its training stage. How to generate pseudo labels for unseen-class samples and how to use such usually noisy pseudo labels are two critical issues in transductive learning. In this work, we introduce an iterative co-training framework which contains two different base ZSL models and an exchanging module. At each iteration, the two different ZSL models are co-trained to separately predict pseudo labels for the unseen-class samples, and the exchanging module exchanges the predicted pseudo labels, then the exchanged pseudo-labeled samples are added into the training sets for the next iteration. By such, our framework can gradually boost the ZSL performance by fully exploiting the potential complementarity of the two models' classification capabilities. In addition, our co-training framework is also applied to the generalized ZSL (GZSL), in which a semantic-guided OOD detector is proposed to pick out the most likely unseen-class samples before class-level classification to alleviate the bias problem in GZSL. Extensive experiments on three benchmarks show that our proposed methods could significantly outperform about $31$ state-of-the-art ones.

CVJan 16, 2022
Pursuing 3D Scene Structures with Optical Satellite Images from Affine Reconstruction to Euclidean Reconstruction

Pinhe Wang, Limin Shi, Bao Chen et al.

How to use multiple optical satellite images to recover the 3D scene structure is a challenging and important problem in the remote sensing field. Most existing methods in literature have been explored based on the classical RPC (rational polynomial camera) model which requires at least 39 GCPs (ground control points), however, it is not trivial to obtain such a large number of GCPs in many real scenes. Addressing this problem, we propose a hierarchical reconstruction framework based on multiple optical satellite images, which needs only 4 GCPs. The proposed framework is composed of an affine dense reconstruction stage and a followed affine-to-Euclidean upgrading stage: At the affine dense reconstruction stage, an affine dense reconstruction approach is explored for pursuing the 3D affine scene structure without any GCP from input satellite images. Then at the affine-to-Euclidean upgrading stage, the obtained 3D affine structure is upgraded to a Euclidean one with 4 GCPs. Experimental results on two public datasets demonstrate that the proposed method significantly outperforms three state-of-the-art methods in most cases.

CVJan 14, 2022
HardBoost: Boosting Zero-Shot Learning with Hard Classes

Bo Liu, Lihua Hu, Zhanyi Hu et al.

This work is a systematical analysis on the so-called hard class problem in zero-shot learning (ZSL), that is, some unseen classes disproportionally affect the ZSL performances than others, as well as how to remedy the problem by detecting and exploiting hard classes. At first, we report our empirical finding that the hard class problem is a ubiquitous phenomenon and persists regardless of used specific methods in ZSL. Then, we find that high semantic affinity among unseen classes is a plausible underlying cause of hardness and design two metrics to detect hard classes. Finally, two frameworks are proposed to remedy the problem by detecting and exploiting hard classes, one under inductive setting, the other under transductive setting. The proposed frameworks could accommodate most existing ZSL methods to further significantly boost their performances with little efforts. Extensive experiments on three popular benchmarks demonstrate the benefits by identifying and exploiting the hard classes in ZSL.

CVJul 8, 2021
Superpoint-guided Semi-supervised Semantic Segmentation of 3D Point Clouds

Shuang Deng, Qiulei Dong, Bo Liu et al.

3D point cloud semantic segmentation is a challenging topic in the computer vision field. Most of the existing methods in literature require a large amount of fully labeled training data, but it is extremely time-consuming to obtain these training data by manually labeling massive point clouds. Addressing this problem, we propose a superpoint-guided semi-supervised segmentation network for 3D point clouds, which jointly utilizes a small portion of labeled scene point clouds and a large number of unlabeled point clouds for network training. The proposed network is iteratively updated with its predicted pseudo labels, where a superpoint generation module is introduced for extracting superpoints from 3D point clouds, and a pseudo-label optimization module is explored for automatically assigning pseudo labels to the unlabeled points under the constraint of the extracted superpoints. Additionally, there are some 3D points without pseudo-label supervision. We propose an edge prediction module to constrain features of edge points. A superpoint feature aggregation module and a superpoint feature consistency loss function are introduced to smooth superpoint features. Extensive experimental results on two 3D public datasets demonstrate that our method can achieve better performance than several state-of-the-art point cloud segmentation networks and several popular semi-supervised segmentation methods with few labeled scenes.

CVJul 7, 2021
Rotation Transformation Network: Learning View-Invariant Point Cloud for Classification and Segmentation

Shuang Deng, Bo Liu, Qiulei Dong et al.

Many recent works show that a spatial manipulation module could boost the performances of deep neural networks (DNNs) for 3D point cloud analysis. In this paper, we aim to provide an insight into spatial manipulation modules. Firstly, we find that the smaller the rotational degree of freedom (RDF) of objects is, the more easily these objects are handled by these DNNs. Then, we investigate the effect of the popular T-Net module and find that it could not reduce the RDF of objects. Motivated by the above two issues, we propose a rotation transformation network for point cloud analysis, called RTN, which could reduce the RDF of input 3D objects to 0. The RTN could be seamlessly inserted into many existing DNNs for point cloud analysis. Extensive experimental results on 3D point cloud classification and segmentation tasks demonstrate that the proposed RTN could improve the performances of several state-of-the-art methods significantly.

CVJul 1, 2021
Language-Level Semantics Conditioned 3D Point Cloud Segmentation

Bo Liu, Shuang Deng, Qiulei Dong et al.

In this work, a language-level Semantics Conditioned framework for 3D Point cloud segmentation, called SeCondPoint, is proposed, where language-level semantics are introduced to condition the modeling of point feature distribution as well as the pseudo-feature generation, and a feature-geometry-based mixup approach is further proposed to facilitate the distribution learning. To our knowledge, this is the first attempt in literature to introduce language-level semantics to the 3D point cloud segmentation task. Since a large number of point features could be generated from the learned distribution thanks to the semantics conditioned modeling, any existing segmentation network could be embedded into the proposed framework to boost its performance. In addition, the proposed framework has the inherent advantage of dealing with novel classes, which seems an impossible feat for the current segmentation networks. Extensive experimental results on two public datasets demonstrate that three typical segmentation networks could achieve significant improvements over their original performances after enhancement by the proposed framework in the conventional 3D segmentation task. Two benchmarks are also introduced for a newly introduced zero-shot 3D segmentation task, and the results also validate the proposed framework.

CVJun 1, 2021
Hardness Sampling for Self-Training Based Transductive Zero-Shot Learning

Liu Bo, Qiulei Dong, Zhanyi Hu

Transductive zero-shot learning (T-ZSL) which could alleviate the domain shift problem in existing ZSL works, has received much attention recently. However, an open problem in T-ZSL: how to effectively make use of unseen-class samples for training, still remains. Addressing this problem, we first empirically analyze the roles of unseen-class samples with different degrees of hardness in the training process based on the uneven prediction phenomenon found in many ZSL methods, resulting in three observations. Then, we propose two hardness sampling approaches for selecting a subset of diverse and hard samples from a given unseen-class dataset according to these observations. The first one identifies the samples based on the class-level frequency of the model predictions while the second enhances the former by normalizing the class frequency via an approximate class prior estimated by an explored prior estimation algorithm. Finally, we design a new Self-Training framework with Hardness Sampling for T-ZSL, called STHS, where an arbitrary inductive ZSL method could be seamlessly embedded and it is iteratively trained with unseen-class samples selected by the hardness sampling approach. We introduce two typical ZSL methods into the STHS framework and extensive experiments demonstrate that the derived T-ZSL methods outperform many state-of-the-art methods on three public benchmarks. Besides, we note that the unseen-class dataset is separately used for training in some existing transductive generalized ZSL (T-GZSL) methods, which is not strict for a GZSL task. Hence, we suggest a more strict T-GZSL data setting and establish a competitive baseline on this setting by introducing the proposed STHS framework to T-GZSL.

ROOct 22, 2020
Optimization-Based Visual-Inertial SLAM Tightly Coupled with Raw GNSS Measurements

Jinxu Liu, Wei Gao, Zhanyi Hu

Unlike loose coupling approaches and the EKF-based approaches in the literature, we propose an optimization-based visual-inertial SLAM tightly coupled with raw Global Navigation Satellite System (GNSS) measurements, a first attempt of this kind in the literature to our knowledge. More specifically, reprojection error, IMU pre-integration error and raw GNSS measurement error are jointly minimized within a sliding window, in which the asynchronism between images and raw GNSS measurements is accounted for. In addition, issues such as marginalization, noisy measurements removal, as well as tackling vulnerable situations are also addressed. Experimental results on public dataset in complex urban scenes show that our proposed approach outperforms state-of-the-art visual-inertial SLAM, GNSS single point positioning, as well as a loose coupling approach, including scenes mainly containing low-rise buildings and those containing urban canyons.

CVAug 29, 2020
Zero-Shot Learning from Adversarial Feature Residual to Compact Visual Feature

Bo Liu, Qiulei Dong, Zhanyi Hu

Recently, many zero-shot learning (ZSL) methods focused on learning discriminative object features in an embedding feature space, however, the distributions of the unseen-class features learned by these methods are prone to be partly overlapped, resulting in inaccurate object recognition. Addressing this problem, we propose a novel adversarial network to synthesize compact semantic visual features for ZSL, consisting of a residual generator, a prototype predictor, and a discriminator. The residual generator is to generate the visual feature residual, which is integrated with a visual prototype predicted via the prototype predictor for synthesizing the visual feature. The discriminator is to distinguish the synthetic visual features from the real ones extracted from an existing categorization CNN. Since the generated residuals are generally numerically much smaller than the distances among all the prototypes, the distributions of the unseen-class features synthesized by the proposed network are less overlapped. In addition, considering that the visual features from categorization CNNs are generally inconsistent with their semantic features, a simple feature selection strategy is introduced for extracting more compact semantic visual features. Extensive experimental results on six benchmark datasets demonstrate that our method could achieve a significantly better performance than existing state-of-the-art methods by 1.2-13.2% in most cases.

ROFeb 1, 2020
Bidirectional Trajectory Computation for Odometer-Aided Visual-Inertial SLAM

Jinxu Liu, Wei Gao, Zhanyi Hu

Odometer-aided visual-inertial SLAM systems typically have a good performance for navigation of wheeled platforms, while they usually suffer from degenerate cases before the first turning. In this paper, firstly we perform an observability analysis w.r.t. the extrinsic parameters before the first turning, which is a complement of the existing results of observability analyses. Secondly, inspired by the above observability analyses, we propose a bidirectional trajectory computation method, by which the poses before the first turning are refined in the backward computation thread, and the real-time trajectory is adjusted accordingly. Experimental results prove that our proposed method not only solves the problem of the unobservability of accelerometer bias and extrinsic parameters before the first turning, but also results in more accurate trajectories in comparison with the state-of-the-art approaches.

LGOct 22, 2019
Face representation by deep learning: a linear encoding in a parameter space?

Qiulei Dong, Jiayin Sun, Zhanyi Hu

Recently, Convolutional Neural Networks (CNNs) have achieved tremendous performances on face recognition, and one popular perspective regarding CNNs' success is that CNNs could learn discriminative face representations from face images with complex image feature encoding. However, it is still unclear what is the intrinsic mechanism of face representation in CNNs. In this work, we investigate this problem by formulating face images as points in a shape-appearance parameter space, and our results demonstrate that: (i) The encoding and decoding of the neuron responses (representations) to face images in CNNs could be achieved under a linear model in the parameter space, in agreement with the recent discovery in primate IT face neurons, but different from the aforementioned perspective on CNNs' face representation with complex image feature encoding; (ii) The linear model for face encoding and decoding in the parameter space could achieve close or even better performances on face recognition and verification than state-of-the-art CNNs, which might provide new lights on the design strategies for face recognition systems; (iii) The neuron responses to face images in CNNs could not be adequately modelled by the axis model, a model recently proposed on face modelling in primate IT cortex. All these results might shed some lights on the often complained blackbox nature behind CNNs' tremendous performances on face recognition.

NCJun 6, 2019
Non-uniqueness phenomenon of object representation in modelling IT cortex by deep convolutional neural network (DCNN)

Qiulei Dong, Bo Liu, Zhanyi Hu

Recently DCNN (Deep Convolutional Neural Network) has been advocated as a general and promising modelling approach for neural object representation in primate inferotemporal cortex. In this work, we show that some inherent non-uniqueness problem exists in the DCNN-based modelling of image object representations. This non-uniqueness phenomenon reveals to some extent the theoretical limitation of this general modelling approach, and invites due attention to be taken in practice.

CVApr 21, 2019
Complete Scene Reconstruction by Merging Images and Laser Scans

Xiang Gao, Shuhan Shen, Lingjie Zhu et al.

Image based modeling and laser scanning are two commonly used approaches in large-scale architectural scene reconstruction nowadays. In order to generate a complete scene reconstruction, an effective way is to completely cover the scene using ground and aerial images, supplemented by laser scanning on certain regions with low texture and complicated structure. Thus, the key issue is to accurately calibrate cameras and register laser scans in a unified framework. To this end, we proposed a three-step pipeline for complete scene reconstruction by merging images and laser scans. First, images are captured around the architecture in a multi-view and multi-scale way and are feed into a structure-from-motion (SfM) pipeline to generate SfM points. Then, based on the SfM result, the laser scanning locations are automatically planned by considering textural richness, structural complexity of the scene and spatial layout of the laser scans. Finally, the images and laser scans are accurately merged in a coarse-to-fine manner. Experimental evaluations on two ancient Chinese architecture datasets demonstrate the effectiveness of our proposed complete scene reconstruction pipeline.

CVMar 27, 2018
Learning Depth from Single Images with Deep Neural Network Embedding Focal Length

Lei He, Guanghui Wang, Zhanyi Hu

Learning depth from a single image, as an important issue in scene understanding, has attracted a lot of attention in the past decade. The accuracy of the depth estimation has been improved from conditional Markov random fields, non-parametric methods, to deep convolutional neural networks most recently. However, there exist inherent ambiguities in recovering 3D from a single 2D image. In this paper, we first prove the ambiguity between the focal length and monocular depth learning, and verify the result using experiments, showing that the focal length has a great influence on accurate depth recovery. In order to learn monocular depth by embedding the focal length, we propose a method to generate synthetic varying-focal-length dataset from fixed-focal-length datasets, and a simple and effective method is implemented to fill the holes in the newly generated images. For the sake of accurate depth recovery, we propose a novel deep neural network to infer depth through effectively fusing the middle-level information on the fixed-focal-length dataset, which outperforms the state-of-the-art methods built on pre-trained VGG. Furthermore, the newly generated varying-focal-length dataset is taken as input to the proposed network in both learning and inference phases. Extensive experiments on the fixed- and varying-focal-length datasets demonstrate that the learned monocular depth with embedded focal length is significantly improved compared to that without embedding the focal length information.

CVMar 23, 2018
CSfM: Community-based Structure from Motion

Hainan Cui, Shuhan Shen, Xiang Gao et al.

Structure-from-Motion approaches could be broadly divided into two classes: incremental and global. While incremental manner is robust to outliers, it suffers from error accumulation and heavy computation load. The global manner has the advantage of simultaneously estimating all camera poses, but it is usually sensitive to epipolar geometry outliers. In this paper, we propose an adaptive community-based SfM (CSfM) method which takes both robustness and efficiency into consideration. First, the epipolar geometry graph is partitioned into separate communities. Then, the reconstruction problem is solved for each community in parallel. Finally, the reconstruction results are merged by a novel global similarity averaging method, which solves three convex $L1$ optimization problems. Experimental results show that our method performs better than many of the state-of-the-art global SfM approaches in terms of computational efficiency, while achieves similar or better reconstruction accuracy and robustness than many of the state-of-the-art incremental SfM approaches.

CVDec 12, 2016
Statistics of Visual Responses to Object Stimuli from Primate AIT Neurons to DNN Neurons

Qiulei Dong, Zhanyi Hu

Cadieu et al. (Cadieu,2014) reported that deep neural networks(DNNs) could rival the representation of primate inferotemporal cortex for object recognition. Lehky et al. (Lehky,2011) provided a statistical analysis on neural responses to object stimuli in primate AIT cortex. They found the intrinsic dimensionality of object representations in AIT cortex is around 100 (Lehky,2014). Considering the outstanding performance of DNNs in object recognition, it is worthwhile investigating whether the responses of DNN neurons have similar response statistics to those of AIT neurons. Following Lehky et al.'s works, we analyze the response statistics to image stimuli and the intrinsic dimensionality of object representations of DNN neurons. Our findings show in terms of kurtosis and Pareto tail index, the response statistics on single-neuron selectivity and population sparseness of DNN neurons are fundamentally different from those of IT neurons except some special cases. By increasing the number of neurons and stimuli, the conclusions could alter substantially. In addition, with the ascendancy of the convolutional layers of DNNs, the single-neuron selectivity and population sparseness of DNN neurons increase, indicating the last convolutional layer is to learn features for object representations, while the following fully-connected layers are to learn categorization features. It is also found that a sufficiently large number of stimuli and neurons are necessary for obtaining a stable dimensionality. To our knowledge, this is the first work to analyze the response statistics of DNN neurons comparing with AIT neurons, and our results provide not only some insights into the discrepancy of DNN neurons with respect to IT neurons in object representation, but also shed some light on possible outcomes of IT neurons when the number of recorded neurons and stimuli is beyond the level in (Lehky,2011,2014).

CVApr 26, 2016
Modern Physiognomy: An Investigation on Predicting Personality Traits and Intelligence from the Human Face

Rizhen Qin, Wei Gao, Huarong Xu et al.

The human behavior of evaluating other individuals with respect to their personality traits and intelligence by evaluating their faces plays a crucial role in human relations. These trait judgments might influence important social outcomes in our lives such as elections and court sentences. Previous studies have reported that human can make valid inferences for at least four personality traits. In addition, some studies have demonstrated that facial trait evaluation can be learned using machine learning methods accurately. In this work, we experimentally explore whether self-reported personality traits and intelligence can be predicted reliably from a facial image. More specifically, the prediction problem is separately cast in two parts: a classification task and a regression task. A facial structural feature is constructed from the relations among facial salient points, and an appearance feature is built by five texture descriptors. In addition, a minutia-based fingerprint feature from a fingerprint image is also explored. The classification results show that the personality traits "Rule-consciousness" and "Vigilance" can be predicted reliably, and that the traits of females can be predicted more accurately than those of male. However, the regression experiments show that it is difficult to predict scores for individual personality traits and intelligence. The residual plots and the correlation results indicate no evident linear correlation between the measured scores and the predicted scores. Both the classification and the regression results reveal that "Rule-consciousness" and "Tension" can be reliably predicted from the facial features, while "Social boldness" gets the worst prediction results. The experiments results show that it is difficult to predict intelligence from either the facial features or the fingerprint feature, a finding that is in agreement with previous studies.

DSDec 1, 2015
Dynamic Parallel and Distributed Graph Cuts

Miao Yu, Shuhan Shen, Zhanyi Hu

Graph-cuts are widely used in computer vision. In order to speed up the optimization process and improve the scalability for large graphs, Strandmark and Kahl introduced a splitting method to split a graph into multiple subgraphs for parallel computation in both shared and distributed memory models. However, this parallel algorithm (parallel BK-algorithm) does not have a polynomial bound on the number of iterations and is found non-convergent in some cases due to the possible multiple optimal solutions of its sub-problems. To remedy this non-convergence problem, in this work we first introduce a merging method capable of merging any number of those adjacent sub-graphs which could hardly reach an agreement on their overlapped region in the parallel BK algorithm. Based on the pseudo-boolean representations of graph cuts,our merging method is shown able to effectively reuse all the computed flows in these sub-graphs. Through both the splitting and merging, we further propose a dynamic parallel and distributed graph-cuts algorithm with guaranteed convergence to the globally optimal solutions within a predefined number of iterations. In essence, this work provides a general framework to allow more sophisticated splitting and merging strategies to be employed to further boost performance. Our dynamic parallel algorithm is validated with extensive experimental results.