Zhimin Gao

CV
20papers
1,179citations
Novelty50%
AI Score30

20 Papers

CVSep 21, 2022Code
FT-HID: A Large Scale RGB-D Dataset for First and Third Person Human Interaction Analysis

Zihui Guo, Yonghong Hou, Pichao Wang et al.

Analysis of human interaction is one important research topic of human motion analysis. It has been studied either using first person vision (FPV) or third person vision (TPV). However, the joint learning of both types of vision has so far attracted little attention. One of the reasons is the lack of suitable datasets that cover both FPV and TPV. In addition, existing benchmark datasets of either FPV or TPV have several limitations, including the limited number of samples, participant subjects, interaction categories, and modalities. In this work, we contribute a large-scale human interaction dataset, namely, FT-HID dataset. FT-HID contains pair-aligned samples of first person and third person visions. The dataset was collected from 109 distinct subjects and has more than 90K samples for three modalities. The dataset has been validated by using several existing action recognition methods. In addition, we introduce a novel multi-view interaction mechanism for skeleton sequences, and a joint learning multi-stream framework for first person and third person visions. Both methods yield promising results on the FT-HID dataset. It is expected that the introduction of this vision-aligned large-scale dataset will promote the development of both FPV and TPV, and their joint learning techniques for human action analysis. The dataset and code are available at \href{https://github.com/ENDLICHERE/FT-HID}{here}.

CVOct 6, 2022
Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition

Zhimin Gao, Peitao Wang, Pei Lv et al.

Despite great progress achieved by transformer in various vision tasks, it is still underexplored for skeleton-based action recognition with only a few attempts. Besides, these methods directly calculate the pair-wise global self-attention equally for all the joints in both the spatial and temporal dimensions, undervaluing the effect of discriminative local joints and the short-range temporal dynamics. In this work, we propose a novel Focal and Global Spatial-Temporal Transformer network (FG-STFormer), that is equipped with two key components: (1) FG-SFormer: focal joints and global parts coupling spatial transformer. It forces the network to focus on modelling correlations for both the learned discriminative spatial joints and human body parts respectively. The selective focal joints eliminate the negative effect of non-informative ones during accumulating the correlations. Meanwhile, the interactions between the focal joints and body parts are incorporated to enhance the spatial dependencies via mutual cross-attention. (2) FG-TFormer: focal and global temporal transformer. Dilated temporal convolution is integrated into the global self-attention mechanism to explicitly capture the local temporal motion patterns of joints or body parts, which is found to be vital important to make temporal transformer work. Extensive experimental results on three benchmarks, namely NTU-60, NTU-120 and NW-UCLA, show our FG-STFormer surpasses all existing transformer-based methods, and compares favourably with state-of-the art GCN-based methods.

CVOct 19, 2023
Human Pose-based Estimation, Tracking and Action Recognition with Deep Learning: A Survey

Lijuan Zhou, Xiang Meng, Zhihuan Liu et al.

Human pose analysis has garnered significant attention within both the research community and practical applications, owing to its expanding array of uses, including gaming, video surveillance, sports performance analysis, and human-computer interactions, among others. The advent of deep learning has significantly improved the accuracy of pose capture, making pose-based applications increasingly practical. This paper presents a comprehensive survey of pose-based applications utilizing deep learning, encompassing pose estimation, pose tracking, and action recognition.Pose estimation involves the determination of human joint positions from images or image sequences. Pose tracking is an emerging research direction aimed at generating consistent human pose trajectories over time. Action recognition, on the other hand, targets the identification of action types using pose estimation or tracking data. These three tasks are intricately interconnected, with the latter often reliant on the former. In this survey, we comprehensively review related works, spanning from single-person pose estimation to multi-person pose estimation, from 2D pose estimation to 3D pose estimation, from single image to video, from mining temporal context gradually to pose tracking, and lastly from tracking to pose-based action recognition. As a survey centered on the application of deep learning to pose analysis, we explicitly discuss both the strengths and limitations of existing techniques. Notably, we emphasize methodologies for integrating these three tasks into a unified framework within video sequences. Additionally, we explore the challenges involved and outline potential directions for future research.

CVNov 13, 2021Code
A Central Difference Graph Convolutional Operator for Skeleton-Based Action Recognition

Shuangyan Miao, Yonghong Hou, Zhimin Gao et al.

This paper proposes a new graph convolutional operator called central difference graph convolution (CDGC) for skeleton based action recognition. It is not only able to aggregate node information like a vanilla graph convolutional operation but also gradient information. Without introducing any additional parameters, CDGC can replace vanilla graph convolution in any existing Graph Convolutional Networks (GCNs). In addition, an accelerated version of the CDGC is developed which greatly improves the speed of training. Experiments on two popular large-scale datasets NTU RGB+D 60 & 120 have demonstrated the efficacy of the proposed CDGC. Code is available at https://github.com/iesymiao/CD-GCN.

CVJan 5, 2021
Trear: Transformer-based RGB-D Egocentric Action Recognition

Xiangyu Li, Yonghong Hou, Pichao Wang et al.

In this paper, we propose a \textbf{Tr}ansformer-based RGB-D \textbf{e}gocentric \textbf{a}ction \textbf{r}ecognition framework, called Trear. It consists of two modules, inter-frame attention encoder and mutual-attentional fusion block. Instead of using optical flow or recurrent units, we adopt self-attention mechanism to model the temporal structure of the data from different modalities. Input frames are cropped randomly to mitigate the effect of the data redundancy. Features from each modality are interacted through the proposed fusion block and combined through a simple yet effective fusion operation to produce a joint RGB-D representation. Empirical experiments on two large egocentric RGB-D datasets, THU-READ and FPHA, and one small dataset, WCVS, have shown that the proposed method outperforms the state-of-the-art results by a large margin.

CVDec 8, 2020
Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry

Xiangyu Li, Yonghong Hou, Pichao Wang et al.

Existing unsupervised visual odometry (VO) methods either match pairwise images or integrate the temporal information using recurrent neural networks over a long sequence of images. They are either not accurate, time-consuming in training or error accumulative. In this paper, we propose a method consisting of two camera pose estimators that deal with the information from pairwise images and a short sequence of images respectively. For image sequences, a Transformer-like structure is adopted to build a geometry model over a local temporal window, referred to as Transformer-based Auxiliary Pose Estimator (TAPE). Meanwhile, a Flow-to-Flow Pose Estimator (F2FPE) is proposed to exploit the relationship between pairwise images. The two estimators are constrained through a simple yet effective consistency loss in training. Empirical evaluation has shown that the proposed method outperforms the state-of-the-art unsupervised learning-based methods by a large margin and performs comparably to supervised and traditional ones on the KITTI and Malaga dataset.

CRJan 22, 2020
Nonlinear Blockchain Scalability: a Game-Theoretic Perspective

Lin Chen, Lei Xu, Zhimin Gao et al.

Recent advances in the blockchain research have been made in two important directions. One is refined resilience analysis utilizing game theory to study the consequences of selfish behaviors of users (miners), and the other is the extension from a linear (chain) structure to a non-linear (graphical) structure for performance improvements, such as IOTA and Graphcoin. The first question that comes to people's minds is what improvements that a blockchain system would see by leveraging these new advances. In this paper, we consider three major metrics for a blockchain system: full verification, scalability, and finality-duration. We { establish a formal framework and} prove that no blockchain system can achieve full verification, high scalability, and low finality-duration simultaneously. We observe that classical blockchain systems like Bitcoin achieves full verification and low finality-duration, Harmony and Ethereum 2.0 achieve low finality-duration and high scalability. As a complementary, we design a non-linear blockchain system that achieves full verification and scalability. We also establish, for the first time, the trade-off between scalability and finality-duration.

CRSep 14, 2019
Private and Atomic Exchange of Assets over Zero Knowledge Based Payment Ledger

Zhimin Gao, Lei Xu, Keshav Kasichainula et al.

Bitcoin brings a new type of digital currency that does not rely on a central system to maintain transactions. By benefiting from the concept of decentralized ledger, users who do not know or trust each other can still conduct transactions in a peer-to-peer manner. Inspired by Bitcoin, other cryptocurrencies were invented in recent years such as Ethereum, Dash, Zcash, Monero, Grin, etc. Some of these focus on enhancing privacy for instance crypto note or systems that apply the similar concept of encrypted notes used for transactions to enhance privacy (e.g., Zcash, Monero). However, there are few mechanisms to support the exchange of privacy-enhanced notes or assets on the chain, and at the same time preserving the privacy of the exchange operations. Existing approaches for fair exchanges of assets with privacy mostly rely on off-chain/side-chain, escrow or centralized services. Thus, we propose a solution that supports oblivious and privacy-protected fair exchange of crypto notes or privacy enhanced crypto assets. The technology is demonstrated by extending zero-knowledge based crypto notes. To address "privacy" and "multi-currency", we build a new zero-knowledge proving system and extend note format with new property to represent various types of tokenized assets or cryptocurrencies. By extending the payment protocol, exchange operations are realized through privacy enhanced transactions (e.g., shielded transactions). Based on the possible scenarios during the exchange operation, we add new constraints and conditions to the zero-knowledge proving system used for validating transactions publicly.

CRNov 12, 2018
Efficient Public Blockchain Client for Lightweight Users

Lei Xu, Lin Chen, Zhimin Gao et al.

Public blockchains provide a decentralized method for storing transaction data and have many applications in different sectors. In order for users to track transactions, a simple method is to let them keep a local copy of the entire public ledger. Since the size of the ledger keeps growing, this method becomes increasingly less practical, especially for lightweight users such as IoT devices and smartphones. In order to cope with the problem, several solutions have been proposed to reduce the storage burden. However, existing solutions either achieve a limited storage reduction (e.g., simple payment verification), or rely on some strong security assumption (e.g., the use of trusted server). In this paper, we propose a new approach to solving the problem. Specifically, we propose an \underline{e}fficient verification protocol for \underline{p}ublic \underline{b}lock\underline{c}hains, or EPBC for short. EPBC is particularly suitable for lightweight users, who only need to store a small amount of data that is {\it independent of} the size of the blockchain. We analyze EPBC's performance and security, and discuss its integration with existing public ledger systems. Experimental results confirm that EPBC is practical for lightweight users.

DSNov 7, 2018
Election with Bribed Voter Uncertainty: Hardness and Approximation Algorithm

Lin Chen, Lei Xu, Shouhuai Xu et al.

Bribery in election (or computational social choice in general) is an important problem that has received a considerable amount of attention. In the classic bribery problem, the briber (or attacker) bribes some voters in attempting to make the briber's designated candidate win an election. In this paper, we introduce a novel variant of the bribery problem, "Election with Bribed Voter Uncertainty" or BVU for short, accommodating the uncertainty that the vote of a bribed voter may or may not be counted. This uncertainty occurs either because a bribed voter may not cast its vote in fear of being caught, or because a bribed voter is indeed caught and therefore its vote is discarded. As a first step towards ultimately understanding and addressing this important problem, we show that it does not admit any multiplicative $O(1)$-approximation algorithm modulo standard complexity assumptions. We further show that there is an approximation algorithm that returns a solution with an additive-$ε$ error in FPT time for any fixed $ε$.

CVMay 18, 2018
MDSSD: Multi-scale Deconvolutional Single Shot Detector for Small Objects

Lisha Cui, Rui Ma, Pei Lv et al.

For most of the object detectors based on multi-scale feature maps, the shallow layers are rich in fine spatial information and thus mainly responsible for small object detection. The performance of small object detection, however, is still less than satisfactory because of the deficiency of semantic information on shallow feature maps. In this paper, we design a Multi-scale Deconvolutional Single Shot Detector (MDSSD), especially for small object detection. In MDSSD, multiple high-level feature maps at different scales are upsampled simultaneously to increase the spatial resolution. Afterwards, we implement the skip connections with low-level feature maps via Fusion Block. The fusion feature maps, named Fusion Module, are of strong feature representational power of small instances. It is noteworthy that these high-level feature maps utilized in Fusion Block preserve both strong semantic information and some fine details of small instances, rather than the top-most layer where the representation of fine details for small objects are potentially wiped out. The proposed framework achieves 77.6% mAP for small object detection on the challenging dataset TT100K with 512 x 512 input, outperforming other detectors with a large margin. Moreover, it can also achieve state-of-the-art results for general object detection on PASCAL VOC2007 test and MS COCO test-dev2015, especially achieving 2 to 5 points improvement on small object categories.

CVMar 17, 2018
Depth Pooling Based Large-scale 3D Action Recognition with Convolutional Neural Networks

Pichao Wang, Wanqing Li, Zhimin Gao et al.

This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI), for both isolated and continuous action recognition. These dynamic images are constructed from a segmented sequence of depth maps using hierarchical bidirectional rank pooling to effectively capture the spatial-temporal information. Specifically, DDI exploits the dynamics of postures over time and DDNI and DDMNI exploit the 3D structural information captured by depth maps. Upon the proposed representations, a ConvNet based method is developed for action recognition. The image-based representations enable us to fine-tune the existing Convolutional Neural Network (ConvNet) models trained on image data without training a large number of parameters from scratch. The proposed method achieved the state-of-art results on three large datasets, namely, the Large-scale Continuous Gesture Recognition Dataset (means Jaccard index 0.4109), the Large-scale Isolated Gesture Recognition Dataset (59.21%), and the NTU RGB+D Dataset (87.08% cross-subject and 84.22% cross-view) even though only the depth modality was used.

CLDec 23, 2017
Dual Long Short-Term Memory Networks for Sub-Character Representation Learning

Han He, Lei Wu, Xiaokun Yang et al.

Characters have commonly been regarded as the minimal processing unit in Natural Language Processing (NLP). But many non-latin languages have hieroglyphic writing systems, involving a big alphabet with thousands or millions of characters. Each character is composed of even smaller parts, which are often ignored by the previous work. In this paper, we propose a novel architecture employing two stacked Long Short-Term Memory Networks (LSTMs) to learn sub-character level representation and capture deeper level of semantic meanings. To build a concrete study and substantiate the efficiency of our neural architecture, we take Chinese Word Segmentation as a research case example. Among those languages, Chinese is a typical case, for which every character contains several components called radicals. Our networks employ a shared radical level embedding to solve both Simplified and Traditional Chinese Word Segmentation, without extra Traditional to Simplified Chinese conversion, in such a highly end-to-end way the word segmentation can be significantly simplified compared to the previous work. Radical level embeddings can also capture deeper semantic meaning below character level and improve the system performance of learning. By tying radical and character embeddings together, the parameter count is reduced whereas semantic knowledge is shared and transferred between two levels, boosting the performance largely. On 3 out of 4 Bakeoff 2005 datasets, our method surpassed state-of-the-art results by up to 0.4%. Our results are reproducible, source codes and corpora are available on GitHub.

CLDec 7, 2017
Effective Neural Solution for Multi-Criteria Word Segmentation

Han He, Lei Wu, Hua Yan et al.

We present a simple yet elegant solution to train a single joint model on multi-criteria corpora for Chinese Word Segmentation (CWS). Our novel design requires no private layers in model architecture, instead, introduces two artificial tokens at the beginning and ending of input sentence to specify the required target criteria. The rest of the model including Long Short-Term Memory (LSTM) layer and Conditional Random Fields (CRFs) layer remains unchanged and is shared across all datasets, keeping the size of parameter collection minimal and constant. On Bakeoff 2005 and Bakeoff 2008 datasets, our innovative design has surpassed both single-criterion and multi-criteria state-of-the-art learning results. To the best knowledge, our design is the first one that has achieved the latest high performance on such large scale datasets. Source codes and corpora of this paper are available on GitHub.

CVFeb 28, 2017
Scene Flow to Action Map: A New Representation for RGB-D based Action Recognition with Convolutional Neural Networks

Pichao Wang, Wanqing Li, Zhimin Gao et al.

Scene flow describes the motion of 3D objects in real world and potentially could be the basis of a good feature for 3D action recognition. However, its use for action recognition, especially in the context of convolutional neural networks (ConvNets), has not been previously studied. In this paper, we propose the extraction and use of scene flow for action recognition from RGB-D data. Previous works have considered the depth and RGB modalities as separate channels and extract features for later fusion. We take a different approach and consider the modalities as one entity, thus allowing feature extraction for action recognition at the beginning. Two key questions about the use of scene flow for action recognition are addressed: how to organize the scene flow vectors and how to represent the long term dynamics of videos based on scene flow. In order to calculate the scene flow correctly on the available datasets, we propose an effective self-calibration method to align the RGB and depth data spatially without knowledge of the camera parameters. Based on the scene flow vectors, we propose a new representation, namely, Scene Flow to Action Map (SFAM), that describes several long term spatio-temporal dynamics for action recognition. We adopt a channel transform kernel to transform the scene flow vectors to an optimal color space analogous to RGB. This transformation takes better advantage of the trained ConvNets models over ImageNet. Experimental results indicate that this new representation can surpass the performance of state-of-the-art methods on two large public datasets.

CVJan 7, 2017
Large-scale Isolated Gesture Recognition Using Convolutional Neural Networks

Pichao Wang, Wanqing Li, Song Liu et al.

This paper proposes three simple, compact yet effective representations of depth sequences, referred to respectively as Dynamic Depth Images (DDI), Dynamic Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI). These dynamic images are constructed from a sequence of depth maps using bidirectional rank pooling to effectively capture the spatial-temporal information. Such image-based representations enable us to fine-tune the existing ConvNets models trained on image data for classification of depth sequences, without introducing large parameters to learn. Upon the proposed representations, a convolutional Neural networks (ConvNets) based method is developed for gesture recognition and evaluated on the Large-scale Isolated Gesture Recognition at the ChaLearn Looking at People (LAP) challenge 2016. The method achieved 55.57\% classification accuracy and ranked $2^{nd}$ place in this challenge but was very close to the best performance even though we only used depth data.

CVAug 22, 2016
Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks

Pichao Wang, Wanqing Li, Song Liu et al.

This paper addresses the problem of continuous gesture recognition from sequences of depth maps using convolutional neutral networks (ConvNets). The proposed method first segments individual gestures from a depth sequence based on quantity of movement (QOM). For each segmented gesture, an Improved Depth Motion Map (IDMM), which converts the depth sequence into one image, is constructed and fed to a ConvNet for recognition. The IDMM effectively encodes both spatial and temporal information and allows the fine-tuning with existing ConvNet models for classification without introducing millions of parameters to learn. The proposed method is evaluated on the Large-scale Continuous Gesture Recognition of the ChaLearn Looking at People (LAP) challenge 2016. It achieved the performance of 0.2655 (Mean Jaccard Index) and ranked $3^{rd}$ place in this challenge.

CVApr 10, 2015
HEp-2 Cell Image Classification with Deep Convolutional Neural Networks

Zhimin Gao, Lei Wang, Luping Zhou et al.

Efficient Human Epithelial-2 (HEp-2) cell image classification can facilitate the diagnosis of many autoimmune diseases. This paper presents an automatic framework for this classification task, by utilizing the deep convolutional neural networks (CNNs) which have recently attracted intensive attention in visual recognition. This paper elaborates the important components of this framework, discusses multiple key factors that impact the efficiency of training a deep CNN, and systematically compares this framework with the well-established image classification models in the literature. Experiments on benchmark datasets show that i) the proposed framework can effectively outperform existing models by properly applying data augmentation; ii) our CNN-based framework demonstrates excellent adaptability across different datasets, which is highly desirable for classification under varying laboratory settings. Our system is ranked high in the cell image classification competition hosted by ICPR 2014.

CVJan 20, 2015
Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences

Pichao Wang, Wanqing Li, Zhimin Gao et al.

Recently, deep learning approach has achieved promising results in various fields of computer vision. In this paper, a new framework called Hierarchical Depth Motion Maps (HDMM) + 3 Channel Deep Convolutional Neural Networks (3ConvNets) is proposed for human action recognition using depth map sequences. Firstly, we rotate the original depth data in 3D pointclouds to mimic the rotation of cameras, so that our algorithms can handle view variant cases. Secondly, in order to effectively extract the body shape and motion information, we generate weighted depth motion maps (DMM) at several temporal scales, referred to as Hierarchical Depth Motion Maps (HDMM). Then, three channels of ConvNets are trained on the HDMMs from three projected orthogonal planes separately. The proposed algorithms are evaluated on MSRAction3D, MSRAction3DExt, UTKinect-Action and MSRDailyActivity3D datasets respectively. We also combine the last three datasets into a larger one (called Combined Dataset) and test the proposed method on it. The results show that our approach can achieve state-of-the-art results on the individual datasets and without dramatical performance degradation on the Combined Dataset.

CVSep 14, 2014
Mining Mid-level Features for Action Recognition Based on Effective Skeleton Representation

Pichao Wang, Wanqing Li, Philip Ogunbona et al.

Recently, mid-level features have shown promising performance in computer vision. Mid-level features learned by incorporating class-level information are potentially more discriminative than traditional low-level local features. In this paper, an effective method is proposed to extract mid-level features from Kinect skeletons for 3D human action recognition. Firstly, the orientations of limbs connected by two skeleton joints are computed and each orientation is encoded into one of the 27 states indicating the spatial relationship of the joints. Secondly, limbs are combined into parts and the limb's states are mapped into part states. Finally, frequent pattern mining is employed to mine the most frequent and relevant (discriminative, representative and non-redundant) states of parts in continuous several frames. These parts are referred to as Frequent Local Parts or FLPs. The FLPs allow us to build powerful bag-of-FLP-based action representation. This new representation yields state-of-the-art results on MSR DailyActivity3D and MSR ActionPairs3D.