Junjie Huang

h-index25

30papers

5,341citations

Novelty47%

AI Score48

Ranked #28,010 of 194,257 authors (top 14%)#10,081 in CV (top 17%)

30 Papers

4.1LGNov 13, 2025

Enhancing Kernel Power K-means: Scalable and Robust Clustering with Random Fourier Features and Possibilistic Method

Yixi Chen, Weixuan Liang, Tianrui Liu et al.

Kernel power $k$-means (KPKM) leverages a family of means to mitigate local minima issues in kernel $k$-means. However, KPKM faces two key limitations: (1) the computational burden of the full kernel matrix restricts its use on extensive data, and (2) the lack of authentic centroid-sample assignment learning reduces its noise robustness. To overcome these challenges, we propose RFF-KPKM, introducing the first approximation theory for applying random Fourier features (RFF) to KPKM. RFF-KPKM employs RFF to generate efficient, low-dimensional feature maps, bypassing the need for the whole kernel matrix. Crucially, we are the first to establish strong theoretical guarantees for this combination: (1) an excess risk bound of $\mathcal{O}(\sqrt{k^3/n})$, (2) strong consistency with membership values, and (3) a $(1+\varepsilon)$ relative error bound achievable using the RFF of dimension $\mathrm{poly}(\varepsilon^{-1}\log k)$. Furthermore, to improve robustness and the ability to learn multiple kernels, we propose IP-RFF-MKPKM, an improved possibilistic RFF-based multiple kernel power $k$-means. IP-RFF-MKPKM ensures the scalability of MKPKM via RFF and refines cluster assignments by combining the merits of the possibilistic membership and fuzzy membership. Experiments on large-scale datasets demonstrate the superior efficiency and clustering accuracy of the proposed methods compared to the state-of-the-art alternatives.

1.4CVSep 10, 2021Code

Face-NMS: A Core-set Selection Approach for Efficient Face Recognition

Yunze Chen, Junjie Huang, Jiagang Zhu et al.

Recently, face recognition in the wild has achieved remarkable success and one key engine is the increasing size of training data. For example, the largest face dataset, WebFace42M contains about 2 million identities and 42 million faces. However, a massive number of faces raise the constraints in training time, computing resources, and memory cost. The current research on this problem mainly focuses on designing an efficient Fully-connected layer (FC) to reduce GPU memory consumption caused by a large number of identities. In this work, we relax these constraints by resolving the redundancy problem of the up-to-date face datasets caused by the greedily collecting operation (i.e. the core-set selection perspective). As the first attempt in this perspective on the face recognition problem, we find that existing methods are limited in both performance and efficiency. For superior cost-efficiency, we contribute a novel filtering strategy dubbed Face-NMS. Face-NMS works on feature space and simultaneously considers the local and global sparsity in generating core sets. In practice, Face-NMS is analogous to Non-Maximum Suppression (NMS) in the object detection community. It ranks the faces by their potential contribution to the overall sparsity and filters out the superfluous face in the pairs with high similarity for local sparsity. With respect to the efficiency aspect, Face-NMS accelerates the whole pipeline by applying a smaller but sufficient proxy dataset in training the proxy model. As a result, with Face-NMS, we successfully scale down the WebFace42M dataset to 60% while retaining its performance on the main benchmarks, offering a 40% resource-saving and 1.64 times acceleration. The code is publicly available for reference at https://github.com/HuangJunJie2017/Face-NMS.

23.1CVNov 18, 2019Code

The Devil is in the Details: Delving into Unbiased Data Processing for Human Pose Estimation

Junjie Huang, Zheng Zhu, Feng Guo et al.

Being a fundamental component in training and inference, data processing has not been systematically considered in human pose estimation community, to the best of our knowledge. In this paper, we focus on this problem and find that the devil of human pose estimation evolution is in the biased data processing. Specifically, by investigating the standard data processing in state-of-the-art approaches mainly including coordinate system transformation and keypoint format transformation (i.e., encoding and decoding), we find that the results obtained by common flipping strategy are unaligned with the original ones in inference. Moreover, there is a statistical error in some keypoint format transformation methods. Two problems couple together, significantly degrade the pose estimation performance and thus lay a trap for the research community. This trap has given bone to many suboptimal remedies, which are always unreported, confusing but influential. By causing failure in reproduction and unfair in comparison, the unreported remedies seriously impedes the technological development. To tackle this dilemma from the source, we propose Unbiased Data Processing (UDP) consist of two technique aspect for the two aforementioned problems respectively (i.e., unbiased coordinate system transformation and unbiased keypoint format transformation). As a model-agnostic approach and a superior solution, UDP successfully pushes the performance boundary of human pose estimation and offers a higher and more reliable baseline for research community. Code is public available in https://github.com/HuangJunJie2017/UDP-Pose

31.2CLJul 10, 2019Code

Modeling Semantic Compositionality with Sememe Knowledge

Fanchao Qi, Junjie Huang, Chenghao Yang et al.

Semantic compositionality (SC) refers to the phenomenon that the meaning of a complex linguistic unit can be composed of the meanings of its constituents. Most related works focus on using complicated compositionality functions to model SC while few works consider external knowledge in models. In this paper, we verify the effectiveness of sememes, the minimum semantic units of human languages, in modeling SC by a confirmatory experiment. Furthermore, we make the first attempt to incorporate sememe knowledge into SC models, and employ the sememeincorporated models in learning representations of multiword expressions, a typical task of SC. In experiments, we implement our models by incorporating knowledge from a famous sememe knowledge base HowNet and perform both intrinsic and extrinsic evaluations. Experimental results show that our models achieve significant performance boost as compared to the baseline methods without considering sememe knowledge. We further conduct quantitative analysis and case studies to demonstrate the effectiveness of applying sememe knowledge in modeling SC. All the code and data of this paper can be obtained on https://github.com/thunlp/Sememe-SC.

0.3CLJun 1, 2019Code

COS960: A Chinese Word Similarity Dataset of 960 Word Pairs

Junjie Huang, Fanchao Qi, Chenghao Yang et al.

Word similarity computation is a widely recognized task in the field of lexical semantics. Most proposed tasks test on similarity of word pairs of single morpheme, while few works focus on words of two morphemes or more morphemes. In this work, we propose COS960, a benchmark dataset with 960 pairs of Chinese wOrd Similarity, where all the words have two morphemes in three Part of Speech (POS) tags with their human annotated similarity rather than relatedness. We give a detailed description of dataset construction and annotation process, and test on a range of word embedding models. The dataset of this paper can be obtained from https://github.com/thunlp/COS960.

16.7IVMar 3, 2025

A Lightweight Deep Exclusion Unfolding Network for Single Image Reflection Removal

Jun-Jie Huang, Tianrui Liu, Zihan Chen et al.

Single Image Reflection Removal (SIRR) is a canonical blind source separation problem and refers to the issue of separating a reflection-contaminated image into a transmission and a reflection image. The core challenge lies in minimizing the commonalities among different sources. Existing deep learning approaches either neglect the significance of feature interactions or rely on heuristically designed architectures. In this paper, we propose a novel Deep Exclusion unfolding Network (DExNet), a lightweight, interpretable, and effective network architecture for SIRR. DExNet is principally constructed by unfolding and parameterizing a simple iterative Sparse and Auxiliary Feature Update (i-SAFU) algorithm, which is specifically designed to solve a new model-based SIRR optimization formulation incorporating a general exclusion prior. This general exclusion prior enables the unfolded SAFU module to inherently identify and penalize commonalities between the transmission and reflection features, ensuring more accurate separation. The principled design of DExNet not only enhances its interpretability but also significantly improves its performance. Comprehensive experiments on four benchmark datasets demonstrate that DExNet achieves state-of-the-art visual and quantitative results while utilizing only approximately 8\% of the parameters required by leading methods.

9.4LGOct 10, 2025

Agentic-KGR: Co-evolutionary Knowledge Graph Construction through Multi-Agent Reinforcement Learning

Jing Li, Zhijie Sun, Zhicheng Zhou et al.

Current knowledge-enhanced large language models (LLMs) rely on static, pre-constructed knowledge bases that suffer from coverage gaps and temporal obsolescence, limiting their effectiveness in dynamic information environments. We present Agentic-KGR, a novel framework enabling co-evolution between LLMs and knowledge graphs (KGs) through multi-round reinforcement learning (RL). Our approach introduces three key innovations: (1) a dynamic schema expansion mechanism that systematically extends graph ontologies beyond pre-defined boundaries during training; (2) a retrieval-augmented memory system enabling synergistic co-evolution between model parameters and knowledge structures through continuous optimization; (3) a learnable multi-scale prompt compression approach that preserves critical information while reducing computational complexity through adaptive sequence optimization. Experimental results demonstrate substantial improvements over supervised baselines and single-round RL approaches in knowledge extraction tasks. When integrated with GraphRAG, our method achieves superior performance in downstream QA tasks, with significant gains in both accuracy and knowledge coverage compared to existing methods.

3.7CVJan 23, 2022

Mixed X-Ray Image Separation for Artworks with Concealed Designs

Wei Pu, Jun-Jie Huang, Barak Sober et al.

In this paper, we focus on X-ray images of paintings with concealed sub-surface designs (e.g., deriving from reuse of the painting support or revision of a composition by the artist), which include contributions from both the surface painting and the concealed features. In particular, we propose a self-supervised deep learning-based image separation approach that can be applied to the X-ray images from such paintings to separate them into two hypothetical X-ray images. One of these reconstructed images is related to the X-ray image of the concealed painting, while the second one contains only information related to the X-ray of the visible painting. The proposed separation network consists of two components: the analysis and the synthesis sub-networks. The analysis sub-network is based on learned coupled iterative shrinkage thresholding algorithms (LCISTA) designed using algorithm unrolling techniques, and the synthesis sub-network consists of several linear mappings. The learning algorithm operates in a totally self-supervised fashion without requiring a sample set that contains both the mixed X-ray images and the separated ones. The proposed method is demonstrated on a real painting with concealed content, Doña Isabel de Porcel by Francisco de Goya, to show its effectiveness.

21.8LGAug 30, 2021Code

Single Node Injection Attack against Graph Neural Networks

Shuchang Tao, Qi Cao, Huawei Shen et al.

Node injection attack on Graph Neural Networks (GNNs) is an emerging and practical attack scenario that the attacker injects malicious nodes rather than modifying original nodes or edges to affect the performance of GNNs. However, existing node injection attacks ignore extremely limited scenarios, namely the injected nodes might be excessive such that they may be perceptible to the target GNN. In this paper, we focus on an extremely limited scenario of single node injection evasion attack, i.e., the attacker is only allowed to inject one single node during the test phase to hurt GNN's performance. The discreteness of network structure and the coupling effect between network structure and node features bring great challenges to this extremely limited scenario. We first propose an optimization-based method to explore the performance upper bound of single node injection evasion attack. Experimental results show that 100%, 98.60%, and 94.98% nodes on three public datasets are successfully attacked even when only injecting one node with one edge, confirming the feasibility of single node injection evasion attack. However, such an optimization-based method needs to be re-optimized for each attack, which is computationally unbearable. To solve the dilemma, we further propose a Generalizable Node Injection Attack model, namely G-NIA, to improve the attack efficiency while ensuring the attack performance. Experiments are conducted across three well-known GNNs. Our proposed G-NIA significantly outperforms state-of-the-art baselines and is 500 times faster than the optimization-based method when inferring.

10.3SIAug 22, 2021Code

Signed Bipartite Graph Neural Networks

Junjie Huang, Huawei Shen, Qi Cao et al.

Signed networks are such social networks having both positive and negative links. A lot of theories and algorithms have been developed to model such networks (e.g., balance theory). However, previous work mainly focuses on the unipartite signed networks where the nodes have the same type. Signed bipartite networks are different from classical signed networks, which contain two different node sets and signed links between two node sets. Signed bipartite networks can be commonly found in many fields including business, politics, and academics, but have been less studied. In this work, we firstly define the signed relationship of the same set of nodes and provide a new perspective for analyzing signed bipartite networks. Then we do some comprehensive analysis of balance theory from two perspectives on several real-world datasets. Specifically, in the peer review dataset, we find that the ratio of balanced isomorphism in signed bipartite networks increased after rebuttal phases. Guided by these two perspectives, we propose a novel Signed Bipartite Graph Neural Networks (SBGNNs) to learn node embeddings for signed bipartite networks. SBGNNs follow most GNNs message-passing scheme, but we design new message functions, aggregation functions, and update functions for signed bipartite networks. We validate the effectiveness of our model on four real-world datasets on Link Sign Prediction task, which is the main machine learning task for signed networks. Experimental results show that our SBGNN model achieves significant improvement compared with strong baseline methods, including feature-based methods and network embedding methods.

11.6CVAug 16, 2021

Masked Face Recognition Challenge: The WebFace260M Track Report

Zheng Zhu, Guan Huang, Jiankang Deng et al.

According to WHO statistics, there are more than 204,617,027 confirmed COVID-19 cases including 4,323,247 deaths worldwide till August 12, 2021. During the coronavirus epidemic, almost everyone wears a facial mask. Traditionally, face recognition approaches process mostly non-occluded faces, which include primary facial features such as the eyes, nose, and mouth. Removing the mask for authentication in airports or laboratories will increase the risk of virus infection, posing a huge challenge to current face recognition systems. Due to the sudden outbreak of the epidemic, there are yet no publicly available real-world masked face recognition (MFR) benchmark. To cope with the above-mentioned issue, we organize the Face Bio-metrics under COVID Workshop and Masked Face Recognition Challenge in ICCV 2021. Enabled by the ultra-large-scale WebFace260M benchmark and the Face Recognition Under Inference Time conStraint (FRUITS) protocol, this challenge (WebFace260M Track) aims to push the frontiers of practical MFR. Since public evaluation sets are mostly saturated or contain noise, a new test set is gathered consisting of elaborated 2,478 celebrities and 60,926 faces. Meanwhile, we collect the world-largest real-world masked test set. In the first phase of WebFace260M Track, 69 teams (total 833 solutions) participate in the challenge and 49 teams exceed the performance of our baseline. There are second phase of the challenge till October 1, 2021 and on-going leaderboard. We will actively update this report in the future.

32.2CLMay 27, 2021Code

CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Junjie Huang, Duyu Tang, Linjun Shou et al.

Finding codes given natural language query isb eneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce the CoSQA dataset.It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.

4.7CVApr 6, 2021

SIMPLE: SIngle-network with Mimicking and Point Learning for Bottom-up Human Pose Estimation

Jiabin Zhang, Zheng Zhu, Jiwen Lu et al.

The practical application requests both accuracy and efficiency on multi-person pose estimation algorithms. But the high accuracy and fast inference speed are dominated by top-down methods and bottom-up methods respectively. To make a better trade-off between accuracy and efficiency, we propose a novel multi-person pose estimation framework, SIngle-network with Mimicking and Point Learning for Bottom-up Human Pose Estimation (SIMPLE). Specifically, in the training process, we enable SIMPLE to mimic the pose knowledge from the high-performance top-down pipeline, which significantly promotes SIMPLE's accuracy while maintaining its high efficiency during inference. Besides, SIMPLE formulates human detection and pose estimation as a unified point learning framework to complement each other in single-network. This is quite different from previous works where the two tasks may interfere with each other. To the best of our knowledge, both mimicking strategy between different method types and unified point learning are firstly proposed in pose estimation. In experiments, our approach achieves the new state-of-the-art performance among bottom-up methods on the COCO, MPII and PoseTrack datasets. Compared with the top-down approaches, SIMPLE has comparable accuracy and faster inference speed.

31.5CLApr 5, 2021Code

WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Junjie Huang, Duyu Tang, Wanjun Zhong et al.

Producing the embedding of a sentence in an unsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on four pretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have there main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both top andbottom layers is better than only using top layers. Lastly, an easy whitening-based vector normalization strategy with less than 10 lines of code consistently boosts the performance.

33.1CVMar 6, 2021

WebFace260M: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition

Zheng Zhu, Guan Huang, Jiankang Deng et al.

In this paper, we contribute a new million-scale face benchmark containing noisy 4M identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol. Firstly, we collect 4M name list and download 260M faces from the Internet. Then, a Cleaning Automatically utilizing Self-Training (CAST) pipeline is devised to purify the tremendous WebFace260M, which is efficient and scalable. To the best of our knowledge, the cleaned WebFace42M is the largest public face recognition training set and we expect to close the data gap between academia and industry. Referring to practical scenarios, Face Recognition Under Inference Time conStraint (FRUITS) protocol and a test set are constructed to comprehensively evaluate face matchers. Equipped with this benchmark, we delve into million-scale face recognition problems. A distributed framework is developed to train face recognition models efficiently without tampering with the performance. Empowered by WebFace42M, we reduce relative 40% failure rate on the challenging IJB-C set, and ranks the 3rd among 430 entries on NIST-FRVT. Even 10% data (WebFace4M) shows superior performance compared with public training set. Furthermore, comprehensive baselines are established on our rich-attribute test set under FRUITS-100ms/500ms/1000ms protocol, including MobileNet, EfficientNet, AttentionNet, ResNet, SENet, ResNeXt and RegNet families. Benchmark website is https://www.face-benchmark.org.

60.2SEFeb 9, 2021Code

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Shuai Lu, Daya Guo, Shuo Ren et al.

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems.

14.9SIJan 7, 2021Code

SDGNN: Learning Node Representation for Signed Directed Networks

Junjie Huang, Huawei Shen, Liang Hou et al.

Network embedding is aimed at mapping nodes in a network into low-dimensional vector representations. Graph Neural Networks (GNNs) have received widespread attention and lead to state-of-the-art performance in learning node representations. However, most GNNs only work in unsigned networks, where only positive links exist. It is not trivial to transfer these models to signed directed networks, which are widely observed in the real world yet less studied. In this paper, we first review two fundamental sociological theories (i.e., status theory and balance theory) and conduct empirical studies on real-world datasets to analyze the social mechanism in signed directed networks. Guided by related sociological theories, we propose a novel Signed Directed Graph Neural Networks model named SDGNN to learn node embeddings for signed directed networks. The proposed model simultaneously reconstructs link signs, link directions, and signed directed triangles. We validate our model's effectiveness on five real-world datasets, which are commonly used as the benchmark for signed network embedding. Experiments demonstrate the proposed model outperforms existing models, including feature-based methods, network embedding methods, and several GNN methods.

7.2CVAug 17, 2020Code

AID: Pushing the Performance Boundary of Human Pose Estimation with Information Dropping Augmentation

Junjie Huang, Zheng Zhu, Guan Huang et al.

Both appearance cue and constraint cue are vital for human pose estimation. However, there is a tendency in most existing works to overfitting the former and overlook the latter. In this paper, we propose Augmentation by Information Dropping (AID) to verify and tackle this dilemma. Alone with AID as a prerequisite for effectively exploiting its potential, we propose customized training schedules, which are designed by analyzing the pattern of loss and performance in training process from the perspective of information supplying. In experiments, as a model-agnostic approach, AID promotes various state-of-the-art methods in both bottom-up and top-down paradigms with different input sizes, frameworks, backbones, training and testing sets. On popular COCO human pose estimation test set, AID consistently boosts the performance of different configurations by around 0.6 AP in top-down paradigm and up to 1.5 AP in bottom-up paradigm. On more challenging CrowdPose dataset, the improvement is more than 1.5 AP. As AID successfully pushes the performance boundary of human pose estimation problem by considerable margin and sets a new state-of-the-art, we hope AID to be a regular configuration for training human pose estimators. The source code will be publicly available for further research.

11.1LGJul 27, 2020

Label-Consistency based Graph Neural Networks for Semi-supervised Node Classification

Bingbing Xu, Junjie Huang, Liang Hou et al.

Graph neural networks (GNNs) achieve remarkable success in graph-based semi-supervised node classification, leveraging the information from neighboring nodes to improve the representation learning of target node. The success of GNNs at node classification depends on the assumption that connected nodes tend to have the same label. However, such an assumption does not always work, limiting the performance of GNNs at node classification. In this paper, we propose label-consistency based graph neural network(LC-GNN), leveraging node pairs unconnected but with the same labels to enlarge the receptive field of nodes in GNNs. Experiments on benchmark datasets demonstrate the proposed LC-GNN outperforms traditional GNNs in graph-based semi-supervised node classification.We further show the superiority of LC-GNN in sparse scenarios with only a handful of labeled nodes.

1.4CLApr 6, 2020

Learning to Recover Reasoning Chains for Multi-Hop Question Answering via Cooperative Games

Yufei Feng, Mo Yu, Wenhan Xiong et al.

We propose the new problem of learning to recover reasoning chains from weakly supervised signals, i.e., the question-answer pairs. We propose a cooperative game approach to deal with this problem, in which how the evidence passages are selected and how the selected passages are connected are handled by two models that cooperate to select the most confident chains from a large set of candidates (from distant supervision). For evaluation, we created benchmarks based on two multi-hop QA datasets, HotpotQA and MedHop; and hand-labeled reasoning chains for the latter. The experimental results demonstrate the effectiveness of our proposed approach.

1.4MLJan 31, 2020

Learning Deep Analysis Dictionaries -- Part II: Convolutional Dictionaries

Jun-Jie Huang, Pier Luigi Dragotti

In this paper, we introduce a Deep Convolutional Analysis Dictionary Model (DeepCAM) by learning convolutional dictionaries instead of unstructured dictionaries as in the case of deep analysis dictionary model introduced in the companion paper. Convolutional dictionaries are more suitable for processing high-dimensional signals like for example images and have only a small number of free parameters. By exploiting the properties of a convolutional dictionary, we present an efficient convolutional analysis dictionary learning approach. A L-layer DeepCAM consists of L layers of convolutional analysis dictionary and element-wise soft-thresholding pairs and a single layer of convolutional synthesis dictionary. Similar to DeepAM, each convolutional analysis dictionary is composed of a convolutional Information Preserving Analysis Dictionary (IPAD) and a convolutional Clustering Analysis Dictionary (CAD). The IPAD and the CAD are learned using variations of the proposed learning algorithm. We demonstrate that DeepCAM is an effective multilayer convolutional model and, on single image super-resolution, achieves performance comparable with other methods while also showing good generalization capabilities.

3.8MLJan 31, 2020

Learning Deep Analysis Dictionaries for Image Super-Resolution

Jun-Jie Huang, Pier Luigi Dragotti

Inspired by the recent success of deep neural networks and the recent efforts to develop multi-layer dictionary models, we propose a Deep Analysis dictionary Model (DeepAM) which is optimized to address a specific regression task known as single image super-resolution. Contrary to other multi-layer dictionary models, our architecture contains L layers of analysis dictionary and soft-thresholding operators to gradually extract high-level features and a layer of synthesis dictionary which is designed to optimize the regression task at hand. In our approach, each analysis dictionary is partitioned into two sub-dictionaries: an Information Preserving Analysis Dictionary (IPAD) and a Clustering Analysis Dictionary (CAD). The IPAD together with the corresponding soft-thresholds is designed to pass the key information from the previous layer to the next layer, while the CAD together with the corresponding soft-thresholding operator is designed to produce a sparse feature representation of its input data that facilitates discrimination of key features. DeepAM uses both supervised and unsupervised setup. Simulation results show that the proposed deep analysis dictionary model achieves better performance compared to a deep neural network that has the same structure and is optimized using back-propagation when training datasets are small.

6.0CVDec 18, 2019

Coupled Network for Robust Pedestrian Detection with Gated Multi-Layer Feature Extraction and Deformable Occlusion Handling

Tianrui Liu, Wenhan Luo, Lin Ma et al.

Pedestrian detection methods have been significantly improved with the development of deep convolutional neural networks. Nevertheless, detecting small-scaled pedestrians and occluded pedestrians remains a challenging problem. In this paper, we propose a pedestrian detection method with a couple-network to simultaneously address these two issues. One of the sub-networks, the gated multi-layer feature extraction sub-network, aims to adaptively generate discriminative features for pedestrian candidates in order to robustly detect pedestrians with large variations on scales. The second sub-network targets in handling the occlusion problem of pedestrian detection by using deformable regional RoI-pooling. We investigate two different gate units for the gated sub-network, namely, the channel-wise gate unit and the spatio-wise gate unit, which can enhance the representation ability of the regional convolutional features among the channel dimensions or across the spatial domain, repetitively. Ablation studies have validated the effectiveness of both the proposed gated multi-layer feature extraction sub-network and the deformable occlusion handling sub-network. With the coupled framework, our proposed pedestrian detector achieves state-of-the-art results on the Caltech and the CityPersons pedestrian detection benchmarks.

2.6CVOct 25, 2019

Gated Multi-layer Convolutional Feature Extraction Network for Robust Pedestrian Detection

Tianrui Liu, Jun-Jie Huang, Tianhong Dai et al.

Pedestrian detection methods have been significantly improved with the development of deep convolutional neural networks. Nevertheless, robustly detecting pedestrians with a large variant on sizes and with occlusions remains a challenging problem. In this paper, we propose a gated multi-layer convolutional feature extraction method which can adaptively generate discriminative features for candidate pedestrian regions. The proposed gated feature extraction framework consists of squeeze units, gate units and a concatenation layer which perform feature dimension squeezing, feature elements manipulation and convolutional features combination from multiple CNN layers, respectively. We proposed two different gate models which can manipulate the regional feature maps in a channel-wise selection manner and a spatial-wise selection manner, respectively. Experiments on the challenging CityPersons dataset demonstrate the effectiveness of the proposed method, especially on detecting those small-size and occluded pedestrians.

2.6CVOct 14, 2019

Multi-Stage HRNet: Multiple Stage High-Resolution Network for Human Pose Estimation

Junjie Huang, Zheng Zhu, Guan Huang

Human pose estimation are of importance for visual understanding tasks such as action recognition and human-computer interaction. In this work, we present a Multiple Stage High-Resolution Network (Multi-Stage HRNet) to tackling the problem of multi-person pose estimation in images. Specifically, we follow the top-down pipelines and high-resolution representations are maintained during single-person pose estimation. In addition, multiple stage network and cross stage feature aggregation are adopted to further refine the keypoint position. The resulting approach achieves promising results in COCO datasets. Our single-model-single-scale test configuration obtains 77.1 AP score in test-dev using publicly available training data.

20.4SIJun 26, 2019Code

Signed Graph Attention Networks

Junjie Huang, Huawei Shen, Liang Hou et al.

Graph or network data is ubiquitous in the real world, including social networks, information networks, traffic networks, biological networks and various technical networks. The non-Euclidean nature of graph data poses the challenge for modeling and analyzing graph data. Recently, Graph Neural Network (GNNs) are proposed as a general and powerful framework to handle tasks on graph data, e.g., node embedding, link prediction and node classification. As a representative implementation of GNNs, Graph Attention Networks (GATs) are successfully applied in a variety of tasks on real datasets. However, GAT is designed to networks with only positive links and fails to handle signed networks which contain both positive and negative links. In this paper, we propose Signed Graph Attention Networks (SiGATs), generalizing GAT to signed networks. SiGAT incorporates graph motifs into GAT to capture two well-known theories in signed network research, i.e., balance theory and status theory. In SiGAT, motifs offer us the flexible structural pattern to aggregate and propagate messages on the signed network to generate node embeddings. We evaluate the proposed SiGAT method by applying it to the signed link prediction task. Experimental results on three real datasets demonstrate that SiGAT outperforms feature-based method, network embedding method and state-of-the-art GNN-based methods like signed graph convolutional network (SGCN).

9.1CVDec 14, 2018

Action Machine: Rethinking Action Recognition in Trimmed Videos

Jiagang Zhu, Wei Zou, Liang Xu et al.

Existing methods in video action recognition mostly do not distinguish human body from the environment and easily overfit the scenes and objects. In this work, we present a conceptually simple, general and high-performance framework for action recognition in trimmed videos, aiming at person-centric modeling. The method, called Action Machine, takes as inputs the videos cropped by person bounding boxes. It extends the Inflated 3D ConvNet (I3D) by adding a branch for human pose estimation and a 2D CNN for pose-based action recognition, being fast to train and test. Action Machine can benefit from the multi-task training of action recognition and pose estimation, the fusion of predictions from RGB images and poses. On NTU RGB-D, Action Machine achieves the state-of-the-art performance with top-1 accuracies of 97.2% and 94.3% on cross-view and cross-subject respectively. Action Machine also achieves competitive performance on another three smaller action recognition datasets: Northwestern UCLA Multiview Action3D, MSR Daily Activity3D and UTD-MHAD. Code will be made available.

2.5CVNov 18, 2018

Optical Flow Based Online Moving Foreground Analysis

Junjie Huang, Wei Zou, Zheng Zhu et al.

Obtained by moving object detection, the foreground mask result is unshaped and can not be directly used in most subsequent processes. In this paper, we focus on this problem and address it by constructing an optical flow based moving foreground analysis framework. During the processing procedure, the foreground masks are analyzed and segmented through two complementary clustering algorithms. As a result, we obtain the instance-level information like the number, location and size of moving objects. The experimental result show that our method adapts itself to the problem and performs well enough for practical applications.

9.9CVNov 18, 2018

An Efficient Optical Flow Based Motion Detection Method for Non-stationary Scenes

Junjie Huang, Wei Zou, Zheng Zhu et al.

Real-time motion detection in non-stationary scenes is a difficult task due to dynamic background, changing foreground appearance and limited computational resource. These challenges degrade the performance of the existing methods in practical applications. In this paper, an optical flow based framework is proposed to address this problem. By applying a novel strategy to utilize optical flow, we enable our method being free of model constructing, training or updating and can be performed efficiently. Besides, a dual judgment mechanism with adaptive intervals and adaptive thresholds is designed to heighten the system's adaptation to different situations. In experiment part, we quantitatively and qualitatively validate the effectiveness and feasibility of our method with videos in various scene conditions. The experimental results show that our method adapts itself to different situations and outperforms the state-of-the-art real-time methods, indicating the advantages of our optical flow based method.

6.8CVJul 13, 2018

Optical Flow Based Real-time Moving Object Detection in Unconstrained Scenes

Junjie Huang, Wei Zou, Jiagang Zhu et al.

Real-time moving object detection in unconstrained scenes is a difficult task due to dynamic background, changing foreground appearance and limited computational resource. In this paper, an optical flow based moving object detection framework is proposed to address this problem. We utilize homography matrixes to online construct a background model in the form of optical flow. When judging out moving foregrounds from scenes, a dual-mode judge mechanism is designed to heighten the system's adaptation to challenging situations. In experiment part, two evaluation metrics are redefined for more properly reflecting the performance of methods. We quantitatively and qualitatively validate the effectiveness and feasibility of our method with videos in various scene conditions. The experimental results show that our method adapts itself to different situations and outperforms the state-of-the-art methods, indicating the advantages of optical flow based methods.