ASApr 7, 2020
Learning to fool the speaker recognitionJiguo Li, Xinfeng Zhang, Jizheng Xu et al.
Due to the widespread deployment of fingerprint/face/speaker recognition systems, attacking deep learning based biometric systems has drawn more and more attention. Previous research mainly studied the attack to the vision-based system, such as fingerprint and face recognition. While the attack for speaker recognition has not been investigated yet, although it has been widely used in our daily life. In this paper, we attempt to fool the state-of-the-art speaker recognition model and present \textit{speaker recognition attacker}, a lightweight model to fool the deep speaker recognition model by adding imperceptible perturbations onto the raw speech waveform. We find that the speaker recognition system is also vulnerable to the attack, and we achieve a high success rate on the non-targeted attack. Besides, we also present an effective method to optimize the speaker recognition attacker to obtain a trade-off between the attack success rate with the perceptual quality. Experiments on the TIMIT dataset show that we can achieve a sentence error rate of $99.2\%$ with an average SNR $57.2\text{dB}$ and PESQ 4.2 with speed rather faster than real-time.
ASApr 7, 2020
Universal Adversarial Perturbations Generative Network for Speaker RecognitionJiguo Li, Xinfeng Zhang, Chuanmin Jia et al.
Attacking deep learning based biometric systems has drawn more and more attention with the wide deployment of fingerprint/face/speaker recognition systems, given the fact that the neural networks are vulnerable to the adversarial examples, which have been intentionally perturbed to remain almost imperceptible for human. In this paper, we demonstrated the existence of the universal adversarial perturbations~(UAPs) for the speaker recognition systems. We proposed a generative network to learn the mapping from the low-dimensional normal distribution to the UAPs subspace, then synthesize the UAPs to perturbe any input signals to spoof the well-trained speaker recognition model with high probability. Experimental results on TIMIT and LibriSpeech datasets demonstrate the effectiveness of our model.
MMApr 7, 2020
Direct Speech-to-image TranslationJiguo Li, Xinfeng Zhang, Chuanmin Jia et al.
Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage. Specifically, a speech encoder is designed to represent the input speech signals as an embedding feature, and it is trained with a pretrained image encoder using teacher-student learning to obtain better generalization ability on new classes. Subsequently, a stacked generative adversarial network is used to synthesize high-quality images conditioned on the embedding feature. Experimental results on both synthesized and real data show that our proposed method is effective to translate the raw speech signals into images without the middle text representation. Ablation study gives more insights about our method.
CVNov 27, 2018
Affinity Derivation and Graph Merge for Instance SegmentationYiding Liu, Siyu Yang, Bin Li et al.
We present an instance segmentation scheme based on pixel affinity information, which is the relationship of two pixels belonging to a same instance. In our scheme, we use two neural networks with similar structure. One is to predict pixel level semantic score and the other is designed to derive pixel affinities. Regarding pixels as the vertexes and affinities as edges, we then propose a simple yet effective graph merge algorithm to cluster pixels into instances. Experimental results show that our scheme can generate fine-grained instance mask. With Cityscapes training data, the proposed scheme achieves 27.3 AP on test set.
CVAug 6, 2018
Weakly Supervised Bilinear Attention Network for Fine-Grained Visual ClassificationTao Hu, Jizheng Xu, Cong Huang et al.
For fine-grained visual classification, objects usually share similar geometric structure but present variant local appearance and different pose. Therefore, localizing and extracting discriminative local features play a crucial role in accurate category prediction. Existing works either pay attention to limited object parts or train isolated networks for locating and classification. In this paper, we propose Weakly Supervised Bilinear Attention Network (WS-BAN) to solve these issues. It jointly generates a set of attention maps (region-of-interest maps) to indicate the locations of object's parts and extracts sequential part features by Bilinear Attention Pooling (BAP). Besides, we propose attention regularization and attention dropout to weakly supervise the generating process of attention maps. WS-BAN can be trained end-to-end and achieves the state-of-the-art performance on multiple fine-grained classification datasets, including CUB-200-2011, Stanford Car and FGVC-Aircraft, which demonstrated its effectiveness.
CVMar 18, 2018
Facial Landmarks Detection by Self-Iterative Regression based Landmarks-Attention NetworkTao Hu, Honggang Qi, Jizheng Xu et al.
Cascaded Regression (CR) based methods have been proposed to solve facial landmarks detection problem, which learn a series of descent directions by multiple cascaded regressors separately trained in coarse and fine stages. They outperform the traditional gradient descent based methods in both accuracy and running speed. However, cascaded regression is not robust enough because each regressor's training data comes from the output of previous regressor. Moreover, training multiple regressors requires lots of computing resources, especially for deep learning based methods. In this paper, we develop a Self-Iterative Regression (SIR) framework to improve the model efficiency. Only one self-iterative regressor is trained to learn the descent directions for samples from coarse stages to fine stages, and parameters are iteratively updated by the same regressor. Specifically, we proposed Landmarks-Attention Network (LAN) as our regressor, which concurrently learns features around each landmark and obtains the holistic location increment. By doing so, not only the rest of regressors are removed to simplify the training process, but the number of model parameters is significantly decreased. The experiments demonstrate that with only 3.72M model parameters, our proposed method achieves the state-of-the-art performance.
CVSep 12, 2017
End-to-End United Video Dehazing and DetectionBoyi Li, Xiulian Peng, Zhangyang Wang et al.
The recent development of CNN-based image dehazing has revealed the effectiveness of end-to-end modeling. However, extending the idea to end-to-end video dehazing has not been explored yet. In this paper, we propose an End-to-End Video Dehazing Network (EVD-Net), to exploit the temporal consistency between consecutive video frames. A thorough study has been conducted over a number of structure options, to identify the best temporal fusion strategy. Furthermore, we build an End-to-End United Video Dehazing and Detection Network(EVDD-Net), which concatenates and jointly trains EVD-Net with a video object detection model. The resulting augmented end-to-end pipeline has demonstrated much more stable and accurate detection results in hazy video.
CVJul 20, 2017
An All-in-One Network for Dehazing and BeyondBoyi Li, Xiulian Peng, Zhangyang Wang et al.
This paper proposes an image dehazing model built with a convolutional neural network (CNN), called All-in-One Dehazing Network (AOD-Net). It is designed based on a re-formulated atmospheric scattering model. Instead of estimating the transmission matrix and the atmospheric light separately as most previous models did, AOD-Net directly generates the clean image through a light-weight CNN. Such a novel end-to-end design makes it easy to embed AOD-Net into other deep models, e.g., Faster R-CNN, for improving high-level task performance on hazy images. Experimental results on both synthesized and natural hazy image datasets demonstrate our superior performance than the state-of-the-art in terms of PSNR, SSIM and the subjective visual quality. Furthermore, when concatenating AOD-Net with Faster R-CNN and training the joint pipeline from end to end, we witness a large improvement of the object detection performance on hazy images.
MMMay 26, 2016
Efficient Multiple Line-Based Intra Prediction for HEVCJiahao Li, Bin Li, Jizheng Xu et al.
Traditional intra prediction usually utilizes the nearest reference line to generate the predicted block when considering strong spatial correlation. However, this kind of single line-based method does not always work well due to at least two issues. One is the incoherence caused by the signal noise or the texture of other object, where this texture deviates from the inherent texture of the current block. The other reason is that the nearest reference line usually has worse reconstruction quality in block-based video coding. Due to these two issues, this paper proposes an efficient multiple line-based intra prediction scheme to improve coding efficiency. Besides the nearest reference line, further reference lines are also utilized. The further reference lines with relatively higher quality can provide potential better prediction. At the same time, the residue compensation is introduced to calibrate the prediction of boundary regions in a block when we utilize further reference lines. To speed up the encoding process, this paper designs several fast algorithms. Experimental results show that, compared with HM-16.9, the proposed fast search method achieves 2.0% bit saving on average and up to 3.7%, with increasing the encoding time by 112%.