Tao Wan

h-index34

9papers

260citations

Novelty48%

AI Score25

Ranked #165,482 of 194,257 authors (top 85%)#53,148 in CV (top 90%)

9 Papers

5.7CVApr 10, 2022

Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

Shunyu Zhang, Xiaoze Jiang, Zequn Yang et al.

Visual Dialog requires an agent to engage in a conversation with humans grounded in an image. Many studies on Visual Dialog focus on the understanding of the dialog history or the content of an image, while a considerable amount of commonsense-required questions are ignored. Handling these scenarios depends on logical reasoning that requires commonsense priors. How to capture relevant commonsense knowledge complementary to the history and the image remains a key challenge. In this paper, we propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK). In our model, the external knowledge is represented with sentence-level facts and graph-level facts, to properly suit the scenario of the composite of dialog history and image. On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features, via graph-based interaction and transformer-based fusion. Experimental results and analysis on VisDial v1.0 and VisDialCK datasets show that our proposed model effectively outperforms comparative methods.

20.8CRNov 20, 2018

FALCON: A Fourier Transform Based Approach for Fast and Secure Convolutional Neural Network Predictions

Shaohua Li, Kaiping Xue, Chenkai Ding et al.

Machine learning as a service has been widely deployed to utilize deep neural network models to provide prediction services. However, this raises privacy concerns since clients need to send sensitive information to servers. In this paper, we focus on the scenario where clients want to classify private images with a convolutional neural network model hosted in the server, while both parties keep their data private. We present FALCON, a fast and secure approach for CNN predictions based on Fourier Transform. Our solution enables linear layers of a CNN model to be evaluated simply and efficiently with fully homomorphic encryption. We also introduce the first efficient and privacy-preserving protocol for softmax function, which is an indispensable component in CNNs and has not yet been evaluated in previous works due to its high complexity. We implemented the FALCON and evaluated the performance on real-world CNN models. The experimental results show that FALCON outperforms the best known works in both computation and communication cost.

2.5CVNov 1, 2018

A sequential guiding network with attention for image captioning

Daouda Sow, Zengchang Qin, Mouhamed Niasse et al.

The recent advances of deep learning in both computer vision (CV) and natural language processing (NLP) provide us a new way of understanding semantics, by which we can deal with more challenging tasks such as automatic description generation from natural images. In this challenge, the encoder-decoder framework has achieved promising performance when a convolutional neural network (CNN) is used as image encoder and a recurrent neural network (RNN) as decoder. In this paper, we introduce a sequential guiding network that guides the decoder during word generation. The new model is an extension of the encoder-decoder framework with attention that has an additional guiding long short-term memory (LSTM) and can be trained in an end-to-end manner by using image/descriptions pairs. We validate our approach by conducting extensive experiments on a benchmark dataset, i.e., MS COCO Captions. The proposed model achieves significant improvement comparing to the other state-of-the-art deep learning models.

11.4CVNov 1, 2018

Pixel Level Data Augmentation for Semantic Image Segmentation using Generative Adversarial Networks

Shuangting Liu, Jiaqi Zhang, Yuxin Chen et al.

Semantic segmentation is one of the basic topics in computer vision, it aims to assign semantic labels to every pixel of an image. Unbalanced semantic label distribution could have a negative influence on segmentation accuracy. In this paper, we investigate using data augmentation approach to balance the semantic label distribution in order to improve segmentation performance. We propose using generative adversarial networks (GANs) to generate realistic images for improving the performance of semantic segmentation networks. Experimental results show that the proposed method can not only improve segmentation performance on those classes with low accuracy, but also obtain 1.3% to 2.1% increase in average segmentation accuracy. It shows that this augmentation method can boost accuracy and be easily applicable to any other segmentation models.

3.9CVMar 14, 2018

Context-Aware Mixed Reality: A Framework for Ubiquitous Interaction

Long Chen, Wen Tang, Nigel John et al.

Mixed Reality (MR) is a powerful interactive technology that yields new types of user experience. We present a semantic based interactive MR framework that exceeds the current geometry level approaches, a step change in generating high-level context-aware interactions. Our key insight is to build semantic understanding in MR that not only can greatly enhance user experience through object-specific behaviours, but also pave the way for solving complex interaction design challenges. The framework generates semantic properties of the real world environment through dense scene reconstruction and deep image understanding. We demonstrate our approach with a material-aware prototype system for generating context-aware physical interactions between the real and the virtual objects. Quantitative and qualitative evaluations are carried out and the results show that the framework delivers accurate and fast semantic information in interactive MR environment, providing effective semantic level interactions.

1.8CLMay 9, 2017

Logical Parsing from Natural Language Based on a Neural Translation Model

Liang Li, Pengyu Li, Yifan Liu et al.

Semantic parsing has emerged as a significant and powerful paradigm for natural language interface and question answering systems. Traditional methods of building a semantic parser rely on high-quality lexicons, hand-crafted grammars and linguistic features which are limited by applied domain or representation. In this paper, we propose a general approach to learn from denotations based on Seq2Seq model augmented with attention mechanism. We encode input sequence into vectors and use dynamic programming to infer candidate logical forms. We utilize the fact that similar utterances should have similar logical forms to help reduce the searching space. Under our learning policy, the Seq2Seq model can learn mappings gradually with noises. Curriculum learning is adopted to make the learning smoother. We test our method on the arithmetic domain which shows our model can successfully infer the correct logical forms and learn the word meanings, compositionality and operation orders simultaneously.

6.6CVMay 8, 2017

Generative Cooperative Net for Image Generation and Data Augmentation

Qiangeng Xu, Zengchang Qin, Tao Wan

How to build a good model for image generation given an abstract concept is a fundamental problem in computer vision. In this paper, we explore a generative model for the task of generating unseen images with desired features. We propose the Generative Cooperative Net (GCN) for image generation. The idea is similar to generative adversarial networks except that the generators and discriminators are trained to work accordingly. Our experiments on hand-written digit generation and facial expression generation show that GCN's two cooperative counterparts (the generator and the classifier) can work together nicely and achieve promising results. We also discovered a usage of such generative model as an data-augmentation tool. Our experiment of applying this method on a recognition task shows that it is very effective comparing to other existing methods. It is easy to set up and could help generate a very large synthesized dataset.

4.3NIMar 20, 2017

A Framework and Comparative Analysis of Control Plane Security of SDN and Conventional Networks

AbdelRahman Abdou, Paul C. van Oorschot, Tao Wan

Software defined networking implements the network control plane in an external entity, rather than in each individual device as in conventional networks. This architectural difference implies a different design for control functions necessary for essential network properties, e.g., loop prevention and link redundancy. We explore how such differences redefine the security weaknesses in the SDN control plane and provide a framework for comparative analysis which focuses on essential network properties required by typical production networks. This enables analysis of how these properties are delivered by the control planes of SDN and conventional networks, and to compare security risks and mitigations. Despite the architectural difference, we find similar, but not identical, exposures in control plane security if both network paradigms provide the same network properties and are analyzed under the same threat model. However, defenses vary; SDN cannot depend on edge based filtering to protect its control plane, while this is arguably the primary defense in conventional networks. Our concrete security analysis suggests that a distributed SDN architecture that supports fault tolerance and consistency checks is important for SDN control plane security. Our analysis methodology may be of independent interest for future security analysis of SDN and conventional networks.

2.4CVMar 1, 2017

Augmented Reality for Depth Cues in Monocular Minimally Invasive Surgery

Long Chen, Wen Tang, Nigel W. John et al.

One of the major challenges in Minimally Invasive Surgery (MIS) such as laparoscopy is the lack of depth perception. In recent years, laparoscopic scene tracking and surface reconstruction has been a focus of investigation to provide rich additional information to aid the surgical process and compensate for the depth perception issue. However, robust 3D surface reconstruction and augmented reality with depth perception on the reconstructed scene are yet to be reported. This paper presents our work in this area. First, we adopt a state-of-the-art visual simultaneous localization and mapping (SLAM) framework - ORB-SLAM - and extend the algorithm for use in MIS scenes for reliable endoscopic camera tracking and salient point mapping. We then develop a robust global 3D surface reconstruction frame- work based on the sparse point clouds extracted from the SLAM framework. Our approach is to combine an outlier removal filter within a Moving Least Squares smoothing algorithm and then employ Poisson surface reconstruction to obtain smooth surfaces from the unstructured sparse point cloud. Our proposed method has been quantitatively evaluated compared with ground-truth camera trajectories and the organ model surface we used to render the synthetic simulation videos. In vivo laparoscopic videos used in the tests have demonstrated the robustness and accuracy of our proposed framework on both camera tracking and surface reconstruction, illustrating the potential of our algorithm for depth augmentation and depth-corrected augmented reality in MIS with monocular endoscopes.