Ming-Ting Sun

16papers

3,770citations

Novelty51%

AI Score29

Ranked #150,529 of 201,326 authors (top 75%)#46,811 in CV (top 79%)

16 Papers

CVJul 11, 2019Code

Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation

Kevin Lin, Lijuan Wang, Kun Luo et al.

Supervised deep learning with pixel-wise training labels has great successes on multi-person part segmentation. However, data labeling at pixel-level is very expensive. To solve the problem, people have been exploring to use synthetic data to avoid the data labeling. Although it is easy to generate labels for synthetic data, the results are much worse compared to those using real data and manual labeling. The degradation of the performance is mainly due to the domain gap, i.e., the discrepancy of the pixel value statistics between real and synthetic data. In this paper, we observe that real and synthetic humans both have a skeleton (pose) representation. We found that the skeletons can effectively bridge the synthetic and real domains during the training. Our proposed approach takes advantage of the rich and realistic variations of the real data and the easily obtainable labels of the synthetic data to learn multi-person part segmentation on real images without any human-annotated labels. Through experiments, we show that without any human labeling, our method performs comparably to several state-of-the-art approaches which require human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other hand, if part labels are also available in the real-images during training, our method outperforms the supervised state-of-the-art methods by a large margin. We further demonstrate the generalizability of our method on predicting novel keypoints in real images where no real data labels are available for the novel keypoints detection. Code and pre-trained models are available at https://github.com/kevinlin311tw/CDCL-human-part-segmentation

CLSep 16, 2020

Contextualized Perturbation for Textual Adversarial Attack

Dianqi Li, Yizhe Zhang, Hao Peng et al.

Adversarial examples expose the vulnerabilities of natural language processing (NLP) models, and can be used to evaluate and improve their robustness. Existing techniques of generating such examples are typically driven by local heuristic rules that are agnostic to the context, often resulting in unnatural and ungrammatical outputs. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs through a mask-then-infill procedure. CLARE builds on a pre-trained masked language model and modifies the inputs in a context-aware manner. We propose three contextualized perturbations, Replace, Insert and Merge, allowing for generating outputs of varied lengths. With a richer range of available strategies, CLARE is able to attack a victim model more efficiently with fewer edits. Extensive experiments and human evaluation demonstrate that CLARE outperforms the baselines in terms of attack success rate, textual similarity, fluency and grammaticality.

CVFeb 28, 2020

Learning Nonparametric Human Mesh Reconstruction from a Single Image without Ground Truth Meshes

Kevin Lin, Lijuan Wang, Ying Jin et al.

Nonparametric approaches have shown promising results on reconstructing 3D human mesh from a single monocular image. Unlike previous approaches that use a parametric human model like skinned multi-person linear model (SMPL), and attempt to regress the model parameters, nonparametric approaches relax the heavy reliance on the parametric space. However, existing nonparametric methods require ground truth meshes as their regression target for each vertex, and obtaining ground truth mesh labels is very expensive. In this paper, we propose a novel approach to learn human mesh reconstruction without any ground truth meshes. This is made possible by introducing two new terms into the loss function of a graph convolutional neural network (Graph CNN). The first term is the Laplacian prior that acts as a regularizer on the reconstructed mesh. The second term is the part segmentation loss that forces the projected region of the reconstructed mesh to match the part segmentation. Experimental results on multiple public datasets show that without using 3D ground truth meshes, the proposed approach outperforms the previous state-of-the-art approaches that require ground truth meshes for training.

CLFeb 16, 2020

Learning to Generate Multiple Style Transfer Outputs for an Input Sentence

Kevin Lin, Ming-Yu Liu, Ming-Ting Sun et al.

Text style transfer refers to the task of rephrasing a given text in a different style. While various methods have been proposed to advance the state of the art, they often assume the transfer output follows a delta distribution, and thus their models cannot generate different style transfer results for a given input text. To address the limitation, we propose a one-to-many text style transfer framework. In contrast to prior works that learn a one-to-one mapping that converts an input sentence to one output sentence, our approach learns a one-to-many mapping that can convert an input sentence to multiple different output sentences, while preserving the input content. This is achieved by applying adversarial training with a latent decomposition scheme. Specifically, we decompose the latent representation of the input sentence to a style code that captures the language style variation and a content code that encodes the language style-independent content. We then combine the content code with the style code for generating a style transfer output. By combining the same content code with a different style code, we generate a different style transfer output. Extensive experimental results with comparisons to several text style transfer approaches on multiple public datasets using a diverse set of performance metrics validate effectiveness of the proposed approach.

CLAug 25, 2019

Domain Adaptive Text Style Transfer

Dianqi Li, Yizhe Zhang, Zhe Gan et al.

Text style transfer without parallel data has achieved some practical success. However, in the scenario where less data is available, these methods may yield poor performance. In this paper, we examine domain adaptation for text style transfer to leverage massively available data from other domains. These data may demonstrate domain shift, which impedes the benefits of utilizing such data for training. To address this challenge, we propose simple yet effective domain adaptive text style transfer models, enabling domain-adaptive information exchange. The proposed models presumably learn from the source domain to: (i) distinguish stylized information and generic content information; (ii) maximally preserve content information; and (iii) adaptively transfer the styles in a domain-aware manner. We evaluate the proposed models on two style transfer tasks (sentiment and formality) over multiple target domains where only limited non-parallel data is available. Extensive experiments demonstrate the effectiveness of the proposed model compared to the baselines.

MMMay 29, 2018

Surface Light Field Compression using a Point Cloud Codec

Xiang Zhang, Philip A. Chou, Ming-Ting Sun et al.

Light field (LF) representations aim to provide photo-realistic, free-viewpoint viewing experiences. However, the most popular LF representations are images from multiple views. Multi-view image-based representations generally need to restrict the range or degrees of freedom of the viewing experience to what can be interpolated in the image domain, essentially because they lack explicit geometry information. We present a new surface light field (SLF) representation based on explicit geometry, and a method for SLF compression. First, we map the multi-view images of a scene onto a 3D geometric point cloud. The color of each point in the point cloud is a function of viewing direction known as a view map. We represent each view map efficiently in a B-Spline wavelet basis. This representation is capable of modeling diverse surface materials and complex lighting conditions in a highly scalable and adaptive manner. The coefficients of the B-Spline wavelet representation are then compressed spatially. To increase the spatial correlation and thus improve compression efficiency, we introduce a smoothing term to make the coefficients more similar across the 3D space. We compress the coefficients spatially using existing point cloud compression (PCC) methods. On the decoder side, the scene is rendered efficiently from any viewing direction by reconstructing the view map at each point. In contrast to multi-view image-based LF approaches, our method supports photo-realistic rendering of real-world scenes from arbitrary viewpoints, i.e., with an unlimited six degrees of freedom (6DOF). In terms of rate and distortion, experimental results show that our method achieves superior performance with lighter decoder complexity compared with a reference image-plus-geometry compression (IGC) scheme, indicating its potential in practical virtual and augmented reality applications.

CVApr 3, 2018

Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning

Dianqi Li, Qiuyuan Huang, Xiaodong He et al.

We study how to generate captions that are not only accurate in describing an image but also discriminative across different images. The problem is both fundamental and interesting, as most machine-generated captions, despite phenomenal research progresses in the past several years, are expressed in a very monotonic and featureless format. While such captions are normally accurate, they often lack important characteristics in human languages - distinctiveness for each caption and diversity for different images. To address this problem, we propose a novel conditional generative adversarial network for generating diverse captions across images. Instead of estimating the quality of a caption solely on one image, the proposed comparative adversarial learning framework better assesses the quality of captions by comparing a set of captions within the image-caption joint space. By contrasting with human-written captions and image-mismatched captions, the caption generator effectively exploits the inherent characteristics of human languages, and generates more discriminative captions. We show that our proposed network is capable of producing accurate and diverse captions across images.

CVFeb 8, 2018

Hole Filling with Multiple Reference Views in DIBR View Synthesis

Shuai Li, Ce Zhu, Ming-Ting Sun

Depth-image-based rendering (DIBR) oriented view synthesis has been widely employed in the current depth-based 3D video systems by synthesizing a virtual view from an arbitrary viewpoint. However, holes may appear in the synthesized view due to disocclusion, thus significantly degrading the quality. Consequently, efforts have been made on developing effective and efficient hole filling algorithms. Current hole filling techniques generally extrapolate/interpolate the hole regions with the neighboring information based on an assumption that the texture pattern in the holes is similar to that of the neighboring background information. However, in many scenarios especially of complex texture, the assumption may not hold. In other words, hole filling techniques can only provide an estimation for a hole which may not be good enough or may even be erroneous considering a wide variety of complex scene of images. In this paper, we first examine the view interpolation with multiple reference views, demonstrating that the problem of emerging holes in a target virtual view can be greatly alleviated by making good use of other neighboring complementary views in addition to its two (commonly used) most neighboring primary views. The effects of using multiple views for view extrapolation in reducing holes are also investigated in this paper. In view of the 3D Video and ongoing free-viewpoint TV standardization, we propose a new view synthesis framework which employs multiple views to synthesize output virtual views. Furthermore, a scheme of selective warping of complementary views is developed by efficiently locating a small number of useful pixels in the complementary views for hole reduction, to avoid a full warping of additional complementary views thus lowering greatly the warping complexity.

CLMay 31, 2017

Adversarial Ranking for Language Generation

Kevin Lin, Dianqi Li, Xiaodong He et al.

Generative adversarial networks (GANs) have great successes on synthesizing data. However, the existing GANs restrict the discriminator to be a binary classifier, and thus limit their learning capacity for tasks that need to synthesize output with rich structures such as natural language descriptions. In this paper, we propose a novel generative adversarial network, RankGAN, for generating high-quality language descriptions. Rather than training the discriminator to learn and assign absolute binary predicate for individual data sample, the proposed RankGAN is able to analyze and rank a collection of human-written and machine-written sentences by giving a reference group. By viewing a set of data samples collectively and evaluating their quality through relative ranking scores, the discriminator is able to make better assessment which in turn helps to learn a better generator. The proposed RankGAN is optimized through the policy gradient technique. Experimental results on multiple public datasets clearly demonstrate the effectiveness of the proposed approach.

CVNov 10, 2015

Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer

Jun Xie, Martin Kiefel, Ming-Ting Sun et al.

Semantic annotations are vital for training models for object recognition, semantic segmentation or scene understanding. Unfortunately, pixelwise annotation of images at very large scale is labor-intensive and only little labeled data is available, particularly at instance level and for street scenes. In this paper, we propose to tackle this problem by lifting the semantic instance labeling task from 2D into 3D. Given reconstructions from stereo or laser data, we annotate static 3D scene elements with rough bounding primitives and develop a model which transfers this information into the image domain. We leverage our method to obtain 2D labels for a novel suburban video dataset which we have collected, resulting in 400k semantic and instance image annotations. A comparison of our method to state-of-the-art label transfer baselines reveals that 3D information enables more efficient annotation while at the same time resulting in improved accuracy and time-coherent labels.

MMFeb 28, 2015

Region-Based Rate-Control for H.264/AVC for Low Bit-Rate Applications

Hai-Miao Hu, Bo Li, Weiyao Lin et al.

Rate-control plays an important role in video coding. However, in the conventional rate-control algorithms, the number and position of Macroblocks (MBs) inside one basic unit for rate-control is inflexible and predetermined. The different characteristics of the MBs are not fully considered. Also, there is no overall optimization of the coding of basic units. This paper proposes a new region-based rate-control scheme for H.264/AVC to improve the coding efficiency. The inter-frame information is explored to objectively divide one frame into multiple regions based on their rate-distortion behaviors. The MBs with the similar characteristics are classified into the same region, and the entire region instead of a single MB or a group of contiguous MBs is treated as a basic unit for rate-control. A linear rate-quantization stepsize model and a linear distortion-quantization stepsize model are proposed to accurately describe the rate-distortion characteristics for the region-based basic units. Moreover, based on the above linear models, an overall optimization model is proposed to obtain suitable Quantization Parameters (QPs) for the region-based basic units. Experimental results demonstrate that the proposed region-based rate-control approach can achieve both better subjective and objective quality by performing the rate-control adaptively with the content, compared to the conventional rate-control approaches.

MMFeb 28, 2015

Macroblock Classification Method for Video Applications Involving Motions

Weiyao Lin, Ming-Ting Sun, Hongxiang Li et al.

In this paper, a macroblock classification method is proposed for various video processing applications involving motions. Based on the analysis of the Motion Vector field in the compressed video, we propose to classify Macroblocks of each video frame into different classes and use this class information to describe the frame content. We demonstrate that this low-computation-complexity method can efficiently catch the characteristics of the frame. Based on the proposed macroblock classification, we further propose algorithms for different video processing applications, including shot change detection, motion discontinuity detection, and outlier rejection for global motion estimation. Experimental results demonstrate that the methods based on the proposed approach can work effectively on these applications.

MMFeb 28, 2015

A Fast Sub-Pixel Motion Estimation Algorithm for H.264/AVC Video Coding

Weiyao Lin, Krit Panusopone, David M. Baylon et al.

Motion Estimation (ME) is one of the most time-consuming parts in video coding. The use of multiple partition sizes in H.264/AVC makes it even more complicated when compared to ME in conventional video coding standards. It is important to develop fast and effective sub-pixel ME algorithms since (a) The computation overhead by sub-pixel ME has become relatively significant while the complexity of integer-pixel search has been greatly reduced by fast algorithms, and (b) Reducing sub-pixel search points can greatly save the computation for sub-pixel interpolation. In this paper, a novel fast sub-pixel ME algorithm is proposed which performs a 'rough' sub-pixel search before the partition selection, and performs a 'precise' sub-pixel search for the best partition. By reducing the searching load for the large number of non-best partitions, the computation complexity for sub-pixel search can be greatly decreased. Experimental results show that our method can reduce the sub-pixel search points by more than 50% compared to existing fast sub-pixel ME methods with negligible quality degradation.

MMFeb 28, 2015

A Computation Control Motion Estimation Method for Complexity-Scalable Video Coding

Weiyao Lin, Krit Panusopone, David M. Baylon et al.

In this paper, a new Computation-Control Motion Estimation (CCME) method is proposed which can perform Motion Estimation (ME) adaptively under different computation or power budgets while keeping high coding performance. We first propose a new class-based method to measure the Macroblock (MB) importance where MBs are classified into different classes and their importance is measured by combining their class information as well as their initial matching cost information. Based on the new MB importance measure, a complete CCME framework is then proposed to allocate computation for ME. The proposed method performs ME in a one-pass flow. Experimental results demonstrate that the proposed method can allocate computation more accurately than previous methods and thus has better performance under the same computation budget.

CVFeb 28, 2015

Group Event Detection with a Varying Number of Group Members for Video Surveillance

Weiyao Lin, Ming-Ting Sun, Radha Poovendran et al.

This paper presents a novel approach for automatic recognition of group activities for video surveillance applications. We propose to use a group representative to handle the recognition with a varying number of group members, and use an Asynchronous Hidden Markov Model (AHMM) to model the relationship between people. Furthermore, we propose a group activity detection algorithm which can handle both symmetric and asymmetric group activities, and demonstrate that this approach enables the detection of hierarchical interactions between people. Experimental results show the effectiveness of our approach.

CVFeb 28, 2015

Activity Recognition Using A Combination of Category Components And Local Models for Video Surveillance

Weiyao Lin, Ming-Ting Sun, Radha Poovendran et al.

This paper presents a novel approach for automatic recognition of human activities for video surveillance applications. We propose to represent an activity by a combination of category components, and demonstrate that this approach offers flexibility to add new activities to the system and an ability to deal with the problem of building models for activities lacking training data. For improving the recognition accuracy, a Confident-Frame- based Recognition algorithm is also proposed, where the video frames with high confidence for recognizing an activity are used as a specialized local model to help classify the remainder of the video frames. Experimental results show the effectiveness of the proposed approach.