Kun Han

CV
h-index14
26papers
502citations
Novelty48%
AI Score49

26 Papers

CVSep 22, 2022
Identity-Aware Hand Mesh Estimation and Personalization from RGB Images

Deying Kong, Linguang Zhang, Liangjian Chen et al. · meta-ai

Reconstructing 3D hand meshes from monocular RGB images has attracted increasing amount of attention due to its enormous potential applications in the field of AR/VR. Most state-of-the-art methods attempt to tackle this task in an anonymous manner. Specifically, the identity of the subject is ignored even though it is practically available in real applications where the user is unchanged in a continuous recording session. In this paper, we propose an identity-aware hand mesh estimation model, which can incorporate the identity information represented by the intrinsic shape parameters of the subject. We demonstrate the importance of the identity information by comparing the proposed identity-aware model to a baseline which treats subject anonymously. Furthermore, to handle the use case where the test subject is unseen, we propose a novel personalization pipeline to calibrate the intrinsic shape parameters using only a few unlabeled RGB images of the subject. Experiments on two large scale public datasets validate the state-of-the-art performance of our proposed method.

CVApr 6, 2023
Localized Region Contrast for Enhancing Self-Supervised Learning in Medical Image Segmentation

Xiangyi Yan, Junayed Naushad, Chenyu You et al. · meta-ai

Recent advancements in self-supervised learning have demonstrated that effective visual representations can be learned from unlabeled images. This has led to increased interest in applying self-supervised learning to the medical domain, where unlabeled images are abundant and labeled images are difficult to obtain. However, most self-supervised learning approaches are modeled as image level discriminative or generative proxy tasks, which may not capture the finer level representations necessary for dense prediction tasks like multi-organ segmentation. In this paper, we propose a novel contrastive learning framework that integrates Localized Region Contrast (LRC) to enhance existing self-supervised pre-training methods for medical image segmentation. Our approach involves identifying Super-pixels by Felzenszwalb's algorithm and performing local contrastive learning using a novel contrastive sampling loss. Through extensive experiments on three multi-organ segmentation datasets, we demonstrate that integrating LRC to an existing self-supervised method in a limited annotation setting significantly improves segmentation performance. Moreover, we show that LRC can also be applied to fully-supervised pre-training methods to further boost performance.

CVJul 23, 2023
Hybrid-CSR: Coupling Explicit and Implicit Shape Representation for Cortical Surface Reconstruction

Shanlin Sun, Thanh-Tung Le, Chenyu You et al. · meta-ai

We present Hybrid-CSR, a geometric deep-learning model that combines explicit and implicit shape representations for cortical surface reconstruction. Specifically, Hybrid-CSR begins with explicit deformations of template meshes to obtain coarsely reconstructed cortical surfaces, based on which the oriented point clouds are estimated for the subsequent differentiable poisson surface reconstruction. By doing so, our method unifies explicit (oriented point clouds) and implicit (indicator function) cortical surface reconstruction. Compared to explicit representation-based methods, our hybrid approach is more friendly to capture detailed structures, and when compared with implicit representation-based methods, our method can be topology aware because of end-to-end training with a mesh-based deformation module. In order to address topology defects, we propose a new topology correction pipeline that relies on optimization-based diffeomorphic surface registration. Experimental results on three brain datasets show that our approach surpasses existing implicit and explicit cortical surface reconstruction methods in numeric metrics in terms of accuracy, regularity, and consistency.

CVMar 16, 2022Code
Topology-Preserving Shape Reconstruction and Registration via Neural Diffeomorphic Flow

Shanlin Sun, Kun Han, Deying Kong et al.

Deep Implicit Functions (DIFs) represent 3D geometry with continuous signed distance functions learned through deep neural nets. Recently DIFs-based methods have been proposed to handle shape reconstruction and dense point correspondences simultaneously, capturing semantic relationships across shapes of the same class by learning a DIFs-modeled shape template. These methods provide great flexibility and accuracy in reconstructing 3D shapes and inferring correspondences. However, the point correspondences built from these methods do not intrinsically preserve the topology of the shapes, unlike mesh-based template matching methods. This limits their applications on 3D geometries where underlying topological structures exist and matter, such as anatomical structures in medical images. In this paper, we propose a new model called Neural Diffeomorphic Flow (NDF) to learn deep implicit shape templates, representing shapes as conditional diffeomorphic deformations of templates, intrinsically preserving shape topologies. The diffeomorphic deformation is realized by an auto-decoder consisting of Neural Ordinary Differential Equation (NODE) blocks that progressively map shapes to implicit templates. We conduct extensive experiments on several medical image organ segmentation datasets to evaluate the effectiveness of NDF on reconstructing and aligning shapes. NDF achieves consistently state-of-the-art organ shape reconstruction and registration results in both accuracy and quality. The source code is publicly available at https://github.com/Siwensun/Neural_Diffeomorphic_Flow--NDF.

CVSep 20, 2023
Light Field Diffusion for Single-View Novel View Synthesis

Yifeng Xiong, Haoyu Ma, Shanlin Sun et al. · meta-ai

Single-view novel view synthesis (NVS), the task of generating images from new viewpoints based on a single reference image, is important but challenging in computer vision. Recent advancements in NVS have leveraged Denoising Diffusion Probabilistic Models (DDPMs) for their exceptional ability to produce high-fidelity images. However, current diffusion-based methods typically utilize camera pose matrices to globally and implicitly enforce 3D constraints, which can lead to inconsistencies in images generated from varying viewpoints, particularly in regions with complex textures and structures. To address these limitations, we present Light Field Diffusion (LFD), a novel conditional diffusion-based approach that transcends the conventional reliance on camera pose matrices. Starting from the camera pose matrices, LFD transforms them into light field encoding, with the same shape as the reference image, to describe the direction of each ray. By integrating light field encoding with the reference image, our method imposes local pixel-wise constraints within the diffusion process, fostering enhanced view consistency. Our approach not only involves training image LFD on the ShapeNet Car dataset but also includes fine-tuning a pre-trained latent diffusion model on the Objaverse dataset. This enables our latent LFD model to exhibit remarkable zero-shot generalization capabilities across out-of-distribution datasets like RTMV as well as in-the-wild images. Experiments demonstrate that LFD not only produces high-fidelity images but also achieves superior 3D consistency in complex regions, outperforming existing novel view synthesis methods.

CVAug 29, 2023Code
On-the-Fly Guidance Training for Medical Image Registration

Yuelin Xin, Yicheng Chen, Shengxiang Ji et al.

This study introduces a novel On-the-Fly Guidance (OFG) training framework for enhancing existing learning-based image registration models, addressing the limitations of weakly-supervised and unsupervised methods. Weakly-supervised methods struggle due to the scarcity of labeled data, and unsupervised methods directly depend on image similarity metrics for accuracy. Our method proposes a supervised fashion for training registration models, without the need for any labeled data. OFG generates pseudo-ground truth during training by refining deformation predictions with a differentiable optimizer, enabling direct supervised learning. OFG optimizes deformation predictions efficiently, improving the performance of registration models without sacrificing inference speed. Our method is tested across several benchmark datasets and leading models, it significantly enhanced performance, providing a plug-and-play solution for training learning-based registration models. Code available at: https://github.com/cilix-ai/on-the-fly-guidance

CVNov 11, 2023
CVTHead: One-shot Controllable Head Avatar with Vertex-feature Transformer

Haoyu Ma, Tong Zhang, Shanlin Sun et al. · meta-ai

Reconstructing personalized animatable head avatars has significant implications in the fields of AR/VR. Existing methods for achieving explicit face control of 3D Morphable Models (3DMM) typically rely on multi-view images or videos of a single subject, making the reconstruction process complex. Additionally, the traditional rendering pipeline is time-consuming, limiting real-time animation possibilities. In this paper, we introduce CVTHead, a novel approach that generates controllable neural head avatars from a single reference image using point-based neural rendering. CVTHead considers the sparse vertices of mesh as the point set and employs the proposed Vertex-feature Transformer to learn local feature descriptors for each vertex. This enables the modeling of long-range dependencies among all the vertices. Experimental results on the VoxCeleb dataset demonstrate that CVTHead achieves comparable performance to state-of-the-art graphics-based methods. Moreover, it enables efficient rendering of novel human heads with various expressions, head poses, and camera views. These attributes can be explicitly controlled using the coefficients of 3DMMs, facilitating versatile and realistic animation in real-time scenarios.

IVApr 8, 2023
MedGen3D: A Deep Generative Framework for Paired 3D Image and Mask Generation

Kun Han, Yifeng Xiong, Chenyu You et al.

Acquiring and annotating sufficient labeled data is crucial in developing accurate and robust learning-based models, but obtaining such data can be challenging in many medical image segmentation tasks. One promising solution is to synthesize realistic data with ground-truth mask annotations. However, no prior studies have explored generating complete 3D volumetric images with masks. In this paper, we present MedGen3D, a deep generative framework that can generate paired 3D medical images and masks. First, we represent the 3D medical data as 2D sequences and propose the Multi-Condition Diffusion Probabilistic Model (MC-DPM) to generate multi-label mask sequences adhering to anatomical geometry. Then, we use an image sequence generator and semantic diffusion refiner conditioned on the generated mask sequences to produce realistic 3D medical images that align with the generated masks. Our proposed framework guarantees accurate alignment between synthetic images and segmentation maps. Experiments on 3D thoracic CT and brain MRI datasets show that our synthetic data is both diverse and faithful to the original data, and demonstrate the benefits for downstream segmentation tasks. We anticipate that MedGen3D's ability to synthesize paired 3D medical images and masks will prove valuable in training deep learning models for medical imaging tasks.

CVAug 31, 2023
Self-Sampling Meta SAM: Enhancing Few-shot Medical Image Segmentation with Meta-Learning

Yiming Zhang, Tianang Leng, Kun Han et al.

While the Segment Anything Model (SAM) excels in semantic segmentation for general-purpose images, its performance significantly deteriorates when applied to medical images, primarily attributable to insufficient representation of medical images in its training dataset. Nonetheless, gathering comprehensive datasets and training models that are universally applicable is particularly challenging due to the long-tail problem common in medical images. To address this gap, here we present a Self-Sampling Meta SAM (SSM-SAM) framework for few-shot medical image segmentation. Our innovation lies in the design of three key modules: 1) An online fast gradient descent optimizer, further optimized by a meta-learner, which ensures swift and robust adaptation to new tasks. 2) A Self-Sampling module designed to provide well-aligned visual prompts for improved attention allocation; and 3) A robust attention-based decoder specifically designed for medical few-shot learning to capture relationship between different slices. Extensive experiments on a popular abdominal CT dataset and an MRI dataset demonstrate that the proposed method achieves significant improvements over state-of-the-art methods in few-shot segmentation, with an average improvements of 10.21% and 1.80% in terms of DSC, respectively. In conclusion, we present a novel approach for rapid online adaptation in interactive image segmentation, adapting to a new organ in just 0.83 minutes. Code is publicly available on GitHub upon acceptance.

CVJul 4, 2023
Hybrid Neural Diffeomorphic Flow for Shape Representation and Generation via Triplane

Kun Han, Shanlin Sun, Xiaohui Xie

Deep Implicit Functions (DIFs) have gained popularity in 3D computer vision due to their compactness and continuous representation capabilities. However, addressing dense correspondences and semantic relationships across DIF-encoded shapes remains a critical challenge, limiting their applications in texture transfer and shape analysis. Moreover, recent endeavors in 3D shape generation using DIFs often neglect correspondence and topology preservation. This paper presents HNDF (Hybrid Neural Diffeomorphic Flow), a method that implicitly learns the underlying representation and decomposes intricate dense correspondences into explicitly axis-aligned triplane features. To avoid suboptimal representations trapped in local minima, we propose hybrid supervision that captures both local and global correspondences. Unlike conventional approaches that directly generate new 3D shapes, we further explore the idea of shape generation with deformed template shape via diffeomorphic flows, where the deformation is encoded by the generated triplane features. Leveraging a pre-existing 2D diffusion model, we produce high-quality and diverse 3D diffeomorphic flows through generated triplanes features, ensuring topological consistency with the template shape. Extensive experiments on medical image organ segmentation datasets evaluate the effectiveness of HNDF in 3D shape representation and generation.

CVJun 7, 2022
Medical Image Registration via Neural Fields

Shanlin Sun, Kun Han, Chenyu You et al.

Image registration is an essential step in many medical image analysis tasks. Traditional methods for image registration are primarily optimization-driven, finding the optimal deformations that maximize the similarity between two images. Recent learning-based methods, trained to directly predict transformations between two images, run much faster, but suffer from performance deficiencies due to model generalization and the inefficiency in handling individual image specific deformations. Here we present a new neural net based image registration framework, called NIR (Neural Image Registration), which is based on optimization but utilizes deep neural nets to model deformations between image pairs. NIR represents the transformation between two images with a continuous function implemented via neural fields, receiving a 3D coordinate as input and outputting the corresponding deformation vector. NIR provides two ways of generating deformation field: directly output a displacement vector field for general deformable registration, or output a velocity vector field and integrate the velocity field to derive the deformation field for diffeomorphic image registration. The optimal registration is discovered by updating the parameters of the neural field via stochastic gradient descent. We describe several design choices that facilitate model optimization, including coordinate encoding, sinusoidal activation, coordinate sampling, and intensity sampling. Experiments on two 3D MR brain scan datasets demonstrate that NIR yields state-of-the-art performance in terms of both registration accuracy and regularity, while running significantly faster than traditional optimization-based methods.

CVAug 29, 2024
GameIR: A Large-Scale Synthesized Ground-Truth Dataset for Image Restoration over Gaming Content

Lebin Zhou, Kun Han, Nam Ling et al.

Image restoration methods like super-resolution and image synthesis have been successfully used in commercial cloud gaming products like NVIDIA's DLSS. However, restoration over gaming content is not well studied by the general public. The discrepancy is mainly caused by the lack of ground-truth gaming training data that match the test cases. Due to the unique characteristics of gaming content, the common approach of generating pseudo training data by degrading the original HR images results in inferior restoration performance. In this work, we develop GameIR, a large-scale high-quality computer-synthesized ground-truth dataset to fill in the blanks, targeting at two different applications. The first is super-resolution with deferred rendering, to support the gaming solution of rendering and transferring LR images only and restoring HR images on the client side. We provide 19200 LR-HR paired ground-truth frames coming from 640 videos rendered at 720p and 1440p for this task. The second is novel view synthesis (NVS), to support the multiview gaming solution of rendering and transferring part of the multiview frames and generating the remaining frames on the client side. This task has 57,600 HR frames from 960 videos of 160 scenes with 6 camera views. In addition to the RGB frames, the GBuffers during the deferred rendering stage are also provided, which can be used to help restoration. Furthermore, we evaluate several SOTA super-resolution algorithms and NeRF-based NVS algorithms over our dataset, which demonstrates the effectiveness of our ground-truth GameIR data in improving restoration performance for gaming content. Also, we test the method of incorporating the GBuffers as additional input information for helping super-resolution and NVS. We release our dataset and models to the general public to facilitate research on restoration methods over gaming content.

CVOct 30, 2025
MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

Arghavan Rezvani, Xiangyi Yan, Anthony T. Wu et al.

In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.

CVApr 9
Rotation Equivariant Convolutions in Deformable Registration of Brain MRI

Arghavan Rezvani, Kun Han, Anthony T. Wu et al.

Image registration is a fundamental task that aligns anatomical structures between images. While CNNs perform well, they lack rotation equivariance - a rotated input does not produce a correspondingly rotated output. This hinders performance by failing to exploit the rotational symmetries inherent in anatomical structures, particularly in brain MRI. In this work, we integrate rotation-equivariant convolutions into deformable brain MRI registration networks. We evaluate this approach by replacing standard encoders with equivariant ones in three baseline architectures, testing on multiple public brain MRI datasets. Our experiments demonstrate that equivariant encoders have three key advantages: 1) They achieve higher registration accuracy while reducing network parameters, confirming the benefit of this anatomical inductive bias. 2) They outperform baselines on rotated input pairs, demonstrating robustness to orientation variations common in clinical practice. 3) They show improved performance with less training data, indicating greater sample efficiency. Our results demonstrate that incorporating geometric priors is a critical step toward building more robust, accurate, and efficient registration models.

CVSep 9, 2025Code
XOCT: Enhancing OCT to OCTA Translation via Cross-Dimensional Supervised Multi-Scale Feature Learning

Pooya Khosravi, Kun Han, Anthony T. Wu et al.

Optical Coherence Tomography Angiography (OCTA) and its derived en-face projections provide high-resolution visualization of the retinal and choroidal vasculature, which is critical for the rapid and accurate diagnosis of retinal diseases. However, acquiring high-quality OCTA images is challenging due to motion sensitivity and the high costs associated with software modifications for conventional OCT devices. Moreover, current deep learning methods for OCT-to-OCTA translation often overlook the vascular differences across retinal layers and struggle to reconstruct the intricate, dense vascular details necessary for reliable diagnosis. To overcome these limitations, we propose XOCT, a novel deep learning framework that integrates Cross-Dimensional Supervision (CDS) with a Multi-Scale Feature Fusion (MSFF) network for layer-aware vascular reconstruction. Our CDS module leverages 2D layer-wise en-face projections, generated via segmentation-weighted z-axis averaging, as supervisory signals to compel the network to learn distinct representations for each retinal layer through fine-grained, targeted guidance. Meanwhile, the MSFF module enhances vessel delineation through multi-scale feature extraction combined with a channel reweighting strategy, effectively capturing vascular details at multiple spatial scales. Our experiments on the OCTA-500 dataset demonstrate XOCT's improvements, especially for the en-face projections which are significant for clinical evaluation of retinal pathologies, underscoring its potential to enhance OCTA accessibility, reliability, and diagnostic value for ophthalmic disease detection and monitoring. The code is available at https://github.com/uci-cbcl/XOCT.

CLJan 20, 2024Code
Orion-14B: Open-source Multilingual Large Language Models

Du Chen, Yi Huang, Xiaopu Li et al.

In this study, we introduce Orion-14B, a collection of multilingual large language models with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.

CVMay 27, 2023
Diffeomorphic Mesh Deformation via Efficient Optimal Transport for Cortical Surface Reconstruction

Tung Le, Khai Nguyen, Shanlin Sun et al.

Mesh deformation plays a pivotal role in many 3D vision tasks including dynamic simulations, rendering, and reconstruction. However, defining an efficient discrepancy between predicted and target meshes remains an open problem. A prevalent approach in current deep learning is the set-based approach which measures the discrepancy between two surfaces by comparing two randomly sampled point-clouds from the two meshes with Chamfer pseudo-distance. Nevertheless, the set-based approach still has limitations such as lacking a theoretical guarantee for choosing the number of points in sampled point-clouds, and the pseudo-metricity and the quadratic complexity of the Chamfer divergence. To address these issues, we propose a novel metric for learning mesh deformation. The metric is defined by sliced Wasserstein distance on meshes represented as probability measures that generalize the set-based approach. By leveraging probability measure space, we gain flexibility in encoding meshes using diverse forms of probability measures, such as continuous, empirical, and discrete measures via varifold representation. After having encoded probability measures, we can compare meshes by using the sliced Wasserstein distance which is an effective optimal transport distance with linear computational complexity and can provide a fast statistical rate for approximating the surface of meshes. To the end, we employ a neural ordinary differential equation (ODE) to deform the input surface into the target shape by modeling the trajectories of the points on the surface. Our experiments on cortical surface reconstruction demonstrate that our approach surpasses other competing methods in multiple datasets and metrics.

CVFeb 25, 2022
Diffeomorphic Image Registration with Neural Velocity Field

Kun Han, Shanlin sun, Xiangyi Yan et al.

Diffeomorphic image registration, offering smooth transformation and topology preservation, is required in many medical image analysis tasks.Traditional methods impose certain modeling constraints on the space of admissible transformations and use optimization to find the optimal transformation between two images. Specifying the right space of admissible transformations is challenging: the registration quality can be poor if the space is too restrictive, while the optimization can be hard to solve if the space is too general. Recent learning-based methods, utilizing deep neural networks to learn the transformation directly, achieve fast inference, but face challenges in accuracy due to the difficulties in capturing the small local deformations and generalization ability. Here we propose a new optimization-based method named DNVF (Diffeomorphic Image Registration with Neural Velocity Field) which utilizes deep neural network to model the space of admissible transformations. A multilayer perceptron (MLP) with sinusoidal activation function is used to represent the continuous velocity field and assigns a velocity vector to every point in space, providing the flexibility of modeling complex deformations as well as the convenience of optimization. Moreover, we propose a cascaded image registration framework (Cas-DNVF) by combining the benefits of both optimization and learning based methods, where a fully convolutional neural network (FCN) is trained to predict the initial deformation, followed by DNVF for further refinement. Experiments on two large-scale 3D MR brain scan datasets demonstrate that our proposed methods significantly outperform the state-of-the-art registration methods.

CLFeb 8, 2021
A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

Boliang Zhang, Ying Lyu, Ning Ding et al.

This paper describes our submission for the End-to-end Multi-domain Task Completion Dialog shared task at the 9th Dialog System Technology Challenge (DSTC-9). Participants in the shared task build an end-to-end task completion dialog system which is evaluated by human evaluation and a user simulator based automatic evaluation. Different from traditional pipelined approaches where modules are optimized individually and suffer from cascading failure, we propose an end-to-end dialog system that 1) uses Generative Pretraining 2 (GPT-2) as the backbone to jointly solve Natural Language Understanding, Dialog State Tracking, and Natural Language Generation tasks, 2) adopts Domain and Task Adaptive Pretraining to tailor GPT-2 to the dialog domain before finetuning, 3) utilizes heuristic pre/post-processing rules that greatly simplify the prediction tasks and improve generalizability, and 4) equips a fault tolerance module to correct errors and inappropriate responses. Our proposed method significantly outperforms baselines and ties for first place in the official evaluation. We make our source code publicly available.

IVDec 16, 2020
Spatial Context-Aware Self-Attention Model For Multi-Organ Segmentation

Hao Tang, Xingwei Liu, Kun Han et al.

Multi-organ segmentation is one of most successful applications of deep learning in medical image analysis. Deep convolutional neural nets (CNNs) have shown great promise in achieving clinically applicable image segmentation performance on CT or MRI images. State-of-the-art CNN segmentation models apply either 2D or 3D convolutions on input images, with pros and cons associated with each method: 2D convolution is fast, less memory-intensive but inadequate for extracting 3D contextual information from volumetric images, while the opposite is true for 3D convolution. To fit a 3D CNN model on CT or MRI images on commodity GPUs, one usually has to either downsample input images or use cropped local regions as inputs, which limits the utility of 3D models for multi-organ segmentation. In this work, we propose a new framework for combining 3D and 2D models, in which the segmentation is realized through high-resolution 2D convolutions, but guided by spatial contextual information extracted from a low-resolution 3D model. We implement a self-attention mechanism to control which 3D features should be used to guide 2D segmentation. Our model is light on memory usage but fully equipped to take 3D contextual information into account. Experiments on multiple organ segmentation datasets demonstrate that by taking advantage of both 2D and 3D models, our method is consistently outperforms existing 2D and 3D models in organ segmentation accuracy, while being able to directly take raw whole-volume image data as inputs.

CLMar 25, 2020
Adversarial Multi-Binary Neural Network for Multi-class Classification

Haiyang Xu, Junwen Chen, Kun Han et al.

Multi-class text classification is one of the key problems in machine learning and natural language processing. Emerging neural networks deal with the problem using a multi-output softmax layer and achieve substantial progress, but they do not explicitly learn the correlation among classes. In this paper, we use a multi-task framework to address multi-class classification, where a multi-class classifier and multiple binary classifiers are trained together. Moreover, we employ adversarial training to distinguish the class-specific features and the class-agnostic features. The model benefits from better feature representation. We conduct experiments on two large-scale multi-class text classification tasks and demonstrate that the proposed architecture outperforms baseline approaches.

CLMar 25, 2020
Learning Syntactic and Dynamic Selective Encoding for Document Summarization

Haiyang Xu, Yahao He, Kun Han et al.

Text summarization aims to generate a headline or a short summary consisting of the major information of the source text. Recent studies employ the sequence-to-sequence framework to encode the input with a neural network and generate abstractive summary. However, most studies feed the encoder with the semantic word embedding but ignore the syntactic information of the text. Further, although previous studies proposed the selective gate to control the information flow from the encoder to the decoder, it is static during the decoding and cannot differentiate the information based on the decoder states. In this paper, we propose a novel neural architecture for document summarization. Our approach has the following contributions: first, we incorporate syntactic information such as constituency parsing trees into the encoding sequence to learn both the semantic and syntactic information from the document, resulting in more accurate summary; second, we propose a dynamic gate network to select the salient information based on the context of the decoder state, which is essential to document summarization. The proposed model has been evaluated on CNN/Daily Mail summarization datasets and the experimental results show that the proposed approach outperforms baseline approaches.

CLMar 18, 2020
Selective Attention Encoders by Syntactic Graph Convolutional Networks for Document Summarization

Haiyang Xu, Yun Wang, Kun Han et al.

Abstractive text summarization is a challenging task, and one need to design a mechanism to effectively extract salient information from the source text and then generate a summary. A parsing process of the source text contains critical syntactic or semantic structures, which is useful to generate more accurate summary. However, modeling a parsing tree for text summarization is not trivial due to its non-linear structure and it is harder to deal with a document that includes multiple sentences and their parsing trees. In this paper, we propose to use a graph to connect the parsing trees from the sentences in a document and utilize the stacked graph convolutional networks (GCNs) to learn the syntactic representation for a document. The selective attention mechanism is used to extract salient information in semantic and structural aspect and generate an abstractive summary. We evaluate our approach on the CNN/Daily Mail text summarization dataset. The experimental results show that the proposed GCNs based selective attention approach outperforms the baselines and achieves the state-of-the-art performance on the dataset.

CLSep 6, 2019
Learning Alignment for Multimodal Emotion Recognition from Speech

Haiyang Xu, Hui Zhang, Kun Han et al.

Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.

CLAug 2, 2019
DELTA: A DEep learning based Language Technology plAtform

Kun Han, Junwen Chen, Hui Zhang et al.

In this paper we present DELTA, a deep learning based language technology platform. DELTA is an end-to-end platform designed to solve industry level natural language and speech processing problems. It integrates most popular neural network models for training as well as comprehensive deployment tools for production. DELTA aims to provide easy and fast experiences for using, deploying, and developing natural language processing and speech models for both academia and industry use cases. We demonstrate the reliable performance with DELTA on several natural language processing and speech tasks, including text classification, named entity recognition, natural language inference, speech recognition, speaker verification, etc. DELTA has been used for developing several state-of-the-art algorithms for publications and delivering real production to serve millions of users.

IRMay 31, 2018
Collaborative Multi-modal deep learning for the personalized product retrieval in Facebook Marketplace

Lu Zheng, Zhao Tan, Kun Han et al.

Facebook Marketplace is quickly gaining momentum among consumers as a favored customer-to-customer (C2C) product trading platform. The recommendation system behind it helps to significantly improve the user experience. Building the recommendation system for Facebook Marketplace is challenging for two reasons: 1) Scalability: the number of products in Facebook Marketplace is huge. Tens of thousands of products need to be scored and recommended within a couple hundred milliseconds for millions of users every day; 2) Cold start: the life span of the C2C products is very short and the user activities on the products are sparse. Thus it is difficult to accumulate enough product level signals for recommendation and we are facing a significant cold start issue. In this paper, we propose to address both the scalability and the cold-start issue by building a collaborative multi-modal deep learning based retrieval system where the compact embeddings for the users and the products are trained with the multi-modal content information. This system shows significant improvement over the benchmark in online and off-line experiments: In the online experiment, it increases the number of messages initiated by the buyer to the seller by +26.95%; in the off-line experiment, it improves the prediction accuracy by +9.58%.