Min Tang

CV
h-index34
34papers
1,150citations
Novelty50%
AI Score58

34 Papers

CLMay 29
The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Xiaobo Wang, Tong Wu, Min Tang et al.

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.

ASAug 14, 2023
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Xiaofei Wang, Manthan Thakker, Zhuo Chen et al.

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

GRJun 1
MidSurfNet: Learnable Face Pairing and Interference Implicit Fields for Generalized Mid-surface Abstraction

Li Ye, Xinhang Zhou, Xingyu Yang et al.

Mid-surface abstraction is essential for finite element analysis of thin-walled CAD models. Existing face pairing-based methods rely on handcrafted geometric heuristics, yet real-world industrial models frequently exhibit multi-wall-thickness regions, self-matching face configurations, and demand for non-center offset surfaces--scenarios where rule-based approaches consistently fail. We present MidSurfNet, a learning-augmented framework that addresses these limitations through two novel components: (1) a neural face pairing module that learns to predict face pair confidence from geometric and topological features, handling complex pairing scenarios beyond rule-based methods; and (2) an interference implicit field that represents mid-surfaces as the interference of two signed distance functions, enabling generalized offset control for flexible positioning in downstream CAE/FEA-oriented workflows. We construct a large-scale mid-surface dataset containing over 1,500 manually annotated CAD models. Experiments demonstrate that MidSurfNet achieves 87.32% face pairing accuracy and successfully handles multi-wall-thickness (61.90% completion) and self-matching (52.94% completion) scenarios that confound all existing methods. Furthermore, MidSurfNet provides a learning-based approach to generalized mid-surface abstraction with arbitrary offset control for CAE-oriented applications.

CVDec 12, 2025Code
Using GUI Agent for Electronic Design Automation

Chunyi Li, Longfei Li, Zicheng Zhang et al.

Graphical User Interface (GUI) agents adopt an end-to-end paradigm that maps a screenshot to an action sequence, thereby automating repetitive tasks in virtual environments. However, existing GUI agents are evaluated almost exclusively on commodity software such as Microsoft Word and Excel. Professional Computer-Aided Design (CAD) suites promise an order-of-magnitude higher economic return, yet remain the weakest performance domain for existing agents and are still far from replacing expert Electronic-Design-Automation (EDA) engineers. We therefore present the first systematic study that deploys GUI agents for EDA workflows. Our contributions are: (1) a large-scale dataset named GUI-EDA, including 5 CAD tools and 5 physical domains, comprising 2,000+ high-quality screenshot-answer-action pairs recorded by EDA scientists and engineers during real-world component design; (2) a comprehensive benchmark that evaluates 30+ mainstream GUI agents, demonstrating that EDA tasks constitute a major, unsolved challenge; and (3) an EDA-specialized metric named EDAgent, equipped with a reflection mechanism that achieves reliable performance on industrial CAD software and, for the first time, outperforms Ph.D. students majored in Electrical Engineering. This work extends GUI agents from generic office automation to specialized, high-value engineering domains and offers a new avenue for advancing EDA productivity. The dataset will be released at: https://github.com/aiben-ch/GUI-EDA.

APFeb 2, 2018
Analysis and computation of some tumor growth models with nutrient: from cell density models to free boundary dynamics

Jian-Guo Liu, Min Tang, Li Wang et al.

In this paper, we study the tumor growth equation along with various models for the nutrient component, including the \emph{in vitro} model and the \emph{in vivo} model. At the cell density level, the spatial availability of the tumor density $n$ is governed by the Darcy law via the pressure $p(n)=n^γ$. For finite $γ$, we prove some a priori estimates of the tumor growth model, such as boundedness of the nutrient density, and non-negativity and growth estimate of the tumor density. As $γ\rightarrow \infty$, the cell density models formally converge to Hele-Shaw flow models, which determine the free boundary dynamics of the tumor tissue in the incompressible limit. We derive several analytical solutions to the Hele-Shaw flow models, which serve as benchmark solutions to the geometric motion of tumor front propagation. Finally, we apply a conservative and positivity preserving numerical scheme to the cell density models, with numerical results verifying the link between cell density models and the free boundary dynamical models.

CLJul 2, 2024Code
Efficient Sparse Attention needs Adaptive Token Release

Chaoran Zhang, Lixin Zou, Dan Luo et al.

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide array of text-centric tasks. However, their `large' scale introduces significant computational and storage challenges, particularly in managing the key-value states of the transformer, which limits their wider applicability. Therefore, we propose to adaptively release resources from caches and rebuild the necessary key-value states. Particularly, we accomplish this by a lightweight controller module to approximate an ideal top-$K$ sparse attention. This module retains the tokens with the highest top-$K$ attention weights and simultaneously rebuilds the discarded but necessary tokens, which may become essential for future decoding. Comprehensive experiments in natural language generation and modeling reveal that our method is not only competitive with full attention in terms of performance but also achieves a significant throughput improvement of up to 221.8%. The code for replication is available on the https://github.com/WHUIR/ADORE.

APMay 5, 2016
Macroscopic limits of pathway-based kinetic models for E.coli chemotaxis in large gradient environments

Weiran Sun, Min Tang

It is of great biological interest to understand the molecular origins of chemotactic behavior of E. coli by developing population-level models based on the underlying signaling pathway dynamics. We derive macroscopic models for E.coli chemotaxis that match quantitatively with the agent-based model (SPECS) for all ranges of the spacial gradient, in particular when the chemical gradient is large such that the standard Keller-Segel model is no longer valid. These equations are derived both formally and rigorously as asymptotic limits for pathway-based kinetic equations. We also present numerical results that show good agreement between the macroscopic models and SPECS. Our work provides an answer to the question of how to determine the population-level diffusion coefficient and drift velocity from the molecular mechanisms of chemotaxis, for both shallow gradients and large gradients environments.

APMar 10, 2016
Well-balanced and asymptotic preserving schemes for kinetic models

Casimir Emako, Min Tang

In this paper, we propose a general framework for designing numerical schemes that have both well-balanced (WB) and asymptotic preserving (AP) properties, for various kinds of kinetic models. We are interested in two different parameter regimes, 1) When the ratio between the mean free path and the characteristic macroscopic length $ε$ tends to zero, the density can be described by (advection) diffusion type (linear or nonlinear) macroscopic models; 2) When $ε$ = O(1), the models behave like hyperbolic equations with source terms and we are interested in their steady states. We apply the framework to three different kinetic models: neutron transport equation and its diffusion limit, the transport equation for chemotaxis and its Keller-Segel limit, and grey radiative transfer equation and its nonlinear diffusion limit. Numerical examples are given to demonstrate the properties of the schemes.

MAMay 23
Adaptive Punishment for Cooperation in Mixed-Motive Games

Min Tang, Fanqi Kong, Linyuan Lü et al.

Mixed-motive scenarios are ubiquitous in real-world multi-agent interactions, where self-interested agents often defect for immediate rewards, overlooking the potential of altruistic cooperation to improve long-term gains and collective welfare. Peer punishment can deter defection, but as costly second-order altruism, its persistent imposition may undermine the punisher's interests. Existing approaches often struggle to effectively implement punishment to promote cooperation. To balance the efficacy and cost of punishment, we propose Adaptive Punishment for Cooperation (APC), a distributed method that determines punishment intensity based on both a dynamic punishment probability and the severity of defection. This dynamic probability substantially reduces costly and ineffective punishment while also promotes cooperation. To accurately assess defection and its severity, we use a defection awareness module, whose learning is guided by game reward. Theoretical analysis and empirical results show APC performs effectively in iterated public goods game. Empirically, APC also significantly outperforms existing baselines across sequential social dilemmas, learning rational and effective punishment policies that foster cooperation by strategically deterring defection.

MANov 6, 2024Code
AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making

Yizhe Huang, Xingbo Wang, Hao Liu et al.

Traditional interactive environments limit agents' intelligence growth with fixed tasks. Recently, single-agent environments address this by generating new tasks based on agent actions, enhancing task diversity. We consider the decision-making problem in multi-agent settings, where tasks are further influenced by social connections, affecting rewards and information access. However, existing multi-agent environments lack a combination of adaptive physical surroundings and social connections, hindering the learning of intelligent behaviors. To address this, we introduce AdaSociety, a customizable multi-agent environment featuring expanding state and action spaces, alongside explicit and alterable social structures. As agents progress, the environment adaptively generates new tasks with social structures for agents to undertake. In AdaSociety, we develop three mini-games showcasing distinct social structures and tasks. Initial results demonstrate that specific social structures can promote both individual and collective benefits, though current reinforcement learning and LLM-based algorithms show limited effectiveness in leveraging social structures to enhance performance. Overall, AdaSociety serves as a valuable research platform for exploring intelligence in diverse physical and social settings. The code is available at https://github.com/bigai-ai/AdaSociety.

LGMay 19
AirfoilGen: A valid-by-construction and performance-aware latent diffusion model for airfoil generation

Zhijie Yang, Min Tang, Qiang Zou

Airfoil shape design is a fundamental task in aerospace engineering, with a direct impact on flight stability and fuel consumption. Deep learning has recently emerged as a promising tool for this task, but existing deep generative approaches remain limited in both geometric validity and physical controllability. They offer little control over the generated shapes, yielding invalid geometries, and they typically do not condition effectively on aerodynamic performance. To address these issues, this paper proposes AirfoilGen, a valid-by-construction and performance-aware latent diffusion model for airfoil. It first introduces a novel airfoil representation scheme, the circle sweeping representation, to constrain the generative process so that output shapes respect essential airfoil characteristics. It then enables explicit control over aerodynamic performance (e.g., lift and drag coefficients) by operating in a learned latent space: a transformer model encodes airfoil shapes into vector embeddings, and a conditional diffusion model denoises Gaussian noise into these latent embeddings while incorporating target aerodynamic performance. In addition, this paper presents a new dataset of over 200,000 airfoils, which is substantially larger than the widely used UIUC airfoil dataset (1,650 airfoils) and more suitable for training modern deep generative models. Experiments demonstrate that AirfoilGen enables airfoil generation with far greater geometric validity and aerodynamic performance controllability than previously achievable, with an average performance-conditioning accuracy of 98.41%.

CVJan 8, 2018Code
End-to-end detection-segmentation network with ROI convolution

Zichen Zhang, Min Tang, Dana Cobzas et al.

We propose an end-to-end neural network that improves the segmentation accuracy of fully convolutional networks by incorporating a localization unit. This network performs object localization first, which is then used as a cue to guide the training of the segmentation network. We test the proposed method on a segmentation task of small objects on a clinical dataset of ultrasound images. We show that by jointly learning for detection and segmentation, the proposed network is able to improve the segmentation accuracy compared to only learning for segmentation. Code is publicly available at https://github.com/vincentzhang/roi-fcn.

ASFeb 12, 2024
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez et al.

Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing and variety of the laughter to be generated. In this work, we propose ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt with precise control of laughter timing and expression. Specifically, ELaTE works on the audio prompt to mimic the voice characteristic, the text prompt to indicate the contents of the generated speech, and the input to control the laughter expression, which can be either the start and end times of laughter, or the additional audio prompt that contains laughter to be mimicked. We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS, and fine-tune it with frame-level representation from a laughter detector as additional conditioning. With a simple scheme to mix small-scale laughter-conditioned data with large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS model can be readily fine-tuned to generate natural laughter with precise controllability, without losing any quality of the pre-trained zero-shot TTS model. Through objective and subjective evaluations, we show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models. See https://aka.ms/elate/ for demo samples.

NAApr 23
Fast Algorithm For Solving Time-dependent Multiscale radiative transport Equation

Qinchen Song, Lei Zhang, Min Tang

When solving the time-dependent radiative transport equation (RTE), implicit time discretization is often employed for its robustness and stability. This results in a sequence of steady-state RTEs with identical cross-sections but varying source terms, whose repeated solution is computationally costly. To address this, we first apply the adaptive tailored finite point scheme (TFPS) for spatial discretization. This scheme exploits prior knowledge of the background media's optical properties to adaptively compress the angular domain, constructing a compressed linear system. A key feature is its ability to reconstruct the layer structure after compression, faithfully capturing the variance at the layer. We then use the Recursive Skeleton Method (RSM) to obtain an explicit multilevel decomposition of the inverse discrete operator, which is reused for all steady-state solutions. Numerical experiments show that our framework achieves high accuracy and significant efficiency across diverse scenarios.

LGMar 24, 2025
RLCAD: Reinforcement Learning Training Gym for Revolution Involved CAD Command Sequence Generation

Xiaolong Yin, Xingyu Lu, Jiahang Shen et al.

A CAD command sequence is a typical parametric design paradigm in 3D CAD systems where a model is constructed by overlaying 2D sketches with operations such as extrusion, revolution, and Boolean operations. Although there is growing academic interest in the automatic generation of command sequences, existing methods and datasets only support operations such as 2D sketching, extrusion,and Boolean operations. This limitation makes it challenging to represent more complex geometries. In this paper, we present a reinforcement learning (RL) training environment (gym) built on a CAD geometric engine. Given an input boundary representation (B-Rep) geometry, the policy network in the RL algorithm generates an action. This action, along with previously generated actions, is processed within the gym to produce the corresponding CAD geometry, which is then fed back into the policy network. The rewards, determined by the difference between the generated and target geometries within the gym, are used to update the RL network. Our method supports operations beyond sketches, Boolean, and extrusion, including revolution operations. With this training gym, we achieve state-of-the-art (SOTA) quality in generating command sequences from B-Rep geometries.

LGAug 6, 2025
Generating Feasible and Diverse Synthetic Populations Using Diffusion Models

Min Tang, Peng Lu, Qing Feng

Population synthesis is a critical task that involves generating synthetic yet realistic representations of populations. It is a fundamental problem in agent-based modeling (ABM), which has become the standard to analyze intelligent transportation systems. The synthetic population serves as the primary input for ABM transportation simulation, with traveling agents represented by population members. However, when the number of attributes describing agents becomes large, survey data often cannot densely support the joint distribution of the attributes in the population due to the curse of dimensionality. This sparsity makes it difficult to accurately model and produce the population. Interestingly, deep generative models trained from available sample data can potentially synthesize possible attribute combinations that present in the actual population but do not exist in the sample data(called sampling zeros). Nevertheless, this comes at the cost of falsely generating the infeasible attribute combinations that do not exist in the population (called structural zeros). In this study, a novel diffusion model-based population synthesis method is proposed to estimate the underlying joint distribution of a population. This approach enables the recovery of numerous missing sampling zeros while keeping the generated structural zeros minimal. Our method is compared with other recently proposed approaches such as Variational Autoencoders (VAE) and Generative Adversarial Network (GAN) approaches, which have shown success in high dimensional tabular population synthesis. We assess the performance of the synthesized outputs using a range of metrics, including marginal distribution similarity, feasibility, and diversity. The results demonstrate that our proposed method outperforms previous approaches in achieving a better balance between the feasibility and diversity of the synthesized population.

AIAug 1, 2025
CADDesigner: Conceptual Design of CAD Models Based on General-Purpose Agent

Jingzhe Ni, Xiaolong Yin, Xingyu Lu et al.

Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing but typically requires a high level of expertise from designers. To lower the entry barrier and improve design efficiency, we present an agent for CAD conceptual design powered by large language models (LLMs). The agent accepts both abstract textual descriptions and freehand sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Context-Independent Imperative Paradigm (CIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases are stored in a structured knowledge base, enabling continuous improvement of the agent's code generation capabilities. Experimental results demonstrate that our method achieves state-of-the-art performance in CAD code generation.

IRJul 20, 2025
Privacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack Perspective

Yubo Wang, Min Tang, Nuo Shen et al.

The large language model (LLM) powered recommendation paradigm has been proposed to address the limitations of traditional recommender systems, which often struggle to handle cold start users or items with new IDs. Despite its effectiveness, this study uncovers that LLM empowered recommender systems are vulnerable to reconstruction attacks that can expose both system and user privacy. To examine this threat, we present the first systematic study on inversion attacks targeting LLM empowered recommender systems, where adversaries attempt to reconstruct original prompts that contain personal preferences, interaction histories, and demographic attributes by exploiting the output logits of recommendation models. We reproduce the vec2text framework and optimize it using our proposed method called Similarity Guided Refinement, enabling more accurate reconstruction of textual prompts from model generated logits. Extensive experiments across two domains (movies and books) and two representative LLM based recommendation models demonstrate that our method achieves high fidelity reconstructions. Specifically, we can recover nearly 65 percent of the user interacted items and correctly infer age and gender in 87 percent of the cases. The experiments also reveal that privacy leakage is largely insensitive to the victim model's performance but highly dependent on domain consistency and prompt complexity. These findings expose critical privacy vulnerabilities in LLM empowered recommender systems.

LGMay 29, 2025
DeepRTE: Pre-trained Attention-based Neural Network for Radiative Transfer

Yekun Zhu, Min Tang, Zheng Ma

In this paper, we propose a novel neural network approach, termed DeepRTE, to address the steady-state Radiative Transfer Equation (RTE). The RTE is a differential-integral equation that governs the propagation of radiation through a participating medium, with applications spanning diverse domains such as neutron transport, atmospheric radiative transfer, heat transfer, and optical imaging. Our DeepRTE framework demonstrates superior computational efficiency for solving the steady-state RTE, surpassing traditional methods and existing neural network approaches. This efficiency is achieved by embedding physical information through derivation of the RTE and mathematically-informed network architecture. Concurrently, DeepRTE achieves high accuracy with significantly fewer parameters, largely due to its incorporation of mechanisms such as multi-head attention. Furthermore, DeepRTE is a mesh-free neural operator framework with inherent zero-shot capability. This is achieved by incorporating Green's function theory and pre-training with delta-function inflow boundary conditions into both its architecture design and training data construction. The efficacy of the proposed approach is substantiated through comprehensive numerical experiments.

ASJun 9, 2024
An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker et al.

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt.

SDJan 16, 2024
NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription

Alon Vinnikov, Amir Ivry, Aviv Hurvitz et al.

We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics. It is recorded across 30 conference rooms, featuring 4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. The tasks focus on single-device DASR, where multi-channel devices always share the same known geometry. This is aligned with common setups in actual conference rooms, and avoids technical complexities associated with multi-device tasks. It also allows for the development of geometry-specific solutions. The NOTSOFAR-1 Challenge aims to advance research in the field of distant conversational speech recognition, providing key resources to unlock the potential of data-driven methods, which we believe are currently constrained by the absence of comprehensive high-quality training and benchmarking datasets.

GRMay 30, 2023
CTSN: Predicting Cloth Deformation for Skeleton-based Characters with a Two-stream Skinning Network

Yudi Li, Min Tang, Yun Yang et al.

We present a novel learning method to predict the cloth deformation for skeleton-based characters with a two-stream network. The characters processed in our approach are not limited to humans, and can be other skeletal-based representations of non-human targets such as fish or pets. We use a novel network architecture which consists of skeleton-based and mesh-based residual networks to learn the coarse and wrinkle features as the overall residual from the template cloth mesh. Our network is used to predict the deformation for loose or tight-fitting clothing or dresses. We ensure that the memory footprint of our network is low, and thereby result in reduced storage and computational requirements. In practice, our prediction for a single cloth mesh for the skeleton-based character takes about 7 milliseconds on an NVIDIA GeForce RTX 3090 GPU. Compared with prior methods, our network can generate fine deformation results with details and wrinkles.

GRDec 13, 2021
N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks

Yudi Li, Min Tang, Yun Yang et al.

We present a novel mesh-based learning approach (N-Cloth) for plausible 3D cloth deformation prediction. Our approach is general and can handle cloth or obstacles represented by triangle meshes with arbitrary topologies. We use graph convolution to transform the cloth and object meshes into a latent space to reduce the non-linearity in the mesh space. Our network can predict the target 3D cloth mesh deformation based on the initial state of the cloth mesh template and the target obstacle mesh. Our approach can handle complex cloth meshes with up to 100K triangles and scenes with various objects corresponding to SMPL humans, non-SMPL humans or rigid bodies. In practice, our approach can be used to generate plausible cloth simulation at 30-45 fps on an NVIDIA GeForce RTX 3090 GPU. We highlight its benefits over prior learning-based methods and physically-based cloth simulators.

CVDec 4, 2021
Sphere Face Model:A 3D Morphable Model with Hypersphere Manifold Latent Space

Diqiong Jiang, Yiwei Jin, Fanglue Zhang et al.

3D Morphable Models (3DMMs) are generative models for face shape and appearance. However, the shape parameters of traditional 3DMMs satisfy the multivariate Gaussian distribution while the identity embeddings satisfy the hypersphere distribution, and this conflict makes it challenging for face reconstruction models to preserve the faithfulness and the shape consistency simultaneously. To address this issue, we propose the Sphere Face Model(SFM), a novel 3DMM for monocular face reconstruction, which can preserve both shape fidelity and identity consistency. The core of our SFM is the basis matrix which can be used to reconstruct 3D face shapes, and the basic matrix is learned by adopting a two-stage training approach where 3D and 2D training data are used in the first and second stages, respectively. To resolve the distribution mismatch, we design a novel loss to make the shape parameters have a hyperspherical latent space. Extensive experiments show that SFM has high representation ability and shape parameter space's clustering performance. Moreover, it produces fidelity face shapes, and the shapes are consistent in challenging conditions in monocular face reconstruction.

ASOct 12, 2021
VarArray: Array-Geometry-Agnostic Continuous Speech Separation

Takuya Yoshioka, Xiaofei Wang, Dongmei Wang et al.

Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription. This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model. The proposed model is applicable to any number of microphones without retraining while leveraging the nonlinear correlation between the input channels. The proposed method adapts different elements that were proposed before separately, including transform-average-concatenate, conformer speech separation, and inter-channel phase differences, and combines them in an efficient and cohesive way. Large-scale evaluation was performed with two real meeting transcription tasks by using a fully developed transcription system requiring no prior knowledge such as reference segmentations, which allowed us to measure the impact that the continuous speech separation system could have in realistic settings. The proposed model outperformed a previous approach to array-geometry-agnostic modeling for all of the geometry configurations considered, achieving asclite-based speaker-agnostic word error rates of 17.5% and 20.4% for the AMI development and evaluation sets, respectively, in the end-to-end setting using no ground-truth segmentations.

CVApr 8, 2021
Reconstructing Recognizable 3D Face Shapes based on 3D Morphable Models

Diqiong Jiang, Yiwei Jin, Fanglue Zhang et al.

Many recent works have reconstructed distinctive 3D face shapes by aggregating shape parameters of the same identity and separating those of different people based on parametric models (e.g., 3D morphable models (3DMMs)). However, despite the high accuracy in the face recognition task using these shape parameters, the visual discrimination of face shapes reconstructed from those parameters is unsatisfactory. The following research question has not been answered in previous works: Do discriminative shape parameters guarantee visual discrimination in represented 3D face shapes? This paper analyzes the relationship between shape parameters and reconstructed shape geometry and proposes a novel shape identity-aware regularization(SIR) loss for shape parameters, aiming at increasing discriminability in both the shape parameter and shape geometry domains. Moreover, to cope with the lack of training data containing both landmark and identity annotations, we propose a network structure and an associated training strategy to leverage mixed data containing either identity or landmark labels. We compare our method with existing methods in terms of the reconstruction error, visual distinguishability, and face recognition accuracy of the shape parameters. Experimental results show that our method outperforms the state-of-the-art methods.

CVAug 14, 2018
Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation

Chengyang Li, Dan Song, Ruofeng Tong et al.

Multispectral pedestrian detection has attracted increasing attention from the research community due to its crucial competence for many around-the-clock applications (e.g., video surveillance and autonomous driving), especially under insufficient illumination conditions. We create a human baseline over the KAIST dataset and reveal that there is still a large gap between current top detectors and human performance. To narrow this gap, we propose a network fusion architecture, which consists of a multispectral proposal network to generate pedestrian proposals, and a subsequent multispectral classification network to distinguish pedestrian instances from hard negatives. The unified network is learned by jointly optimizing pedestrian detection and semantic segmentation tasks. The final detections are obtained by integrating the outputs from different modalities as well as the two stages. The approach significantly outperforms state-of-the-art methods on the KAIST dataset while remain fast. Additionally, we contribute a sanitized version of training annotations for the KAIST dataset, and examine the effects caused by different kinds of annotation errors. Future research of this problem will benefit from the sanitized version which eliminates the interference of annotation errors.

CVMar 14, 2018
Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection

Chengyang Li, Dan Song, Ruofeng Tong et al.

Multispectral images of color-thermal pairs have shown more effective than a single color channel for pedestrian detection, especially under challenging illumination conditions. However, there is still a lack of studies on how to fuse the two modalities effectively. In this paper, we deeply compare six different convolutional network fusion architectures and analyse their adaptations, enabling a vanilla architecture to obtain detection performances comparable to the state-of-the-art results. Further, we discover that pedestrian detection confidences from color or thermal images are correlated with illumination conditions. With this in mind, we propose an Illumination-aware Faster R-CNN (IAF RCNN). Specifically, an Illumination-aware Network is introduced to give an illumination measure of the input image. Then we adaptively merge color and thermal sub-networks via a gate function defined over the illumination value. The experimental results on KAIST Multispectral Pedestrian Benchmark validate the effectiveness of the proposed IAF R-CNN.

CVOct 31, 2017
Segmentation-by-Detection: A Cascade Network for Volumetric Medical Image Segmentation

Min Tang, Zichen Zhang, Dana Cobzas et al.

We propose an attention mechanism for 3D medical image segmentation. The method, named segmentation-by-detection, is a cascade of a detection module followed by a segmentation module. The detection module enables a region of interest to come to attention and produces a set of object region candidates which are further used as an attention model. Rather than dealing with the entire volume, the segmentation module distills the information from the potential region. This scheme is an efficient solution for volumetric data as it reduces the influence of the surrounding noise which is especially important for medical data with low signal-to-noise ratio. Experimental results on 3D ultrasound data of the femoral head shows superiority of the proposed method when compared with a standard fully convolutional network like the U-Net.

NAAug 28, 2017
An accurate front capturing scheme for tumor growth models with a free boundary limit

Jian-Guo Liu, Min Tang, Li Wang et al.

We consider a class of tumor growth models under the combined effects of density-dependent pressure and cell multiplication, with a free boundary model as its singular limit when the pressure-density relationship becomes highly nonlinear. In particular, the constitutive law connecting pressure $p$ and density $ρ$ is $p(ρ)=\frac{m}{m-1} ρ^{m-1}$, and when $m \gg 1$, the cell density $ρ$ may evolve its support due to a pressure-driven geometric motion with sharp interface along the boundary of its support. The nonlinearity and degeneracy in the diffusion bring great challenges in numerical simulations, let alone the capturing of the singular free boundary limit. Prior to the present paper, there is lack of standard mechanism to numerically capture the front propagation speed as $m\gg 1$. In this paper, we develope a numerical scheme based on a novel prediction-correction reformulation that can accurately approximate the front propagation even when the nonlinearity is extremely strong. We show that the semi-discrete scheme naturally connects to the free boundary limit equation as $m \rightarrow \infty$, and with proper spacial discretization, the fully discrete scheme has improved stability, preserves positivity, and implements without nonlinear solvers. Finally, extensive numerical examples in both one and two dimensions are provided to verify the claimed properties and showcase good performance in various applications.

CVMay 17, 2017
A deep level set method for image segmentation

Min Tang, Sepehr Valipour, Zichen Vincent Zhang et al.

This paper proposes a novel image segmentation approachthat integrates fully convolutional networks (FCNs) with a level setmodel. Compared with a FCN, the integrated method can incorporatesmoothing and prior information to achieve an accurate segmentation.Furthermore, different than using the level set model as a post-processingtool, we integrate it into the training phase to fine-tune the FCN. Thisallows the use of unlabeled data during training in a semi-supervisedsetting. Using two types of medical imaging data (liver CT and left ven-tricle MRI data), we show that the integrated method achieves goodperformance even when little training data is available, outperformingthe FCN or the level set model alone.

NAOct 28, 2016
Uniform convergent scheme for strongly anisotropic diffusion equations with closed field lines

Yihong Wang, Wenjun Ying, Min Tang

In magnetized plasma, the magnetic field confines particles around field lines. The ratio between the intensity of the parallel and perpendicular viscosity or heat conduction may reach the order of $10^{12}$. When the magnetic fields have closed field lines and form a "magnetic island", the convergence order of most known schemes depends on the anisotropy strength. In this paper, by integration of the original differential equation along each closed field line, we introduce a simple but very efficient asymptotic preserving reformulation, which yields uniform convergence with respect to the anisotropy strength. Only slight modification to the original code is required and neither change of coordinates nor mesh adaptation is needed. Numerical examples demonstrating the performance of the new scheme are presented.

NAJul 29, 2016
An Asymptotic Preserving method for strongly anisotropic diffusion equations based on field line integration

Min Tang, Yihong Wang

In magnetized plasma, the magnetic field confines the particles around the field lines. The anisotropy intensity in the viscosity and heat conduction may reach the order of $10^{12}$. When the boundary conditions are periodic or Neumann, the strong diffusion leads to an ill-posed limiting problem. To remove the ill-conditionedness in the highly anisotropic diffusion equations, we introduce a simple but very efficient asymptotic preserving reformulation in this paper. The key idea is that, instead of discretizing the Neumann boundary conditions locally, we replace one of the Neumann boundary condition by the integration of the original problem along the field line, the singular $1/ε$ terms can be replaced by $O(1)$ terms after the integration, so that yields a well-posed problem. Small modifications to the original code are required and no change of coordinates nor mesh adaptation are needed. Uniform convergence with respect to the anisotropy strength $1/ε$ can be observed numerically and the condition number does not scale with the anisotropy.

GRAug 25, 2015
PolyDepth: Real-time Penetration Depth Computation using Iterative Contact-Space Projection

Changsoo Je, Min Tang, Youngeun Lee et al.

We present a real-time algorithm that finds the Penetration Depth (PD) between general polygonal models based on iterative and local optimization techniques. Given an in-collision configuration of an object in configuration space, we find an initial collision-free configuration using several methods such as centroid difference, maximally clear configuration, motion coherence, random configuration, and sampling-based search. We project this configuration on to a local contact space using a variant of continuous collision detection algorithm and construct a linear convex cone around the projected configuration. We then formulate a new projection of the in-collision configuration onto the convex cone as a Linear Complementarity Problem (LCP), which we solve using a type of Gauss-Seidel iterative algorithm. We repeat this procedure until a locally optimal PD is obtained. Our algorithm can process complicated models consisting of tens of thousands triangles at interactive rates.