Jack Xin

LG
h-index19
55papers
1,128citations
Novelty53%
AI Score57

55 Papers

NAApr 19, 2022
Proximal Implicit ODE Solvers for Accelerating Learning Neural ODEs

Justin Baker, Hedi Xia, Yiwei Wang et al.

Learning neural ODEs often requires solving very stiff ODE systems, primarily using explicit adaptive step size ODE solvers. These solvers are computationally expensive, requiring the use of tiny step sizes for numerical stability and accuracy guarantees. This paper considers learning neural ODEs using implicit ODE solvers of different orders leveraging proximal operators. The proximal implicit solver consists of inner-outer iterations: the inner iterations approximate each implicit update step using a fast optimization algorithm, and the outer iterations solve the ODE system over time. The proximal implicit ODE solver guarantees superiority over explicit solvers in numerical stability and computational efficiency. We validate the advantages of proximal implicit solvers over existing popular neural ODE solvers on various challenging benchmark tasks, including learning continuous-depth graph neural networks and continuous normalizing flows.

STApr 11, 2007
A Dynamic Algorithm for Blind Separation of Convolutive Sound Mixtures

Jie Liu, Jack Xin, Yingyong Qi

We study an efficient dynamic blind source separation algorithm of convolutive sound mixtures based on updating statistical information in the frequency domain, andminimizing the support of time domain demixing filters by a weighted least square method. The permutation and scaling indeterminacies of separation, and concatenations of signals in adjacent time frames are resolved with optimization of $l^1 \times l^\infty$ norm on cross-correlation coefficients at multiple time lags. The algorithm is a direct method without iterations, and is adaptive to the environment. Computations on recorded and synthetic mixtures of speech and music signals show excellent performance.

NANov 26, 2017
Computing effective diffusivity of chaotic and stochastic flows using structure preserving schemes

Zhongjian Wang, Jack Xin, Zhiwen Zhang

In this paper we study the problem of computing the effective diffusivity for a particle moving in chaotic and stochastic flows. In addition we numerically investigate the residual diffusion phenomenon in chaotic advection. The residual diffusion refers to the non-zero effective (homogenized) diffusion in the limit of zero molecular diffusion as a result of chaotic mixing of the streamlines. In this limit traditional numerical methods typically fail since the solutions of the advection-diffusion equation develop sharp gradients. Instead of solving the Fokker-Planck equation in the Eulerian formulation, we compute the motion of particles in the Lagrangian formulation, which is modelled by stochastic differential equations (SDEs). We propose a new numerical integrator based on a stochastic splitting method to solve the corresponding SDEs in which the deterministic subproblem is symplectic preserving while the random subproblem can be viewed as a perturbation. We provide rigorous error analysis for the new numerical integrator using the backward error analysis technique and show that our method outperforms standard Euler-based integrators. Numerical results are presented to demonstrate the accuracy and efficiency of the proposed method for several typical chaotic and stochastic flow problems of physical interests.

COMP-PHAug 31, 2022
A DeepParticle method for learning and generating aggregation patterns in multi-dimensional Keller-Segel chemotaxis systems

Zhongjian Wang, Jack Xin, Zhiwen Zhang

We study a regularized interacting particle method for computing aggregation patterns and near singular solutions of a Keller-Segal (KS) chemotaxis system in two and three space dimensions, then further develop DeepParticle (DP) method to learn and generate solutions under variations of physical parameters. The KS solutions are approximated as empirical measures of particles which self-adapt to the high gradient part of solutions. We utilize the expressiveness of deep neural networks (DNNs) to represent the transform of samples from a given initial (source) distribution to a target distribution at finite time T prior to blowup without assuming invertibility of the transforms. In the training stage, we update the network weights by minimizing a discrete 2-Wasserstein distance between the input and target empirical measures. To reduce computational cost, we develop an iterative divide-and-conquer algorithm to find the optimal transition matrix in the Wasserstein distance. We present numerical results of DP framework for successful learning and generation of KS dynamics in the presence of laminar and chaotic flows. The physical parameter in this work is either the small diffusivity of chemo-attractant or the reciprocal of the flow amplitude in the advection-dominated regime.

NAFeb 28, 2012
A Numerical Study of Turbulent Flame Speeds of Curvature and Strain G-equations in Cellular Flows

Yu-Yu Liu, Jack Xin, Yifeng Yu

We study front speeds of curvature and strain G-equations arising in turbulent combustion. These G-equations are Hamilton-Jacobi type level set partial differential equations (PDEs) with non-coercive Hamiltonians and degenerate nonlinear second order diffusion. The Hamiltonian of strain G-equation is also non-convex. Numerical computation is performed based on monotone discretization and weighted essentially nonoscillatory (WENO) approximation of transformed G-equations on a fixed periodic domain. The advection field in the computation is a two dimensional Hamiltonian flow consisting of a periodic array of counter-rotating vortices, or cellular flows. Depending on whether the evolution is predominantly in the hyperbolic or parabolic regimes, suitable explicit and semi-implicit time stepping methods are chosen. The turbulent flame speeds are computed as the linear growth rates of large time solutions. A new nonlinear parabolic PDE is proposed for the reinitialization of level set functions to prevent piling up of multiple bundles of level sets on the periodic domain. We found that the turbulent flame speed $s_T$ of the curvature G-equation is enhanced as the intensity $A$ of cellular flows increases, at a rate between those of the inviscid and viscous G-equations. The $s_T$ of the strain G-equation increases in small $A$, decreases in larger $A$, then drops down to zero at a large enough but finite value $A_{*}$. The flame front ceases to propagate at this critical intensity $A_*$, and is quenched by the cellular flow.

IVJul 1, 2023
Weighted Anisotropic-Isotropic Total Variation for Poisson Denoising

Kevin Bui, Yifei Lou, Fredrick Park et al.

Poisson noise commonly occurs in images captured by photon-limited imaging systems such as in astronomy and medicine. As the distribution of Poisson noise depends on the pixel intensity value, noise levels vary from pixels to pixels. Hence, denoising a Poisson-corrupted image while preserving important details can be challenging. In this paper, we propose a Poisson denoising model by incorporating the weighted anisotropic-isotropic total variation (AITV) as a regularization. We then develop an alternating direction method of multipliers with a combination of a proximal operator for an efficient implementation. Lastly, numerical experiments demonstrate that our algorithm outperforms other Poisson denoising methods in terms of image quality and computational efficiency.

CVJul 2, 2023
A Proximal Algorithm for Network Slimming

Kevin Bui, Fanghui Xue, Fredrick Park et al.

As a popular channel pruning method for convolutional neural networks (CNNs), network slimming (NS) has a three-stage process: (1) it trains a CNN with $\ell_1$ regularization applied to the scaling factors of the batch normalization layers; (2) it removes channels whose scaling factors are below a chosen threshold; and (3) it retrains the pruned model to recover the original accuracy. This time-consuming, three-step process is a result of using subgradient descent to train CNNs. Because subgradient descent does not exactly train CNNs towards sparse, accurate structures, the latter two steps are necessary. Moreover, subgradient descent does not have any convergence guarantee. Therefore, we develop an alternative algorithm called proximal NS. Our proposed algorithm trains CNNs towards sparse, accurate structures, so identifying a scaling factor threshold is unnecessary and fine tuning the pruned CNNs is optional. Using Kurdyka-Łojasiewicz assumptions, we establish global convergence of proximal NS. Lastly, we validate the efficacy of the proposed algorithm on VGGNet, DenseNet and ResNet on CIFAR 10/100. Our experiments demonstrate that after one round of training, proximal NS yields a CNN with competitive accuracy and compression.

LGFeb 10, 2023
Feature Affinity Assisted Knowledge Distillation and Quantization of Deep Neural Networks on Label-Free Data

Zhijian Li, Biao Yang, Penghang Yin et al.

In this paper, we propose a feature affinity (FA) assisted knowledge distillation (KD) method to improve quantization-aware training of deep neural networks (DNN). The FA loss on intermediate feature maps of DNNs plays the role of teaching middle steps of a solution to a student instead of only giving final answers in the conventional KD where the loss acts on the network logits at the output level. Combining logit loss and FA loss, we found that the quantized student network receives stronger supervision than from the labeled ground-truth data. The resulting FAQD is capable of compressing model on label-free data, which brings immediate practical benefits as pre-trained teacher models are readily available and unlabeled data are abundant. In contrast, data labeling is often laborious and expensive. Finally, we propose a fast feature affinity (FFA) loss that accurately approximates FA loss with a lower order of computational complexity, which helps speed up training for high resolution image input.

CVJan 6, 2023
Difference of Anisotropic and Isotropic TV for Segmentation under Blur and Poisson Noise

Kevin Bui, Yifei Lou, Fredrick Park et al.

In this paper, we aim to segment an image degraded by blur and Poisson noise. We adopt a smoothing-and-thresholding (SaT) segmentation framework that finds a piecewise-smooth solution, followed by $k$-means clustering to segment the image. Specifically for the image smoothing step, we replace the least-squares fidelity for Gaussian noise in the Mumford-Shah model with a maximum posterior (MAP) term to deal with Poisson noise and we incorporate the weighted difference of anisotropic and isotropic total variation (AITV) as a regularization to promote the sparsity of image gradients. For such a nonconvex model, we develop a specific splitting scheme and utilize a proximal operator to apply the alternating direction method of multipliers (ADMM). Convergence analysis is provided to validate the efficacy of the ADMM scheme. Numerical experiments on various segmentation scenarios (grayscale/color and multiphase) showcase that our proposed method outperforms a number of segmentation methods, including the original SaT.

CVApr 16, 2022
Searching Intrinsic Dimensions of Vision Transformers

Fanghui Xue, Biao Yang, Yingyong Qi et al.

It has been shown by many researchers that transformers perform as well as convolutional neural networks in many computer vision tasks. Meanwhile, the large computational costs of its attention module hinder further studies and applications on edge devices. Some pruning methods have been developed to construct efficient vision transformers, but most of them have considered image classification tasks only. Inspired by these results, we propose SiDT, a method for pruning vision transformer backbones on more complicated vision tasks like object detection, based on the search of transformer dimensions. Experiments on CIFAR-100 and COCO datasets show that the backbones with 20\% or 40\% dimensions/parameters pruned can have similar or even better performance than the unpruned models. Moreover, we have also provided the complexity analysis and comparisons with the previous pruning methods.

PSOct 5, 2012
Turbulent Flame Speeds of G-equation Models in Unsteady Cellular Flows

Yu-Yu Liu, Jack Xin, Yifeng Yu

We perform a computationl study of front speeds of G-equation models in time dependent cellular flows. The G-equations arise in premixed turbulent combustion, and are Hamilton-Jacobi type level set partial differential equations (PDEs). The curvature-strain G equations are also non-convex with degenerate diffusion. The computation is based on monotone finite difference discretization and weighted essentially nonoscillatory (WENO) methods. We found that the large time front speeds lock into the frequency of time periodic cellular flows in curvature-strain G-equations similar to what occurs in the basic inviscid G-equation. However, such frequency locking phenomenon disappears in viscous G-equation, and in the inviscid G-equation if time periodic oscillation of the cellular flow is replaced by time stochastic oscillation.

LGApr 9, 2022
Channel Pruning In Quantization-aware Training: An Adaptive Projection-gradient Descent-shrinkage-splitting Method

Zhijian Li, Jack Xin

We propose an adaptive projection-gradient descent-shrinkage-splitting method (APGDSSM) to integrate penalty based channel pruning into quantization-aware training (QAT). APGDSSM concurrently searches weights in both the quantized subspace and the sparse subspace. APGDSSM uses shrinkage operator and a splitting technique to create sparse weights, as well as the Group Lasso penalty to push the weight sparsity into channel sparsity. In addition, we propose a novel complementary transformed l1 penalty to stabilize the training for extreme compression.

NAJan 6, 2015
Computational Modeling of Spectral Data Fitting with Nonlinear Distortions

Yuanchang Sun, Wensong Wu, Jack Xin

Substances such as chemical compounds are invisible to human eyes, they are usually captured by sensing equipments with their spectral fingerprints. Though spectra of pure chemicals can be identified by visual inspection, the spectra of their mixtures take a variety of complicated forms. Given the knowledge of spectral references of the constituent chemicals, the task of data fitting is to retrieve their weights, and this usually can be obtained by solving a least squares problem. Complications occur if the basis functions (reference spectra) may not be used directly to best fit the data. In fact, random distortions (spectral variability) such as shifting, compression, and expansion have been observed in some source spectra when the underlying substances are mixed. In this paper, we formulate mathematical model for such nonlinear effects and build them into data fitting algorithms. If minimal knowledge of the distortions is available, a deterministic approach termed {\it augmented least squares} is developed and it fits the spectral references along with their derivatives to the mixtures. If the distribution of the distortions is known a prior, we consider to solve the problem with maximum likelihood estimators which incorporate the shifts into the variance matrix. The proposed methods are substantiated with numerical examples including data from Raman spectroscopy (RS), nuclear magnetic resonance (NMR), and differential optical absorption spectroscopy (DOAS) and show satisfactory results.

CVJul 16, 2024
AFIDAF: Alternating Fourier and Image Domain Adaptive Filters as an Efficient Alternative to Attention in ViTs

Yunling Zheng, Zeyi Xu, Fanghui Xue et al.

We propose and demonstrate an alternating Fourier and image domain filtering approach for feature extraction as an efficient alternative to build a vision backbone without using the computationally intensive attention. The performance among the lightweight models reaches the state-of-the-art level on ImageNet-1K classification, and improves downstream tasks on object detection and segmentation consistently as well. Our approach also serves as a new tool to compress vision transformers (ViTs).

41.7LGMay 20
On the Regularity and Generalization of One-Step Wasserstein-guided Generative Models for PDE-Induced Measures

Likun Lin, Zhongjian Wang, Jack Xin et al.

Despite the remarkable empirical success of generative models, the available theory on their statistical accuracy in scientific computing remains largely pessimistic. This paper develops a theoretical framework for understanding the regularity of transport maps and the generalization properties of one-step Wasserstein-guided generative models for PDE-induced probability measures. We consider normalized target densities associated with linear elliptic and parabolic equations on bounded domains, as well as diffusion and Fokker--Planck equations on the torus. Under standard structural assumptions, we prove that these target measures satisfy doubling conditions. By combining this fact with regularity theory for optimal transport between doubling measures, we show that the optimal transport map from a uniform source measure to the target measure is Hölder continuous. This regularity yields an approximation-theoretic justification for one-step generative models that learn PDE-induced distributions via a single pushforward map. As a representative instance, we study DeepParticle and derive excess-risk bounds characterizing the discrepancy between the learned map and the population-optimal map. We also establish a robustness estimate under target shift and illustrate the theory with experiments which support the derived rates.

94.0NAMay 19
A Novel Stochastic Particle-Field Algorithm for a Reaction-Diffusion-Advection Cancer Invasion Model

Jingyuan Hu, Zhongjian Wang, Jack Xin et al.

In this paper, we present a novel numerical framework for solving a specific biological reaction-diffusion-advection system of cancer growth in three dimensions (3D) using particles of variable mass. We adopt empirical particle measures to represent cell density and dynamically construct the concentration fields of multiple related chemical species throughout the 3D domain. Efficient interaction between the particles and the spatial grid is achieved through a Particle-in-Cell (PIC) algorithm, while diffusion in space is solved rapidly using a spectral method. We demonstrate that for this particular system, the rate of change of particle mass remains bounded over finite time intervals. Furthermore, in addition to the inherent positivity preservation of cell density guaranteed by the empirical particle measures, the concentrations constructed by the algorithm are also unconditionally positivity-preserving on the spatial grid. Moreover, we present a rigorous error analysis for the proposed method, and numerical experiments confirm the theoretical convergence rates. To the best of our knowledge, this is the first numerical work to solve this system in three dimensions, wherein a rapid spread of cells driven by haptotactic flux is observed, similar to the behavior documented in the two-dimensional case.

LGJul 2, 2023
Fourier-Mixed Window Attention: Accelerating Informer for Long Sequence Time-Series Forecasting

Nhat Thanh Tran, Jack Xin

We study a fast local-global window-based attention method to accelerate Informer for long sequence time-series forecasting. While window attention being local is a considerable computational saving, it lacks the ability to capture global token information which is compensated by a subsequent Fourier transform block. Our method, named FWin, does not rely on query sparsity hypothesis and an empirical approximation underlying the ProbSparse attention of Informer. Through experiments on univariate and multivariate datasets, we show that FWin transformers improve the overall prediction accuracies of Informer while accelerating its inference speeds by 1.6 to 2 times. We also provide a mathematical definition of FWin attention, and prove that it is equivalent to the canonical full attention under the block diagonal invertibility (BDI) condition of the attention matrix. The BDI is shown experimentally to hold with high probability for typical benchmark datasets.

6.1NAApr 16
An Efficient Particle-Field Algorithm with Neural Interpolation based on a Parabolic-Hyperbolic Chemotaxis System in 3D

Jongwon David Kim, Jack Xin

Tumor angiogenesis involves a collection of tumor cells moving towards blood vessels for nutrients to grow. Angiogenesis, and in general chemotaxis systems have been modeled using partial differential equations (PDEs) and as such require numerical methods to approximate their solutions in 3 space dimensions (3D). This is an expensive computation when solutions develop large gradients at unknown locations, and so efficient algorithms to capture the main dynamical behavior are valuable. Here as a case study, we consider a parabolic-hyperbolic Keller-Segel (PHKS) system in the angiogenesis literature, and develop a mesh-free particle-based neural network algorithm that scales better to 3D than traditional mesh based solvers. From a regularized approximation of PHKS, we derive a neural stochastic interacting particle-field (NSIPF) algorithm where the bacterial density is represented as empirical measures of particles and the field variable (concentration of chemo-attractant) by a convolutional neural network (CNN) trained on low cost synthetic data. As a new model, NSIPF preserves total mass and nonnegativity of the density, and captures the dynamics of 3D multi-bump solutions at much faster speeds compared with classical finite difference (FD) and SIPF methods.

7.0CVMay 11
USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

Elisha Dayag, Nhat Thanh Tran, Jack Xin

Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.

LGMar 1
Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

Zhiqi Yu, Xingping Liu, Haobin Mao et al.

Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.

LGMar 11, 2024
COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

Aozhong Zhang, Zi Yang, Naigang Wang et al.

Post-training quantization (PTQ) has emerged as a practical approach to compress large neural networks, making them highly efficient for deployment. However, effectively reducing these models to their low-bit counterparts without compromising the original accuracy remains a key challenge. In this paper, we propose an innovative PTQ algorithm termed COMQ, which sequentially conducts coordinate-wise minimization of the layer-wise reconstruction errors. We consider the widely used integer quantization, where every quantized weight can be decomposed into a shared floating-point scalar and an integer bit-code. Within a fixed layer, COMQ treats all the scaling factor(s) and bit-codes as the variables of the reconstruction error. Every iteration improves this error along a single coordinate while keeping all other variables constant. COMQ is easy to use and requires no hyper-parameter tuning. It instead involves only dot products and rounding operations. We update these variables in a carefully designed greedy order, significantly enhancing the accuracy. COMQ achieves remarkable results in quantizing 4-bit Vision Transformers, with a negligible loss of less than 1% in Top-1 accuracy. In 4-bit INT quantization of convolutional neural networks, COMQ maintains near-lossless accuracy with a minimal drop of merely 0.3% in Top-1 accuracy.

LGMar 10, 2024
FWin transformer for dengue prediction under climate and ocean influence

Nhat Thanh Tran, Jack Xin, Guofa Zhou

Dengue fever is one of the most deadly mosquito-born tropical infectious diseases. Detailed long range forecast model is vital in controlling the spread of disease and making mitigation efforts. In this study, we examine methods used to forecast dengue cases for long range predictions. The dataset consists of local climate/weather in addition to global climate indicators of Singapore from 2000 to 2019. We utilize newly developed deep neural networks to learn the intricate relationship between the features. The baseline models in this study are in the class of recent transformers for long sequence forecasting tasks. We found that a Fourier mixed window attention (FWin) based transformer performed the best in terms of both the mean square error and the maximum absolute error on the long range dengue forecast up to 60 weeks.

CVJan 19
Deep Image Prior with L0 Gradient Regularizer for Image Smoothing

Nhat Thanh Tran, Kevin Bui, Jack Xin

Image smoothing is a fundamental image processing operation that preserves the underlying structure, such as strong edges and contours, and removes minor details and textures in an image. Many image smoothing algorithms rely on computing local window statistics or solving an optimization problem. Recent state-of-the-art methods leverage deep learning, but they require a carefully curated training dataset. Because constructing a proper training dataset for image smoothing is challenging, we propose DIP-$\ell_0$, a deep image prior framework that incorporates the $\ell_0$ gradient regularizer. This framework can perform high-quality image smoothing without any training data. To properly minimize the associated loss function that has the nonconvex, nonsmooth $\ell_0$ ``norm", we develop an alternating direction method of multipliers algorithm that utilizes an off-the-shelf $\ell_0$ gradient minimization solver. Numerical experiments demonstrate that the proposed DIP-$\ell_0$ outperforms many image smoothing algorithms in edge-preserving image smoothing and JPEG artifact removal.

LGOct 3, 2025
CrossLag: Predicting Major Dengue Outbreaks with a Domain Knowledge Informed Transformer

Ashwin Prabu, Nhat Thanh Tran, Guofa Zhou et al.

A variety of models have been developed to forecast dengue cases to date. However, it remains a challenge to predict major dengue outbreaks that need timely public warnings the most. In this paper, we introduce CrossLag, an environmentally informed attention that allows for the incorporation of lagging endogenous signals behind the significant events in the exogenous data into the architecture of the transformer at low parameter counts. Outbreaks typically lag behind major changes in climate and oceanic anomalies. We use TimeXer, a recent general-purpose transformer distinguishing exogenous-endogenous inputs, as the baseline for this study. Our proposed model outperforms TimeXer by a considerable margin in detecting and predicting major outbreaks in Singapore dengue data over a 24-week prediction window.

LGAug 27, 2025
Filter then Attend: Improving attention-based Time Series Forecasting with Spectral Filtering

Elisha Dayag, Nhat Thanh Van Tran, Jack Xin

Transformer-based models are at the forefront in long time-series forecasting (LTSF). While in many cases, these models are able to achieve state of the art results, they suffer from a bias toward low-frequencies in the data and high computational and memory requirements. Recent work has established that learnable frequency filters can be an integral part of a deep forecasting model by enhancing the model's spectral utilization. These works choose to use a multilayer perceptron to process their filtered signals and thus do not solve the issues found with transformer-based models. In this paper, we establish that adding a filter to the beginning of transformer-based models enhances their performance in long time-series forecasting. We add learnable filters, which only add an additional $\approx 1000$ parameters to several transformer-based models and observe in multiple instances 5-10 \% relative improvement in forecasting performance. Additionally, we find that with filters added, we are able to decrease the embedding dimension of our models, resulting in transformer-based architectures that are both smaller and more effective than their non-filtering base models. We also conduct synthetic experiments to analyze how the filters enable Transformer-based models to better utilize the full spectrum for forecasting.

CVJun 10, 2025
SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

Nhat Thanh Tran, Fanghui Xue, Shuai Zhang et al.

Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach on Imagenet-1k where classification results show that SEMA is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images at similar model parameter sizes.

LGMay 23, 2025
Beyond Discreteness: Finite-Sample Analysis of Straight-Through Estimator for Quantization

Halyun Jeong, Jack Xin, Penghang Yin

Training quantized neural networks requires addressing the non-differentiable and discrete nature of the underlying optimization problem. To tackle this challenge, the straight-through estimator (STE) has become the most widely adopted heuristic, allowing backpropagation through discrete operations by introducing surrogate gradients. However, its theoretical properties remain largely unexplored, with few existing works simplifying the analysis by assuming an infinite amount of training data. In contrast, this work presents the first finite-sample analysis of STE in the context of neural network quantization. Our theoretical results highlight the critical role of sample size in the success of STE, a key insight absent from existing studies. Specifically, by analyzing the quantization-aware training of a two-layer neural network with binary weights and activations, we derive the sample complexity bound in terms of the data dimensionality that guarantees the convergence of STE-based optimization to the global minimum. Moreover, in the presence of label noises, we uncover an intriguing recurrence property of STE-gradient method, where the iterate repeatedly escape from and return to the optimal binary weights. Our analysis leverages tools from compressed sensing and dynamical systems theory.

CVJun 1, 2024
An Image Segmentation Model with Transformed Total Variation

Elisha Dayag, Kevin Bui, Fredrick Park et al.

Based on transformed $\ell_1$ regularization, transformed total variation (TTV) has robust image recovery that is competitive with other nonconvex total variation (TV) regularizers, such as TV$^p$, $0<p<1$. Inspired by its performance, we propose a TTV-regularized Mumford--Shah model with fuzzy membership function for image segmentation. To solve it, we design an alternating direction method of multipliers (ADMM) algorithm that utilizes the transformed $\ell_1$ proximal operator. Numerical experiments demonstrate that using TTV is more effective than classical TV and other nonconvex TV variants in image segmentation.

SDMar 30, 2022
Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE

Ziang Long, Yunling Zheng, Meng Yu et al.

Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker's identity. We applied relaxed group-wise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance. In experiments of zero-shot many-to-many voice conversion task on VCTK data set, with the self-attention layer and relaxed group-wise splitting method, our model achieves a gain of speaker classification accuracy on unseen speakers by 28.3\% while slightly improved conversion voice quality in terms of MOSNet scores. Our encouraging findings point to future research on integrating more variety of attention structures in VAE framework while controlling model size and overfitting for advancing zero-shot many-to-many voice conversions.

CVFeb 21, 2022
An Efficient Smoothing and Thresholding Image Segmentation Framework with Weighted Anisotropic-Isotropic Total Variation

Kevin Bui, Yifei Lou, Fredrick Park et al.

In this paper, we design an efficient, multi-stage image segmentation framework that incorporates a weighted difference of anisotropic and isotropic total variation (AITV). The segmentation framework generally consists of two stages: smoothing and thresholding, thus referred to as SaT. In the first stage, a smoothed image is obtained by an AITV-regularized Mumford-Shah (MS) model, which can be solved efficiently by the alternating direction method of multipliers (ADMM) with a closed-form solution of a proximal operator of the $\ell_1 -α\ell_2$ regularizer. Convergence of the ADMM algorithm is analyzed. In the second stage, we threshold the smoothed image by $K$-means clustering to obtain the final segmentation result. Numerical experiments demonstrate that the proposed segmentation framework is versatile for both grayscale and color images, efficient in producing high-quality segmentation results within a few seconds, and robust to input images that are corrupted with noise, blur, or both. We compare the AITV method with its original convex TV and nonconvex TV$^p (0<p<1)$ counterparts, showcasing the qualitative and quantitative advantages of our proposed method.

LGJan 23, 2022
An integrated recurrent neural network and regression model with spatial and climatic couplings for vector-borne disease dynamics

Zhijian Li, Jack Xin, Guofa Zhou

We developed an integrated recurrent neural network and nonlinear regression spatio-temporal model for vector-borne disease evolution. We take into account climate data and seasonality as external factors that correlate with disease transmitting insects (e.g. flies), also spill-over infections from neighboring regions surrounding a region of interest. The climate data is encoded to the model through a quadratic embedding scheme motivated by recommendation systems. The neighboring regions' influence is modeled by a long short-term memory neural network. The integrated model is trained by stochastic gradient descent and tested on leish-maniasis data in Sri Lanka from 2013-2018 where infection outbreaks occurred. Our model outperformed ARIMA models across a number of regions with high infections, and an associated ablation study renders support to our modeling hypothesis and ideas.

LGJan 22, 2022
glassoformer: a query-sparse transformer for post-fault power grid voltage prediction

Yunling Zheng, Carson Hu, Guang Lin et al.

We propose GLassoformer, a novel and efficient transformer architecture leveraging group Lasso regularization to reduce the number of queries of the standard self-attention mechanism. Due to the sparsified queries, GLassoformer is more computationally efficient than the standard transformers. On the power grid post-fault voltage prediction task, GLassoformer shows remarkably better prediction than many existing benchmark algorithms in terms of accuracy and stability.

LGNov 2, 2021
DeepParticle: learning invariant measure by a deep neural network minimizing Wasserstein distance on data generated from an interacting particle method

Zhongjian Wang, Jack Xin, Zhiwen Zhang

We introduce the so called DeepParticle method to learn and generate invariant measures of stochastic dynamical systems with physical parameters based on data computed from an interacting particle method (IPM). We utilize the expressiveness of deep neural networks (DNNs) to represent the transform of samples from a given input (source) distribution to an arbitrary target distribution, neither assuming distribution functions in closed form nor a finite state space for the samples. In training, we update the network weights to minimize a discrete Wasserstein distance between the input and target samples. To reduce computational cost, we propose an iterative divide-and-conquer (a mini-batch interior point) algorithm, to find the optimal transition matrix in the Wasserstein distance. We present numerical results to demonstrate the performance of our method for accelerating IPM computation of invariant measures of stochastic dynamical systems arising in computing reaction-diffusion front speeds through chaotic flows. The physical parameter is a large Peclét number reflecting the advection dominated regime of our interest.

LGDec 10, 2020
Recurrence of Optimum for Training Weight and Activation Quantized Networks

Ziang Long, Penghang Yin, Jack Xin

Deep neural networks (DNNs) are quantized for efficient inference on resource-constrained platforms. However, training deep learning models with low-precision weights and activations involves a demanding optimization task, which calls for minimizing a stage-wise loss function subject to a discrete set-constraint. While numerous training methods have been proposed, existing studies for full quantization of DNNs are mostly empirical. From a theoretical point of view, we study practical techniques for overcoming the combinatorial nature of network quantization. Specifically, we investigate a simple yet powerful projected gradient-like algorithm for quantizing two-linear-layer networks, which proceeds by repeatedly moving one step at float weights in the negation of a heuristic \emph{fake} gradient of the loss function (so-called coarse gradient) evaluated at quantized weights. For the first time, we prove that under mild conditions, the sequence of quantized weights recurrently visits the global optimum of the discrete minimization problem for training fully quantized network. We also show numerical evidence of the recurrence phenomenon of weight evolution in training quantized deep networks.

LGNov 23, 2020
Learning Quantized Neural Nets by Coarse Gradient Method for Non-linear Classification

Ziang Long, Penghang Yin, Jack Xin

Quantized or low-bit neural networks are attractive due to their inference efficiency. However, training deep neural networks with quantized activations involves minimizing a discontinuous and piecewise constant loss function. Such a loss function has zero gradients almost everywhere (a.e.), which makes the conventional gradient-based algorithms inapplicable. To this end, we study a novel class of \emph{biased} first-order oracle, termed coarse gradient, for overcoming the vanished gradient issue. A coarse gradient is generated by replacing the a.e. zero derivatives of quantized (i.e., stair-case) ReLU activation composited in the chain rule with some heuristic proxy derivative called straight-through estimator (STE). Although having been widely used in training quantized networks empirically, fundamental questions like when and why the ad-hoc STE trick works, still lacks theoretical understanding. In this paper, we propose a class of STEs with certain monotonicity, and consider their applications to the training of a two-linear-layer network with quantized activation functions for non-linear multi-category classification. We establish performance guarantees for the proposed STEs by showing that the corresponding coarse gradient methods converge to the global minimum, which leads to a perfect classification. Lastly, we present experimental results on synthetic data as well as MNIST dataset to verify our theoretical findings and demonstrate the effectiveness of our proposed STEs.

LGOct 18, 2020
A Spatial-Temporal Graph Based Hybrid Infectious Disease Model with Application to COVID-19

Yunling Zheng, Zhijian Li, Jack Xin et al.

As the COVID-19 pandemic evolves, reliable prediction plays an important role for policy making. The classical infectious disease model SEIR (susceptible-exposed-infectious-recovered) is a compact yet simplistic temporal model. The data-driven machine learning models such as RNN (recurrent neural networks) can suffer in case of limited time series data such as COVID-19. In this paper, we combine SEIR and RNN on a graph structure to develop a hybrid spatio-temporal model to achieve both accuracy and efficiency in training and forecasting. We introduce two features on the graph structure: node feature (local temporal infection trend) and edge feature (geographic neighbor effect). For node feature, we derive a discrete recursion (called I-equation) from SEIR so that gradient descend method applies readily to its optimization. For edge feature, we design an RNN model to capture the neighboring effect and regularize the landscape of loss function so that local minima are effective and robust for prediction. The resulting hybrid model (called IeRNN) improves the prediction accuracy on state-level COVID-19 new case data from the US, out-performing standard temporal models (RNN, SEIR, and ARIMA) in 1-day and 7-day ahead forecasting. Our model accommodates various degrees of reopening and provides potential outcomes for policymakers.

CVOct 3, 2020
Improving Network Slimming with Nonconvex Regularization

Kevin Bui, Fredrick Park, Shuai Zhang et al.

Convolutional neural networks (CNNs) have developed to become powerful models for various computer vision tasks ranging from object detection to semantic segmentation. However, most of the state-of-the-art CNNs cannot be deployed directly on edge devices such as smartphones and drones, which need low latency under limited power and memory bandwidth. One popular, straightforward approach to compressing CNNs is network slimming, which imposes $\ell_1$ regularization on the channel-associated scaling factors via the batch normalization layers during training. Network slimming thereby identifies insignificant channels that can be pruned for inference. In this paper, we propose replacing the $\ell_1$ penalty with an alternative nonconvex, sparsity-inducing penalty in order to yield a more compressed and/or accurate CNN architecture. We investigate $\ell_p (0 < p < 1)$, transformed $\ell_1$ (T$\ell_1$), minimax concave penalty (MCP), and smoothly clipped absolute deviation (SCAD) due to their recent successes and popularity in solving sparse optimization problems, such as compressed sensing and variable selection. We demonstrate the effectiveness of network slimming with nonconvex penalties on three neural network architectures -- VGG-19, DenseNet-40, and ResNet-164 -- on standard image classification datasets. Based on the numerical experiments, T$\ell_1$ preserves model accuracy against channel pruning, $\ell_{1/2, 3/4}$ yield better compressed models with similar accuracies after retraining as $\ell_1$, and MCP and SCAD provide more accurate models after retraining with similar compression as $\ell_1$. Network slimming with T$\ell_1$ regularization also outperforms the latest Bayesian modification of network slimming in compressing a CNN architecture in terms of memory storage while preserving its model accuracy after channel pruning.

CVAug 31, 2020
An Integrated Approach to Produce Robust Models with High Efficiency

Zhijian Li, Bao Wang, Jack Xin

Deep Neural Networks (DNNs) needs to be both efficient and robust for practical uses. Quantization and structure simplification are promising ways to adapt DNNs to mobile devices, and adversarial training is the most popular method to make DNNs robust. In this work, we try to obtain both features by applying a convergent relaxation quantization algorithm, Binary-Relax (BR), to a robust adversarial-trained model, ResNets Ensemble via Feynman-Kac Formalism (EnResNet). We also discover that high precision, such as ternary (tnn) and 4-bit, quantization will produce sparse DNNs. However, this sparsity is unstructured under advarsarial training. To solve the problems that adversarial training jeopardizes DNNs' accuracy on clean images and the struture of sparsity, we design a trade-off loss function that helps DNNs preserve their natural accuracy and improve the channel sparsity. With our trade-off loss function, we achieve both goals with no reduction of resistance under weak attacks and very minor reduction of resistance under strong attcks. Together with quantized EnResNet with trade-off loss function, we provide robust models that have high efficiency.

LGAug 10, 2020
RARTS: An Efficient First-Order Relaxed Architecture Search Method

Fanghui Xue, Yingyong Qi, Jack Xin

Differentiable architecture search (DARTS) is an effective method for data-driven neural network design based on solving a bilevel optimization problem. Despite its success in many architecture search tasks, there are still some concerns about the accuracy of first-order DARTS and the efficiency of the second-order DARTS. In this paper, we formulate a single level alternative and a relaxed architecture search (RARTS) method that utilizes the whole dataset in architecture learning via both data and network splitting, without involving mixed second derivatives of the corresponding loss functions like DARTS. In our formulation of network splitting, two networks with different but related weights cooperate in search of a shared architecture. The advantage of RARTS over DARTS is justified by a convergence theorem and an analytically solvable model. Moreover, RARTS outperforms DARTS and its variants in accuracy and search efficiency, as shown in adequate experimental results. For the task of searching topological architecture, i.e., the edges and the operations, RARTS obtains a higher accuracy and 60\% reduction of computational cost than second-order DARTS on CIFAR-10. RARTS continues to out-perform DARTS upon transfer to ImageNet and is on par with recent variants of DARTS even though our innovation is purely on the training algorithm without modifying search space. For the task of searching width, i.e., the number of channels in convolutional layers, RARTS also outperforms the traditional network pruning benchmarks. Further experiments on the public architecture search benchmark like NATS-Bench also support the preeminence of RARTS.

PEJul 14, 2020
A Recurrent Neural Network and Differential Equation Based Spatiotemporal Infectious Disease Model with Application to COVID-19

Zhijian Li, Yunling Zheng, Jack Xin et al.

The outbreaks of Coronavirus Disease 2019 (COVID-19) have impacted the world significantly. Modeling the trend of infection and real-time forecasting of cases can help decision making and control of the disease spread. However, data-driven methods such as recurrent neural networks (RNN) can perform poorly due to limited daily samples in time. In this work, we develop an integrated spatiotemporal model based on the epidemic differential equations (SIR) and RNN. The former after simplification and discretization is a compact model of temporal infection trend of a region while the latter models the effect of nearest neighboring regions. The latter captures latent spatial information. %that is not publicly reported. We trained and tested our model on COVID-19 data in Italy, and show that it out-performs existing temporal models (fully connected NN, SIR, ARIMA) in 1-day, 3-day, and 1-week ahead forecasting especially in the regime of limited training data.

CVMay 9, 2020
A Weighted Difference of Anisotropic and Isotropic Total Variation for Relaxed Mumford-Shah Color and Multiphase Image Segmentation

Kevin Bui, Fredrick Park, Yifei Lou et al.

In a class of piecewise-constant image segmentation models, we propose to incorporate a weighted difference of anisotropic and isotropic total variation (AITV) to regularize the partition boundaries in an image. In particular, we replace the total variation regularization in the Chan-Vese segmentation model and a fuzzy region competition model by the proposed AITV. To deal with the nonconvex nature of AITV, we apply the difference-of-convex algorithm (DCA), in which the subproblems can be minimized by the primal-dual hybrid gradient method with linesearch. The convergence of the DCA scheme is analyzed. In addition, a generalization to color image segmentation is discussed. In the numerical experiments, we compare the proposed models with the classic convex approaches and the two-stage segmentation methods (smoothing and then thresholding) on various images, showing that our models are effective in image segmentation and robust with respect to impulsive noises.

LGFeb 28, 2020
Global Convergence and Geometric Characterization of Slow to Fast Weight Evolution in Neural Network Training for Classifying Linearly Non-Separable Data

Ziang Long, Penghang Yin, Jack Xin

In this paper, we study the dynamics of gradient descent in learning neural networks for classification problems. Unlike in existing works, we consider the linearly non-separable case where the training data of different classes lie in orthogonal subspaces. We show that when the network has sufficient (but not exceedingly large) number of neurons, (1) the corresponding minimization problem has a desirable landscape where all critical points are global minima with perfect classification; (2) gradient descent is guaranteed to converge to the global minima. Moreover, we discovered a geometric condition on the network weights so that when it is satisfied, the weight evolution transitions from a slow phase of weight direction spreading to a fast phase of weight convergence. The geometric condition says that the convex hull of the weights projected on the unit sphere contains the origin.

CVDec 17, 2019
$\ell_0$ Regularized Structured Sparsity Convolutional Neural Networks

Kevin Bui, Fredrick Park, Shuai Zhang et al.

Deepening and widening convolutional neural networks (CNNs) significantly increases the number of trainable weight parameters by adding more convolutional layers and feature maps per layer, respectively. By imposing inter- and intra-group sparsity onto the weights of the layers during the training process, a compressed network can be obtained with accuracy comparable to a dense one. In this paper, we propose a new variant of sparse group lasso that blends the $\ell_0$ norm onto the individual weight parameters and the $\ell_{2,1}$ norm onto the output channels of a layer. To address the non-differentiability of the $\ell_0$ norm, we apply variable splitting resulting in an algorithm that consists of executing stochastic gradient descent followed by hard thresholding for each iteration. Numerical experiments are demonstrated on LeNet-5 and wide-residual-networks for MNIST and CIFAR 10/100, respectively. They showcase the effectiveness of our proposed method in attaining superior test accuracy with network sparsification on par with the current state of the art.

LGMar 13, 2019
Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

Penghang Yin, Jiancheng Lyu, Shuai Zhang et al.

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.

LGFeb 20, 2019
Learning Sparse Neural Networks via $\ell_0$ and T$\ell_1$ by a Relaxed Variable Splitting Method with Application to Multi-scale Curve Classification

Fanghui Xue, Jack Xin

We study sparsification of convolutional neural networks (CNN) by a relaxed variable splitting method of $\ell_0$ and transformed-$\ell_1$ (T$\ell_1$) penalties, with application to complex curves such as texts written in different fonts, and words written with trembling hands simulating those of Parkinson's disease patients. The CNN contains 3 convolutional layers, each followed by a maximum pooling, and finally a fully connected layer which contains the largest number of network weights. With $\ell_0$ penalty, we achieved over 99 \% test accuracy in distinguishing shaky vs. regular fonts or hand writings with above 86 \% of the weights in the fully connected layer being zero. Comparable sparsity and test accuracy are also reached with a proper choice of T$\ell_1$ penalty.

LGFeb 13, 2019
A Study on Graph-Structured Recurrent Neural Networks and Sparsification with Application to Epidemic Forecasting

Zhijian Li, Xiyang Luo, Bao Wang et al.

We study epidemic forecasting on real-world health data by a graph-structured recurrent neural network (GSRNN). We achieve state-of-the-art forecasting accuracy on the benchmark CDC dataset. To improve model efficiency, we sparsify the network weights via transformed-$\ell_1$ penalty and maintain prediction accuracy at the same level with 70% of the network weights being zero.

LGJan 24, 2019
AutoShuffleNet: Learning Permutation Matrices via an Exact Lipschitz Continuous Penalty in Deep Convolutional Neural Networks

Jiancheng Lyu, Shuai Zhang, Yingyong Qi et al.

ShuffleNet is a state-of-the-art light weight convolutional neural network architecture. Its basic operations include group, channel-wise convolution and channel shuffling. However, channel shuffling is manually designed empirically. Mathematically, shuffling is a multiplication by a permutation matrix. In this paper, we propose to automate channel shuffling by learning permutation matrices in network training. We introduce an exact Lipschitz continuous non-convex penalty so that it can be incorporated in the stochastic gradient descent to approximate permutation at high precision. Exact permutations are obtained by simple rounding at the end of training and are used in inference. The resulting network, referred to as AutoShuffleNet, achieved improved classification accuracies on CIFAR-10 and ImageNet data sets. In addition, we found experimentally that the standard convex relaxation of permutation matrices into stochastic matrices leads to poor performance. We prove theoretically the exactness (error bounds) in recovering permutation matrices when our penalty function is zero (very small). We present examples of permutation optimization through graph matching and two-layer neural network models where the loss functions are calculated in closed analytical form. In the examples, convex relaxation failed to capture permutations whereas our penalty succeeded.

LGAug 15, 2018
Blended Coarse Gradient Descent for Full Quantization of Deep Neural Networks

Penghang Yin, Shuai Zhang, Jiancheng Lyu et al.

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular full precision counterparts. To maintain the same performance level especially at low bit-widths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights, hence mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bit-width such as binarization. In full quantization of ResNet-18 for ImageNet classification task, BCGD gives 64.36\% top-1 accuracy with binary weights across all layers and 4-bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46\%. As theoretical justification, we show convergence analysis of coarse gradient descent for a two-linear-layer neural network model with Gaussian input data, and prove that the expected coarse gradient correlates positively with the underlying true gradient.

CVJan 19, 2018
BinaryRelax: A Relaxation Approach For Training Deep Neural Networks With Quantized Weights

Penghang Yin, Shuai Zhang, Jiancheng Lyu et al.

We propose BinaryRelax, a simple two-phase algorithm, for training deep neural networks with quantized weights. The set constraint that characterizes the quantization of weights is not imposed until the late stage of training, and a sequence of \emph{pseudo} quantized weights is maintained. Specifically, we relax the hard constraint into a continuous regularizer via Moreau envelope, which turns out to be the squared Euclidean distance to the set of quantized weights. The pseudo quantized weights are obtained by linearly interpolating between the float weights and their quantizations. A continuation strategy is adopted to push the weights towards the quantized state by gradually increasing the regularization parameter. In the second phase, exact quantization scheme with a small learning rate is invoked to guarantee fully quantized weights. We test BinaryRelax on the benchmark CIFAR and ImageNet color image datasets to demonstrate the superiority of the relaxed quantization approach and the improved accuracy over the state-of-the-art training methods. Finally, we prove the convergence of BinaryRelax under an approximate orthogonality condition.

LGNov 23, 2017
Deep Learning for Real-Time Crime Forecasting and its Ternarization

Bao Wang, Penghang Yin, Andrea L. Bertozzi et al.

Real-time crime forecasting is important. However, accurate prediction of when and where the next crime will happen is difficult. No known physical model provides a reasonable approximation to such a complex system. Historical crime data are sparse in both space and time and the signal of interests is weak. In this work, we first present a proper representation of crime data. We then adapt the spatial temporal residual network on the well represented data to predict the distribution of crime in Los Angeles at the scale of hours in neighborhood-sized parcels. These experiments as well as comparisons with several existing approaches to prediction demonstrate the superiority of the proposed model in terms of accuracy. Finally, we present a ternarization technique to address the resource consumption issue for its deployment in real world. This work is an extension of our short conference proceeding paper [Wang et al, Arxiv 1707.03340].