Nhat Thanh Tran

CV
h-index19
6papers
20citations
Novelty52%
AI Score48

6 Papers

LGJul 2, 2023
Fourier-Mixed Window Attention: Accelerating Informer for Long Sequence Time-Series Forecasting

Nhat Thanh Tran, Jack Xin

We study a fast local-global window-based attention method to accelerate Informer for long sequence time-series forecasting. While window attention being local is a considerable computational saving, it lacks the ability to capture global token information which is compensated by a subsequent Fourier transform block. Our method, named FWin, does not rely on query sparsity hypothesis and an empirical approximation underlying the ProbSparse attention of Informer. Through experiments on univariate and multivariate datasets, we show that FWin transformers improve the overall prediction accuracies of Informer while accelerating its inference speeds by 1.6 to 2 times. We also provide a mathematical definition of FWin attention, and prove that it is equivalent to the canonical full attention under the block diagonal invertibility (BDI) condition of the attention matrix. The BDI is shown experimentally to hold with high probability for typical benchmark datasets.

CVMay 11
USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

Elisha Dayag, Nhat Thanh Tran, Jack Xin

Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.

LGMar 10, 2024
FWin transformer for dengue prediction under climate and ocean influence

Nhat Thanh Tran, Jack Xin, Guofa Zhou

Dengue fever is one of the most deadly mosquito-born tropical infectious diseases. Detailed long range forecast model is vital in controlling the spread of disease and making mitigation efforts. In this study, we examine methods used to forecast dengue cases for long range predictions. The dataset consists of local climate/weather in addition to global climate indicators of Singapore from 2000 to 2019. We utilize newly developed deep neural networks to learn the intricate relationship between the features. The baseline models in this study are in the class of recent transformers for long sequence forecasting tasks. We found that a Fourier mixed window attention (FWin) based transformer performed the best in terms of both the mean square error and the maximum absolute error on the long range dengue forecast up to 60 weeks.

CVJan 19
Deep Image Prior with L0 Gradient Regularizer for Image Smoothing

Nhat Thanh Tran, Kevin Bui, Jack Xin

Image smoothing is a fundamental image processing operation that preserves the underlying structure, such as strong edges and contours, and removes minor details and textures in an image. Many image smoothing algorithms rely on computing local window statistics or solving an optimization problem. Recent state-of-the-art methods leverage deep learning, but they require a carefully curated training dataset. Because constructing a proper training dataset for image smoothing is challenging, we propose DIP-$\ell_0$, a deep image prior framework that incorporates the $\ell_0$ gradient regularizer. This framework can perform high-quality image smoothing without any training data. To properly minimize the associated loss function that has the nonconvex, nonsmooth $\ell_0$ ``norm", we develop an alternating direction method of multipliers algorithm that utilizes an off-the-shelf $\ell_0$ gradient minimization solver. Numerical experiments demonstrate that the proposed DIP-$\ell_0$ outperforms many image smoothing algorithms in edge-preserving image smoothing and JPEG artifact removal.

LGOct 3, 2025
CrossLag: Predicting Major Dengue Outbreaks with a Domain Knowledge Informed Transformer

Ashwin Prabu, Nhat Thanh Tran, Guofa Zhou et al.

A variety of models have been developed to forecast dengue cases to date. However, it remains a challenge to predict major dengue outbreaks that need timely public warnings the most. In this paper, we introduce CrossLag, an environmentally informed attention that allows for the incorporation of lagging endogenous signals behind the significant events in the exogenous data into the architecture of the transformer at low parameter counts. Outbreaks typically lag behind major changes in climate and oceanic anomalies. We use TimeXer, a recent general-purpose transformer distinguishing exogenous-endogenous inputs, as the baseline for this study. Our proposed model outperforms TimeXer by a considerable margin in detecting and predicting major outbreaks in Singapore dengue data over a 24-week prediction window.

CVJun 10, 2025
SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

Nhat Thanh Tran, Fanghui Xue, Shuai Zhang et al.

Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach on Imagenet-1k where classification results show that SEMA is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images at similar model parameter sizes.