Tao Hong

LG
h-index1
9papers
55citations
Novelty51%
AI Score37

9 Papers

CVAug 7, 2022
PDO-s3DCNNs: Partial Differential Operator Based Steerable 3D CNNs

Zhengyang Shen, Tao Hong, Qi She et al.

Steerable models can provide very general and flexible equivariance by formulating equivariance requirements in the language of representation theory and feature fields, which has been recognized to be effective for many vision tasks. However, deriving steerable models for 3D rotations is much more difficult than that in the 2D case, due to more complicated mathematics of 3D rotations. In this work, we employ partial differential operators (PDOs) to model 3D filters, and derive general steerable 3D CNNs, which are called PDO-s3DCNNs. We prove that the equivariant filters are subject to linear constraints, which can be solved efficiently under various conditions. As far as we know, PDO-s3DCNNs are the most general steerable CNNs for 3D rotations, in the sense that they cover all common subgroups of $SO(3)$ and their representations, while existing methods can only be applied to specific groups and representations. Extensive experiments show that our models can preserve equivariance well in the discrete domain, and outperform previous works on SHREC'17 retrieval and ISBI 2012 segmentation tasks with a low network complexity.

OCOct 20, 2021
Merging Multigrid Optimization with SESOP

Tao Hong, Irad Yavneh, Michael Zibulevsky

A merger of two optimization frameworks is introduced: SEquential Subspace OPtimization (SESOP) with MultiGrid (MG) optimization. At each iteration of the algorithm, the search direction implied by the coarse-grid correction process of MG is added to the low dimensional search-space of SESOP, which includes the preconditioned gradient and search directions involving the previous iterates, called {\em history}. Numerical experiments demonstrate the effectiveness of this approach. We then study the asymptotic convergence factor of the two-level version of SESOP-MG (dubbed SESOP-TG) for optimization of quadratic functions, and derive approximately optimal fixed parameters, which may reduce the computational overhead for such problems significantly.

CLNov 16, 2025
Evidence of Phase Transitions in Small Transformer-Based Language Models

Noah Hong, Tao Hong

Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors

LGJan 18, 2025
HOPS: High-order Polynomials with Self-supervised Dimension Reduction for Load Forecasting

Pengyang Song, Han Feng, Shreyashi Shukla et al.

Load forecasting is a fundamental task in smart grid. Many techniques have been applied to developing load forecasting models. Due to the challenges such as the Curse of Dimensionality, overfitting, and limited computing resources, multivariate higher-order polynomial models have received limited attention in load forecasting, despite their desirable mathematical foundations and optimization properties. In this paper, we propose low rank approximation and self-supervised dimension reduction to address the aforementioned issues. To further improve computational efficiency, we also utilize a fast Conjugate Gradient based algorithm for the proposed polynomial models. Based on the load datasets from the ISO New England, the proposed method high-order polynomials with self-supervised dimension reduction (HOPS) demonstrates higher forecasting accuracy over several competitive models. Additionally, experimental results indicate that our approach alleviates redundant variable construction, achieving better forecasts with fewer input variables.

CLMay 8, 2023
Coherent Wave Dynamics and Language Generation of a Generative Pre-trained Transformer

Tao Hong

Large Language Models (LLMs), such as the Generative Pretrained Transformer (GPT), have achieved tremendous success in various language tasks, but their emergent abilities have also raised many questions, concerns, and challenges that need to be addressed. To gain a better understanding of the models' inner mechanisms, we analyze the hidden state and channel wave dynamics in a small GPT, focusing on the coherence of wave patterns in terms of cross-channel correlation and individual auto-correlation. Our findings suggest that wave dynamics offer consistent and repeatable intrinsic oscillation modes, along with context-aware plasticity and expressiveness in language generation. By analyzing wave patterns, coherence, and clustering, we provide a systematic way to identify and interpret the functionality of the hidden state channels, paving the way to understand and control higher-level language pattern formation. In addition, we investigate the Poisson statistics of spelling errors in text sequence generation across various levels of model training and observe a phase-transition-like process. As coherence builds up, there is a competition between the generation of correct and misspelled words. However, once the model is adequately trained and significant coherence has emerged, the coherent process becomes strong enough to effectively suppress spelling errors, preventing the cascade amplification of defects. The distribution of correct spellings transitions from Poissonian to Sub-Poissonian, while the distribution of misspellings shows the opposite trend. By leveraging concepts and techniques from quantum physics, we gain novel insights into the dynamics of the small GPT. This approach can be extended to larger language models that exhibit more complex coherent language patterns, opening up opportunities to interpret their emergent capabilities and develop more specialized models.

CVMay 6, 2018
Acceleration of RED via Vector Extrapolation

Tao Hong, Yaniv Romano, Michael Elad

Models play an important role in inverse problems, serving as the prior for representing the original signal to be recovered. REgularization by Denoising (RED) is a recently introduced general framework for constructing such priors using state-of-the-art denoising algorithms. Using RED, solving inverse problems is shown to amount to an iterated denoising process. However, as the complexity of denoising algorithms is generally high, this might lead to an overall slow algorithm. In this paper, we suggest an accelerated technique based on vector extrapolation (VE) to speed-up existing RED solvers. Numerical experiments validate the obtained gain by VE, leading to a substantial savings in computations compared with the original fixed-point method.

SPSep 19, 2017
Optimized Structured Sparse Sensing Matrices for Compressive Sensing

Tao Hong, Xiao Li, Zhihui Zhu et al.

We consider designing a robust structured sparse sensing matrix consisting of a sparse matrix with a few non-zero entries per row and a dense base matrix for capturing signals efficiently We design the robust structured sparse sensing matrix through minimizing the distance between the Gram matrix of the equivalent dictionary and the target Gram of matrix holding small mutual coherence. Moreover, a regularization is added to enforce the robustness of the optimized structured sparse sensing matrix to the sparse representation error (SRE) of signals of interests. An alternating minimization algorithm with global sequence convergence is proposed for solving the corresponding optimization problem. Numerical experiments on synthetic data and natural images show that the obtained structured sensing matrix results in a higher signal reconstruction than a random dense sensing matrix.

LGJan 4, 2017
Online Learning Sensing Matrix and Sparsifying Dictionary Simultaneously for Compressive Sensing

Tao Hong, Zhihui Zhu

This paper considers the problem of simultaneously learning the Sensing Matrix and Sparsifying Dictionary (SMSD) on a large training dataset. To address the formulated joint learning problem, we propose an online algorithm that consists of a closed-form solution for optimizing the sensing matrix with a fixed sparsifying dictionary and a stochastic method for learning the sparsifying dictionary on a large dataset when the sensing matrix is given. Benefiting from training on a large dataset, the obtained compressive sensing (CS) system by the proposed algorithm yields a much better performance in terms of signal recovery accuracy than the existing ones. The simulation results on natural images demonstrate the effectiveness of the suggested online algorithm compared with the existing methods.

LGSep 27, 2016
An Efficient Method for Robust Projection Matrix Design

Tao Hong, Zhihui Zhu

Our objective is to efficiently design a robust projection matrix $Φ$ for the Compressive Sensing (CS) systems when applied to the signals that are not exactly sparse. The optimal projection matrix is obtained by mainly minimizing the average coherence of the equivalent dictionary. In order to drop the requirement of the sparse representation error (SRE) for a set of training data as in [15] [16], we introduce a novel penalty function independent of a particular SRE matrix. Without requiring of training data, we can efficiently design the robust projection matrix and apply it for most of CS systems, like a CS system for image processing with a conventional wavelet dictionary in which the SRE matrix is generally not available. Simulation results demonstrate the efficiency and effectiveness of the proposed approach compared with the state-of-the-art methods. In addition, we experimentally demonstrate with natural images that under similar compression rate, a CS system with a learned dictionary in high dimensions outperforms the one in low dimensions in terms of reconstruction accuracy. This together with the fact that our proposed method can efficiently work in high dimension suggests that a CS system can be potentially implemented beyond the small patches in sparsity-based image processing.