Ben Liang

h-index28

17papers

132citations

Novelty57%

AI Score54

Ranked #28,981 of 201,326 authors (top 14%)#6,703 in LG (top 16%)

17 Papers

SYJun 3, 2016

Distributed Real-Time Power Balancing in Renewable-Integrated Power Grids with Storage and Flexible Loads

Sun Sun, Min Dong, Ben Liang

The large-scale integration of renewable generation directly affects the reliability of power grids. We investigate the problem of power balancing in a general renewable-integrated power grid with storage and flexible loads. We consider a power grid that is supplied by one conventional generator (CG) and multiple renewable generators (RGs) each co-located with storage,and is connected with external markets. An aggregator operates the power grid to maintain power balance between supply and demand. Aiming at minimizing the long-term system cost, we first propose a real-time centralized power balancing solution, taking into account the uncertainty of the renewable generation, loads, and energy prices. We then provide a distributed implementation algorithm, significantly reducing both computational burden and communication overhead. We demonstrate that our proposed algorithm is asymptotically optimal as the storage capacity increases and the CG ramping constraint loosens. Moreover, the distributed implementation enjoys a fast convergence rate, and enables each RG and the aggregator to make their own decisions. Simulation shows that our proposed algorithm outperforms alternatives and can achieve near-optimal performance for a wide range of storage capacity.

CVMay 15Code

3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds

Bingwen Qiu, Yuan Liu, Junqi Bai et al.

A fundamental challenge in point cloud object detection lies in the conflict between the extreme sparsity of distant points and the need for remote context understanding. The existing methods typically use 1D serialization to expand the receptive field, which inevitably discards already scarce local geometric details and reduces detection of distant and small objects. To address this issue, we propose 3DTMDet, a novel detection network that synergistically combines state space models (Mamba) with Transformers. The core idea is to utilize SSM's linear complexity and advantages in long sequence modeling to effectively capture global interactions between sparse and distant points, while using Transformer modules with local attention to encode fine-grained geometric structures in local point sets, preserving accurate shape information. We propose the 3D Hybrid Mamba Transformer (3DHMT) block, which uses an SSM-Attention-SSM pipeline to balance global context understanding and local detail preservation, effectively alleviating the tension between receptive field enlargement and geometric preservation in remote detection. In addition, we introduced a voxel generation block inspired by LiDAR physics, which diffuses features along the sensor observation direction to reconstruct the complete object structure of occlusion and distant areas. Extensive experiments conducted on the KITTI and ONCE datasets have shown that 3DTMDet outperforms state-of-the-art detectors. The code is available at https://github.com/QiuBingwen/3DTMDet.

LGMar 19, 2023

Generative Adversarial Classification Network with Application to Network Traffic Classification

Rozhina Ghanavi, Ben Liang, Ali Tizghadam

Large datasets in machine learning often contain missing data, which necessitates the imputation of missing data values. In this work, we are motivated by network traffic classification, where traditional data imputation methods do not perform well. We recognize that no existing method directly accounts for classification accuracy during data imputation. Therefore, we propose a joint data imputation and data classification method, termed generative adversarial classification network (GACN), whose architecture contains a generator network, a discriminator network, and a classification network, which are iteratively optimized toward the ultimate objective of classification accuracy. For the scenario where some data samples are unlabeled, we further propose an extension termed semi-supervised GACN (SSGACN), which is able to use the partially labeled data to improve classification accuracy. We conduct experiments with real-world network traffic data traces, which demonstrate that GACN and SS-GACN can more accurately impute data features that are more important for classification, and they outperform existing methods in terms of classification accuracy.

LGSep 23, 2024

Novel Gradient Sparsification Algorithm via Bayesian Inference

Ali Bereyhi, Ben Liang, Gary Boudreau et al.

Error accumulation is an essential component of the Top-$k$ sparsification method in distributed gradient descent. It implicitly scales the learning rate and prevents the slow-down of lateral movement, but it can also deteriorate convergence. This paper proposes a novel sparsification algorithm called regularized Top-$k$ (RegTop-$k$) that controls the learning rate scaling of error accumulation. The algorithm is developed by looking at the gradient sparsification as an inference problem and determining a Bayesian optimal sparsification mask via maximum-a-posteriori estimation. It utilizes past aggregated gradients to evaluate posterior statistics, based on which it prioritizes the local gradient entries. Numerical experiments with ResNet-18 on CIFAR-10 show that at $0.1\%$ sparsification, RegTop-$k$ achieves about $8\%$ higher accuracy than standard Top-$k$.

LGOct 4, 2023

Improving Knowledge Distillation with Teacher's Explanation

Sayantan Chowdhury, Ben Liang, Ali Tizghadam et al.

Knowledge distillation (KD) improves the performance of a low-complexity student model with the help of a more powerful teacher. The teacher in KD is a black-box model, imparting knowledge to the student only through its predictions. This limits the amount of transferred knowledge. In this work, we introduce a novel Knowledge Explaining Distillation (KED) framework, which allows the student to learn not only from the teacher's predictions but also from the teacher's explanations. We propose a class of superfeature-explaining teachers that provide explanation over groups of features, along with the corresponding student model. We also present a method for constructing the superfeatures. We then extend KED to reduce complexity in convolutional neural networks, to allow augmentation with hidden-representation distillation methods, and to work with a limited amount of training data using chimeric sets. Our experiments over a variety of datasets show that KED students can substantially outperform KD students of similar complexity.

CVSep 27, 2025Code

FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection

Ben Liang, Yuan Liu, Bingwen Qiu et al.

Aerial-view object detection is a critical technology for real-world applications such as natural resource monitoring, traffic management, and UAV-based search and rescue. Detecting tiny objects in high-resolution aerial imagery presents a long-standing challenge due to their limited visual cues and the difficulty of modeling global context in complex scenes. Existing methods are often hampered by delayed contextual fusion and inadequate non-linear modeling, failing to effectively use global information to refine shallow features and thus encountering a performance bottleneck. To address these challenges, we propose FMC-DETR, a novel framework with frequency-decoupled fusion for aerial-view object detection. First, we introduce the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which applies cascaded wavelet transforms to enhance global low-frequency context perception in shallow features while preserving fine-grained details, and employs Kolmogorov-Arnold networks to achieve adaptive non-linear modeling of multi-scale dependencies. Next, a lightweight Cross-stage Partial Fusion (CPF) module reduces redundancy and improves multi-scale feature interaction. Finally, we introduce the Multi-Domain Feature Coordination (MDFC) module, which unifies spatial, frequency, and structural priors to to balance detail preservation and global enhancement. Extensive experiments on benchmark aerial-view datasets demonstrate that FMC-DETR achieves state-of-the-art performance with fewer parameters. On the challenging VisDrone dataset, our model achieves improvements of 6.5% AP and 8.2% AP50 over the baseline, highlighting its effectiveness in tiny object detection. The code can be accessed at https://github.com/bloomingvision/FMC-DETR.

LGJan 13, 2025

Universal Training of Neural Networks to Achieve Bayes Optimal Classification Accuracy

Mohammadreza Tavasoli Naeini, Ali Bereyhi, Morteza Noshad et al.

This work invokes the notion of $f$-divergence to introduce a novel upper bound on the Bayes error rate of a general classification task. We show that the proposed bound can be computed by sampling from the output of a parameterized model. Using this practical interpretation, we introduce the Bayes optimal learning threshold (BOLT) loss whose minimization enforces a classification model to achieve the Bayes error rate. We validate the proposed loss for image and text classification tasks, considering MNIST, Fashion-MNIST, CIFAR-10, and IMDb datasets. Numerical experiments demonstrate that models trained with BOLT achieve performance on par with or exceeding that of cross-entropy, particularly on challenging datasets. This highlights the potential of BOLT in improving generalization.

LGOct 29, 2025

BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training

Mohammadreza Tavasoli Naeini, Ali Bereyhi, Morteza Noshad et al.

We introduce BOLT-GAN, a simple yet effective modification of the WGAN framework inspired by the Bayes Optimal Learning Threshold (BOLT). We show that with a Lipschitz continuous discriminator, BOLT-GAN implicitly minimizes a different metric distance than the Earth Mover (Wasserstein) distance and achieves better training stability. Empirical evaluations on four standard image generation benchmarks (CIFAR-10, CelebA-64, LSUN Bedroom-64, and LSUN Church-64) show that BOLT-GAN consistently outperforms WGAN, achieving 10-60% lower Frechet Inception Distance (FID). Our results suggest that BOLT is a broadly applicable principle for enhancing GAN training.

LGOct 8, 2025

Coupled Data and Measurement Space Dynamics for Enhanced Diffusion Posterior Sampling

Shayan Mohajer Hamidi, En-Hui Yang, Ben Liang

Inverse problems, where the goal is to recover an unknown signal from noisy or incomplete measurements, are central to applications in medical imaging, remote sensing, and computational biology. Diffusion models have recently emerged as powerful priors for solving such problems. However, existing methods either rely on projection-based techniques that enforce measurement consistency through heuristic updates, or they approximate the likelihood $p(\boldsymbol{y} \mid \boldsymbol{x})$, often resulting in artifacts and instability under complex or high-noise conditions. To address these limitations, we propose a novel framework called \emph{coupled data and measurement space diffusion posterior sampling} (C-DPS), which eliminates the need for constraint tuning or likelihood approximation. C-DPS introduces a forward stochastic process in the measurement space $\{\boldsymbol{y}_t\}$, evolving in parallel with the data-space diffusion $\{\boldsymbol{x}_t\}$, which enables the derivation of a closed-form posterior $p(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t, \boldsymbol{y}_{t-1})$. This coupling allows for accurate and recursive sampling based on a well-defined posterior distribution. Empirical results demonstrate that C-DPS consistently outperforms existing baselines, both qualitatively and quantitatively, across multiple inverse problem benchmarks.

LGJan 10, 2025

Regularized Top-$k$: A Bayesian Framework for Gradient Sparsification

Ali Bereyhi, Ben Liang, Gary Boudreau et al.

Error accumulation is effective for gradient sparsification in distributed settings: initially-unselected gradient entries are eventually selected as their accumulated error exceeds a certain level. The accumulation essentially behaves as a scaling of the learning rate for the selected entries. Although this property prevents the slow-down of lateral movements in distributed gradient descent, it can deteriorate convergence in some settings. This work proposes a novel sparsification scheme that controls the learning rate scaling of error accumulation. The development of this scheme follows two major steps: first, gradient sparsification is formulated as an inverse probability (inference) problem, and the Bayesian optimal sparsification mask is derived as a maximum-a-posteriori estimator. Using the prior distribution inherited from Top-$k$, we derive a new sparsification algorithm which can be interpreted as a regularized form of Top-$k$. We call this algorithm regularized Top-$k$ (RegTop-$k$). It utilizes past aggregated gradients to evaluate posterior statistics of the next aggregation. It then prioritizes the local accumulated gradient entries based on these posterior statistics. We validate our derivation through numerical experiments. In distributed linear regression, it is observed that while Top-$k$ remains at a fixed distance from the global optimum, RegTop-$k$ converges to the global optimum at significantly higher compression ratios. We further demonstrate the generalization of this observation by employing RegTop-$k$ in distributed training of ResNet-18 on CIFAR-10, where it noticeably outperforms Top-$k$.

LGFeb 25, 2022

Dynamic Regret of Online Mirror Descent for Relatively Smooth Convex Cost Functions

Nima Eshraghi, Ben Liang

The performance of online convex optimization algorithms in a dynamic environment is often expressed in terms of the dynamic regret, which measures the decision maker's performance against a sequence of time-varying comparators. In the analysis of the dynamic regret, prior works often assume Lipschitz continuity or uniform smoothness of the cost functions. However, there are many important cost functions in practice that do not satisfy these conditions. In such cases, prior analyses are not applicable and fail to guarantee the optimization performance. In this letter, we show that it is possible to bound the dynamic regret, even when neither Lipschitz continuity nor uniform smoothness is present. We adopt the notion of relative smoothness with respect to some user-defined regularization function, which is a much milder requirement on the cost functions. We first show that under relative smoothness, the dynamic regret has an upper bound based on the path length and functional variation. We then show that with an additional condition of relatively strong convexity, the dynamic regret can be bounded by the path length and gradient variation. These regret bounds provide performance guarantees to a wide variety of online optimization problems that arise in different application domains. Finally, we present numerical experiments that demonstrate the advantage of adopting a regularization function under which the cost functions are relatively smooth.

ITDec 7, 2021

Gradient and Projection Free Distributed Online Min-Max Resource Optimization

Jingrong Wang, Ben Liang

We consider distributed online min-max resource allocation with a set of parallel agents and a parameter server. Our goal is to minimize the pointwise maximum over a set of time-varying and decreasing cost functions, without a priori information about these functions. We propose a novel online algorithm, termed Distributed Online resource Re-Allocation (DORA), where non-stragglers learn to relinquish resource and share resource with stragglers. A notable feature of DORA is that it does not require gradient calculation or projection operation, unlike most existing online optimization strategies. This allows it to substantially reduce the computation overhead in large-scale and distributed networks. We analyze the worst-case performance of DORA and derive an upper bound on its dynamic regret for non-convex functions. We further consider an application to the bandwidth allocation problem in distributed online machine learning. Our numerical study demonstrates the efficacy of the proposed solution and its performance advantage over gradient- and/or projection-based resource allocation algorithms in reducing wall-clock time.

ITMay 9, 2021

Delay-Tolerant Constrained OCO with Application to Network Resource Allocation

Juncheng Wang, Ben Liang, Min Dong et al.

We consider online convex optimization (OCO) with multi-slot feedback delay, where an agent makes a sequence of online decisions to minimize the accumulation of time-varying convex loss functions, subject to short-term and long-term constraints that are possibly time-varying. The current convex loss function and the long-term constraint function are revealed to the agent only after the decision is made, and they may be delayed for multiple time slots. Existing work on OCO under this general setting has focused on the static regret, which measures the gap of losses between the online decision sequence and an offline benchmark that is fixed over time. In this work, we consider both the static regret and the more practically meaningful dynamic regret, where the benchmark is a time-varying sequence of per-slot optimizers. We propose an efficient algorithm, termed Delay-Tolerant Constrained-OCO (DTC-OCO), which uses a novel constraint penalty with double regularization to tackle the asynchrony between information feedback and decision updates. We derive upper bounds on its dynamic regret, static regret, and constraint violation, proving them to be sublinear under mild conditions. We further apply DTC-OCO to a general network resource allocation problem, which arises in many systems such as data networks and cloud computing. Simulation results demonstrate substantial performance gain of DTC-OCO over the known best alternative.

NIApr 30, 2021

Flow-Packet Hybrid Traffic Classification for Class-Aware Network Routing

Sayantan Chowdhury, Ben Liang, Ali Tizghadam et al.

Network traffic classification using machine learning techniques has been widely studied. Most existing schemes classify entire traffic flows, but there are major limitations to their practicality. At a network router, the packets need to be processed with minimum delay, so the classifier cannot wait until the end of the flow to make a decision. Furthermore, a complicated machine learning algorithm can be too computationally expensive to implement inside the router. In this paper, we introduce flow-packet hybrid traffic classification (FPHTC), where the router makes a decision per packet based on a routing policy that is designed through transferring the learned knowledge from a flow-based classifier residing outside the router. We analyze the generalization bound of FPHTC and show its advantage over regular packet-based traffic classification. We present experimental results using a real-world traffic dataset to illustrate the classification performance of FPHTC. We show that it is robust toward traffic pattern changes and can be deployed with limited computational resource.

LGFeb 26, 2021

On the Generalization of Stochastic Gradient Descent with Momentum

Ali Ramezani-Kebrya, Ashish Khisti, Ben Liang

While momentum-based methods, in conjunction with stochastic gradient descent (SGD), are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods. In this work, we first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees when SGD with standard heavy-ball momentum (SGDM) is run for multiple epochs. Then, for smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM), and show that it admits an upper-bound on the generalization error. Thus, our results show that machine learning models can be trained for multiple epochs of SGDEM with a guarantee for generalization. Finally, for the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes. Extending our results on generalization, we also develop an upper-bound on the expected true risk, in terms of the number of training steps, the size of the training set, and the momentum parameter. Experimental evaluations verify the consistency between the numerical results and our theoretical bounds and the effectiveness of SGDEM for smooth Lipschitz loss functions.

LGSep 12, 2018

On the Generalization of Stochastic Gradient Descent with Momentum

Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher et al.

While momentum-based accelerated variants of stochastic gradient descent (SGD) are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods. In this work, we first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. Then, for smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes, and show that it can train machine learning models for multiple epochs with a guarantee for generalization. Finally, for the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes. Extending our results on generalization, we also develop an upper bound on the expected true risk, in terms of the number of training steps, sample size, and momentum. Our experimental evaluations verify the consistency between the numerical results and our theoretical bounds. SGDEM improves the generalization error of SGDM when training ResNet-18 on ImageNet in practical distributed settings.

MMMay 22, 2013

Optimal Frame Transmission for Scalable Video with Hierarchical Prediction Structure

Saied Mehdian, Ben Liang

An optimal frame transmission scheme is presented for streaming scalable video over a link with limited capacity. The objective is to select a transmission sequence of frames and their transmission schedule such that the overall video quality is maximized. The problem is solved for two general classes of hierarchical prediction structures, which include as a special case the popular dyadic structure. Based on a new characterization of the interdependence among frames in terms of trees, structural properties of an optimal transmission schedule are derived. These properties lead to the development of a jointly optimal frame selection and scheduling algorithm, which has computational complexity that is quadratic in the number of frames. Simulation results show that the optimal scheme substantially outperforms three existing alternatives.