Hui Guan

LG
h-index13
36papers
347citations
Novelty52%
AI Score57

36 Papers

LGMar 10, 2022Code
A Tree-Structured Multi-Task Model Recommender

Lijun Zhang, Xiao Liu, Hui Guan

Tree-structured multi-task architectures have been employed to jointly tackle multiple vision tasks in the context of multi-task learning (MTL). The major challenge is to determine where to branch out for each task given a backbone model to optimize for both task accuracy and computation efficiency. To address the challenge, this paper proposes a recommender that, given a set of tasks and a convolutional neural network-based backbone model, automatically suggests tree-structured multi-task architectures that could achieve a high task performance while meeting a user-specified computation budget without performing model training. Extensive evaluations on popular MTL benchmarks show that the recommended architectures could achieve competitive task accuracy and computation efficiency compared with state-of-the-art MTL methods. Our tree-structured multi-task model recommender is open-sourced and available at https://github.com/zhanglijun95/TreeMTL.

DCMar 24, 2023
GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

Sandeep Polisetty, Juelin Liu, Kobi Falus et al.

Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. Data parallel approaches contain redundant work as subgraphs sampled by different GPUs contain significant overlap. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant work by splitting the sampling, loading, and training of each mini-batch across multiple GPUs. Split parallelism, however, introduces communication overheads that can be more than the savings from removing redundant work. We further present a lightweight partitioning algorithm that probabilistically minimizes these overheads. We implement split parallelism in GSplit and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P^3$.

LGFeb 1, 2023
Automatically Marginalized MCMC in Probabilistic Programming

Jinlin Lai, Javier Burroni, Hui Guan et al.

Hamiltonian Monte Carlo (HMC) is a powerful algorithm to sample latent variables from Bayesian models. The advent of probabilistic programming languages (PPLs) frees users from writing inference algorithms and lets users focus on modeling. However, many models are difficult for HMC to solve directly, and often require tricks like model reparameterization. We are motivated by the fact that many of those models could be simplified by marginalization. We propose to use automatic marginalization as part of the sampling process using HMC in a graphical model extracted from a PPL, which substantially improves sampling from real-world hierarchical models.

LGApr 13, 2023
Structured Pruning for Multi-Task Deep Neural Networks

Siddhant Garg, Lijun Zhang, Hui Guan · amazon-science

Although multi-task deep neural network (DNN) models have computation and storage benefits over individual single-task DNN models, they can be further optimized via model compression. Numerous structured pruning methods are already developed that can readily achieve speedups in single-task models, but the pruning of multi-task networks has not yet been extensively studied. In this work, we investigate the effectiveness of structured pruning on multi-task models. We use an existing single-task filter pruning criterion and also introduce an MTL-based filter pruning criterion for estimating the filter importance scores. We prune the model using an iterative pruning strategy with both pruning methods. We show that, with careful hyper-parameter tuning, architectures obtained from different pruning methods do not have significant differences in their performances across tasks when the number of parameters is similar. We also show that iterative structure pruning may not be the best way to achieve a well-performing pruned model because, at extreme pruning levels, there is a high drop in performance across all tasks. But when the same models are randomly initialized and re-trained, they show better results.

LGNov 28, 2022
Flow: Per-Instance Personalized Federated Learning Through Dynamic Routing

Kunjal Panchal, Sunav Choudhary, Nisarg Parikh et al.

Personalization in Federated Learning (FL) aims to modify a collaboratively trained global model according to each client. Current approaches to personalization in FL are at a coarse granularity, i.e. all the input instances of a client use the same personalized model. This ignores the fact that some instances are more accurately handled by the global model due to better generalizability. To address this challenge, this work proposes Flow, a fine-grained stateless personalized FL approach. Flow creates dynamic personalized models by learning a routing mechanism that determines whether an input instance prefers the local parameters or its global counterpart. Thus, Flow introduces per-instance routing in addition to leveraging per-client personalization to improve accuracies at each client. Further, Flow is stateless which makes it unnecessary for a client to retain its personalized state across FL rounds. This makes Flow practical for large-scale FL settings and friendly to newly joined clients. Evaluations on Stackoverflow, Reddit, and EMNIST datasets demonstrate the superiority in prediction accuracy of Flow over state-of-the-art non-personalized and only per-client personalized approaches to FL.

LGOct 8, 2022
Uplifting Message Passing Neural Network with Graph Original Information

Xiao Liu, Lijun Zhang, Hui Guan

Message passing neural networks (MPNNs) learn the representation of graph-structured data based on graph original information, including node features and graph structures, and have shown astonishing improvement in node classification tasks. However, the expressive power of MPNNs is upper bounded by the first-order Weisfeiler-Leman test and its accuracy still has room for improvement. This work studies how to improve MPNNs' expressiveness and generalizability by fully exploiting graph original information both theoretically and empirically. It further proposes a new GNN model called INGNN (INformation-enhanced Graph Neural Network) that leverages the insights to improve node classification performance. Extensive experiments on both synthetic and real datasets demonstrate the superiority (average rank 1.78) of our INGNN compared with state-of-the-art methods.

LGMay 25, 2022
Improving Subgraph Representation Learning via Multi-View Augmentation

Yili Shen, Xiao Liu, Cheng-Wei Ju et al.

Subgraph representation learning based on Graph Neural Network (GNN) has exhibited broad applications in scientific advancements, such as predictions of molecular structure-property relationships and collective cellular function. In particular, graph augmentation techniques have shown promising results in improving graph-based and node-based classification tasks. Still, they have rarely been explored in the existing GNN-based subgraph representation learning studies. In this study, we develop a novel multi-view augmentation mechanism to improve subgraph representation learning models and thus the accuracy of downstream prediction tasks. Our augmentation technique creates multiple variants of subgraphs and embeds these variants into the original graph to achieve highly improved training efficiency, scalability, and accuracy. Benchmark experiments on several real-world biological and physiological datasets demonstrate the superiority of our proposed multi-view augmentation techniques in subgraph representation learning.

LGMay 1
EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems

Luan Pham, Victor Nicolet, Joey Dodds et al.

Anomaly detection and localization (ADL) is critical for maintaining reliability and availability in cloud systems. Recent ADL developments focus on metric and log data, leaving event data unexplored. To address this gap, we propose EventADL, the first open-box event-based ADL framework for cloud-based service systems. To motivate the design of our framework, we conduct a systematic analysis on 520 real-world incidents, and provide insights into how anomalies and their root causes manifest through event data. EventADL has three phases: offline training, online anomaly detection, and root cause localization. During the training phase, EventADL first learns Event Semantic Patterns (ESPs), which capture normal interactions between system entities using historical event data, and then learns Event Frequency Patterns (EFPs), which capture the normal frequency of known ESPs. In the online anomaly detection phase, any data in the event stream that deviates significantly from either pattern is identified as anomalous. For localization, EventADL constructs an Intervention Graph that models the relationships between recent system interactions and the detected anomalies for automatic root cause localization. The framework is designed to operate efficiently with unlabeled data and to produce interpretable anomalies with their corresponding root causes. Our evaluation on three real cloud service systems and two real-world incidents demonstrates that EventADL outperforms existing methods, achieving F1-scores of at least 90% for anomaly detection and 100% top-3 accuracy in root cause localization.

CVJan 8, 2024Code
Attack-Resilient Image Watermarking Using Stable Diffusion

Lijun Zhang, Xiao Liu, Antoni Viros Martin et al.

Watermarking images is critical for tracking image provenance and proving ownership. With the advent of generative models, such as stable diffusion, that can create fake but realistic images, watermarking has become particularly important to make human-created images reliably identifiable. Unfortunately, the very same stable diffusion technology can remove watermarks injected using existing methods. To address this problem, we present ZoDiac, which uses a pre-trained stable diffusion model to inject a watermark into the trainable latent space, resulting in watermarks that can be reliably detected in the latent vector even when attacked. We evaluate ZoDiac on three benchmarks, MS-COCO, DiffusionDB, and WikiArt, and find that ZoDiac is robust against state-of-the-art watermark attacks, with a watermark detection rate above 98% and a false positive rate below 6.4%, outperforming state-of-the-art watermarking methods. We hypothesize that the reciprocating denoising process in diffusion models may inherently enhance the robustness of the watermark when faced with strong attacks and validate the hypothesis. Our research demonstrates that stable diffusion is a promising approach to robust watermarking, able to withstand even stable-diffusion--based attack methods. ZoDiac is open-sourced and available at https://github.com/zhanglijun95/ZoDiac.

LGMay 24, 2024Code
Thinking Forward: Memory-Efficient Federated Finetuning of Language Models

Kunjal Panchal, Nisarg Parikh, Sunav Choudhary et al.

Finetuning large language models (LLMs) in federated learning (FL) settings has become increasingly important as it allows resource-constrained devices to finetune a model using private data. However, finetuning LLMs using backpropagation requires excessive memory (especially from intermediate activations) for resource-constrained devices. While Forward-mode Auto-Differentiation (AD) can significantly reduce memory footprint from activations, we observe that directly applying it to LLM finetuning results in slow convergence and poor accuracy. In this paper, we introduce Spry, an FL algorithm that splits trainable weights of an LLM among participating clients, such that each client computes gradients using forward-mode AD that are closer estimations of the true gradients. Spry achieves a low memory footprint, high accuracy, and fast convergence. We formally prove that the global gradients in Spry are unbiased estimators of true global gradients for homogeneous data distributions across clients, while heterogeneity increases bias of the estimates. We also derive Spry's convergence rate, showing that the gradients decrease inversely proportional to the number of FL rounds, indicating the convergence up to the limits of heterogeneity. Empirically, Spry reduces the memory footprint during training by 1.4-7.1x in contrast to backpropagation, while reaching comparable accuracy, across a wide range of language tasks, models, and FL settings. Spry reduces the convergence time by 1.2-20.3x and achieves 5.2-13.5% higher accuracy against zero-order methods. When finetuning Llama2-7B with LoRA, compared to the peak memory consumption of 33.9GB of backpropagation, Spry only consumes 6.2GB of peak memory. For OPT13B, the reduction is from 76.5GB to 10.8GB. Spry makes feasible previously impossible FL deployments on commodity edge devices. Our source code is available at https://github.com/Astuary/Spry.

LGDec 18, 2025
Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

Kunjal Panchal, Saayan Mitra, Somdeb Sarkhel et al.

Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.

CVSep 12, 2024
WildFit: Autonomous In-situ Model Adaptation for Resource-Constrained IoT Systems

Mohammad Mehdi Rastikerdar, Jin Huang, Hui Guan et al.

Resource-constrained IoT devices increasingly rely on deep learning models, however, these models experience significant accuracy drops due to domain shifts when encountering variations in lighting, weather, and seasonal conditions. While cloud-based retraining can address this issue, many IoT deployments operate with limited connectivity and energy constraints, making traditional fine-tuning approaches impractical. We explore this challenge through the lens of wildlife ecology, where camera traps must maintain accurate species classification across changing seasons, weather, and habitats without reliable connectivity. We introduce WildFit, an autonomous in-situ adaptation framework that leverages the key insight that background scenes change more frequently than the visual characteristics of monitored species. WildFit combines background-aware synthesis to generate training samples on-device with drift-aware fine-tuning that triggers model updates only when necessary to conserve resources. Our background-aware synthesis surpasses efficient baselines by 7.3% and diffusion models by 3.0% while being orders of magnitude faster, our drift-aware fine-tuning achieves Pareto optimality with 50% fewer updates and 1.5% higher accuracy, and the end-to-end system outperforms domain adaptation approaches by 20-35% while consuming only 11.2 Wh over 37 days-enabling battery-powered deployment.

SEApr 1
Ambig-IaC: Multi-level Disambiguation for Interactive Cloud Infrastructure-as-Code Synthesis

Zhenning Yang, Kaden Gruizenga, Tongyuan Miao et al.

The scale and complexity of modern cloud infrastructure have made Infrastructure-as-Code (IaC) essential for managing deployments. While large Language models (LLMs) are increasingly being used to generate IaC configurations from natural language, user requests are often underspecified. Unlike traditional code generation, IaC configurations cannot be executed cheaply or iteratively repaired, forcing the LLMs into an almost one-shot regime. We observe that ambiguity in IaC exhibits a tractable compositional structure: configurations decompose into three hierarchical axes (resources, topology, attributes) where higher-level decisions constrain lower-level ones. We propose a training-free, disagreement-driven framework that generates diverse candidate specifications, identifies structural disagreements across these axes, ranks them by informativeness, and produces targeted clarification questions that progressively narrow the configuration space. We introduce \textsc{Ambig-IaC}, a benchmark of 300 validated IaC tasks with ambiguous prompts, and an evaluation framework based on graph edit distance and embedding similarity. Our method outperforms the strongest baseline, achieving relative improvements of +18.4\% and +25.4\% on structure and attribute evaluations, respectively.

CVAug 7, 2024
AdapMTL: Adaptive Pruning Framework for Multitask Learning Model

Mingcan Xiang, Steven Jiaxun Tang, Qizheng Yang et al.

In the domain of multimedia and multimodal processing, the efficient handling of diverse data streams such as images, video, and sensor data is paramount. Model compression and multitask learning (MTL) are crucial in this field, offering the potential to address the resource-intensive demands of processing and interpreting multiple forms of media simultaneously. However, effectively compressing a multitask model presents significant challenges due to the complexities of balancing sparsity allocation and accuracy performance across multiple tasks. To tackle these challenges, we propose AdapMTL, an adaptive pruning framework for MTL models. AdapMTL leverages multiple learnable soft thresholds independently assigned to the shared backbone and the task-specific heads to capture the nuances in different components' sensitivity to pruning. During training, it co-optimizes the soft thresholds and MTL model weights to automatically determine the suitable sparsity level at each component to achieve both high task accuracy and high overall sparsity. It further incorporates an adaptive weighting mechanism that dynamically adjusts the importance of task-specific losses based on each task's robustness to pruning. We demonstrate the effectiveness of AdapMTL through comprehensive experiments on popular multitask datasets, namely NYU-v2 and Tiny-Taskonomy, with different architectures, showcasing superior performance compared to state-of-the-art pruning methods.

CVOct 29, 2023
Efficient IoT Inference via Context-Awareness

Mohammad Mehdi Rastikerdar, Jin Huang, Shiwei Fang et al.

While existing strategies to execute deep learning-based classification on low-power platforms assume the models are trained on all classes of interest, this paper posits that adopting context-awareness i.e. narrowing down a classification task to the current deployment context consisting of only recent inference queries can substantially enhance performance in resource-constrained environments. We propose a new paradigm, CACTUS, for scalable and efficient context-aware classification where a micro-classifier recognizes a small set of classes relevant to the current context and, when context change happens (e.g., a new class comes into the scene), rapidly switches to another suitable micro-classifier. CACTUS features several innovations, including optimizing the training cost of context-aware classifiers, enabling on-the-fly context-aware switching between classifiers, and balancing context switching costs and performance gains via simple yet effective switching policies. We show that CACTUS achieves significant benefits in accuracy, latency, and compute budget across a range of datasets and IoT platforms.

LGJun 27, 2025Code
The Cost of Avoiding Backpropagation

Kunjal Panchal, Sunav Choudhary, Yuriy Brun et al.

Forward-mode automatic differentiation (FmAD) and zero-order (ZO) optimization have been proposed as memory-efficient alternatives to backpropagation (BP) for gradient computation, especially in low-resource settings. However, their practical benefits remain unclear due to two key gaps: a lack of comparison against memory-efficient BP variants, such as activation checkpointing, and a lack of a unified theoretical analysis. This work presents a comprehensive theoretical and empirical comparison of BP, FmAD, and ZO methods. Our theoretical analysis shows that while FmAD, and ZO can reduce memory usage, they incur significant costs in accuracy, convergence speed, and computation compared to BP with checkpointing. These drawbacks worsen with larger models or constrained perturbation budgets. Empirical experiments on large language and vision-language models show that BP with checkpointing outperforms FmAD and ZO variants, including those enhanced with variance reduction, achieving up to 31.1% higher accuracy, 34.8% faster convergence, and 3.8x fewer computations at comparable memory usage. Our results highlight fundamental limitations of FmAD and ZO, and reaffirm BP with checkpointing as the most effective strategy for model training under memory-constrained settings. Our code is available at https://github.com/Astuary/The_Cost_of_Avoiding_Backpropagation.

LGMay 25, 2025Code
Communication-Efficient Multi-Device Inference Acceleration for Transformer Models

Xiao Liu, Lijun Zhang, Deepak Ganesan et al.

Transformer models power many AI applications but suffer from high inference latency, limiting their use in real-time settings. Multi-device inference can reduce latency by parallelizing computation. Yet, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We propose ASTRA, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism and a Mixed-Precision Attention mechanism designed to minimize inter-device communication. ASTRA compresses non-local token embeddings via vector quantization and preserves task accuracy through two optimizations, Noise-Augmented Quantization and Distributed Class Tokens. Experiments on ViT and GPT2 across vision and NLP tasks show that ASTRA achieves up to 2.64X speedups over single-device inference and up to 15.25X speedups over state-of-the-art multi-device inferences, while operating under bandwidths as low as 10 Mbps. ASTRA is open-sourced at https://github.com/xl1990/Astra.

LGOct 25, 2021Code
AutoMTL: A Programming Framework for Automating Efficient Multi-Task Learning

Lijun Zhang, Xiao Liu, Hui Guan

Multi-task learning (MTL) jointly learns a set of tasks by sharing parameters among tasks. It is a promising approach for reducing storage costs while improving task accuracy for many computer vision tasks. The effective adoption of MTL faces two main challenges. The first challenge is to determine what parameters to share across tasks to optimize for both memory efficiency and task accuracy. The second challenge is to automatically apply MTL algorithms to an arbitrary CNN backbone without requiring time-consuming manual re-implementation and significant domain expertise. This paper addresses the challenges by developing the first programming framework AutoMTL that automates efficient MTL model development for vision tasks. AutoMTL takes as inputs an arbitrary backbone convolutional neural network (CNN) and a set of tasks to learn, and automatically produces a multi-task model that achieves high accuracy and small memory footprint simultaneously. Experiments on three popular MTL benchmarks (CityScapes, NYUv2, Tiny-Taskonomy) demonstrate the effectiveness of AutoMTL over state-of-the-art approaches as well as the generalizability of AutoMTL across CNNs. AutoMTL is open-sourced and available at https://github.com/zhanglijun95/AutoMTL.

CVDec 15, 2024
AI-Driven Innovations in Volumetric Video Streaming: A Review

Erfan Entezami, Hui Guan

Recent efforts to enhance immersive and interactive user experiences have driven the development of volumetric video, a form of 3D content that enables 6 DoF. Unlike traditional 2D content, volumetric content can be represented in various ways, such as point clouds, meshes, or neural representations. However, due to its complex structure and large amounts of data size, deploying this new form of 3D data presents significant challenges in transmission and rendering. These challenges have hindered the widespread adoption of volumetric video in daily applications. In recent years, researchers have proposed various AI-driven techniques to address these challenges and improve the efficiency and quality of volumetric content streaming. This paper provides a comprehensive overview of recent advances in AI-driven approaches to facilitate volumetric content streaming. Through this review, we aim to offer insights into the current state-of-the-art and suggest potential future directions for advancing the deployment of volumetric video streaming in real-world applications.

CHEM-PHApr 10
Transferable FB-GNN-MBE Framework for Potential Energy Surfaces: Data-Adaptive Transfer Learning in Deep Learned Many-Body Expansion Theory

Siqi Chen, Zhiqiang Wang, Yili Shen et al.

Mechanistic understanding and rational design of complex chemical systems depend on fast and accurate predictions of electronic structures beyond individual building blocks. However, if the system exceeds hundreds of atoms, first-principles quantum mechanical (QM) modeling becomes impractical. In this study, we developed FB-GNN-MBE by integrating a fragment-based graph neural network (FB-GNN) into the many-body expansion (MBE) theory and demonstrated its capacity to reproduce first-principles potential energy surfaces (PES) for hierarchically structured systems with manageable accuracy, complexity, and interpretability. Specifically, we divided the entire system into basic building blocks (fragments), evaluated their one-fragment energies using a QM model, and addressed many-fragment interactions using the structure-property relationships trained by FB-GNNs. Our investigation shows that FB-GNN-MBE achieves chemical accuracy in predicting two-body (2B) and three-body (3B) energies across water, phenol, and mixture benchmarks, as well as the one-dimensional dissociation curves of water and phenol dimers. To transfer the success of FB-GNN-MBE across various systems with minimal computational costs and data demands, we developed and validated a teacher-student learning protocol. A heavy-weight FB-GNN trained on a mixed-density water cluster ensemble (teacher) distills its learned knowledge and passes it to a light-weight GNN (student), which is later fine-tuned on a uniform-density (H2O)21 cluster ensemble. This transfer learning strategy resulted in efficient and accurate prediction of 2B and 3B energies for variously sized water clusters without retraining. Our transferable FB-GNN-MBE framework outperformed conventional non-FB-GNN-based models and showed high practicality for large-scale molecular simulations.

LGJun 21, 2025
Reimagining Parameter Space Exploration with Diffusion Models

Lijun Zhang, Xiao Liu, Hui Guan

Adapting neural networks to new tasks typically requires task-specific fine-tuning, which is time-consuming and reliant on labeled data. We explore a generative alternative that produces task-specific parameters directly from task identity, eliminating the need for task-specific training. To this end, we propose using diffusion models to learn the underlying structure of effective task-specific parameter space and synthesize parameters on demand. Once trained, the task-conditioned diffusion model can generate specialized weights directly from task identifiers. We evaluate this approach across three scenarios: generating parameters for a single seen task, for multiple seen tasks, and for entirely unseen tasks. Experiments show that diffusion models can generate accurate task-specific parameters and support multi-task interpolation when parameter subspaces are well-structured, but fail to generalize to unseen tasks, highlighting both the potential and limitations of this generative solution.

SEOct 23, 2025
Automated Cloud Infrastructure-as-Code Reconciliation with AI Agents

Zhenning Yang, Hui Guan, Victor Nicolet et al.

Cloud infrastructure is managed through a mix of interfaces -- traditionally, cloud consoles, command-line interfaces (CLI), and SDKs are the tools of choice. Recently, Infrastructure-as-Code/IaC frameworks (e.g., Terraform) have quickly gained popularity. Unlike conventional tools, IaC~frameworks encode the infrastructure in a "source-of-truth" configuration. They are capable of automatically carrying out modifications to the cloud -- deploying, updating, or destroying resources -- to bring the actual infrastructure into alignment with the IaC configuration. However, when IaC is used alongside consoles, CLIs, or SDKs, it loses visibility into external changes, causing infrastructure drift, where the configuration becomes outdated, and later IaC operations may undo valid updates or trigger errors. We present NSync, an automated system for IaC reconciliation that propagates out-of-band changes back into the IaC program. Our key insight is that infrastructure changes eventually all occur via cloud API invocations -- the lowest layer for cloud management operations. NSync gleans insights from API traces to detect drift (i.e., non-IaC changes) and reconcile it (i.e., update the IaC configuration to capture the changes). It employs an agentic architecture that leverages LLMs to infer high-level intents from noisy API sequences, synthesize targeted IaC updates using specialized tools, and continually improve through a self-evolving knowledge base of past reconciliations. We further introduce a novel evaluation pipeline for injecting realistic drifts into cloud infrastructure and assessing reconciliation performance. Experiments across five real-world Terraform projects and 372 drift scenarios show that NSync outperforms the baseline both in terms of accuracy (from 0.71 to 0.97 pass@3) and token efficiency (1.47$\times$ improvement).

NEAug 11, 2025
Adaptive Spiking with Plasticity for Energy Aware Neuromorphic Systems

Eduardo Calle-Ortiz, Hui Guan, Deepak Ganesan et al.

This paper presents ASPEN, a novel energy-aware technique for neuromorphic systems that could unleash the future of intelligent, always-on, ultra-low-power, and low-burden wearables. Our main research objectives are to explore the feasibility of neuromorphic computing for wearables, identify open research directions, and demonstrate the feasibility of developing an adaptive spiking technique for energy-aware computation, which can be game-changing for resource-constrained devices in always-on applications. As neuromorphic computing systems operate based on spike events, their energy consumption is closely related to spiking activity, i.e., each spike incurs computational and power costs; consequently, minimizing the number of spikes is a critical strategy for operating under constrained energy budgets. To support this goal, ASPEN utilizes stochastic perturbations to the neuronal threshold during training to not only enhance the network's robustness across varying thresholds, which can be controlled at inference time, but also act as a regularizer that improves generalization, reduces spiking activity, and enables energy control without the need for complex retraining or pruning. More specifically, ASPEN adaptively adjusts intrinsic neuronal parameters as a lightweight and scalable technique for dynamic energy control without reconfiguring the entire model. Our evaluation on neuromorphic emulator and hardware shows that ASPEN significantly reduces spike counts and energy consumption while maintaining accuracy comparable to state-of-the-art methods.

CVNov 8, 2024
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models

Xiao Liu, Lijun Zhang, Deepak Ganesan et al.

Vision Language Models (VLMs) are central to Visual Question Answering (VQA) systems and are typically deployed in the cloud due to their high computational demands. However, this cloud-only approach underutilizes edge computational resources and requires significant bandwidth for transmitting raw images. In this paper, we introduce an edge-cloud collaborative VQA system, called LLaVA-AlignedVQ, which features a novel Aligned Vector Quantization algorithm (AlignedVQ) that efficiently compress intermediate features without compromising accuracy to support partitioned execution. Our experiments demonstrate that LLaVA-AlignedVQ achieves approximately 1365x compression rate of intermediate features, reducing data transmission overhead by 96.8% compared to transmitting JPEG90-compressed images to the cloud. LLaVA-AlignedVQ achieves an inference speedup of 2-15x while maintaining high accuracy, remaining within -2.23% to +1.6% of the original model's accuracy performance across eight VQA datasets, compared to the cloud-only solution.

LGOct 21, 2024
Understanding and Alleviating Memory Consumption in RLHF for LLMs

Jin Zhou, Hanmei Yang, Steven et al.

Fine-tuning with Reinforcement Learning with Human Feedback (RLHF) is essential for aligning large language models (LLMs). However, RLHF often encounters significant memory challenges. This study is the first to examine memory usage in the RLHF context, exploring various memory management strategies and unveiling the reasons behind excessive memory consumption. Additionally, we introduce a simple yet effective approach that substantially reduces the memory required for RLHF fine-tuning.

DCJun 12, 2024
ProTrain: Efficient LLM Training via Memory-Aware Techniques

Hanmei Yang, Jin Zhou, Yao Fu et al.

It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the SOTA training systems.

LGJun 1, 2024
Graph Neural Network Training Systems: A Performance Comparison of Full-Graph and Mini-Batch

Saurabh Bajaj, Hojae Son, Juelin Liu et al.

Graph Neural Networks (GNNs) have gained significant attention in recent years due to their ability to learn representations of graph-structured data. Two common methods for training GNNs are mini-batch training and full-graph training. Since these two methods require different training pipelines and systems optimizations, two separate classes of GNN training systems emerged, each tailored for one method. Works that introduce systems belonging to a particular category predominantly compare them with other systems within the same category, offering limited or no comparison with systems from the other category. Some prior work also justifies its focus on one specific training method by arguing that it achieves higher accuracy than the alternative. The literature, however, has incomplete and contradictory evidence in this regard. In this paper, we provide a comprehensive empirical comparison of representative full-graph and mini-batch GNN training systems. We find that the mini-batch training systems consistently converge faster than the full-graph training ones across multiple datasets, GNN models, and system configurations. We also find that mini-batch training techniques converge to similar to or often higher accuracy values than full-graph training ones, showing that mini-batch sampling is not necessarily detrimental to accuracy. Our work highlights the importance of comparing systems across different classes, using time-to-accuracy rather than epoch time for performance comparison, and selecting appropriate hyperparameters for each training method separately.

LGMay 20, 2023
Attacking All Tasks at Once Using Adversarial Examples in Multi-Task Learning

Lijun Zhang, Xiao Liu, Kaleel Mahmood et al.

Visual content understanding frequently relies on multi-task models to extract robust representations of a single visual input for multiple downstream tasks. However, in comparison to extensively studied single-task models, the adversarial robustness of multi-task models has received significantly less attention and many questions remain unclear: 1) How robust are multi-task models to single task adversarial attacks, 2) Can adversarial attacks be designed to simultaneously attack all tasks in a multi-task model, and 3) How does parameter sharing across tasks affect multi-task model robustness to adversarial attacks? This paper aims to answer these questions through careful analysis and rigorous experimentation. First, we analyze the inherent drawbacks of two commonly-used adaptations of single-task white-box attacks in attacking multi-task models. We then propose a novel attack framework, Dynamic Gradient Balancing Attack (DGBA). Our framework poses the problem of attacking all tasks in a multi-task model as an optimization problem that can be efficiently solved through integer linear programming. Extensive evaluation on two popular MTL benchmarks, NYUv2 and Tiny-Taxonomy, demonstrates the effectiveness of DGBA compared to baselines in attacking both clean and adversarially trained multi-task models. Our results also reveal a fundamental trade-off between improving task accuracy via parameter sharing across tasks and undermining model robustness due to increased attack transferability from parameter sharing.

CVNov 19, 2021
Toward Compact Parameter Representations for Architecture-Agnostic Neural Network Compression

Yuezhou Sun, Wenlong Zhao, Lijun Zhang et al.

This paper investigates deep neural network (DNN) compression from the perspective of compactly representing and storing trained parameters. We explore the previously overlooked opportunity of cross-layer architecture-agnostic representation sharing for DNN parameters. To do this, we decouple feedforward parameters from DNN architectures and leverage additive quantization, an extreme lossy compression method invented for image descriptors, to compactly represent the parameters. The representations are then finetuned on task objectives to improve task accuracy. We conduct extensive experiments on MobileNet-v2, VGG-11, ResNet-50, Feature Pyramid Networks, and pruned DNNs trained for classification, detection, and segmentation tasks. The conceptually simple scheme consistently outperforms iterative unstructured pruning. Applied to ResNet-50 with 76.1% top-1 accuracy on the ILSVRC12 classification challenge, it achieves a $7.2\times$ compression ratio with no accuracy loss and a $15.3\times$ compression ratio at 74.79% accuracy. Further analyses suggest that representation sharing can frequently happen across network layers and that learning shared representations for an entire DNN can achieve better accuracy at the same compression ratio than compressing the model as multiple separate parts. We release PyTorch code to facilitate DNN deployment on resource-constrained devices and spur future research on efficient representations and storage of DNN parameters.

AINov 18, 2021
COMET: A Novel Memory-Efficient Deep Learning Training Framework by Using Error-Bounded Lossy Compression

Sian Jin, Chengming Zhang, Xintong Jiang et al.

Training wide and deep neural networks (DNNs) require large amounts of storage resources such as memory because the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped with very limited memory capacities due to hardware design constraints, which significantly limits the maximum batch size and hence performance speedup when training large-scale DNNs. Traditional memory saving techniques either suffer from performance overhead or are constrained by limited interconnect bandwidth or specific interconnect technology. In this paper, we propose a novel memory-efficient CNN training framework (called COMET) that leverages error-bounded lossy compression to significantly reduce the memory requirement for training, to allow training larger models or to accelerate training. Different from the state-of-the-art solutions that adopt image-based lossy compressors (such as JPEG) to compress the activation data, our framework purposely adopts error-bounded lossy compression with a strict error-controlling mechanism. Specifically, we perform a theoretical analysis on the compression error propagation from the altered activation data to the gradients, and empirically investigate the impact of altered gradients over the training process. Based on these analyses, we optimize the error-bounded lossy compression and propose an adaptive error-bound control scheme for activation data compression. We evaluate our design against state-of-the-art solutions with five widely-adopted CNNs and ImageNet dataset. Experiments demonstrate that our proposed framework can significantly reduce the training memory consumption by up to 13.5X over the baseline training and 1.8X over another state-of-the-art compression-based framework, respectively, with little or no accuracy loss.

LGJul 23, 2021
Rethinking Hard-Parameter Sharing in Multi-Domain Learning

Lijun Zhang, Qizheng Yang, Xiao Liu et al.

Hard parameter sharing in multi-domain learning (MDL) allows domains to share some of the model parameters to reduce storage cost while improving prediction accuracy. One common sharing practice is to share the bottom layers of a deep neural network among domains while using separate top layers for each domain. In this work, we revisit this common practice via an empirical study on image classification tasks from a diverse set of visual domains and make two surprising observations. (1) Using separate bottom-layer parameters could achieve significantly better performance than the common practice and this phenomenon holds with different experimental settings. (2) A multi-domain model with a small proportion of domain-specific parameters from bottom layers can achieve competitive performance with independent models trained on each domain separately. Our observations suggest that people adopt the new strategy of using separate bottom-layer parameters as a stronger baseline for model design in MDL.

LGMay 5, 2021
Scalable Graph Neural Network Training: The Case for Sampling

Marco Serafini, Hui Guan

Graph Neural Networks (GNNs) are a new and increasingly popular family of deep neural network architectures to perform learning on graphs. Training them efficiently is challenging due to the irregular nature of graph data. The problem becomes even more challenging when scaling to large graphs that exceed the capacity of single devices. Standard approaches to distributed DNN training, such as data and model parallelism, do not directly apply to GNNs. Instead, two different approaches have emerged in the literature: whole-graph and sample-based training. In this paper, we review and compare the two approaches. Scalability is challenging with both approaches, but we make a case that research should focus on sample-based training since it is a more promising approach. Finally, we review recent systems supporting sample-based training.

CVDec 16, 2020
SID-NISM: A Self-supervised Low-light Image Enhancement Framework

Lijun Zhang, Xiao Liu, Erik Learned-Miller et al.

When capturing images in low-light conditions, the images often suffer from low visibility, which not only degrades the visual aesthetics of images, but also significantly degenerates the performance of many computer vision algorithms. In this paper, we propose a self-supervised low-light image enhancement framework (SID-NISM), which consists of two components, a Self-supervised Image Decomposition Network (SID-Net) and a Nonlinear Illumination Saturation Mapping function (NISM). As a self-supervised network, SID-Net could decompose the given low-light image into its reflectance, illumination and noise directly without any prior training or reference image, which distinguishes it from existing supervised-learning methods greatly. Then, the decomposed illumination map will be enhanced by NISM. Having the restored illumination map, the enhancement can be achieved accordingly. Experiments on several public challenging low-light image datasets reveal that the images enhanced by SID-NISM are more natural and have less unexpected artifacts.

LGNov 5, 2019
Post-Training 4-bit Quantization on Embedding Tables

Hui Guan, Andrey Malevich, Jiyan Yang et al.

Continuous representations have been widely adopted in recommender systems where a large number of entities are represented using embedding vectors. As the cardinality of the entities increases, the embedding components can easily contain millions of parameters and become the bottleneck in both storage and inference due to large memory consumption. This work focuses on post-training 4-bit quantization on the continuous embeddings. We propose row-wise uniform quantization with greedy search and codebook-based quantization that consistently outperforms state-of-the-art quantization approaches on reducing accuracy degradation. We deploy our uniform quantization technique on a production model in Facebook and demonstrate that it can reduce the model size to only 13.89% of the single-precision version while the model quality stays neutral.

LGOct 31, 2019
In-Place Zero-Space Memory Protection for CNN

Hui Guan, Lin Ning, Zhen Lin et al.

Convolutional Neural Networks (CNN) are being actively explored for safety-critical applications such as autonomous vehicles and aerospace, where it is essential to ensure the reliability of inference results in the presence of possible memory faults. Traditional methods such as error correction codes (ECC) and Triple Modular Redundancy (TMR) are CNN-oblivious and incur substantial memory overhead and energy cost. This paper introduces in-place zero-space ECC assisted with a new training scheme weight distribution-oriented training. The new method provides the first known zero space cost memory protection for CNNs without compromising the reliability offered by traditional ECC.

IRJan 18, 2017
First Study on Data Readiness Level

Hui Guan, Thanos Gentimis, Hamid Krim et al.

We introduce the idea of Data Readiness Level (DRL) to measure the relative richness of data to answer specific questions often encountered by data scientists. We first approach the problem in its full generality explaining its desired mathematical properties and applications and then we propose and study two DRL metrics. Specifically, we define DRL as a function of at least four properties of data: Noisiness, Believability, Relevance, and Coherence. The information-theoretic based metrics, Cosine Similarity and Document Disparity, are proposed as indicators of Relevance and Coherence for a piece of data. The proposed metrics are validated through a text-based experiment using Twitter data.