LGFeb 16, 2023
Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based acceleratorsMalte J. Rasch, Charles Mackin, Manuel Le Gallo et al.
Analog in-memory computing (AIMC) -- a promising approach for energy-efficient acceleration of deep learning workloads -- computes matrix-vector multiplications (MVMs) but only approximately, due to nonidealities that often are non-deterministic or nonlinear. This can adversely impact the achievable deep neural network (DNN) inference accuracy as compared to a conventional floating point (FP) implementation. While retraining has previously been suggested to improve robustness, prior work has explored only a few DNN topologies, using disparate and overly simplified AIMC hardware models. Here, we use hardware-aware (HWA) training to systematically examine the accuracy of AIMC for multiple common artificial intelligence (AI) workloads across multiple DNN topologies, and investigate sensitivity and robustness to a broad set of nonidealities. By introducing a new and highly realistic AIMC crossbar-model, we improve significantly on earlier retraining approaches. We show that many large-scale DNNs of various topologies, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers, can in fact be successfully retrained to show iso-accuracy on AIMC. Our results further suggest that AIMC nonidealities that add noise to the inputs or outputs, not the weights, have the largest impact on DNN accuracy, and that RNNs are particularly robust to all nonidealities.
LGMar 10, 2023
Zone-based Federated Learning for Mobile Sensing DataXiaopeng Jiang, Thinh On, NhatHai Phan et al.
Mobile apps, such as mHealth and wellness applications, can benefit from deep learning (DL) models trained with mobile sensing data collected by smart phones or wearable devices. However, currently there is no mobile sensing DL system that simultaneously achieves good model accuracy while adapting to user mobility behavior, scales well as the number of users increases, and protects user data privacy. We propose Zone-based Federated Learning (ZoneFL) to address these requirements. ZoneFL divides the physical space into geographical zones mapped to a mobile-edge-cloud system architecture for good model accuracy and scalability. Each zone has a federated training model, called a zone model, which adapts well to data and behaviors of users in that zone. Benefiting from the FL design, the user data privacy is protected during the ZoneFL training. We propose two novel zone-based federated training algorithms to optimize zone models to user mobility behavior: Zone Merge and Split (ZMS) and Zone Gradient Diffusion (ZGD). ZMS optimizes zone models by adapting the zone geographical partitions through merging of neighboring zones or splitting of large zones into smaller ones. Different from ZMS, ZGD maintains fixed zones and optimizes a zone model by incorporating the gradients derived from neighboring zones' data. ZGD uses a self-attention mechanism to dynamically control the impact of one zone on its neighbors. Extensive analysis and experimental results demonstrate that ZoneFL significantly outperforms traditional FL in two models for heart rate prediction and human activity recognition. In addition, we developed a ZoneFL system using Android phones and AWS cloud. The system was used in a heart rate prediction field study with 63 users for 4 months, and we demonstrated the feasibility of ZoneFL in real-life.
SYMay 12, 2018
Visual Path Tracking Control for Park SceneLinjiong Zhu, Wenfu Wang, Weijie Yang et al.
Autonomous driving application is developing towards specific scenes. Park scene has features such as low speed, fixed routes, short connection, less complex traffic, and hence is suitable for bringing autonomous driving technology into reality. This paper targets park scene, and proposes a visual path tracking lateral control method using only one webcam. First, we calculate error of distance and error of angle from camera images, and then use fuzzy logic to fuzzify them into a combined error degree. The PID control algorithm takes it as input, and outputs steering wheel angle control command. Fuzzification could tolerate the error brought by image transformation and lane detection, making PID control more stably. Our experiments in both virtual and real scene show that our method can accurately and robustly follow the path, even at night. Compared with pure pursuit, our method can make 5 meters turning.
LGMay 14, 2025Code
Analog Foundation ModelsJulian Büchel, Iason Chalas, Giovanni Acampa et al.
Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network inference beyond the limits of conventional von Neumann-based architectures. However, AIMC introduces fundamental challenges such as noisy computations and strict constraints on input and output quantization. Because of these constraints and imprecisions, off-the-shelf LLMs are not able to achieve 4-bit-level performance when deployed on AIMC-based hardware. While researchers previously investigated recovering this accuracy gap on small, mostly vision-based models, a generic method applicable to LLMs pre-trained on trillions of tokens does not yet exist. In this work, we introduce a general and scalable method to robustly adapt LLMs for execution on noisy, low-precision analog hardware. Our approach enables state-of-the-art models $\unicode{x2013}$ including Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct $\unicode{x2013}$ to retain performance comparable to 4-bit weight, 8-bit activation baselines, despite the presence of analog noise and quantization constraints. Additionally, we show that as a byproduct of our training methodology, analog foundation models can be quantized for inference on low-precision digital hardware. Finally, we show that our models also benefit from test-time compute scaling, showing better scaling behavior than models trained with 4-bit weight and 8-bit static input quantization. Our work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models. Code is available at https://github.com/IBM/analog-foundation-models.
LGJun 10, 2025Code
Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLMsJung Hyun Lee, Seungjae Shin, Vinnam Kim et al.
As the rapid scaling of large language models (LLMs) poses significant challenges for deployment on resource-constrained devices, there is growing interest in extremely low-bit quantization, such as 2-bit. Although prior works have shown that 2-bit large models are pareto-optimal over their 4-bit smaller counterparts in both accuracy and latency, these advancements have been limited to pre-trained LLMs and have not yet been extended to instruction-tuned models. To bridge this gap, we propose Unified Progressive Quantization (UPQ)$-$a novel progressive quantization framework (FP16$\rightarrow$INT4$\rightarrow$INT2) that unifies block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for INT2 instruction-tuned LLM quantization. UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to significantly reduce the quantization error introduced by subsequent INT2 quantization. Next, UPQ applies Distill-QAT to enable INT2 instruction-tuned LLMs to generate responses consistent with their original FP16 counterparts by minimizing the generalized Jensen-Shannon divergence (JSD) between the two. To the best of our knowledge, we are the first to demonstrate that UPQ can quantize open-source instruction-tuned LLMs to INT2 without relying on proprietary post-training data, while achieving state-of-the-art performances on MMLU and IFEval$-$two of the most representative benchmarks for evaluating instruction-tuned LLMs.
MTRL-SCIMar 2, 2024
Knowledge-Reuse Transfer Learning Methods in Molecular and Material ScienceAn Chen, Zhilong Wang, Karl Luigi Loza Vidaurre et al.
Molecules and materials are the foundation for the development of modern advanced industries such as energy storage systems and semiconductor devices. However, traditional trial-and-error methods or theoretical calculations are highly resource-intensive, and extremely long R&D (Research and Development) periods cannot meet the urgent need for molecules/materials in industrial development. Machine learning (ML) methods based on big data are expected to break this dilemma. However, the difficulty in constructing large-scale datasets of new molecules/materials due to the high cost of data acquisition and annotation limits the development of machine learning. The application of transfer learning lowers the data requirements for model training, which makes transfer learning stand out in researches addressing data quality issues. In this review, we summarize recent advances in transfer learning related to molecular and materials science. We focus on the application of transfer learning methods for the discovery of advanced molecules/materials, particularly, the construction of transfer learning frameworks for different systems, and how transfer learning can enhance the performance of models. In addition, the challenges of transfer learning are also discussed.
LGApr 17, 2025
FedX: Adaptive Model Decomposition and Quantization for IoT Federated LearningPhung Lai, Xiaopeng Jiang, Hai Phan et al.
Federated Learning (FL) allows collaborative training among multiple devices without data sharing, thus enabling privacy-sensitive applications on mobile or Internet of Things (IoT) devices, such as mobile health and asset tracking. However, designing an FL system with good model utility that works with low computation/communication overhead on heterogeneous, resource-constrained mobile/IoT devices is challenging. To address this problem, this paper proposes FedX, a novel adaptive model decomposition and quantization FL system for IoT. To balance utility with resource constraints on IoT devices, FedX decomposes a global FL model into different sub-networks with adaptive numbers of quantized bits for different devices. The key idea is that a device with fewer resources receives a smaller sub-network for lower overhead but utilizes a larger number of quantized bits for higher model utility, and vice versa. The quantization operations in FedX are done at the server to reduce the computational load on devices. FedX iteratively minimizes the losses in the devices' local data and in the server's public data using quantized sub-networks under a regularization term, and thus it maximizes the benefits of combining FL with model quantization through knowledge sharing among the server and devices in a cost-effective training process. Extensive experiments show that FedX significantly improves quantization times by up to 8.43X, on-device computation time by 1.5X, and total end-to-end training time by 1.36X, compared with baseline FL systems. We guarantee the global model convergence theoretically and validate local model convergence empirically, highlighting FedX's optimization efficiency.
LGNov 17, 2021
FLSys: Toward an Open Ecosystem for Federated Learning Mobile AppsXiaopeng Jiang, Han Hu, Vijaya Datta Mayyuri et al.
This article presents the design, implementation, and evaluation of FLSys, a mobile-cloud federated learning (FL) system, which can be a key component for an open ecosystem of FL models and apps. FLSys is designed to work on smart phones with mobile sensing data. It balances model performance with resource consumption, tolerates communication failures, and achieves scalability. In FLSys, different DL models with different FL aggregation methods can be trained and accessed concurrently by different apps. Furthermore, FLSys provides advanced privacy preserving mechanisms and a common API for third-party app developers to access FL models. FLSys adopts a modular design and is implemented in Android and AWS cloud. We co-designed FLSys with a human activity recognition (HAR) model. HAR sensing data was collected in the wild from 100+ college students during a 4-month period. We implemented HAR-Wild, a CNN model tailored to mobile devices, with a data augmentation mechanism to mitigate the problem of non-Independent and Identically Distributed data. A sentiment analysis model is also used to demonstrate that FLSys effectively supports concurrent models. This article reports our experience and lessons learned from conducting extensive experiments using simulations, Android/Linux emulations, and Android phones that demonstrate FLSys achieves good model utility and practical system performance.
NEMar 1, 2021
Enhancing hierarchical surrogate-assisted evolutionary algorithm for high-dimensional expensive optimization via random projectionXiaodong Ren, Daofu Guo, Zhigang Ren et al.
By remarkably reducing real fitness evaluations, surrogate-assisted evolutionary algorithms (SAEAs), especially hierarchical SAEAs, have been shown to be effective in solving computationally expensive optimization problems. The success of hierarchical SAEAs mainly profits from the potential benefit of their global surrogate models known as "blessing of uncertainty" and the high accuracy of local models. However, their performance leaves room for improvement on highdimensional problems since now it is still challenging to build accurate enough local models due to the huge solution space. Directing against this issue, this study proposes a new hierarchical SAEA by training local surrogate models with the help of the random projection technique. Instead of executing training in the original high-dimensional solution space, the new algorithm first randomly projects training samples onto a set of low-dimensional subspaces, then trains a surrogate model in each subspace, and finally achieves evaluations of candidate solutions by averaging the resulting models. Experimental results on six benchmark functions of 100 and 200 dimensions demonstrate that random projection can significantly improve the accuracy of local surrogate models and the new proposed hierarchical SAEA possesses an obvious edge over state-of-the-art SAEAs
NEJan 19, 2021
A Surrogate-Assisted Variable Grouping Algorithm for General Large Scale Global Optimization ProblemsAn Chen, Zhigang Ren, Muyi Wang et al.
Problem decomposition plays a vital role when applying cooperative coevolution (CC) to large scale global optimization problems. However, most learning-based decomposition algorithms either only apply to additively separable problems or face the issue of false separability detections. Directing against these limitations, this study proposes a novel decomposition algorithm called surrogate-assisted variable grouping (SVG). SVG first designs a general-separability-oriented detection criterion according to whether the optimum of a variable changes with other variables. This criterion is consistent with the separability definition and thus endows SVG with broad applicability and high accuracy. To reduce the fitness evaluation requirement, SVG seeks the optimum of a variable with the help of a surrogate model rather than the original expensive high-dimensional model. Moreover, it converts the variable grouping process into a dynamic-binary-tree search one, which facilitates reutilizing historical separability detection information and thus reducing detection times. To evaluate the performance of SVG, a suite of benchmark functions with up to 2000 dimensions, including additively and non-additively separable ones, were designed. Experimental results on these functions indicate that, compared with six state-of-the-art decomposition algorithms, SVG possesses broader applicability and competitive efficiency. Furthermore, it can significantly enhance the optimization performance of CC.
NEApr 5, 2020
An Eigenspace Divide-and-Conquer Approach for Large-Scale OptimizationZhigang Ren, Yongsheng Liang, Muyi Wang et al.
Divide-and-conquer-based (DC-based) evolutionary algorithms (EAs) have achieved notable success in dealing with large-scale optimization problems (LSOPs). However, the appealing performance of this type of algorithms generally requires a high-precision decomposition of the optimization problem, which is still a challenging task for existing decomposition methods. This study attempts to address the above issue from a different perspective and proposes an eigenspace divide-and-conquer (EDC) approach. Different from existing DC-based algorithms that perform decomposition and optimization in the original decision space, EDC first establishes an eigenspace by conducting singular value decomposition on a set of high-quality solutions selected from recent generations. Then it transforms the optimization problem into the eigenspace, and thus significantly weakens the dependencies among the corresponding eigenvariables. Accordingly, these eigenvariables can be efficiently grouped by a simple random strategy and each of the resulting subproblems can be addressed more easily by a traditional EA. To verify the efficiency of EDC, comprehensive experimental studies were conducted on two sets of benchmark functions. Experimental results indicate that EDC is robust to its parameters and has good scalability to the problem dimension. The comparison with several state-of-the-art algorithms further confirms that EDC is pretty competitive and performs better on complicated LSOPs.
CYNov 28, 2018
A Scoring Method for Driving Safety Credit Using Trajectory DataWenfu Wang, Weijie Yang, An Chen et al.
Urban traffic systems worldwide are suffering from severe traffic safety problems. Traffic safety is affected by many complex factors, and heavily related to all drivers' behaviors involved in traffic system. Drivers with aggressive driving behaviors increase the risk of traffic accidents. In order to manage the safety level of traffic system, we propose Driving Safety Credit inspired by credit score in financial security field, and design a scoring method using trajectory data and violation records. First, we extract driving habits, aggressive driving behaviors and traffic violation behaviors from driver's trajectories and traffic violation records. Next, we train a classification model to filtered out irrelevant features. And at last, we score each driver with selected features. We verify our proposed scoring method using 40 days of traffic simulation, and proves the effectiveness of our scoring method.
NEMar 1, 2018
Niching an Archive-based Gaussian Estimation of Distribution Algorithm via Adaptive ClusteringYongsheng Liang, Zhigang Ren, Bei Pang et al.
As a model-based evolutionary algorithm, estimation of distribution algorithm (EDA) possesses unique characteristics and has been widely applied to global optimization. However, traditional Gaussian EDA (GEDA) may suffer from premature convergence and has a high risk of falling into local optimum when dealing with multimodal problem. In this paper, we first attempts to improve the performance of GEDA by utilizing historical solutions and develops a novel archive-based EDA variant. The use of historical solutions not only enhances the search efficiency of EDA to a large extent, but also significantly reduces the population size so that a faster convergence could be achieved. Then, the archive-based EDA is further integrated with a novel adaptive clustering strategy for solving multimodal optimization problems. Taking the advantage of the clustering strategy in locating different promising areas and the powerful exploitation ability of the archive-based EDA, the resultant algorithm is endowed with strong capability in finding multiple optima. To verify the efficiency of the proposed algorithm, we tested it on a set of well-known niching benchmark problems and compared it with several state-of-the-art niching algorithms. The experimental results indicate that the proposed algorithm is competitive.
NEMar 1, 2018
A Global Information Based Adaptive Threshold for Grouping Large Scale Global Optimization ProblemsAn Chen, Yipeng Zhang, Zhigang Ren et al.
By taking the idea of divide-and-conquer, cooperative coevolution (CC) provides a powerful architecture for large scale global optimization (LSGO) problems, but its efficiency relies highly on the decomposition strategy. It has been shown that differential grouping (DG) performs well on decomposing LSGO problems by effectively detecting the interaction among decision variables. However, its decomposition accuracy depends highly on the threshold. To improve the decomposition accuracy of DG, a global information based adaptive threshold setting algorithm (GIAT) is proposed in this paper. On the one hand, by reducing the sensitivity of the indicator in DG to the roundoff error and the magnitude of contribution weight of subcomponent, we proposed a new indicator for two variables which is much more sensitive to their interaction. On the other hand, instead of setting the threshold only based on one pair of variables, the threshold is generated from the interaction information for all pair of variables. By conducting the experiments on two sets of LSGO benchmark functions, the correctness and robustness of this new indicator and GIAT were verified.
NEMar 1, 2018
Enhancing Cooperative Coevolution for Large Scale Optimization by Adaptively Constructing Surrogate ModelsBei Pang, Zhigang Ren, Yongsheng Liang et al.
It has been shown that cooperative coevolution (CC) can effectively deal with large scale optimization problems (LSOPs) through a divide-and-conquer strategy. However, its performance is severely restricted by the current context-vector-based sub-solution evaluation method since this method needs to access the original high dimensional simulation model when evaluating each sub-solution and thus requires many computation resources. To alleviate this issue, this study proposes an adaptive surrogate model assisted CC framework. This framework adaptively constructs surrogate models for different sub-problems by fully considering their characteristics. For the single dimensional sub-problems obtained through decomposition, accurate enough surrogate models can be obtained and used to find out the optimal solutions of the corresponding sub-problems directly. As for the nonseparable sub-problems, the surrogate models are employed to evaluate the corresponding sub-solutions, and the original simulation model is only adopted to reevaluate some good sub-solutions selected by surrogate models. By these means, the computation cost could be greatly reduced without significantly sacrificing evaluation quality. Empirical studies on IEEE CEC 2010 benchmark functions show that the concrete algorithm based on this framework is able to find much better solutions than the conventional CC algorithms and a non-CC algorithm even with much fewer computation resources.
NEFeb 27, 2018
Surrogate Model Assisted Cooperative Coevolution for Large Scale OptimizationZhigang Ren, Bei Pang, Yongsheng Liang et al.
It has been shown that cooperative coevolution (CC) can effectively deal with large scale optimization problems (LSOPs) through a divide-and-conquer strategy. However, its performance is severely restricted by the current context-vector-based sub-solution evaluation method since this method needs to access the original high dimensional simulation model when evaluating each sub-solution and thus requires many computation resources. To alleviate this issue, this study proposes a novel surrogate model assisted cooperative coevolution (SACC) framework. SACC constructs a surrogate model for each sub-problem obtained via decomposition and employs it to evaluate corresponding sub-solutions. The original simulation model is only adopted to reevaluate some good sub-solutions selected by surrogate models, and these real evaluated sub-solutions will be in turn employed to update surrogate models. By this means, the computation cost could be greatly reduced without significantly sacrificing evaluation quality. To show the efficiency of SACC, this study uses radial basis function (RBF) and success-history based adaptive differential evolution (SHADE) as surrogate model and optimizer, respectively. RBF and SHADE have been proved to be effective on small and medium scale problems. This study first scales them up to LSOPs of 1000 dimensions under the SACC framework, where they are tailored to a certain extent for adapting to the characteristics of LSOP and SACC. Empirical studies on IEEE CEC 2010 benchmark functions demonstrate that SACC significantly enhances the evaluation efficiency on sub-solutions, and even with much fewer computation resource, the resultant RBF-SHADE-SACC algorithm is able to find much better solutions than traditional CC algorithms.
NEFeb 25, 2018
Enhancing Gaussian Estimation of Distribution Algorithm by Exploiting Evolution Direction with ArchiveYongsheng Liang, Zhigang Ren, Xianghua Yao et al.
As a typical model-based evolutionary algorithm (EA), estimation of distribution algorithm (EDA) possesses unique characteristics and has been widely applied to global optimization. However, the common-used Gaussian EDA (GEDA) usually suffers from premature convergence which severely limits its search efficiency. This study first systematically analyses the reasons for the deficiency of the traditional GEDA, then tries to enhance its performance by exploiting its evolution direction, and finally develops a new GEDA variant named EDA2. Instead of only utilizing some good solutions produced in the current generation when estimating the Gaussian model, EDA2 preserves a certain number of high-quality solutions generated in previous generations into an archive and takes advantage of these historical solutions to assist estimating the covariance matrix of Gaussian model. By this means, the evolution direction information hidden in the archive is naturally integrated into the estimated model which in turn can guide EDA2 towards more promising solution regions. Moreover, the new estimation method significantly reduces the population size of EDA2 since it needs fewer individuals in the current population for model estimation. As a result, a fast convergence can be achieved. To verify the efficiency of EDA2, we tested it on a variety of benchmark functions and compared it with several state-of-the-art EAs, including IPOP-CMAES, AMaLGaM, three high-powered DE algorithms, and a new PSO algorithm. The experimental results demonstrate that EDA2 is efficient and competitive.