Yuanlong Li

h-index29

5papers

227citations

Novelty49%

AI Score37

Ranked #89,606 of 194,257 authors (top 46%)#5,514 in AI (top 44%)

5 Papers

2.5AINov 9, 2022

Deep Explainable Learning with Graph Based Data Assessing and Rule Reasoning

Yuanlong Li, Gaopan Huang, Min Zhou et al.

Learning an explainable classifier often results in low accuracy model or ends up with a huge rule set, while learning a deep model is usually more capable of handling noisy data at scale, but with the cost of hard to explain the result and weak at generalization. To mitigate this gap, we propose an end-to-end deep explainable learning approach that combines the advantage of deep model in noise handling and expert rule-based interpretability. Specifically, we propose to learn a deep data assessing model which models the data as a graph to represent the correlations among different observations, whose output will be used to extract key data features. The key features are then fed into a rule network constructed following predefined noisy expert rules with trainable parameters. As these models are correlated, we propose an end-to-end training framework, utilizing the rule classification loss to optimize the rule learning model and data assessing model at the same time. As the rule-based computation is none-differentiable, we propose a gradient linking search module to carry the gradient information from the rule learning model to the data assessing model. The proposed method is tested in an industry production system, showing comparable prediction accuracy, much higher generalization stability and better interpretability when compared with a decent deep ensemble baseline, and shows much better fitting power than pure rule-based approach.

5.9ARJul 15, 2025Code

SystolicAttention: Fusing FlashAttention within a Single Systolic Array

Jiawei Lin, Guokai Chen, Yuanlong Li et al.

Transformer models rely heavily on scaled dot-product attention (SDPA), typically implemented using the FlashAttention algorithm. However, current systolic-array-based accelerators face significant challenges when executing FlashAttention. Systolic arrays achieve high utilization primarily for consecutive and large matrix multiplications, whereas FlashAttention requires frequent interleaving of matrix multiplications and softmax operations. The frequent data swaps between matrix multiplications on the systolic array and softmax operations on external units result in low array utilization. Moreover, when these computations run concurrently, the softmax stage contends with matrix multiplication for register file and SRAM ports, further degrading performance. To overcome these limitations, we propose FSA, an enhanced systolic array architecture that enables the FlashAttention algorithm to run entirely within a single systolic array, eliminating the need for external vector units. At the core of FSA is SystolicAttention, a novel scheduling algorithm that maps FlashAttention operations onto systolic arrays with fine-grained, element-wise overlap. This approach significantly improves array utilization while preserving the original floating-point operation order to maintain numerical stability. We implement FSA in synthesizable RTL and evaluate its performance against state-of-the-art commercial accelerators. Our results show that FSA achieves 1.77 and 4.83 times higher attention FLOPs/s utilization compared to AWS Neuron v2 and Google TPUv5e, respectively, with only 12% area overhead.

1.5LGMay 24, 2018

Intelligent Trainer for Model-Based Reinforcement Learning

Yuanlong Li, Linsen Dong, Xin Zhou et al.

Model-based reinforcement learning (MBRL) has been proposed as a promising alternative solution to tackle the high sampling cost challenge in the canonical reinforcement learning (RL), by leveraging a learned model to generate synthesized data for policy training purpose. The MBRL framework, nevertheless, is inherently limited by the convoluted process of jointly learning control policy and configuring hyper-parameters (e.g., global/local models, real and synthesized data, etc). The training process could be tedious and prohibitively costly. In this research, we propose an "reinforcement on reinforcement" (RoR) architecture to decompose the convoluted tasks into two layers of reinforcement learning. The inner layer is the canonical model-based RL training process environment (TPE), which learns the control policy for the underlying system and exposes interfaces to access states, actions and rewards. The outer layer presents an RL agent, called as AI trainer, to learn an optimal hyper-parameter configuration for the inner TPE. This decomposition approach provides a desirable flexibility to implement different trainer designs, called as "train the trainer". In our research, we propose and optimize two alternative trainer designs: 1) a uni-head trainer and 2) a multi-head trainer. Our proposed RoR framework is evaluated for five tasks in the OpenAI gym (i.e., Pendulum, Mountain Car, Reacher, Half Cheetah and Swimmer). Compared to three other baseline algorithms, our proposed Train-the-Trainer algorithm has a competitive performance in auto-tuning capability, with upto 56% expected sampling cost saving without knowing the best parameter setting in advance. The proposed trainer framework can be easily extended to other cases in which the hyper-parameter tuning is costly.

20.6AISep 15, 2017

Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning

Yuanlong Li, Yonggang Wen, Kyle Guan et al.

Cooling system plays a critical role in a modern data center (DC). Developing an optimal control policy for DC cooling system is a challenging task. The prevailing approaches often rely on approximating system models that are built upon the knowledge of mechanical cooling, electrical and thermal management, which is difficult to design and may lead to sub-optimal or unstable performances. In this paper, we propose utilizing the large amount of monitoring data in DC to optimize the control policy. To do so, we cast the cooling control policy design into an energy cost minimization problem with temperature constraints, and tap it into the emerging deep reinforcement learning (DRL) framework. Specifically, we propose an end-to-end cooling control algorithm (CCA) that is based on the actor-critic framework and an off-policy offline version of the deep deterministic policy gradient (DDPG) algorithm. In the proposed CCA, an evaluation network is trained to predict an energy cost counter penalized by the cooling status of the DC room, and a policy network is trained to predict optimized control settings when gave the current load and weather information. The proposed algorithm is evaluated on the EnergyPlus simulation platform and on a real data trace collected from the National Super Computing Centre (NSCC) of Singapore. Our results show that the proposed CCA can achieve about 11% cooling cost saving on the simulation platform compared with a manually configured baseline control algorithm. In the trace-based study, we propose a de-underestimation validation mechanism as we cannot directly test the algorithm on a real DC. Even though with DUE the results are conservative, we can still achieve about 15% cooling energy saving on the NSCC data trace if we set the inlet temperature threshold at 26.6 degree Celsius.

1.5NEAug 15, 2016

Power Data Classification: A Hybrid of a Novel Local Time Warping and LSTM

Yuanlong Li, Han Hu, Yonggang Wen et al.

In this paper, for the purpose of data centre energy consumption monitoring and analysis, we propose to detect the running programs in a server by classifying the observed power consumption series. Time series classification problem has been extensively studied with various distance measurements developed; also recently the deep learning based sequence models have been proved to be promising. In this paper, we propose a novel distance measurement and build a time series classification algorithm hybridizing nearest neighbour and long short term memory (LSTM) neural network. More specifically, first we propose a new distance measurement termed as Local Time Warping (LTW), which utilizes a user-specified set for local warping, and is designed to be non-commutative and non-dynamic programming. Second we hybridize the 1NN-LTW and LSTM together. In particular, we combine the prediction probability vector of 1NN-LTW and LSTM to determine the label of the test cases. Finally, using the power consumption data from a real data center, we show that the proposed LTW can improve the classification accuracy of DTW from about 84% to 90%. Our experimental results prove that the proposed LTW is competitive on our data set compared with existed DTW variants and its non-commutative feature is indeed beneficial. We also test a linear version of LTW and it can significantly outperform existed linear runtime lower bound methods like LB_Keogh. Furthermore, with the hybrid algorithm, for the power series classification task we achieve an accuracy up to about 93%. Our research can inspire more studies on time series distance measurement and the hybrid of the deep learning models with other traditional models.