LGSep 29, 2025Code
EOE: Evolutionary Optimization of Experts for Training Language ModelsYingshi Chen
This paper presents an evolutionary framework for the training of large language models(LLM). The models are divided into several experts(sub-networks), which have the same structure but different parameter values. Only one expert is trained at each step. After the classical AdamW optimization, some evolutionary operators(crossover, PSO, and mutation) act on the tensor weights between the current expert and the best expert. So current expert would learn the experience of best expert. The direction of best expert would help current expert's loss decrease faster. Finally, only save the weight of the best expert. Experiments show that best expert would achieve nearly the same accuracy as the full model. This would greatly reduce the size of the model for inference. Since only one expert is trained at each step, the training needs much less memory and has much higher throughput. Experiments show that the throughput would accelerate more than ten times! Our source code is available. It's a pure c++/cu framework, which is suitable for easy deployment on PCs and edge computing devices.
LGOct 2, 2020Code
Attention augmented differentiable forest for tabular dataYingshi Chen
Differentiable forest is an ensemble of decision trees with full differentiability. Its simple tree structure is easy to use and explain. With full differentiability, it would be trained in the end-to-end learning framework with gradient-based optimization method. In this paper, we propose tree attention block(TAB) in the framework of differentiable forest. TAB block has two operations, squeeze and regulate. The squeeze operation would extract the characteristic of each tree. The regulate operation would learn nonlinear relations between these trees. So TAB block would learn the importance of each tree and adjust its weight to improve accuracy. Our experiment on large tabular dataset shows attention augmented differentiable forest would get comparable accuracy with gradient boosted decision trees(GBDT), which is the state-of-the-art algorithm for tabular datasets. And on some datasets, our model has higher accuracy than best GBDT libs (LightGBM, Catboost, and XGBoost). Differentiable forest model supports batch training and batch size is much smaller than the size of training set. So on larger data sets, its memory usage is much lower than GBDT model. The source codes are available at https://github.com/closest-git/QuantumForest.
LGFeb 29, 2020Code
Deep differentiable forest with sparse attention for the tabular dataYingshi Chen
We present a general architecture of deep differentiable forest and its sparse attention mechanism. The differentiable forest has the advantages of both trees and neural networks. Its structure is a simple binary tree, easy to use and understand. It has full differentiability and all variables are learnable parameters. We would train it by the gradient-based optimization method, which shows great power in the training of deep CNN. We find and analyze the attention mechanism in the differentiable forest. That is, each decision depends on only a few important features, and others are irrelevant. The attention is always sparse. Based on this observation, we improve its sparsity by data-aware initialization. We use the attribute importance to initialize the attention weight. Then the learned weight is much sparse than that from random initialization. Our experiment on some large tabular dataset shows differentiable forest has higher accuracy than GBDT, which is the state of art algorithm for tabular datasets. The source codes are available at https://github.com/closest-git/QuantumForest
LGJan 26, 2020Code
LiteMORT: A memory efficient gradient boosting tree system on adaptive compact distributionsYingshi Chen
Gradient boosted decision trees (GBDT) is the leading algorithm for many commercial and academic data applications. We give a deep analysis of this algorithm, especially the histogram technique, which is a basis for the regulized distribution with compact support. We present three new modifications. 1) Share memory technique to reduce memory usage. In many cases, it only need the data source itself and no extra memory. 2) Implicit merging for "merge overflow problem"."merge overflow" means that merge some small datasets to huge datasets, which are too huge to be solved. By implicit merging, we just need the original small datasets to train the GBDT model. 3) Adaptive resize algorithm of histogram bins to improve accuracy. Experiments on two large Kaggle competitions verified our methods. They use much less memory than LightGBM and have higher accuracy. We have implemented these algorithms in an open-source package LiteMORT. The source codes are available at https://github.com/closest-git/LiteMORT
LGJan 6, 2020Code
Express Wavenet -- a low parameter optical neural network with random shift wavelet patternYingshi Chen
Express Wavenet is an improved optical diffractive neural network. At each layer, it uses wavelet-like pattern to modulate the phase of optical waves. For input image with n2 pixels, express wavenet reduce parameter number from O(n2) to O(n). Only need one percent of the parameters, and the accuracy is still very high. In the MNIST dataset, it only needs 1229 parameters to get accuracy of 92%, while the standard optical network needs 125440 parameters. The random shift wavelets show the characteristics of optical network more vividly. Especially the vanishing gradient phenomenon in the training process. We present a modified expressway structure for this problem. Experiments verified the effect of random shift wavelet and expressway structure. Our work shows optical diffractive network would use much fewer parameters than other neural networks. The source codes are available at https://github.com/closest-git/ONNet.
LGDec 23, 2019Code
An optical diffractive deep neural network with multiple frequency-channelsYingshi Chen, Jinfeng Zhu
Diffractive deep neural network (DNNet) is a novel machine learning framework on the modulation of optical transmission. Diffractive network would get predictions at the speed of light. It's pure passive architecture, no additional power consumption. We improved the accuracy of diffractive network with optical waves at different frequency. Each layers have multiple frequency-channels (optical distributions at different frequency). These channels are merged at the output plane to get final output. The experiment in the fasion-MNIST and EMNIST datasets showed multiple frequency-channels would increase the accuracy a lot. We also give detailed analysis to show the difference between DNNet and MLP. The modulation process in DNNet is actually optical activation function. We develop an open source package ONNet. The source codes are available at https://github.com/closest-git/ONNet.
LGDec 7, 2019Code
A novel guided deep learning algorithm to design low-cost SPP filmsYingshi Chen, Jinfeng Zhu
The design of surface plasmon polaritons (SPP) films is an ill-posed inverse problem. There are many-to-one correspondence between the structures and user needs. We present a novel guided deep learning algorithm to find optimal solutions (with both high accuracy and low cost). To achieve this goal, we use low cost sample replacement algorithm in training process. The deep CNN would gradually learn better model from samples with lower cost. We have successfully applied this algorithm to the design of low-cost SPP films. Our model learned to replace precious metals with ordinary metals to reduce cost. So the the cost of predicted structure is much lower than standard deep CNN. And the average relative error of spectrum is less than 10%. The source codes are available at https://github.com/closest-git/MetaLab.
LGOct 12, 2021
Fast Block Linear System Solver Using Q-Learning Schduling for Unified Dynamic Power System SimulationsYingshi Chen, Xinli Song, HanYang Dai et al.
We present a fast block direct solver for the unified dynamic simulations of power systems. This solver uses a novel Q-learning based method for task scheduling. Unified dynamic simulations of power systems represent a method in which the electric-mechanical transient, medium-term and long-term dynamic phenomena are organically united. Due to the high rank and large numbers in solving, fast solution of these equations is the key to speeding up the simulation. The sparse systems of simulation contain complex nested block structure, which could be used by the solver to speed up. For the scheduling of blocks and frontals in the solver, we use a learning based task-tree scheduling technique in the framework of Markov Decision Process. That is, we could learn optimal scheduling strategies by offline training on many sample matrices. Then for any systems, the solver would get optimal task partition and scheduling on the learned model. Our learning-based algorithm could help improve the performance of sparse solver, which has been verified in some numerical experiments. The simulation on some large power systems shows that our solver is 2-6 times faster than KLU, which is the state-of-the-art sparse solver for circuit simulation problems.
NASep 30, 2021
Learning the Markov Decision Process in the Sparse Gaussian EliminationYingshi Chen
We propose a learning-based approach for the sparse Gaussian Elimination. There are many hard combinatorial optimization problems in modern sparse solver. These NP-hard problems could be handled in the framework of Markov Decision Process, especially the Q-Learning technique. We proposed some Q-Learning algorithms for the main modules of sparse solver: minimum degree ordering, task scheduling and adaptive pivoting. Finally, we recast the sparse solver into the framework of Q-Learning. Our study is the first step to connect these two classical mathematical models: Gaussian Elimination and Markov Decision Process. Our learning-based algorithm could help improve the performance of sparse solver, which has been verified in some numerical experiments.
LGJul 12, 2021
The Brownian motion in the transformer modelYingshi Chen
Transformer is the state of the art model for many language and visual tasks. In this paper, we give a deep analysis of its multi-head self-attention (MHSA) module and find that: 1) Each token is a random variable in high dimensional feature space. 2) After layer normalization, these variables are mapped to points on the hyper-sphere. 3) The update of these tokens is a Brownian motion. The Brownian motion has special properties, its second order item should not be ignored. So we present a new second-order optimizer(an iterative K-FAC algorithm) for the MHSA module. In some short words: All tokens are mapped to high dimension hyper-sphere. The Scaled Dot-Product Attention $softmax(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d}})$ is just the Markov transition matrix for the random walking on the sphere. And the deep learning process would learn proper kernel function to get proper positions of these tokens. The training process in the MHSA module corresponds to a Brownian motion worthy of further study.
LGJan 1, 2021
An iterative K-FAC algorithm for Deep LearningYingshi Chen
Kronecker-factored Approximate Curvature (K-FAC) method is a high efficiency second order optimizer for the deep learning. Its training time is less than SGD(or other first-order method) with same accuracy in many large-scale problems. The key of K-FAC is to approximates Fisher information matrix (FIM) as a block-diagonal matrix where each block is an inverse of tiny Kronecker factors. In this short note, we present CG-FAC -- an new iterative K-FAC algorithm. It uses conjugate gradient method to approximate the nature gradient. This CG-FAC method is matrix-free, that is, no need to generate the FIM matrix, also no need to generate the Kronecker factors A and G. We prove that the time and memory complexity of iterative CG-FAC is much less than that of standard K-FAC algorithm.
LGOct 27, 2020
A short note on the decision tree based neural turing machineYingshi Chen
Turing machine and decision tree have developed independently for a long time. With the recent development of differentiable models, there is an intersection between them. Neural turing machine(NTM) opens door for the memory network. It use differentiable attention mechanism to read/write external memory bank. Differentiable forest brings differentiable properties to classical decision tree. In this short note, we show the deep connection between these two models. That is: differentiable forest is a special case of NTM. Differentiable forest is actually decision tree based neural turing machine. Based on this deep connection, we propose a response augmented differential forest (RaDF). The controller of RaDF is differentiable forest, the external memory of RaDF are response vectors which would be read/write by leaf nodes.
LGApr 7, 2020
Learning Unsplit-field-based PML for the FDTD Method by Deep Differentiable ForestYingshi Chen, Naixing Feng
Alternative unsplit-filed-based absorbing boundary condition (ABC) computation approach for the finite-difference time-domain (FDTD) is efficiently proposed based on the deep differentiable forest. The deep differentiable forest (DDF) model is introduced to replace the conventional perfectly matched layer (PML) ABC during the computation process of FDTD. The field component data on the interface of traditional PML are adopted to train the DDF-based PML model. DDF has the advantages of both trees and neural networks. Its tree structure is easy to use and explain for the numerical PML data. It has full differentiability like neural networks. DDF could be trained by powerful techniques from deep learning. So compared to the traditional PML implementation, the proposed method can greatly reduce the size of FDTD physical domain and the calculation complexity of FDTD due to the novel model which only involves the one-cell thickness of boundary layer. Numerical simulations have been carried out to benchmark the performance of the proposed approach. Numerical results illustrate that the proposed method can not only easily replace the traditional PML, but also be integrated into the FDTD computation process with satisfactory numerical accuracy and compatibility to the FDTD.