Fei Dai

CL
h-index7
6papers
42citations
Novelty56%
AI Score41

6 Papers

DCJul 22, 2022
Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

Fei Dai, Yawen Chen, Zhiyi Huang et al.

Communication efficiency plays an important role in accelerating the distributed training of Deep Neural Networks (DNN). All-reduce is the crucial communication primitive to reduce model parameters in distributed DNN training. Most existing all-reduce algorithms are designed for traditional electrical interconnect systems, which cannot meet the communication requirements for distributed training of large DNNs due to the low data bandwidth of the electrical interconnect systems. One of the promising alternatives for electrical interconnect is optical interconnect, which can provide high bandwidth, low transmission delay, and low power cost. We propose an efficient scheme called WRHT (Wavelength Reused Hierarchical Tree) for implementing all-reduce operation in optical interconnect systems. WRHT can take advantage of WDM (Wavelength Division Multiplexing) to reduce the communication time of distributed data-parallel DNN training. We further derive the required number of wavelengths, the minimum number of communication steps, and the communication time for the all-reduce operation on optical interconnect. The constraint of insertion loss is also considered in our analysis. Simulation results show that the communication time of all-reduce by WRHT is reduced by 80.81%, 64.36%, and 82.12%, respectively, compared with three traditional all-reduce algorithms according to our simulation results of an optical interconnect system. Our results also show that WRHT can reduce the communication time of all-reduce operation by 92.42% and 91.31% compared to two existing all-reduce algorithms running in the electrical interconnect system.

CLJun 17, 2025Code
CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation

Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong et al.

Training data has been proven to be one of the most critical components in training generative AI. However, obtaining high-quality data remains challenging, with data privacy issues presenting a significant hurdle. To address the need for high-quality data. Synthesize data has emerged as a mainstream solution, demonstrating impressive performance in areas such as images, audio, and video. Generating mixed-type data, especially high-quality tabular data, still faces significant challenges. These primarily include its inherent heterogeneous data types, complex inter-variable relationships, and intricate column-wise distributions. In this paper, we introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed tabular data containing both numerical and categorical features, while being more flexible in capturing complex interactions among variables. We further propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion. This approach adaptively controls the weight of causal regularization, enhancing the model's performance without compromising its generative capabilities. Comprehensive experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics. Our code is publicly available at: https://github.com/Godz-z/CausalDiffTab.

CLMar 30, 2025Code
Mixture of Routers

Jia-Chen Zhang, Yu-Jie Xiong, Xi-He Qiu et al.

Supervised fine-tuning (SFT) is a milestone in aligning large language models with human instructions and adapting them to downstream tasks. In particular, Low-Rank Adaptation (LoRA) has gained widespread attention due to its parameter efficiency. However, its impact on improving the performance of large models remains limited. Recent studies suggest that combining LoRA with Mixture-of-Experts (MoE) can significantly enhance fine-tuning performance. MoE adapts to the diversity and complexity of datasets by dynamically selecting the most suitable experts, thereby improving task accuracy and efficiency. Despite impressive results, recent studies reveal issues in the MoE routing mechanism, such as incorrect assignments and imbalanced expert allocation. Inspired by the principles of Redundancy and Fault Tolerance Theory. We innovatively integrate the concept of Mixture of Experts into the routing mechanism and propose an efficient fine-tuning method called Mixture of Routers (MoR). It employs multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers. The results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%. MoR can serve as a plug-and-play, parameter-efficient fine-tuning method suitable for a wide range of applications. Our code is available here: https://anonymous.4open.science/r/MoR-DFC6.

CVNov 26, 2024
I2VControl: Disentangled and Unified Video Motion Synthesis Control

Wanquan Feng, Tianhao Qi, Jiawei Liu et al.

Motion controllability is crucial in video synthesis. However, most previous methods are limited to single control types, and combining them often results in logical conflicts. In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. We rethink camera control, object dragging, and motion brush, reformulating all tasks into a consistent representation based on point trajectories, each managed by a dedicated formulation. Accordingly, we propose a spatial partitioning strategy, where each unit is assigned to a concomitant control category, enabling diverse control types to be dynamically orchestrated within a single synthesis pipeline without conflicts. Furthermore, we design an adapter structure that functions as a plug-in for pre-trained models and is agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. Project page: https://wanquanf.github.io/I2VControl .

CVMar 21, 2025
HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis

Mengtian Li, Jinshu Chen, Wanquan Feng et al.

Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.

DCSep 30, 2021
Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC)

Fei Dai, Yawen Chen, Haibo Zhang et al.

Fully Connected Neural Network (FCNN) is a class of Artificial Neural Networks widely used in computer science and engineering, whereas the training process can take a long time with large datasets in existing many-core systems. Optical Network-on-Chip (ONoC), an emerging chip-scale optical interconnection technology, has great potential to accelerate the training of FCNN with low transmission delay, low power consumption, and high throughput. However, existing methods based on Electrical Network-on-Chip (ENoC) cannot fit in ONoC because of the unique properties of ONoC. In this paper, we propose a fine-grained parallel computing model for accelerating FCNN training on ONoC and derive the optimal number of cores for each execution stage with the objective of minimizing the total amount of time to complete one epoch of FCNN training. To allocate the optimal number of cores for each execution stage, we present three mapping strategies and compare their advantages and disadvantages in terms of hotspot level, memory requirement, and state transitions. Simulation results show that the average prediction error for the optimal number of cores in NN benchmarks is within 2.3%. We further carry out extensive simulations which demonstrate that FCNN training time can be reduced by 22.28% and 4.91% on average using our proposed scheme, compared with traditional parallel computing methods that either allocate a fixed number of cores or allocate as many cores as possible, respectively. Compared with ENoC, simulation results show that under batch sizes of 64 and 128, on average ONoC can achieve 21.02% and 12.95% on reducing training time with 47.85% and 39.27% on saving energy, respectively.