Ao Zhou

LG
h-index47
24papers
351citations
Novelty53%
AI Score51

24 Papers

LGAug 28, 2023Code
EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices

Rongjie Yi, Liwei Guo, Shiyun Wei et al.

Large language models (LLMs) such as GPTs and Mixtral-8x7B have revolutionized machine intelligence due to their exceptional abilities in generic ML tasks. Transiting LLMs from datacenters to edge devices brings benefits like better privacy and availability, but is challenged by their massive parameter size and thus unbearable runtime costs. To this end, we present EdgeMoE, an on-device inference engine for mixture-of-expert (MoE) LLMs -- a popular form of sparse LLM that scales its parameter size with almost constant computing complexity. EdgeMoE achieves both memory- and compute-efficiency by partitioning the model into the storage hierarchy: non-expert weights are held in device memory; while expert weights are held on external storage and fetched to memory only when activated. This design is motivated by a key observation that expert weights are bulky but infrequently used due to sparse activation. To further reduce the expert I/O swapping overhead, EdgeMoE incorporates two novel techniques: (1) expert-wise bitwidth adaptation that reduces the expert sizes with tolerable accuracy loss; (2) expert preloading that predicts the activated experts ahead of time and preloads it with the compute-I/O pipeline. On popular MoE LLMs and edge devices, EdgeMoE showcase significant memory savings and speedup over competitive baselines. The code is available at https://github.com/UbiquitousLearning/mllm.

LGAug 21, 2024
FedMoE: Personalized Federated Learning via Heterogeneous Mixture of Experts

Hanzi Mei, Dongqi Cai, Ao Zhou et al. · cambridge

As Large Language Models (LLMs) push the boundaries of AI capabilities, their demand for data is growing. Much of this data is private and distributed across edge devices, making Federated Learning (FL) a de-facto alternative for fine-tuning (i.e., FedLLM). However, it faces significant challenges due to the inherent heterogeneity among clients, including varying data distributions and diverse task types. Towards a versatile FedLLM, we replace traditional dense model with a sparsely-activated Mixture-of-Experts (MoE) architecture, whose parallel feed-forward networks enable greater flexibility. To make it more practical in resource-constrained environments, we present FedMoE, the efficient personalized FL framework to address data heterogeneity, constructing an optimal sub-MoE for each client and bringing the knowledge back to global MoE. FedMoE is composed of two fine-tuning stages. In the first stage, FedMoE simplifies the problem by conducting a heuristic search based on observed activation patterns, which identifies a suboptimal submodel for each client. In the second stage, these submodels are distributed to clients for further training and returned for server aggregating through a novel modular aggregation strategy. Meanwhile, FedMoE progressively adjusts the submodels to optimal through global expert recommendation. Experimental results demonstrate the superiority of our method over previous personalized FL methods.

LGJun 15, 2022
Boosting DNN Cold Inference on Edge Devices

Rongjie Yi, Ting Cao, Ao Zhou et al.

DNNs are ubiquitous on edge devices nowadays. With its increasing importance and use cases, it's not likely to pack all DNNs into device memory and expect that each inference has been warmed up. Therefore, cold inference, the process to read, initialize, and execute a DNN model, is becoming commonplace and its performance is urgently demanded to be optimized. To this end, we present NNV12, the first on-device inference engine that optimizes for cold inference NNV12 is built atop 3 novel optimization knobs: selecting a proper kernel (implementation) for each DNN operator, bypassing the weights transformation process by caching the post-transformed weights on disk, and pipelined execution of many kernels on asymmetric processors. To tackle with the huge search space, NNV12 employs a heuristic-based scheme to obtain a near-optimal kernel scheduling plan. We fully implement a prototype of NNV12 and evaluate its performance across extensive experiments. It shows that NNV12 achieves up to 15.2x and 401.5x compared to the state-of-the-art DNN engines on edge CPUs and GPUs, respectively.

LGMar 20, 2023
Hardware-Aware Graph Neural Network Automated Design for Edge Computing Platforms

Ao Zhou, Jianlei Yang, Yingjie Qi et al.

Graph neural networks (GNNs) have emerged as a popular strategy for handling non-Euclidean data due to their state-of-the-art performance. However, most of the current GNN model designs mainly focus on task accuracy, lacking in considering hardware resources limitation and real-time requirements of edge application scenarios. Comprehensive profiling of typical GNN models indicates that their execution characteristics are significantly affected across different computing platforms, which demands hardware awareness for efficient GNN designs. In this work, HGNAS is proposed as the first Hardware-aware Graph Neural Architecture Search framework targeting resource constraint edge devices. By decoupling the GNN paradigm, HGNAS constructs a fine-grained design space and leverages an efficient multi-stage search strategy to explore optimal architectures within a few GPU hours. Moreover, HGNAS achieves hardware awareness during the GNN architecture design by leveraging a hardware performance predictor, which could balance the GNN model accuracy and efficiency corresponding to the characteristics of targeted devices. Experimental results show that HGNAS can achieve about $10.6\times$ speedup and $88.2\%$ peak memory reduction with a negligible accuracy loss compared to DGCNN on various edge devices, including Nvidia RTX3080, Jetson TX2, Intel i7-8700K and Raspberry Pi 3B+.

CLFeb 2Code
Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing

Lingkun Long, Yushi Huang, Shihao Bai et al.

Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than $29\times$ lossless speedup under $32K$ context length. The code is publicly available at: https://github.com/Longxmas/Focus-dLLM

LGOct 18, 2023
Architectural Implications of GNN Aggregation Programming Abstractions

Yingjie Qi, Jianlei Yang, Ao Zhou et al.

Graph neural networks (GNNs) have gained significant popularity due to the powerful capability to extract useful representations from graph data. As the need for efficient GNN computation intensifies, a variety of programming abstractions designed for optimizing GNN Aggregation have emerged to facilitate acceleration. However, there is no comprehensive evaluation and analysis upon existing abstractions, thus no clear consensus on which approach is better. In this letter, we classify existing programming abstractions for GNN Aggregation by the dimension of data organization and propagation method. By constructing these abstractions on a state-of-the-art GNN library, we perform a thorough and detailed characterization study to compare their performance and efficiency, and provide several insights on future GNN acceleration based on our analysis.

LGAug 23, 2024
HGNAS: Hardware-Aware Graph Neural Architecture Search for Edge Devices

Ao Zhou, Jianlei Yang, Yingjie Qi et al.

Graph Neural Networks (GNNs) are becoming increasingly popular for graph-based learning tasks such as point cloud processing due to their state-of-the-art (SOTA) performance. Nevertheless, the research community has primarily focused on improving model expressiveness, lacking consideration of how to design efficient GNN models for edge scenarios with real-time requirements and limited resources. Examining existing GNN models reveals varied execution across platforms and frequent Out-Of-Memory (OOM) problems, highlighting the need for hardware-aware GNN design. To address this challenge, this work proposes a novel hardware-aware graph neural architecture search framework tailored for resource constraint edge devices, namely HGNAS. To achieve hardware awareness, HGNAS integrates an efficient GNN hardware performance predictor that evaluates the latency and peak memory usage of GNNs in milliseconds. Meanwhile, we study GNN memory usage during inference and offer a peak memory estimation method, enhancing the robustness of architecture evaluations when combined with predictor outcomes. Furthermore, HGNAS constructs a fine-grained design space to enable the exploration of extreme performance architectures by decoupling the GNN paradigm. In addition, the multi-stage hierarchical search strategy is leveraged to facilitate the navigation of huge candidates, which can reduce the single search time to a few GPU hours. To the best of our knowledge, HGNAS is the first automated GNN design framework for edge devices, and also the first work to achieve hardware awareness of GNNs across different platforms. Extensive experiments across various applications and edge devices have proven the superiority of HGNAS. It can achieve up to a 10.6x speedup and an 82.5% peak memory reduction with negligible accuracy loss compared to DGCNN on ModelNet40.

CLNov 7, 2024Code
PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training

Rongjie Yi, Xiang Li, Weikai Xie et al.

The interest in developing small language models (SLM) for on-device deployment is fast growing. However, the existing SLM design hardly considers the device hardware characteristics. Instead, this work presents a simple yet effective principle for SLM design: architecture searching for (near-)optimal runtime efficiency before pre-training. Guided by this principle, we develop PhoneLM SLM family (currently with 0.5B and 1.5B versions), that acheive the state-of-the-art capability-efficiency tradeoff among those with similar parameter size. We fully open-source the code, weights, and training datasets of PhoneLM for reproducibility and transparency, including both base and instructed versions. We also release a finetuned version of PhoneLM capable of accurate Android Intent invocation, and an end-to-end Android demo. All materials are available at https://github.com/UbiquitousLearning/PhoneLM.

CLAug 8, 2025Code
SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Lingkun Long, Rubing Yang, Yushi Huang et al.

Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code will be released upon acceptance.

LGAug 23, 2024
AEMLO: AutoEncoder-Guided Multi-Label Oversampling

Ao Zhou, Bin Liu, Jin Wang et al.

Class imbalance significantly impacts the performance of multi-label classifiers. Oversampling is one of the most popular approaches, as it augments instances associated with less frequent labels to balance the class distribution. Existing oversampling methods generate feature vectors of synthetic samples through replication or linear interpolation and assign labels through neighborhood information. Linear interpolation typically generates new samples between existing data points, which may result in insufficient diversity of synthesized samples and further lead to the overfitting issue. Deep learning-based methods, such as AutoEncoders, have been proposed to generate more diverse and complex synthetic samples, achieving excellent performance on imbalanced binary or multi-class datasets. In this study, we introduce AEMLO, an AutoEncoder-guided Oversampling technique specifically designed for tackling imbalanced multi-label data. AEMLO is built upon two fundamental components. The first is an encoder-decoder architecture that enables the model to encode input data into a low-dimensional feature space, learn its latent representations, and then reconstruct it back to its original dimension, thus applying to the generation of new data. The second is an objective function tailored to optimize the sampling task for multi-label scenarios. We show that AEMLO outperforms the existing state-of-the-art methods with extensive empirical studies.

LGMar 1, 2024
FedRDMA: Communication-Efficient Cross-Silo Federated LLM via Chunked RDMA Transmission

Zeling Zhang, Dongqi Cai, Yiran Zhang et al. · cambridge

Communication overhead is a significant bottleneck in federated learning (FL), which has been exaggerated with the increasing size of AI models. In this paper, we propose FedRDMA, a communication-efficient cross-silo FL system that integrates RDMA into the FL communication protocol. To overcome the limitations of RDMA in wide-area networks (WANs), FedRDMA divides the updated model into chunks and designs a series of optimization techniques to improve the efficiency and robustness of RDMA-based communication. We implement FedRDMA atop the industrial federated learning framework and evaluate it on a real-world cross-silo FL scenario. The experimental results show that \sys can achieve up to 3.8$\times$ speedup in communication efficiency compared to traditional TCP/IP-based FL systems.

LGApr 8, 2024
Graph Neural Networks Automated Design and Deployment on Device-Edge Co-Inference Systems

Ao Zhou, Jianlei Yang, Tong Qiao et al.

The key to device-edge co-inference paradigm is to partition models into computation-friendly and computation-intensive parts across the device and the edge, respectively. However, for Graph Neural Networks (GNNs), we find that simply partitioning without altering their structures can hardly achieve the full potential of the co-inference paradigm due to various computational-communication overheads of GNN operations over heterogeneous devices. We present GCoDE, the first automatic framework for GNN that innovatively Co-designs the architecture search and the mapping of each operation on Device-Edge hierarchies. GCoDE abstracts the device communication process into an explicit operation and fuses the search of architecture and the operations mapping in a unified space for joint-optimization. Also, the performance-awareness approach, utilized in the constraint-based search process of GCoDE, enables effective evaluation of architecture efficiency in diverse heterogeneous systems. We implement the co-inference engine and runtime dispatcher in GCoDE to enhance the deployment efficiency. Experimental results show that GCoDE can achieve up to $44.9\times$ speedup and $98.2\%$ energy reduction compared to existing approaches across various applications and system configurations.

LGFeb 5, 2025
RLOMM: An Efficient and Robust Online Map Matching Framework with Reinforcement Learning

Minxiao Chen, Haitao Yuan, Nan Jiang et al.

Online map matching is a fundamental problem in location-based services, aiming to incrementally match trajectory data step-by-step onto a road network. However, existing methods fail to meet the needs for efficiency, robustness, and accuracy required by large-scale online applications, making this task still challenging. This paper introduces a novel framework that achieves high accuracy and efficient matching while ensuring robustness in handling diverse scenarios. To improve efficiency, we begin by modeling the online map matching problem as an Online Markov Decision Process (OMDP) based on its inherent characteristics. This approach helps efficiently merge historical and real-time data, reducing unnecessary calculations. Next, to enhance robustness, we design a reinforcement learning method, enabling robust handling of real-time data from dynamically changing environments. In particular, we propose a novel model learning process and a comprehensive reward function, allowing the model to make reasonable current matches from a future-oriented perspective, and to continuously update and optimize during the decision-making process based on feedback. Lastly, to address the heterogeneity between trajectories and roads, we design distinct graph structures, facilitating efficient representation learning through graph and recurrent neural networks. To further align trajectory and road data, we introduce contrastive learning to decrease their distance in the latent space, thereby promoting effective integration of the two. Extensive evaluations on three real-world datasets confirm that our method significantly outperforms existing state-of-the-art solutions in terms of accuracy, efficiency and robustness.

LGMar 27, 2024
Multi-Label Adaptive Batch Selection by Highlighting Hard and Imbalanced Samples

Ao Zhou, Bin Liu, Jin Wang et al.

Deep neural network models have demonstrated their effectiveness in classifying multi-label data from various domains. Typically, they employ a training mode that combines mini-batches with optimizers, where each sample is randomly selected with equal probability when constructing mini-batches. However, the intrinsic class imbalance in multi-label data may bias the model towards majority labels, since samples relevant to minority labels may be underrepresented in each mini-batch. Meanwhile, during the training process, we observe that instances associated with minority labels tend to induce greater losses. Existing heuristic batch selection methods, such as priority selection of samples with high contribution to the objective function, i.e., samples with high loss, have been proven to accelerate convergence while reducing the loss and test error in single-label data. However, batch selection methods have not yet been applied and validated in multi-label data. In this study, we introduce a simple yet effective adaptive batch selection algorithm tailored to multi-label deep learning models. It adaptively selects each batch by prioritizing hard samples related to minority labels. A variant of our method also takes informative label correlations into consideration. Comprehensive experiments combining five multi-label deep learning models on thirteen benchmark datasets show that our method converges faster and performs better than random batch selection.

LGApr 15, 2024
GNNavigator: Towards Adaptive Training of Graph Neural Networks via Automatic Guideline Exploration

Tong Qiao, Jianlei Yang, Yingjie Qi et al.

Graph Neural Networks (GNNs) succeed significantly in many applications recently. However, balancing GNNs training runtime cost, memory consumption, and attainable accuracy for various applications is non-trivial. Previous training methodologies suffer from inferior adaptability and lack a unified training optimization solution. To address the problem, this work proposes GNNavigator, an adaptive GNN training configuration optimization framework. GNNavigator meets diverse GNN application requirements due to our unified software-hardware co-abstraction, proposed GNNs training performance model, and practical design space exploration solution. Experimental results show that GNNavigator can achieve up to 3.1x speedup and 44.9% peak memory reduction with comparable accuracy to state-of-the-art approaches.

IRAug 22, 2025
Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

Ao Zhou, Zebo Gu, Tenghao Sun et al.

Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.

LGApr 16, 2024
HiGraphDTI: Hierarchical Graph Representation Learning for Drug-Target Interaction Prediction

Bin Liu, Siqi Wu, Jin Wang et al.

The discovery of drug-target interactions (DTIs) plays a crucial role in pharmaceutical development. The deep learning model achieves more accurate results in DTI prediction due to its ability to extract robust and expressive features from drug and target chemical structures. However, existing deep learning methods typically generate drug features via aggregating molecular atom representations, ignoring the chemical properties carried by motifs, i.e., substructures of the molecular graph. The atom-drug double-level molecular representation learning can not fully exploit structure information and fails to interpret the DTI mechanism from the motif perspective. In addition, sequential model-based target feature extraction either fuses limited contextual information or requires expensive computational resources. To tackle the above issues, we propose a hierarchical graph representation learning-based DTI prediction method (HiGraphDTI). Specifically, HiGraphDTI learns hierarchical drug representations from triple-level molecular graphs to thoroughly exploit chemical information embedded in atoms, motifs, and molecules. Then, an attentional feature fusion module incorporates information from different receptive fields to extract expressive target features.Last, the hierarchical attention mechanism identifies crucial molecular segments, which offers complementary views for interpreting interaction mechanisms. The experiment results not only demonstrate the superiority of HiGraphDTI to the state-of-the-art methods, but also confirm the practical ability of our model in interaction interpretation and new DTI discovery.

LGDec 5, 2025
GCoDE: Efficient Device-Edge Co-Inference for GNNs via Architecture-Mapping Co-Search

Ao Zhou, Jianlei Yang, Tong Qiao et al.

Graph Neural Networks (GNNs) have emerged as the state-of-the-art graph learning method. However, achieving efficient GNN inference on edge devices poses significant challenges, limiting their application in real-world edge scenarios. This is due to the high computational cost of GNNs and limited hardware resources on edge devices, which prevent GNN inference from meeting real-time and energy requirements. As an emerging paradigm, device-edge co-inference shows potential for improving inference efficiency and reducing energy consumption on edge devices. Despite its potential, research on GNN device-edge co-inference remains scarce, and our findings show that traditional model partitioning methods are ineffective for GNNs. To address this, we propose GCoDE, the first automatic framework for GNN architecture-mapping Co-design and deployment on Device-Edge hierarchies. By abstracting the device communication process into an explicit operation, GCoDE fuses the architecture and mapping scheme in a unified design space for joint optimization. Additionally, GCoDE's system performance awareness enables effective evaluation of architecture efficiency across diverse heterogeneous systems. By analyzing the energy consumption of various GNN operations, GCoDE introduces an energy prediction method that improves energy assessment accuracy and identifies energy-efficient solutions. Using a constraint-based random search strategy, GCoDE identifies the optimal solution in 1.5 hours, balancing accuracy and efficiency. Moreover, the integrated co-inference engine in GCoDE enables efficient deployment and execution of GNN co-inference. Experimental results show that GCoDE can achieve up to 44.9x speedup and 98.2% energy reduction compared to existing approaches across diverse applications and system configurations.

LGDec 21, 2024
Batch Selection for Multi-Label Classification Guided by Uncertainty and Dynamic Label Correlations

Ao Zhou, Bin Liu, Jin Wang et al.

The accuracy of deep neural networks is significantly influenced by the effectiveness of mini-batch construction during training. In single-label scenarios, such as binary and multi-class classification tasks, it has been demonstrated that batch selection algorithms preferring samples with higher uncertainty achieve better performance than difficulty-based methods. Although there are two batch selection methods tailored for multi-label data, none of them leverage important uncertainty information. Adapting the concept of uncertainty to multi-label data is not a trivial task, since there are two issues that should be tackled. First, traditional variance or entropy-based uncertainty measures ignore fluctuations of predictions within sliding windows and the importance of the current model state. Second, existing multi-label methods do not explicitly exploit the label correlations, particularly the uncertainty-based label correlations that evolve during the training process. In this paper, we propose an uncertainty-based multi-label batch selection algorithm. It assesses uncertainty for each label by considering differences between successive predictions and the confidence of current outputs, and further leverages dynamic uncertainty-based label correlations to emphasize instances whose uncertainty is synergistically expressed across multiple labels. Empirical studies demonstrate the effectiveness of our method in improving the performance and accelerating the convergence of various multi-label deep learning models.

LGFeb 14, 2022
Benchmarking of DL Libraries and Models on Mobile Devices

Qiyang Zhang, Xiang Li, Xiangying Che et al.

Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In this paper, we first build a comprehensive benchmark that includes 6 representative DL libs and 15 diversified DL models. We then perform extensive experiments on 10 mobile devices, which help reveal a complete landscape of the current mobile DL libs ecosystem. For example, we find that the best-performing DL lib is severely fragmented across different models and hardware, and the gap between those DL libs can be rather huge. In fact, the impacts of DL libs can overwhelm the optimizations from algorithms or hardware, e.g., model quantization and GPU/DSP-based heterogeneous computing. Finally, atop the observations, we summarize practical implications to different roles in the DL lib ecosystem.

LGApr 7, 2021
Optimizing Memory Efficiency of Graph Neural Networks on Edge Computing Platforms

Ao Zhou, Jianlei Yang, Yeqi Gao et al.

Graph neural networks (GNN) have achieved state-of-the-art performance on various industrial tasks. However, the poor efficiency of GNN inference and frequent Out-Of-Memory (OOM) problem limit the successful application of GNN on edge computing platforms. To tackle these problems, a feature decomposition approach is proposed for memory efficiency optimization of GNN inference. The proposed approach could achieve outstanding optimization on various GNN models, covering a wide range of datasets, which speeds up the inference by up to 3x. Furthermore, the proposed feature decomposition could significantly reduce the peak memory usage (up to 5x in memory efficiency improvement) and mitigate OOM problems during GNN inference.

LGOct 22, 2020
Hierarchical Federated Learning through LAN-WAN Orchestration

Jinliang Yuan, Mengwei Xu, Xiao Ma et al.

Federated learning (FL) was designed to enable mobile phones to collaboratively learn a global model without uploading their private data to a cloud server. However, exiting FL protocols has a critical communication bottleneck in a federated network coupled with privacy concerns, usually powered by a wide-area network (WAN). Such a WAN-driven FL design leads to significantly high cost and much slower model convergence. In this work, we propose an efficient FL protocol, which involves a hierarchical aggregation mechanism in the local-area network (LAN) due to its abundant bandwidth and almost negligible monetary cost than WAN. Our proposed FL can accelerate the learning process and reduce the monetary cost with frequent local aggregation in the same LAN and infrequent global aggregation on a cloud across WAN. We further design a concrete FL platform, namely LanFL, that incorporates several key techniques to handle those challenges introduced by LAN: cloud-device aggregation architecture, intra-LAN peer-to-peer (p2p) topology generation, inter-LAN bandwidth capacity heterogeneity. We evaluate LanFL on 2 typical Non-IID datasets, which reveals that LanFL can significantly accelerate FL training (1.5x-6.0x), save WAN traffic (18.3x-75.6x), and reduce monetary cost (3.8x-27.2x) while preserving the model accuracy.

LGMar 21, 2020
DP-Net: Dynamic Programming Guided Deep Neural Network Compression

Dingcheng Yang, Wenjian Yu, Ao Zhou et al.

In this work, we propose an effective scheme (called DP-Net) for compressing the deep neural networks (DNNs). It includes a novel dynamic programming (DP) based algorithm to obtain the optimal solution of weight quantization and an optimization process to train a clustering-friendly DNN. Experiments showed that the DP-Net allows larger compression than the state-of-the-art counterparts while preserving accuracy. The largest 77X compression ratio on Wide ResNet is achieved by combining DP-Net with other compression techniques. Furthermore, the DP-Net is extended for compressing a robust DNN model with negligible accuracy loss. At last, a custom accelerator is designed on FPGA to speed up the inference computation with DP-Net.

CVFeb 20, 2018
Fast and robust misalignment correction of Fourier ptychographic microscopy

Ao Zhou, Wei Wang, Ni Chen et al.

Fourier ptychographi cmicroscopy(FPM) is a newly developed computational imaging technique that can provide gigapixel images with both high resolution (HR) and wide field of view (FOV). However, the positional misalignment of the LED array induces a degradation of the reconstruction, especially in the regions away from the optical axis. In this paper, we propose a robust and fast method to correct the LED misalignment of FPM, termed as misalignment correction for FPM (mcFPM). Although different regions in the FOV have different sensitivity to the LED misalignment, the experimental results show that mcFPM is robust to eliminate the degradation in each region. Compared with the state-of-the-art methods, mcFPM is much faster.