Chao Jin

LG
h-index28
22papers
279citations
Novelty47%
AI Score55

22 Papers

DCJun 2
UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

Xinming Wei, Chao Jin, Tuo Dai et al.

Large-scale expert parallelism (EP) is becoming pivotal for training and serving frontier MoE models, but it also amplifies device-level expert load imbalance into compute stragglers, token all-to-all bottlenecks, and activation-memory spikes. Existing balancers redistribute experts periodically based on historical load, which becomes unreliable for production deployments with non-stationary load patterns. We present UltraEP, the first exact-load, real-time balancer for large-EP MoE training and serving prefill on rack-scale nodes (RSNs). Built upon the extended scale-up connectivity of RSNs, UltraEP rebalances every microbatch and layer on critical paths, which requires nontrivial co-design of plan solving and expert replication communication to minimize exposed overhead. To this end, UltraEP eagerly reacts to post-gating load with efficient quota-driven planning, and executes the resulting irregular expert-state transfers with RSN-native persistent tile streaming and relay-based fan-out mitigation. Averaged across MoE models from 106B to 671B parameters in training and prefill, UltraEP achieves 94.3% of the force-balanced ideal throughput, delivering 1.49$\times$ improvement over non-balancing, while reducing the final inter-rank imbalance from 1.30$-$4.01 to 1.01$-$1.04. Additionally, we validate UltraEP's scalability and robustness in production MoE training with 2560 GPUs.

IVSep 7, 2023Code
Punctate White Matter Lesion Segmentation in Preterm Infants Powered by Counterfactually Generative Learning

Zehua Ren, Yongheng Sun, Miaomiao Wang et al.

Accurate segmentation of punctate white matter lesions (PWMLs) are fundamental for the timely diagnosis and treatment of related developmental disorders. Automated PWMLs segmentation from infant brain MR images is challenging, considering that the lesions are typically small and low-contrast, and the number of lesions may dramatically change across subjects. Existing learning-based methods directly apply general network architectures to this challenging task, which may fail to capture detailed positional information of PWMLs, potentially leading to severe under-segmentations. In this paper, we propose to leverage the idea of counterfactual reasoning coupled with the auxiliary task of brain tissue segmentation to learn fine-grained positional and morphological representations of PWMLs for accurate localization and segmentation. A simple and easy-to-implement deep-learning framework (i.e., DeepPWML) is accordingly designed. It combines the lesion counterfactual map with the tissue probability map to train a lightweight PWML segmentation network, demonstrating state-of-the-art performance on a real-clinical dataset of infant T1w MR images. The code is available at \href{https://github.com/ladderlab-xjtu/DeepPWML}{https://github.com/ladderlab-xjtu/DeepPWML}.

LGMay 25
BigMac: Breaking the Pareto Frontier of Compute and Memory in Multimodal LLM Training

Zili Zhang, Chengxu Yang, Shenglong Zhang et al.

Training multimodal large language models (MLLMs) is challenged by both model and data heterogeneity. Existing systems redesign the training pipeline to address these challenges, but remain bound by a Pareto frontier between compute and memory efficiency, improving one only at the expense of the other. We present BigMac, a new training pipeline for multimodal LLMs. The core idea of BigMac is to elegantly nest the encoder and generator computation into the original LLM pipeline, forming a dependency-safe nested pipeline structure. With this design, BigMac reduces the activation memory complexity of the encoder and generator to O(1) while keeping the activation memory complexity of the LLM unchanged. At the same time, it achieves the same computational efficiency as the idealized setting with unlimited memory. As a result, BigMac breaks the Pareto frontier between computational efficiency and memory usage, enabling simultaneous optimization of both computation and memory in MLLM training. We evaluate BigMac on multiple MLLMs and training workloads. Experimental results show that BigMac achieves a 1.08$\times$-1.9$\times$ training speedup over baseline systems while maintaining stable memory usage as batch size increases.

CRApr 11
EncFormer: Secure and Efficient Transformer Inference over Encrypted Data

Yufan Zhu, Chao Jin, Khin Mi Mi Aung et al.

Transformer inference in machine-learning-as-a-service (MLaaS) raises privacy concerns for sensitive user inputs. Prior secure solutions that combine fully homomorphic encryption (FHE) and secure multiparty computation (MPC) are bottlenecked by inefficient FHE kernels, communication-heavy MPC protocols, and expensive FHE-MPC conversions. We present EncFormer, a two-party private Transformer inference framework that introduces Stage Compatible Patterns so that FHE kernels compose efficiently, reducing repacking and conversions. EncFormer also provides a cost analysis model built around a minimal-conversion baseline, enabling principled selection of FHE-MPC boundaries. To further reduce communication, EncFormer proposes a secure complex CKKS-MPC conversion protocol and designs communication-efficient MPC protocols for nonlinearities. With GPU optimizations, evaluations on GPT- and BERT-style models show that EncFormer achieves 1.4x-30.4x lower online MPC communication and 1.3x-9.8x lower end-to-end latency against prior hybrid FHE-MPC systems, and 1.9x-3.5x lower end-to-end latency on BERT-base than FHE-only pipelines under a matched backend, while maintaining near-plaintext accuracy on selected GLUE tasks.

LGMar 8, 2022
LSTMSPLIT: Effective SPLIT Learning based LSTM on Sequential Time-Series Data

Lianlian Jiang, Yuexuan Wang, Wenyi Zheng et al.

Federated learning (FL) and split learning (SL) are the two popular distributed machine learning (ML) approaches that provide some data privacy protection mechanisms. In the time-series classification problem, many researchers typically use 1D convolutional neural networks (1DCNNs) based on the SL approach with a single client to reduce the computational overhead at the client-side while still preserving data privacy. Another method, recurrent neural network (RNN), is utilized on sequentially partitioned data where segments of multiple-segment sequential data are distributed across various clients. However, to the best of our knowledge, it is still not much work done in SL with long short-term memory (LSTM) network, even the LSTM network is practically effective in processing time-series data. In this work, we propose a new approach, LSTMSPLIT, that uses SL architecture with an LSTM network to classify time-series data with multiple clients. The differential privacy (DP) is applied to solve the data privacy leakage. The proposed method, LSTMSPLIT, has achieved better or reasonable accuracy compared to the Split-1DCNN method using the electrocardiogram dataset and the human activity recognition dataset. Furthermore, the proposed method, LSTMSPLIT, can also achieve good accuracy after applying differential privacy to preserve the user privacy of the cut layer of the LSTMSPLIT.

LGMar 30
Heddle: A Distributed Orchestration System for Agentic RL Rollout

Zili Zhang, Yinmin Zhong, Chengxu Yang et al.

Agentic Reinforcement Learning (RL) enables LLMs to solve complex tasks by alternating between a data-collection rollout phase and a policy training phase. During rollout, the agent generates trajectories, i.e., multi-step interactions between LLMs and external tools. Yet, frequent tool calls induce long-tailed trajectory generation that bottlenecks rollouts. This stems from step-centric designs that ignore trajectory context, triggering three system problems for long-tail trajectory generation: queueing delays, interference overhead, and inflated per-token time. We propose Heddle, a trajectory-centric system to optimize the when, where, and how of agentic rollout execution. Heddle integrates three core mechanisms: trajectory-level scheduling using runtime prediction and progressive priority to minimize cumulative queueing; trajectory-aware placement via presorted dynamic programming and opportunistic migration during idle tool call intervals to minimize interference; and trajectory-adaptive resource manager that dynamically tunes model parallelism to accelerate the per-token time of long-tail trajectories while maintaining high throughput for short trajectories. Evaluations across diverse agentic RL workloads demonstrate that Heddle effectively neutralizes the long-tail bottleneck, achieving up to 2.5$\times$ higher end-to-end rollout throughput compared to state-of-the-art baselines.

AIApr 19
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

Chao Jin, Wenkui Yang, Hao Sun et al.

While progress in GUI agents has been largely driven by industrial-scale training, ungrounded hallucinations often trigger cascading failures in real-world deployments.Unlike general VLM domains, the GUI agent field lacks a hallucination-focused suite for fine-grained diagnosis, reliable evaluation, and targeted mitigation.To bridge this gap, we introduce HalluClear, a comprehensive suite for hallucination mitigation in GUI agents as a complement to computation-intensive scaling. HalluClear comprises: (1) a GUI-specific hallucination taxonomy derived from empirical failure analysis; (2) a calibrated three-stage evaluation workflow which enhances VLM-as-a-judge reliability via expert-annotated benchmarking and ensemble credibility estimation; and (3) a mitigation scheme based on closed-loop structured reasoning, enabling lightweight continual post-training with cold-start initialization for both generalist and GUI-specialist agents. Experiments across representative agents and public benchmarks demonstrate that post-training on only 9K samples within our suite can significantly reduce hallucinations, thereby improving grounding and action fidelity, offering a compute-efficient pathway to robust GUI automation.

CRApr 9
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

Wenkui Yang, Chao Jin, Haisu Zhu et al.

Existing red-teaming studies on GUI agents have important limitations. Adversarial perturbations typically require white-box access, which is unavailable for commercial systems, while prompt injection is increasingly mitigated by stronger safety alignment. To study robustness under a more practical threat model, we propose Semantic-level UI Element Injection, a red-teaming setting that overlays safety-aligned and harmless UI elements onto screenshots to misdirect the agent's visual grounding. Our method uses a modular Editor-Overlapper-Victim pipeline and an iterative search procedure that samples multiple candidate edits, keeps the best cumulative overlay, and adapts future prompt strategies based on previous failures. Across five victim models, our optimized attacks improve attack success rate by up to 4.4x over random injection on the strongest victims. Moreover, elements optimized on one source model transfer effectively to other target models, indicating model-agnostic vulnerabilities. After the first successful attack, the victim still clicks the attacker-controlled element in more than 15% of later independent trials, versus below 1% for random injection, showing that the injected element acts as a persistent attractor rather than simple visual clutter.

CVDec 3, 2023Code
Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting

Jin Liu, Huaibo Huang, Chao Jin et al.

Face stylization refers to the transformation of a face into a specific portrait style. However, current methods require the use of example-based adaptation approaches to fine-tune pre-trained generative models so that they demand lots of time and storage space and fail to achieve detailed style transformation. This paper proposes a training-free face stylization framework, named Portrait Diffusion. This framework leverages off-the-shelf text-to-image diffusion models, eliminating the need for fine-tuning specific examples. Specifically, the content and style images are first inverted into latent codes. Then, during image reconstruction using the corresponding latent code, the content and style features in the attention space are delicately blended through a modified self-attention operation called Style Attention Control. Additionally, a Chain-of-Painting method is proposed for the gradual redrawing of unsatisfactory areas from rough adjustments to fine-tuning. Extensive experiments validate the effectiveness of our Portrait Diffusion method and demonstrate the superiority of Chain-of-Painting in achieving precise face stylization. Code will be released at \url{https://github.com/liujin112/PortraitDiffusion}.

LGMay 9
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

Chao Jin, Xinming Wei, Yinmin Zhong et al.

Load imbalance is a long-standing challenge in Mixture-of-Experts (MoE) training and is exacerbated in reinforcement learning (RL) for LLMs, where hot experts can shift frequently across micro-batches. Existing MoE training systems rely on historical loads to predict future expert demand, making them less effective under sharp fluctuations. We propose ReLibra, an MoE RL training system that exploits a unique opportunity in RL's rollout-training workflow, routing replay, to enable fine-grained load balancing at micro-batch granularity. Because rollout and training process the same tokens with the same MoE parameters, the token-to-expert routing decisions are known before training starts. Leveraging this information, ReLibra places two MoE load-balancing mechanisms at inter- and intra-batch timescales, matching their communication patterns to hierarchical network bandwidths. At the inter-batch timescale, ReLibra performs expert reordering to redistribute experts for batch-level cross-node balancing; at the intra-batch timescale, it dynamically performs expert replication within a node to absorb micro-batch-level load fluctuations. Experiments on diverse MoE LLMs and RL workloads show that ReLibra improves training throughput by up to 1.6$\times$ over Megatron-LM and by up to 1.2$\times$ over EPLB, even when EPLB is given oracle loads. Moreover, ReLibra remains within 6%-10% of the throughput of an idealized balanced baseline.

DCApr 18, 2024
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

Chao Jin, Zili Zhang, Xuanlin Jiang et al.

Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.

LGApr 22, 2025
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

Yinmin Zhong, Zili Zhang, Xiaoniu Song et al.

Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the disaggregated architecture, in which dedicated resources are assigned to each stage. However, in real-world deployments, we observe that the colocated architecture suffers from resource coupling, where the two stages are constrained to use the same resources. This coupling compromises the scalability and cost-efficiency of colocated RL in large-scale training. In contrast, the disaggregated architecture allows for flexible resource allocation, supports heterogeneous training setups, and facilitates cross-datacenter deployment. StreamRL is designed with disaggregation from first principles and fully unlocks its potential by addressing two types of performance bottlenecks in existing disaggregated RL frameworks: pipeline bubbles, caused by stage dependencies, and skewness bubbles, resulting from long-tail output length distributions. To address pipeline bubbles, StreamRL breaks the traditional stage boundary in synchronous RL algorithms through stream generation and achieves full overlapping in asynchronous RL. To address skewness bubbles, StreamRL employs an output-length ranker model to identify long-tail samples and reduces generation time via skewness-aware dispatching and scheduling. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems, and improves cost-effectiveness by up to 1.33x in a heterogeneous, cross-datacenter setting.

DCApr 3, 2025
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Ruidong Zhu, Ziheng Jiang, Chao Jin et al.

Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.

AIApr 18, 2025
Task Assignment and Exploration Optimization for Low Altitude UAV Rescue via Generative AI Enhanced Multi-agent Reinforcement Learning

Xin Tang, Qian Chen, Wenjie Weng et al.

The integration of emerging uncrewed aerial vehicles (UAVs) with artificial intelligence (AI) and ground-embedded robots (GERs) has transformed emergency rescue operations in unknown environments. However, the high computational demands often exceed a single UAV's capacity, making it difficult to continuously provide stable high-level services. To address this, this paper proposes a cooperation framework involving UAVs, GERs, and airships. The framework enables resource pooling through UAV-to-GER (U2G) and UAV-to-airship (U2A) links, offering computing services for offloaded tasks. Specifically, we formulate the multi-objective problem of task assignment and exploration as a dynamic long-term optimization problem aiming to minimize task completion time and energy use while ensuring stability. Using Lyapunov optimization, we transform it into a per-slot deterministic problem and propose HG-MADDPG, which combines the Hungarian algorithm with a GDM-based multi-agent deep deterministic policy gradient. Simulations demonstrate significant improvements in offloading efficiency, latency, and system stability over baselines.

IVFeb 1, 2025
Patch Triplet Similarity Purification for Guided Real-World Low-Dose CT Image Denoising

Junhao Long, Fengwei Yang, Juncheng Yan et al.

Image denoising of low-dose computed tomography (LDCT) is an important problem for clinical diagnosis with reduced radiation exposure. Previous methods are mostly trained with pairs of synthetic or misaligned LDCT and normal-dose CT (NDCT) images. However, trained with synthetic noise or misaligned LDCT/NDCT image pairs, the denoising networks would suffer from blurry structure or motion artifacts. Since non-contrast CT (NCCT) images share the content characteristics to the corresponding NDCT images in a three-phase scan, they can potentially provide useful information for real-world LDCT image denoising. To exploit this aspect, in this paper, we propose to incorporate clean NCCT images as useful guidance for the learning of real-world LDCT image denoising networks. To alleviate the issue of spatial misalignment in training data, we design a new Patch Triplet Similarity Purification (PTSP) strategy to select highly similar patch (instead of image) triplets of LDCT, NDCT, and NCCT images for network training. Furthermore, we modify two image denoising transformers of SwinIR and HAT to accommodate the NCCT image guidance, by replacing vanilla self-attention with cross-attention. On our collected clinical dataset, the modified transformers trained with the data selected by our PTSP strategy show better performance than 15 comparison methods on real-world LDCT image denoising. Ablation studies validate the effectiveness of our NCCT image guidance and PTSP strategy. We will publicly release our data and code.

LGMay 9, 2024
From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks

Xue Geng, Zhe Wang, Chunyun Chen et al.

Deep neural networks (DNNs) have been widely used in many artificial intelligence (AI) tasks. However, deploying them brings significant challenges due to the huge cost of memory, energy, and computation. To address these challenges, researchers have developed various model compression techniques such as model quantization and model pruning. Recently, there has been a surge in research of compression methods to achieve model efficiency while retaining the performance. Furthermore, more and more works focus on customizing the DNN hardware accelerators to better leverage the model compression techniques. In addition to efficiency, preserving security and privacy is critical for deploying DNNs. However, the vast and diverse body of related works can be overwhelming. This inspires us to conduct a comprehensive survey on recent research toward the goal of high-performance, cost-efficient, and safe deployment of DNNs. Our survey first covers the mainstream model compression techniques such as model quantization, model pruning, knowledge distillation, and optimizations of non-linear operations. We then introduce recent advances in designing hardware accelerators that can adapt to efficient model compression approaches. Additionally, we discuss how homomorphic encryption can be integrated to secure DNN deployment. Finally, we discuss several issues, such as hardware evaluation, generalization, and integration of various compression approaches. Overall, we aim to provide a big picture of efficient DNNs, from algorithm to hardware accelerators and security perspectives.

ACC-PHMay 23, 2023
Trend-Based SAC Beam Control Method with Zero-Shot in Superconducting Linear Accelerator

Xiaolong Chen, Xin Qi, Chunguang Su et al.

The superconducting linear accelerator is a highly flexiable facility for modern scientific discoveries, necessitating weekly reconfiguration and tuning. Accordingly, minimizing setup time proves essential in affording users with ample experimental time. We propose a trend-based soft actor-critic(TBSAC) beam control method with strong robustness, allowing the agents to be trained in a simulated environment and applied to the real accelerator directly with zero-shot. To validate the effectiveness of our method, two different typical beam control tasks were performed on China Accelerator Facility for Superheavy Elements (CAFe II) and a light particle injector(LPI) respectively. The orbit correction tasks were performed in three cryomodules in CAFe II seperately, the time required for tuning has been reduced to one-tenth of that needed by human experts, and the RMS values of the corrected orbit were all less than 1mm. The other transmission efficiency optimization task was conducted in the LPI, our agent successfully optimized the transmission efficiency of radio-frequency quadrupole(RFQ) to over $85\%$ within 2 minutes. The outcomes of these two experiments offer substantiation that our proposed TBSAC approach can efficiently and effectively accomplish beam commissioning tasks while upholding the same standard as skilled human experts. As such, our method exhibits potential for future applications in other accelerator commissioning fields.

ASJan 10, 2022
Cross-Modal ASR Post-Processing System for Error Correction and Utterance Rejection

Jing Du, Shiliang Pu, Qinbo Dong et al.

Although modern automatic speech recognition (ASR) systems can achieve high performance, they may produce errors that weaken readers' experience and do harm to downstream tasks. To improve the accuracy and reliability of ASR hypotheses, we propose a cross-modal post-processing system for speech recognizers, which 1) fuses acoustic features and textual features from different modalities, 2) joints a confidence estimator and an error corrector in multi-task learning fashion and 3) unifies error correction and utterance rejection modules. Compared with single-modal or single-task models, our proposed system is proved to be more effective and efficient. Experiment result shows that our post-processing system leads to more than 10% relative reduction of character error rate (CER) for both single-speaker and multi-speaker speech on our industrial ASR system, with about 1.7ms latency for each token, which ensures that extra latency introduced by post-processing is acceptable in streaming speech recognition.

CRJul 5, 2021
Popcorn: Paillier Meets Compression For Efficient Oblivious Neural Network Inference

Jun Wang, Chao Jin, Souhail Meftah et al.

Oblivious inference enables the cloud to provide neural network inference-as-a-service (NN-IaaS), whilst neither disclosing the client data nor revealing the server's model. However, the privacy guarantee under oblivious inference usually comes with a heavy cost of efficiency and accuracy. We propose Popcorn, a concise oblivious inference framework entirely built on the Paillier homomorphic encryption scheme. We design a suite of novel protocols to compute non-linear activation and max-pooling layers. We leverage neural network compression techniques (i.e., neural weights pruning and quantization) to accelerate the inference computation. To implement the Popcorn framework, we only need to replace algebraic operations of existing networks with their corresponding Paillier homomorphic operations, which is extremely friendly for engineering development. We first conduct the performance evaluation and comparison based on the MNIST and CIFAR-10 classification tasks. Compared with existing solutions, Popcorn brings a significant communication overhead deduction, with a moderate runtime increase. Then, we benchmark the performance of oblivious inference on ImageNet. To our best knowledge, this is the first report based on a commercial-level dataset, taking a step towards the deployment to production.

CRFeb 6, 2021
FFConv: Fast Factorized Convolutional Neural Network Inference on Encrypted Data

Yuxiao Lu, Jie Lin, Chao Jin et al.

Homomorphic Encryption (HE), allowing computations on encrypted data (ciphertext) without decrypting it first, enables secure but prohibitively slow Convolutional Neural Network (CNN) inference for privacy-preserving applications in clouds. To reduce the inference latency, one approach is to pack multiple messages into a single ciphertext in order to reduce the number of ciphertexts and support massive parallelism of Homomorphic Multiply-Accumulate (HMA) operations between ciphertexts. Despite the faster HECNN inference, the mainstream packing schemes Dense Packing (DensePack) and Convolution Packing (ConvPack) introduce expensive rotation overhead, which prolongs the inference latency of HECNN for deeper and wider CNN architectures. In this paper, we propose a low-rank factorization method named FFConv dedicated to efficient ciphertext packing for reducing both the rotation overhead and HMA operations. FFConv approximates a d x d convolution layer with low-rank factorized convolutions, in which a d x d low-rank convolution with fewer channels is followed by a 1 x 1 convolution to restore the channels. The d x d low-rank convolution with DensePack leads to significantly reduced rotation operations, while the rotation overhead of 1 x 1 convolution with ConvPack is close to zero. To our knowledge, FFConv is the first work that is capable of reducing the rotation overhead incurred by DensePack and ConvPack simultaneously, without introducing additional special blocks into the HECNN inference pipeline. Compared to prior art LoLa and Falcon, our method reduces the inference latency by up to 88% and 21%, respectively, with comparable accuracy on MNIST and CIFAR-10.

CVOct 25, 2018
Investigating the Automatic Classification of Algae Using Fusion of Spectral and Morphological Characteristics of Algae via Deep Residual Learning

Jason L. Deglint, Chao Jin, Alexander Wong

Under the impact of global climate changes and human activities, harmful algae blooms in surface waters have become a growing concern due to negative impacts on water related industries. Therefore, reliable and cost effective methods of quantifying the type and concentration of threshold levels of algae cells has become critical for ensuring successful water management. In this work, we present SAMSON, an innovative system to automatically classify multiple types of algae from different phyla groups by combining standard morphological features with their multi-wavelength signals. Two phyla with focused investigation in this study are the Cyanophyta phylum (blue-green algae), and the Chlorophyta phylum (green algae). We use a custom-designed microscopy imaging system which is configured to image water samples at two fluorescent wavelengths and seven absorption wavelengths using discrete-wavelength high-powered light emitting diodes (LEDs). Powered by computer vision and machine learning, we investigate the possibility and effectiveness of automatic classification using a deep residual convolutional neural network. More specifically, a classification accuracy of 96% was achieved in an experiment conducted with six different algae types. This high level of accuracy was achieved using a deep residual convolutional neural network that learns the optimal combination of spectral and morphological features. These findings elude to the possibility of leveraging a unique fingerprint of algae cell (i.e. spectral wavelengths and morphological features) to automatically distinguish different algae types. Our work herein demonstrates that, when coupled with multi-band fluorescence microscopy, machine learning algorithms can potentially be used as a robust and cost-effective tool for identifying and enumerating algae cells.

CVMay 3, 2018
The feasibility of automated identification of six algae types using neural networks and fluorescence-based spectral-morphological features

Jason L. Deglint, Chao Jin, Angela Chao et al.

Harmful algae blooms (HABs), which produce lethal toxins, are a growing global concern since they negatively affect the quality of drinking water and have major negative impact on wildlife, the fishing industry, as well as tourism and recreational water use. In this study, we investigate the feasibility of leveraging machine learning and fluorescence-based spectral-morphological features to enable the identification of six different algae types in an automated fashion. More specifically, a custom multi-band fluorescence imaging microscope is used to capture fluorescence imaging data of a water sample at six different excitation wavelengths ranging from 405 nm - 530 nm. A number of morphological and spectral fluorescence features are then extracted from the isolated micro-organism imaging data, and used to train neural network classification models designed for the purpose of identification of the six algae types given an isolated micro-organism. Experimental results using three different neural network classification models showed that the use of either fluorescence-based spectral features or fluorescence-based spectral-morphological features to train neural network classification models led to statistically significant improvements in identification accuracy when compared to the use of morphological features (with average identification accuracies of 95.7%+/-3.5% and 96.1%+/-1.5%, respectively). These preliminary results are quite promising, given that the identification accuracy of human taxonomists are typically between the range of 67% and 83%, and thus illustrates the feasibility of leveraging machine learning and fluorescence-based spectral-morphological features as a viable method for automated identification of different algae types.