Yijing Zeng

LG
h-index25
3papers
28citations
Novelty73%
AI Score41

3 Papers

DCSep 25, 2025
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

Yilong Li, Shuai Zhang, Yijing Zeng et al. · amazon-science

Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly half a day and LLaMA-3-8B for voice interactions up to almost 20.8 hours.

LGAug 2, 2021
Few-Shot Domain Adaptation For End-to-End Communication

Jayaram Raghuram, Yijing Zeng, Dolores García Martí et al.

The problem of end-to-end learning of a communication system using an autoencoder -- consisting of an encoder, channel, and decoder modeled using neural networks -- has recently been shown to be a promising approach. A challenge faced in the practical adoption of this learning approach is that under changing channel conditions (e.g. a wireless link), it requires frequent retraining of the autoencoder in order to maintain a low decoding error rate. Since retraining is both time consuming and requires a large number of samples, it becomes impractical when the channel distribution is changing quickly. We propose to address this problem using a fast and sample-efficient (few-shot) domain adaptation method that does not change the encoder and decoder networks. Different from conventional training-time unsupervised or semi-supervised domain adaptation, here we have a trained autoencoder from a source distribution, that we want to adapt (at test time) to a target distribution using only a small labeled dataset and no unlabeled data. Our method focuses on a Gaussian mixture density network based channel model, and formulates its adaptation based on class and component-conditional affine transformations. The learned affine transformations are used to design an optimal input transformation at the decoder to compensate for the distribution shift, and effectively present to the decoder inputs close to the source distribution. Experiments on a real mmWave FPGA setup as well as a number of simulated distribution changes common to the wireless setting demonstrate the effectiveness of our method at adaptation using very small number of target domain samples.

LGFeb 22, 2017
Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent

Fengan Li, Lingjiao Chen, Yijing Zeng et al.

Data compression is a popular technique for improving the efficiency of data processing workloads such as SQL queries and more recently, machine learning (ML) with classical batch gradient methods. But the efficacy of such ideas for mini-batch stochastic gradient descent (MGD), arguably the workhorse algorithm of modern ML, is an open question. MGD's unique data access pattern renders prior art, including those designed for batch gradient methods, less effective. We fill this crucial research gap by proposing a new lossless compression scheme we call tuple-oriented compression (TOC) that is inspired by an unlikely source, the string/text compression scheme Lempel-Ziv-Welch, but tailored to MGD in a way that preserves tuple boundaries within mini-batches. We then present a suite of novel compressed matrix operation execution techniques tailored to the TOC compression scheme that operate directly over the compressed data representation and avoid decompression overheads. An extensive empirical evaluation with real-world datasets shows that TOC consistently achieves substantial compression ratios by up to 51x and reduces runtimes for MGD workloads by up to 10.2x in popular ML systems.