CVOct 30, 2025Code
LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video GenerationHuanlin Gao, Ping Chen, Fuyuan Shi et al.
We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9x speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis. Our code is available at :https://github.com/UnicomAI/LeMiCa
AIMay 14
MediaClaw: Multimodal Intelligent-Agent Platform Technical ReportShaoan Zhao, Huanlin Gao, Qiang Hui et al.
MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three-layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address practical deployment pain points in AIGC adoption, including fragmented capabilities, heterogeneous interfaces, disconnected production processes, and limited reuse of high-quality production workflows. \system{} abstracts full-category AIGC capabilities into a unified invocation model, uses plugins to support hot-pluggable capability expansion, and uses task-oriented Skills to turn complex production processes into reusable workflow assets. This report focuses on the architectural design philosophy of MediaClaw, the design logic of its core capability model, and the key engineering trade-offs in implementation. It aims to provide reusable practical reference for building multimodal capability platforms.
LGJan 27
MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching InferenceHuanlin Gao, Ping Chen, Fuyuan Shi et al.
We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached Jacobian--vector products (JVP) to construct interval average velocities from instantaneous velocities, it effectively mitigates local error accumulation. To further improve cache timing and JVP reuse stability, we develop a trajectory-stability scheduling strategy as a practical tool, employing a Peak-Suppressed Shortest Path under budget constraints to determine the schedule. Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate that MeanCache achieves 4.12X and 4.56X and 3.59X acceleration, respectively, while consistently outperforming state-of-the-art caching baselines in generation quality. We believe this simple yet effective approach provides a new perspective for Flow Matching inference and will inspire further exploration of stability-driven acceleration in commercial-scale generative models.
CVAug 21, 2019
A Realistic Face-to-Face Conversation System based on Deep Neural NetworksZezhou Chen, Zhaoxiang Liu, Huan Hu et al.
To improve the experiences of face-to-face conversation with avatar, this paper presents a novel conversation system. It is composed of two sequence-to-sequence models respectively for listening and speaking and a Generative Adversarial Network (GAN) based realistic avatar synthesizer. The models exploit the facial action and head pose to learn natural human reactions. Based on the models' output, the synthesizer uses the Pixel2Pixel model to generate realistic facial images. To show the improvement of our system, we use a 3D model based avatar driving scheme as a reference. We train and evaluate our neural networks with the data from ESPN shows. Experimental results show that our conversation system can generate natural facial reactions and realistic facial images.
CVMay 5, 2019
Towards More Realistic Human-Robot Conversation: A Seq2Seq-based Body Gesture Interaction SystemMinjie Hua, Fuyuan Shi, Yibing Nan et al.
This paper presents a novel system that enables intelligent robots to exhibit realistic body gestures while communicating with humans. The proposed system consists of a listening model and a speaking model used in corresponding conversational phases. Both models are adapted from the sequence-to-sequence (seq2seq) architecture to synthesize body gestures represented by the movements of twelve upper-body keypoints. All the extracted 2D keypoints are firstly 3D-transformed, then rotated and normalized to discard irrelevant information. Substantial videos of human conversations from Youtube are collected and preprocessed to train the listening and speaking models separately, after which the two models are evaluated using metrics of mean squared error (MSE) and cosine similarity on the test dataset. The tuned system is implemented to drive a virtual avatar as well as Pepper, a physical humanoid robot, to demonstrate the improvement on conversational interaction abilities of our method in practice.
CVApr 28, 2019
Synthetic Data Generation and Adaption for Object Detection in Smart Vending MachinesKai Wang, Fuyuan Shi, Wenqi Wang et al.
This paper presents an improved scheme for the generation and adaption of synthetic images for the training of deep Convolutional Neural Networks(CNNs) to perform the object detection task in smart vending machines. While generating synthetic data has proved to be effective for complementing the training data in supervised learning methods, challenges still exist for generating virtual images which are similar to those of the complex real scenes and minimizing redundant training data. To solve these problems, we consider the simulation of cluttered objects placed in a virtual scene and the wide-angle camera with distortions used to capture the whole scene in the data generation process, and post-processed the generated images with a elaborately-designed generative network to make them more similar to the real images. Various experiments have been conducted to prove the efficiency of using the generated virtual images to enhance the detection precision on existing datasets with limited real training data and the generalization ability of applying the trained network to datasets collected in new environment.