h-index49
42papers
526citations
Novelty53%
AI Score58

42 Papers

AIAug 29, 2023
AutoDroid: LLM-powered Task Automation in Android

Hao Wen, Yuanchun Li, Guohong Liu et al.

Mobile task automation is an attractive technique that aims to enable voice-based hands-free user interaction with smartphones. However, existing approaches suffer from poor scalability due to the limited language understanding ability and the non-trivial manual efforts required from developers or end-users. The recent advance of large language models (LLMs) in language understanding and reasoning inspires us to rethink the problem from a model-centric perspective, where task preparation, comprehension, and execution are handled by a unified language model. In this work, we introduce AutoDroid, a mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts. The key insight is to combine the commonsense knowledge of LLMs and domain-specific knowledge of apps through automated dynamic analysis. The main components include a functionality-aware UI representation method that bridges the UI with the LLM, exploration-based memory injection techniques that augment the app-specific domain knowledge of LLM, and a multi-granularity query optimization module that reduces the cost of model inference. We integrate AutoDroid with off-the-shelf LLMs including online GPT-4/GPT-3.5 and on-device Vicuna, and evaluate its performance on a new benchmark for memory-augmented Android task automation with 158 common tasks. The results demonstrated that AutoDroid is able to precisely generate actions with an accuracy of 90.9%, and complete tasks with a success rate of 71.3%, outperforming the GPT-4-powered baselines by 36.4% and 39.7%. The demo, benchmark suites, and source code of AutoDroid will be released at url{https://autodroid-sys.github.io/}.

CVFeb 9Code
MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team, Donghua Yu, Mingshu Chen et al.

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

IVJul 11, 2024Code
OMR-NET: a two-stage octave multi-scale residual network for screen content image compression

Shiqi Jiang, Ting Ren, Congrui Fu et al.

Screen content (SC) differs from natural scene (NS) with unique characteristics such as noise-free, repetitive patterns, and high contrast. Aiming at addressing the inadequacies of current learned image compression (LIC) methods for SC, we propose an improved two-stage octave convolutional residual blocks (IToRB) for high and low-frequency feature extraction and a cascaded two-stage multi-scale residual blocks (CTMSRB) for improved multi-scale learning and nonlinearity in SC. Additionally, we employ a window-based attention module (WAM) to capture pixel correlations, especially for high contrast regions in the image. We also construct a diverse SC image compression dataset (SDU-SCICD2K) for training, including text, charts, graphics, animation, movie, game and mixture of SC images and NS images. Experimental results show our method, more suited for SC than NS data, outperforms existing LIC methods in rate-distortion performance on SC images. The code is publicly available at https://github.com/SunshineSki/OMR Net.git.

CVJun 29, 2022
Turbo: Opportunistic Enhancement for Edge Video Analytics

Yan Lu, Shiqi Jiang, Ting Cao et al.

Edge computing is being widely used for video analytics. To alleviate the inherent tension between accuracy and cost, various video analytics pipelines have been proposed to optimize the usage of GPU on edge nodes. Nonetheless, we find that GPU compute resources provisioned for edge nodes are commonly under-utilized due to video content variations, subsampling and filtering at different places of a pipeline. As opposed to model and pipeline optimization, in this work, we study the problem of opportunistic data enhancement using the non-deterministic and fragmented idle GPU resources. In specific, we propose a task-specific discrimination and enhancement module and a model-aware adversarial training mechanism, providing a way to identify and transform low-quality images that are specific to a video pipeline in an accurate and efficient manner. A multi-exit model structure and a resource-aware scheduler is further developed to make online enhancement decisions and fine-grained inference execution under latency and GPU resource constraints. Experiments across multiple video analytics pipelines and datasets reveal that by judiciously allocating a small amount of idle resources on frames that tend to yield greater marginal benefits from enhancement, our system boosts DNN object detection accuracy by $7.3-11.3\%$ without incurring any latency costs.

IVMay 18Code
Inter-LPCM: Learning-based Inter-Frame Predictive Coding for LiDAR Point Cloud Compression

Chang Sun, Hui Yuan, Shiqi Jiang et al.

Because LiDAR sensors acquire point clouds with a fixed angular resolution, the resulting data can be systematically parameterized and efficiently compressed in the spherical coordinate system. Traditional spherical coordinate-based point cloud compression methods have demonstrated strong rate-distortion (RD) performance, with the predictive geometry coding (PredGeom) method in the geometry-based point cloud compression (G-PCC) standard being a prominent example. Although PredGeom includes an inter-frame prediction mode, it relies on a simple linear model, which limits its ability to capture complex motion patterns and structural dependencies. Meanwhile, existing learning-based compression methods in the spherical domain do not exploit inter-frame correlations to reduce geometry redundancy. To address these limitations, we propose a learning-based inter-frame predictive coding method, termed Inter-LPCM. For azimuth prediction, we employ a delta coding strategy based on the predefined angular resolution. To improve radius compression, we introduce an inter-frame radius predictive (Inter-RP) model that estimates the current point's radius using neighboring points from both the current frame and the registered reference frame. In addition, we design a lightweight attention-based prediction (LAEP) model to predict elevation angles by capturing long-range geometric correlations across different coordinates. For quantization, we propose an RD-optimized method to select quantization steps in the spherical coordinate system. For entropy coding, we design distinct models for each spherical coordinate component. These models are adapted to the statistical priors of each coordinate, enabling more accurate probability estimation. Our source code is publicly available at https://github.com/SDUChangSun/Inter-LPCM

LGMar 13, 2023
AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments

Hao Wen, Yuanchun Li, Zunshuai Zhang et al.

Deep learning models are increasingly deployed to edge devices for real-time applications. To ensure stable service quality across diverse edge environments, it is highly desirable to generate tailored model architectures for different conditions. However, conventional pre-deployment model generation approaches are not satisfactory due to the difficulty of handling the diversity of edge environments and the demand for edge information. In this paper, we propose to adapt the model architecture after deployment in the target environment, where the model quality can be precisely measured and private edge data can be retained. To achieve efficient and effective edge model generation, we introduce a pretraining-assisted on-cloud model elastification method and an edge-friendly on-device architecture search method. Model elastification generates a high-quality search space of model architectures with the guidance of a developer-specified oracle model. Each subnet in the space is a valid model with different environment affinity, and each device efficiently finds and maintains the most suitable subnet based on a series of edge-tailored optimizations. Extensive experiments on various edge devices demonstrate that our approach is able to achieve significantly better accuracy-latency tradeoffs (e.g. 46.74\% higher on average accuracy with a 60\% latency budget) than strong baselines with minimal overhead (13 GPU hours in the cloud and 2 minutes on the edge server).

IVSep 16, 2024
SPAC: Sampling-based Progressive Attribute Compression for Dense Point Clouds

Xiaolong Mao, Hui Yuan, Tian Guo et al.

We propose an end-to-end attribute compression method for dense point clouds. The proposed method combines a frequency sampling module, an adaptive scale feature extraction module with geometry assistance, and a global hyperprior entropy model. The frequency sampling module uses a Hamming window and the Fast Fourier Transform to extract high-frequency components of the point cloud. The difference between the original point cloud and the sampled point cloud is divided into multiple sub-point clouds. These sub-point clouds are then partitioned using an octree, providing a structured input for feature extraction. The feature extraction module integrates adaptive convolutional layers and uses offset-attention to capture both local and global features. Then, a geometry-assisted attribute feature refinement module is used to refine the extracted attribute features. Finally, a global hyperprior model is introduced for entropy encoding. This model propagates hyperprior parameters from the deepest (base) layer to the other layers, further enhancing the encoding efficiency. At the decoder, a mirrored network is used to progressively restore features and reconstruct the color attribute through transposed convolutional layers. The proposed method encodes base layer information at a low bitrate and progressively adds enhancement layer information to improve reconstruction accuracy. Compared to the latest G-PCC test model (TMC13v23) under the MPEG common test conditions (CTCs), the proposed method achieved an average Bjontegaard delta bitrate reduction of 24.58% for the Y component (21.23% for YUV combined) on the MPEG Category Solid dataset and 22.48% for the Y component (17.19% for YUV combined) on the MPEG Category Dense dataset. This is the first instance of a learning-based codec outperforming the G-PCC standard on these datasets under the MPEG CTCs.

IVJul 11, 2024
Global Spatial-Temporal Information-based Residual ConvLSTM for Video Space-Time Super-Resolution

Congrui Fu, Hui Yuan, Shiqi Jiang et al.

By converting low-frame-rate, low-resolution videos into high-frame-rate, high-resolution ones, space-time video super-resolution techniques can enhance visual experiences and facilitate more efficient information dissemination. We propose a convolutional neural network (CNN) for space-time video super-resolution, namely GIRNet. To generate highly accurate features and thus improve performance, the proposed network integrates a feature-level temporal interpolation module with deformable convolutions and a global spatial-temporal information-based residual convolutional long short-term memory (convLSTM) module. In the feature-level temporal interpolation module, we leverage deformable convolution, which adapts to deformations and scale variations of objects across different scene locations. This presents a more efficient solution than conventional convolution for extracting features from moving objects. Our network effectively uses forward and backward feature information to determine inter-frame offsets, leading to the direct generation of interpolated frame features. In the global spatial-temporal information-based residual convLSTM module, the first convLSTM is used to derive global spatial-temporal information from the input features, and the second convLSTM uses the previously computed global spatial-temporal information feature as its initial cell state. This second convLSTM adopts residual connections to preserve spatial information, thereby enhancing the output features. Experiments on the Vimeo90K dataset show that the proposed method outperforms state-of-the-art techniques in peak signal-to-noise-ratio (by 1.45 dB, 1.14 dB, and 0.02 dB over STARnet, TMNet, and 3DAttGAN, respectively), structural similarity index(by 0.027, 0.023, and 0.006 over STARnet, TMNet, and 3DAttGAN, respectively), and visually.

AISep 16, 2023
Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

Fucheng Jia, Shiqi Jiang, Ting Cao et al.

Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this issue, this paper presents the pioneering inbrowser inference system, nnJIT, which enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. nnJIT is built upon two novel techniques that significantly reduce kernel search and compilation overhead while improving performance firmly: Tensor-Web Compiling Co-Design lowers compiling costs by around 100X through eliminating redundant and ineffective compiling passes; Web-Specific Lite Kernel Optimization Space reduces kernel tuning costs by focusing on Web programming requirements and efficient device resource utilization, pruning the optimization space from millions to only dozens. nnJIT is evaluated for modern models, e.g., BART, T5, and Llama 2, on a range of edge devices including laptops and smartphones using different browsers and hardware from ARM, Intel, AMD and Nvidia. The results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines.

ROMay 8Code
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

Xin Ding, Xinrui Wang, Yifan Yang et al.

Existing memory systems for embodied agents typically inject retrieved memory as static context at episode start, a paradigm we term Ahead-of-time Monolithic Memory Injection (AMMI). However, this static design quickly becomes misaligned with the agent's evolving state and may degrade lightweight executors below the no-memory baseline. To address this, we propose MemCompiler, which reframes memory utilization as State-Conditioned Memory Compilation. A learned Memory Compiler reads a structured Brief State capturing the agent's current execution state and dynamically selects and compiles only relevant memory into executable guidance. This guidance is delivered through a text channel and a latent Soft-Mem channel that preserves perceptual information not expressible in text. Across Alf World, EmbodiedBench, and ScienceWorld, MemCompiler consistently improves over no-memory across open-source backbones (up to +129%), matches or approaches frontier closed-source systems, and reduces per-step latency by 60%, demonstrating that state-aware memory compilation improves both effectiveness and efficiency.

AIMar 20, 2025Code
Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

Gaole Dai, Shiqi Jiang, Ting Cao et al.

We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid obtains a substantial task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 5.2%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves a remarkably low latency of 4.3s per step, which is 6.1x faster compared with existing mobile agents. The source code is available at https://github.com/V-Droid-Agent/V-Droid.

CVNov 3, 2025
OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation

Heyu Guo, Shanmu Wang, Ruichun Ma et al.

Vision-language-action (VLA) models have shown strong generalization for robotic action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception guides the robotic manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

CVNov 10, 2025
MPJudge: Towards Perceptual Assessment of Music-Induced Paintings

Shiqi Jiang, Tianyi Liang, Changbo Wang et al.

Music induced painting is a unique artistic practice, where visual artworks are created under the influence of music. Evaluating whether a painting faithfully reflects the music that inspired it poses a challenging perceptual assessment task. Existing methods primarily rely on emotion recognition models to assess the similarity between music and painting, but such models introduce considerable noise and overlook broader perceptual cues beyond emotion. To address these limitations, we propose a novel framework for music induced painting assessment that directly models perceptual coherence between music and visual art. We introduce MPD, the first large scale dataset of music painting pairs annotated by domain experts based on perceptual coherence. To better handle ambiguous cases, we further collect pairwise preference annotations. Building on this dataset, we present MPJudge, a model that integrates music features into a visual encoder via a modulation based fusion mechanism. To effectively learn from ambiguous cases, we adopt Direct Preference Optimization for training. Extensive experiments demonstrate that our method outperforms existing approaches. Qualitative results further show that our model more accurately identifies music relevant regions in paintings.

AIJul 25, 2024
Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment

Shenghong Dai, Shiqi Jiang, Yifan Yang et al.

This paper presents Babel, the expandable modality alignment model, specially designed for multi-modal sensing. While there has been considerable work on multi-modality alignment, they all struggle to effectively incorporate multiple sensing modalities due to the data scarcity constraints. How to utilize multi-modal data with partial pairings in sensing remains an unresolved challenge. Babel tackles this challenge by introducing the concept of expandable modality alignment. The key idea involves transforming the N-modality alignment into a series of binary-modality alignments. Novel techniques are also proposed to further mitigate data scarcity issue and balance the contribution of the newly incorporated modality with the previously established modality alignment during the expandable alignment process. We provide the comprehensive implementation. In the pre-training phase, Babel currently aligns 6 sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. For the deployment phase, as a foundation model, any single or combination of aligned modalities could be selected from Babel and applied to downstream tasks. Evaluation demonstrates Babel's outstanding performance on eight human activity recognition datasets, compared to a broad range of baselines e.g., the SOTA single-modal sensing networks, multi-modal sensing framework, and multi-modal large language models. Babel not only improves the performance of individual modality sensing (12% averaged accuracy improvement), but also effectively fuses multiple available modalities (up to 22% accuracy increase). Case studies also highlight emerging application scenarios empowered by Babel, including cross-modality retrieval (i.e., sensing imaging), and bridging LLM for sensing comprehension.

CVMay 1, 2025Code
AVA: Towards Agentic Video Analytics with Vision Language Models

Yuxuan Yan, Shiqi Jiang, Ting Cao et al.

AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.

AIMay 11
EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents

Ruofei Ju, Xinrui Wang, Xin Ding et al.

Embodied agents can benefit from skills that guide object search, action execution, and state changes across diverse environments. Since embodied environments vary across layouts, object states, and other execution factors, these skills must self-evolve from trajectories generated during task execution. However, existing skill self-evolution methods are mainly developed in digital environments and often convert trajectories into coarse skill updates. Directly applying this paradigm to embodied settings is problematic, because a failed task execution may reflect not only incorrect skill content, but also an execution lapse in which the agent fails to follow valid guidance. We propose EmbodiSkill, a training-free framework for embodied skill self-evolution through skill-aware reflection and targeted revision. EmbodiSkill interprets each trajectory with respect to the current skill, uses skill-changing evidence to update the skill body, and uses execution-lapse evidence to preserve and emphasize valid guidance. Experiments on ALFWorld and EmbodiedBench show that EmbodiSkill consistently improves embodied task success. On ALFWorld, EmbodiSkill enables a frozen Qwen3.5-27B executor to reach 93.28% task success, outperforming GPT-5.2 used as a direct agent without skills by 31.58%. These results show that skill-aware self-evolution helps embodied agents accumulate reusable procedural knowledge from their own trajectories.

LGFeb 13
Concept Heterogeneity-aware Representation Steering

Laziz U. Abdullaev, Noelle Y. L. Wong, Ryan T. Z. Lee et al.

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two unimodal Gaussian distributions with identical covariance, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

CVMar 19
Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng, Xin Ding, Yifan Yang et al.

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

CVOct 2, 2023
Color and Texture Dual Pipeline Lightweight Style Transfer

ShiQi Jiang

Style transfer methods typically generate a single stylized output of color and texture coupling for reference styles, and color transfer schemes may introduce distortion or artifacts when processing reference images with duplicate textures. To solve the problem, we propose a Color and Texture Dual Pipeline Lightweight Style Transfer CTDP method, which employs a dual pipeline method to simultaneously output the results of color and texture transfer. Furthermore, we designed a masked total variation loss to suppress artifacts and small texture representations in color transfer results without affecting the semantic part of the content. More importantly, we are able to add texture structures with controllable intensity to color transfer results for the first time. Finally, we conducted feature visualization analysis on the texture generation mechanism of the framework and found that smoothing the input image can almost completely eliminate this texture structure. In comparative experiments, the color and texture transfer results generated by CTDP both achieve state-of-the-art performance. Additionally, the weight of the color transfer branch model size is as low as 20k, which is 100-1500 times smaller than that of other state-of-the-art models.

CVOct 24, 2023
Region-controlled Style Transfer

Junjie Kang, Jinsong Wu, Shiqi Jiang

Image style transfer is a challenging task in computational vision. Existing algorithms transfer the color and texture of style images by controlling the neural network's feature layers. However, they fail to control the strength of textures in different regions of the content image. To address this issue, we propose a training method that uses a loss function to constrain the style intensity in different regions. This method guides the transfer strength of style features in different regions based on the gradient relationship between style and content images. Additionally, we introduce a novel feature fusion method that linearly transforms content features to resemble style features while preserving their semantic relationships. Extensive experiments have demonstrated the effectiveness of our proposed approach.

CVJun 29, 2023
Lightweight texture transfer based on texture feature preset

ShiQi Jiang

In the task of texture transfer, reference texture images typically exhibit highly repetitive texture features, and the texture transfer results from different content images under the same style also share remarkably similar texture patterns. Encoding such highly similar texture features often requires deep layers and a large number of channels, making it is also the main source of the entire model's parameter count and computational load, and inference time. We propose a lightweight texture transfer based on texture feature preset (TFP). TFP takes full advantage of the high repetitiveness of texture features by providing preset universal texture feature maps for a given style. These preset feature maps can be fused and decoded directly with shallow color transfer feature maps of any content to generate texture transfer results, thereby avoiding redundant texture information from being encoded repeatedly. The texture feature map we preset is encoded through noise input images with consistent distribution (standard normal distribution). This consistent input distribution can completely avoid the problem of texture transfer differentiation, and by randomly sampling different noise inputs, we can obtain different texture features and texture transfer results under the same reference style. Compared to state-of-the-art techniques, our TFP not only produces visually superior results but also reduces the model size by 3.2-3538 times and speeds up the process by 1.8-5.6 times.

CVNov 9, 2023
Dual Pipeline Style Transfer with Input Distribution Differentiation

ShiQi Jiang, JunJie Kang, YuJian Li

The color and texture dual pipeline architecture (CTDP) suppresses texture representation and artifacts through masked total variation loss (Mtv), and further experiments have shown that smooth input can almost completely eliminate texture representation. We have demonstrated through experiments that smooth input is not the key reason for removing texture representations, but rather the distribution differentiation of the training dataset. Based on this, we propose an input distribution differentiation training strategy (IDD), which forces the generation of textures to be completely dependent on the noise distribution, while the smooth distribution will not produce textures at all. Overall, our proposed distribution differentiation training strategy allows for two pre-defined input distributions to be responsible for two generation tasks, with noise distribution responsible for texture generation and smooth distribution responsible for color smooth transfer. Finally, we choose a smooth distribution as the input for the forward inference stage to completely eliminate texture representations and artifacts in color transfer tasks.

CVMar 8, 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

Xin Ding, Hao Wu, Yifan Yang et al.

With the rise of real-world human-AI interaction applications, such as AI assistants, the need for Streaming Video Dialogue is critical. To address this need, we introduce StreamMind, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100) and enables proactive, always-on responses in real time, without explicit user intervention. To solve the key challenge of the contradiction between linear video streaming speed and quadratic transformer computation cost, we propose a novel perception-cognition interleaving paradigm named ''event-gated LLM invocation'', in contrast to the existing per-time-step LLM invocation. By introducing a Cognition Gate network between the video encoder and the LLM, LLM is only invoked when relevant events occur. To realize the event feature extraction with constant cost, we propose Event-Preserving Feature Extractor (EPFE) based on state-space method, generating a single perception token for spatiotemporal features. These techniques enable the video LLM with full-FPS perception and real-time cognition response. Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency, paving the way for ultra-high-FPS applications, such as Game AI and interactive media. The code and data is available at https://aka.ms/StreamMind.

CVAug 25, 2025
VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results

Sizhuo Ma, Wei-Ting Chen, Qiang Gao et al.

Face images play a crucial role in numerous applications; however, real-world conditions frequently introduce degradations such as noise, blur, and compression artifacts, affecting overall image quality and hindering subsequent tasks. To address this challenge, we organized the VQualA 2025 Challenge on Face Image Quality Assessment (FIQA) as part of the ICCV 2025 Workshops. Participants created lightweight and efficient models (limited to 0.5 GFLOPs and 5 million parameters) for the prediction of Mean Opinion Scores (MOS) on face images with arbitrary resolutions and realistic degradations. Submissions underwent comprehensive evaluations through correlation metrics on a dataset of in-the-wild face images. This challenge attracted 127 participants, with 1519 final submissions. This report summarizes the methodologies and findings for advancing the development of practical FIQA approaches.

AIOct 25, 2024
The Potential and Value of AI Chatbot in Personalized Cognitive Training

Zilong Wang, Nan Chen, Luna K. Qiu et al.

In recent years, the rapid aging of the global population has led to an increase in cognitive disorders, such as Alzheimer's disease, presenting significant public health challenges. Although no effective treatments currently exist to reverse Alzheimer's, prevention and early intervention, including cognitive training, are critical. This report explores the potential of AI chatbots in enhancing personalized cognitive training. We introduce ReMe, a web-based framework designed to create AI chatbots that facilitate cognitive training research, specifically targeting episodic memory tasks derived from personal life logs. By leveraging large language models, ReMe provides enhanced user-friendly, interactive, and personalized training experiences. Case studies demonstrate ReMe's effectiveness in engaging users through life recall and open-ended language puzzles, highlighting its potential to improve cognitive training design. Despite promising results, further research is needed to validate training effectiveness through large-scale studies that include cognitive ability evaluations. Overall, ReMe offers a promising approach to personalized cognitive training, utilizing AI capabilities to meet the growing demand for non-pharmacological interventions in cognitive health, with future research aiming to expand its applications and efficacy.

LGFeb 8, 2024
Anatomizing Deep Learning Inference in Web Browsers

Qipeng Wang, Shiqi Jiang, Zhenpeng Chen et al.

Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology

LGApr 11, 2025
Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

Fucheng Jia, Zewen Wu, Shiqi Jiang et al. · microsoft-research

Large language models (LLMs) are increasingly being deployed on mobile devices, but the limited DRAM capacity constrains the deployable model size. This paper introduces ActiveFlow, the first LLM inference framework that can achieve adaptive DRAM usage for modern LLMs (not ReLU-based), enabling the scaling up of deployable model sizes. The framework is based on the novel concept of active weight DRAM-flash swapping and incorporates three novel techniques: (1) Cross-layer active weights preloading. It uses the activations from the current layer to predict the active weights of several subsequent layers, enabling computation and data loading to overlap, as well as facilitating large I/O transfers. (2) Sparsity-aware self-distillation. It adjusts the active weights to align with the dense-model output distribution, compensating for approximations introduced by contextual sparsity. (3) Active weight DRAM-flash swapping pipeline. It orchestrates the DRAM space allocation among the hot weight cache, preloaded active weights, and computation-involved weights based on available memory. Results show ActiveFlow achieves the performance-cost Pareto frontier compared to existing efficiency optimization methods.

CVMar 12, 2024
AACP: Aesthetics assessment of children's paintings based on self-supervised learning

Shiqi Jiang, Ning Li, Chen Shi et al.

The Aesthetics Assessment of Children's Paintings (AACP) is an important branch of the image aesthetics assessment (IAA), playing a significant role in children's education. This task presents unique challenges, such as limited available data and the requirement for evaluation metrics from multiple perspectives. However, previous approaches have relied on training large datasets and subsequently providing an aesthetics score to the image, which is not applicable to AACP. To solve this problem, we construct an aesthetics assessment dataset of children's paintings and a model based on self-supervised learning. 1) We build a novel dataset composed of two parts: the first part contains more than 20k unlabeled images of children's paintings; the second part contains 1.2k images of children's paintings, and each image contains eight attributes labeled by multiple design experts. 2) We design a pipeline that includes a feature extraction module, perception modules and a disentangled evaluation module. 3) We conduct both qualitative and quantitative experiments to compare our model's performance with five other methods using the AACP dataset. Our experiments reveal that our method can accurately capture aesthetic features and achieve state-of-the-art performance.

CVJul 12, 2025
PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process

Shiqi Jiang, Xinpeng Li, Xi Mao et al.

Artistic image assessment has become a prominent research area in computer vision. In recent years, the field has witnessed a proliferation of datasets and methods designed to evaluate the aesthetic quality of paintings. However, most existing approaches focus solely on static final images, overlooking the dynamic and multi-stage nature of the artistic painting process. To address this gap, we propose a novel framework for human-aligned assessment of painting processes. Specifically, we introduce the Painting Process Assessment Dataset (PPAD), the first large-scale dataset comprising real and synthetic painting process images, annotated by domain experts across eight detailed attributes. Furthermore, we present PPJudge (Painting Process Judge), a Transformer-based model enhanced with temporally-aware positional encoding and a heterogeneous mixture-of-experts architecture, enabling effective assessment of the painting process. Experimental results demonstrate that our method outperforms existing baselines in accuracy, robustness, and alignment with human judgment, offering new insights into computational creativity and art education.

CVApr 30, 2025
Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Jiaxu Qian, Chendong Wang, Yifan Yang et al. · microsoft-research

Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omission of critical information, hampering performance. To address these limitations, we introduce \SysName, a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits. \SysName features three key innovations: a prompt-aware strategy that dynamically highlights relevant image regions, a spatial-preserving orchestration schema that maintains object integrity, and a budget-aware prompting method that balances global context with crucial visual details. Comprehensive evaluations across multiple datasets demonstrate that \SysName consistently outperforms baseline methods, achieving up to a $26.9\%$ improvement in accuracy while significantly reducing token consumption.

CVApr 18, 2024
TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

Tianyi Liang, Jiangqi Liu, Yifei Huang et al.

Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential. Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds. We present TextCenGen, a training-free dynamic background adaptation in the blank region for text-friendly image generation. Instead of directly reducing attention in text areas, which degrades image quality, we relocate conflicting objects before background optimization. Our method analyzes cross-attention maps to identify conflicting objects overlapping with text regions and uses a force-directed graph approach to guide their relocation, followed by attention excluding constraints to ensure smooth backgrounds. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality. Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across four seed datasets, TextCenGen outperforms existing methods by achieving 23% lower saliency overlap in text regions while maintaining 98% of the semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).

LGOct 17, 2025
Expert Merging in Sparse Mixture of Experts with Nash Bargaining

Dung V. Nguyen, Anh T. Nguyen, Minh H. Nguyen et al.

Existing expert merging strategies for Sparse Mixture of Experts (SMoE) typically rely on input-dependent or input-independent averaging of expert parameters, but often lack a principled weighting mechanism. In this work, we reinterpret expert merging through the lens of game theory, revealing cooperative and competitive dynamics among experts. Based on this perspective, we introduce Nash Merging of Experts (NAMEx), a novel framework that incorporates Nash Bargaining into the merging process, enabling more balanced and efficient collaboration among experts. Additionally, we incorporate complex momentum into NAMEx to accelerate expert propagation with theoretical guarantees for convergence. Extensive experiments across language modelling, text classification, image classification, and zero-shot robustness under data corruption show that NAMEx consistently outperforms competing methods while integrating seamlessly with popular MoE architectures. Finally, we demonstrate NAMEx's scalability by applying it to large-scale systems, including Qwen1.5-MoE (14B) and DeepSeek-MoE (16B), where it proves effective in both zero-shot and fine-tuning settings.

CVOct 5, 2025
Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning

Chendong Wang, Donglin Bai, Yifan Yang et al.

We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.

LGOct 2, 2025
Diffusion^2: Turning 3D Environments into Radio Frequency Heatmaps

Kyoungjun Park, Yifan Yang, Changhan Ge et al.

Modeling radio frequency (RF) signal propagation is essential for understanding the environment, as RF signals offer valuable insights beyond the capabilities of RGB cameras, which are limited by the visible-light spectrum, lens coverage, and occlusions. It is also useful for supporting wireless diagnosis, deployment, and optimization. However, accurately predicting RF signals in complex environments remains a challenge due to interactions with obstacles such as absorption and reflection. We introduce Diffusion^2, a diffusion-based approach that uses 3D point clouds to model the propagation of RF signals across a wide range of frequencies, from Wi-Fi to millimeter waves. To effectively capture RF-related features from 3D data, we present the RF-3D Encoder, which encapsulates the complexities of 3D geometry along with signal-specific details. These features undergo multi-scale embedding to simulate the actual RF signal dissemination process. Our evaluation, based on synthetic and real-world measurements, demonstrates that Diffusion^2 accurately estimates the behavior of RF signals in various frequency bands and environmental conditions, with an error margin of just 1.9 dB and 27x faster than existing methods, marking a significant advancement in the field. Refer to https://rfvision-project.github.io/ for more information.

DCSep 27, 2025
Scaling LLM Test-Time Compute with Mobile NPU on Smartphones

Zixu Hao, Jianyu Wei, Tuowei Wang et al. · microsoft-research

Deploying Large Language Models (LLMs) on mobile devices faces the challenge of insufficient performance in smaller models and excessive resource consumption in larger ones. This paper highlights that mobile Neural Processing Units (NPUs) have underutilized computational resources, particularly their matrix multiplication units, during typical LLM inference. To leverage this wasted compute capacity, we propose applying parallel test-time scaling techniques on mobile NPUs to enhance the performance of smaller LLMs. However, this approach confronts inherent NPU challenges, including inadequate hardware support for fine-grained quantization and low efficiency in general-purpose computations. To overcome these, we introduce two key techniques: a hardware-aware tile quantization scheme that aligns group quantization with NPU memory access patterns, and efficient LUT-based replacements for complex operations such as Softmax and dequantization. We design and implement an end-to-end inference system that leverages the NPU's compute capability to support test-time scaling on Qualcomm Snapdragon platforms. Experiments show our approach brings significant speedups: up to 19.0 for mixed-precision GEMM and 2.2 for Softmax. More importantly, we demonstrate that smaller models using test-time scaling can match or exceed the accuracy of larger models, achieving a new performance-cost Pareto frontier.

AISep 26, 2025
ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration

Gaole Dai, Shiqi Jiang, Ting Cao et al.

Reward is critical to the evaluation and training of large language models (LLMs). However, existing rule-based or model-based reward methods struggle to generalize to GUI agents, where access to ground-truth trajectories or application databases is often unavailable, and static trajectory-based LLM-as-a-Judge approaches suffer from limited accuracy. To address these challenges, we propose ProRe, a proactive reward system that leverages a general-purpose reasoner and domain-specific evaluator agents (actors). The reasoner schedules targeted state probing tasks, which the evaluator agents then execute by actively interacting with the environment to collect additional observations. This enables the reasoner to assign more accurate and verifiable rewards to GUI agents. Empirical results on over 3K trajectories demonstrate that ProRe improves reward accuracy and F1 score by up to 5.3% and 19.4%, respectively. Furthermore, integrating ProRe with state-of-the-art policy agents yields a success rate improvement of up to 22.4%.

CVJul 23, 2025
STQE: Spatial-Temporal Attribute Quality Enhancement for G-PCC Compressed Dynamic Point Clouds

Tian Guo, Hui Yuan, Xiaolong Mao et al.

Very few studies have addressed quality enhancement for compressed dynamic point clouds. In particular, the effective exploitation of spatial-temporal correlations between point cloud frames remains largely unexplored. Addressing this gap, we propose a spatial-temporal attribute quality enhancement (STQE) network that exploits both spatial and temporal correlations to improve the visual quality of G-PCC compressed dynamic point clouds. Our contributions include a recoloring-based motion compensation module that remaps reference attribute information to the current frame geometry to achieve precise inter-frame geometric alignment, a channel-aware temporal attention module that dynamically highlights relevant regions across bidirectional reference frames, a Gaussian-guided neighborhood feature aggregation module that efficiently captures spatial dependencies between geometry and color attributes, and a joint loss function based on the Pearson correlation coefficient, designed to alleviate over-smoothing effects typical of point-wise mean squared error optimization. When applied to the latest G-PCC test model, STQE achieved improvements of 0.855 dB, 0.682 dB, and 0.828 dB in delta PSNR, with Bjøntegaard Delta rate (BD-rate) reductions of -25.2%, -31.6%, and -32.5% for the Luma, Cb, and Cr components, respectively.

SEJul 1, 2025
iPanda: An LLM-based Agent for Automated Conformance Testing of Communication Protocols

Xikai Sun, Fan Dang, Shiqi Jiang et al.

Conformance testing is essential for ensuring that protocol implementations comply with their specifications. However, traditional testing approaches involve manually creating numerous test cases and scripts, making the process labor-intensive and inefficient. Recently, Large Language Models (LLMs) have demonstrated impressive text comprehension and code generation abilities, providing promising opportunities for automation. In this paper, we propose iPanda, the first framework that leverages LLMs to automate protocol conformance testing. Given a protocol specification document and its implementation, iPanda first employs a keyword-based method to automatically generate comprehensive test cases. Then, it utilizes retrieval-augmented generation and customized CoT strategy to effectively interpret the implementation and produce executable test programs. To further enhance programs' quality, iPanda incorporates an iterative optimization mechanism to refine generated test scripts interactively. Finally, by executing and analyzing the generated tests, iPanda systematically verifies compliance between implementations and protocol specifications. Comprehensive experiments on various protocols show that iPanda significantly outperforms pure LLM-based approaches, improving the success rate (Pass@1) of test-program generation by factors ranging from 4.675 times to 10.751 times.

IVFeb 21, 2025
FD-LSCIC: Frequency Decomposition-based Learned Screen Content Image Compression

Shiqi Jiang, Hui Yuan, Shuai Li et al.

The learned image compression (LIC) methods have already surpassed traditional techniques in compressing natural scene (NS) images. However, directly applying these methods to screen content (SC) images, which possess distinct characteristics such as sharp edges, repetitive patterns, embedded text and graphics, yields suboptimal results. This paper addresses three key challenges in SC image compression: learning compact latent features, adapting quantization step sizes, and the lack of large SC datasets. To overcome these challenges, we propose a novel compression method that employs a multi-frequency two-stage octave residual block (MToRB) for feature extraction, a cascaded triple-scale feature fusion residual block (CTSFRB) for multi-scale feature integration and a multi-frequency context interaction module (MFCIM) to reduce inter-frequency correlations. Additionally, we introduce an adaptive quantization module that learns scaled uniform noise for each frequency component, enabling flexible control over quantization granularity. Furthermore, we construct a large SC image compression dataset (SDU-SCICD10K), which includes over 10,000 images spanning basic SC images, computer-rendered images, and mixed NS and SC images from both PC and mobile platforms. Experimental results demonstrate that our approach significantly improves SC image compression performance, outperforming traditional standards and state-of-the-art learning-based methods in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM).

IVFeb 21, 2025
Interleaved Block-based Learned Image Compression with Feature Enhancement and Quantization Error Compensation

Shiqi Jiang, Hui Yuan, Shuai Li et al.

In recent years, learned image compression (LIC) methods have achieved significant performance improvements. However, obtaining a more compact latent representation and reducing the impact of quantization errors remain key challenges in the field of LIC. To address these challenges, we propose a feature extraction module, a feature refinement module, and a feature enhancement module. Our feature extraction module shuffles the pixels in the image, splits the resulting image into sub-images, and extracts coarse features from the sub-images. Our feature refinement module stacks the coarse features and uses an attention refinement block composed of concatenated three-dimensional convolution residual blocks to learn more compact latent features by exploiting correlations across channels, within sub-images (intra-sub-image correlations), and across sub-images (inter-sub-image correlations). Our feature enhancement module reduces information loss in the decoded features following quantization. We also propose a quantization error compensation module that mitigates the quantization mismatch between training and testing. Our four modules can be readily integrated into state-of-the-art LIC methods. Experiments show that combining our modules with Tiny-LIC outperforms existing LIC methods and image compression standards in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) on the Kodak dataset and the CLIC dataset.

CVOct 19, 2024
Making Every Frame Matter: Continuous Activity Recognition in Streaming Video via Adaptive Video Context Modeling

Hao Wu, Donglin Bai, Shiqi Jiang et al.

Video activity recognition has become increasingly important in robots and embodied AI. Recognizing continuous video activities poses considerable challenges due to the fast expansion of streaming video, which contains multi-scale and untrimmed activities. We introduce a novel system, CARS, to overcome these issues through adaptive video context modeling. Adaptive video context modeling refers to selectively maintaining activity-related features in temporal and spatial dimensions. CARS has two key designs. The first is an activity spatial feature extraction by eliminating irrelevant visual features while maintaining recognition accuracy. The second is an activity-aware state update introducing dynamic adaptability to better preserve the video context for multi-scale activity recognition. Our CARS runs at speeds $>$30 FPS on typical edge devices and outperforms all baselines by 1.2\% to 79.7\% in accuracy. Moreover, we explore applying CARS to a large video model as a video encoder. Experimental results show that our CARS can result in a 0.46-point enhancement (on a 5-point scale) on the in-distribution video activity dataset, and an improvement ranging from 1.19\% to 4\% on zero-shot video activity datasets.

CVDec 3, 2021
Multi-modal application: Image Memes Generation

Zhiyuan Liu, Chuanzheng Sun, Yuxin Jiang et al.

Meme is an interesting word. Internet memes offer unique insights into the changes in our perception of the world, the media and our own lives. If you surf the Internet for long enough, you will see it somewhere on the Internet. With the rise of social media platforms and convenient image dissemination, Image Meme has gained fame. Image memes have become a kind of pop culture and they play an important role in communication over social media, blogs, and open messages. With the development of artificial intelligence and the widespread use of deep learning, Natural Language Processing (NLP) and Computer Vision (CV) can also be used to solve more problems in life, including meme generation. An Internet meme commonly takes the form of an image and is created by combining a meme template (image) and a caption (natural language sentence). In our project, we propose an end-to-end encoder-decoder architecture meme generator. For a given input sentence, we use the Meme template selection model to determine the emotion it expresses and select the image template. Then generate captions and memes through to the meme caption generator. Code and models are available at github