CVJul 17, 2023
Extreme Image Compression using Fine-tuned VQGANsQi Mao, Tinghan Yang, Yinuo Zhang et al.
Recent advances in generative compression methods have demonstrated remarkable progress in enhancing the perceptual quality of compressed data, especially in scenarios with low bitrates. However, their efficacy and applicability to achieve extreme compression ratios ($<0.05$ bpp) remain constrained. In this work, we propose a simple yet effective coding framework by introducing vector quantization (VQ)--based generative models into the image compression domain. The main insight is that the codebook learned by the VQGAN model yields a strong expressive capacity, facilitating efficient compression of continuous information in the latent space while maintaining reconstruction quality. Specifically, an image can be represented as VQ-indices by finding the nearest codeword, which can be encoded using lossless compression methods into bitstreams. We propose clustering a pre-trained large-scale codebook into smaller codebooks through the K-means algorithm, yielding variable bitrates and different levels of reconstruction quality within the coding framework. Furthermore, we introduce a transformer to predict lost indices and restore images in unstable environments. Extensive qualitative and quantitative experiments on various benchmark datasets demonstrate that the proposed framework outperforms state-of-the-art codecs in terms of perceptual quality-oriented metrics and human perception at extremely low bitrates ($\le 0.04$ bpp). Remarkably, even with the loss of up to $20\%$ of indices, the images can be effectively restored with minimal perceptual loss.
CLFeb 17, 2024Code
CliqueParcel: An Approach For Batching LLM Prompts That Jointly Optimizes Efficiency And FaithfulnessJiayi Liu, Tinghan Yang, Jennifer Neville
Large language models (LLMs) have become pivotal in recent research. However, during the inference process, LLMs still require substantial resources. In this paper, we propose CliqueParcel, a method designed to improve the efficiency of LLMs via prompt batching. Existing strategies to optimize inference efficiency often compromise on output quality, leading to a discounted output problem. This issue might result in reduced accuracy or outputs that are less detailed. CliqueParcel is our answer to this challenge. While ensuring accuracy and minimizing deviations from the original outputs (i.e., faithfulness), our method significantly improves efficiency during inference. To lay the groundwork, we first redefine efficiency measurements by excluding the reduction in running time due to shorter lengths. Then, we provide a comprehensive trade-off between efficiency and faithfulness to clarify the nature of the 'discounted output' problem. Within the CliqueParcel framework, we suggest multiple batching sub-methods and discuss the specific scenarios in which they can be applied. During evaluation, CliqueParcel is tested on eight widely recognized datasets, which can be classified into three types: reading comprehension, open-source question-answering, and reasoning. Our experiments explore the performance of CliqueParcel, including efficiency, faithfulness, and the trade-off between them. This work provides novel insights into inference efficiency and demonstrates promising performance.
CVDec 4, 2025
Generative Neural Video Compression via Video Diffusion PriorQi Mao, Hao Cheng, Tinghan Yang et al.
We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.
CVAug 19, 2025Code
CLIPSym: Delving into Symmetry Detection with CLIPTinghan Yang, Md Ashiqur Rahman, Raymond A. Yeh
Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP's image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and $G$-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP's language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP's pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.
AISep 26, 2025
UniMIC: Token-Based Multimodal Interactive Coding for Human-AI CollaborationQi Mao, Tinghan Yang, Jiahao Li et al.
The rapid progress of Large Multimodal Models (LMMs) and cloud-based AI agents is transforming human-AI collaboration into bidirectional, multimodal interaction. However, existing codecs remain optimized for unimodal, one-way communication, resulting in repeated degradation under conventional compress-transmit-reconstruct pipelines. To address this limitation, we propose UniMIC, a Unified token-based Multimodal Interactive Coding framework that bridges edge devices and cloud AI agents. Instead of transmitting raw pixels or plain text, UniMIC employs compact tokenized representations as the communication medium, enabling efficient low-bitrate transmission while maintaining compatibility with LMMs. To further enhance compression, lightweight Transformer-based entropy models with scenario-specific designs-generic, masked, and text-conditioned-effectively minimize inter-token redundancy. Extensive experiments on text-to-image generation, text-guided inpainting, outpainting, and visual question answering show that UniMIC achieves substantial bitrate savings and remains robust even at ultra-low bitrates (<0.05bpp), without compromising downstream task performance. These results establish UniMIC as a practical and forward-looking paradigm for next-generation multimodal interactive communication.
CRMar 10, 2021
S3: Side-Channel Attack on Stylus Pencil through SensorsHabiba Farrukh, Tinghan Yang, Hanwen Xu et al.
With smart devices being an essential part of our everyday lives, unsupervised access to the mobile sensors' data can result in a multitude of side-channel attacks. In this paper, we study potential data leaks from Apple Pencil (2nd generation) supported by the Apple iPad Pro, the latest stylus pen which attaches to the iPad body magnetically for charging. We observe that the Pencil's body affects the magnetic readings sensed by the iPad's magnetometer when a user is using the Pencil. Therefore, we ask: Can we infer what a user is writing on the iPad screen with the Apple Pencil, given access to only the iPad's motion sensors' data? To answer this question, we present Side-channel attack on Stylus pencil through Sensors (S3), a system that identifies what a user is writing from motion sensor readings. We first use the sharp fluctuations in the motion sensors' data to determine when a user is writing on the iPad. We then introduce a high-dimensional particle filter to track the location and orientation of the Pencil during usage. Lastly, to guide particles, we build the Pencil's magnetic map serving as a bridge between the measured magnetic data and the Pencil location and orientation. We evaluate S3 with 10 subjects and demonstrate that we correctly identify 93.9%, 96%, 97.9%, and 93.33% of the letters, numbers, shapes, and words by only having access to the motion sensors' data.