Yu Ning

h-index25

8papers

192citations

Novelty57%

AI Score50

Ranked #20,868 of 194,257 authors (top 11%)#7,540 in CV (top 13%)

8 Papers

7.6CVFeb 1, 2023Code

Learning Prototype Classifiers for Long-Tailed Recognition

Saurabh Sharma, Yongqin Xian, Ning Yu et al.

The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. Most recent works in LTR use softmax classifiers that are biased in that they correlate classifier norm with the amount of training data for a given class. In this work, we show that learning prototype classifiers addresses the biased softmax problem in LTR. Prototype classifiers can deliver promising results simply using Nearest-Class- Mean (NCM), a special case where prototypes are empirical centroids. We go one step further and propose to jointly learn prototypes by using distances to prototypes in representation space as the logit scores for classification. Further, we theoretically analyze the properties of Euclidean distance based prototype classifiers that lead to stable gradient-based optimization which is robust to outliers. To enable independent distance scales along each channel, we enhance Prototype classifiers by learning channel-dependent temperature parameters. Our analysis shows that prototypes learned by Prototype classifiers are better separated than empirical centroids. Results on four LTR benchmarks show that Prototype classifier outperforms or is comparable to state-of-the-art methods. Our code is made available at https://github.com/saurabhsharma1993/prototype-classifier-ltr.

31.8CROct 3, 2022

Membership Inference Attacks Against Text-to-image Generation Models

Yixin Wu, Ning Yu, Zheng Li et al.

Text-to-image generation models have recently attracted unprecedented attention as they unlatch imaginative applications in all areas of life. However, developing such models requires huge amounts of data that might contain privacy-sensitive information, e.g., face identity. While privacy risks have been extensively demonstrated in the image classification and GAN generation domains, privacy risks in the text-to-image generation domain are largely unexplored. In this paper, we perform the first privacy analysis of text-to-image generation models through the lens of membership inference. Specifically, we propose three key intuitions about membership information and design four attack methodologies accordingly. We conduct comprehensive evaluations on two mainstream text-to-image generation models including sequence-to-sequence modeling and diffusion-based modeling. The empirical results show that all of the proposed attacks can achieve significant performance, in some cases even close to an accuracy of 1, and thus the corresponding risk is much more severe than that shown by existing membership inference attacks. We further conduct an extensive ablation study to analyze the factors that may affect the attack performance, which can guide developers and researchers to be alert to vulnerabilities in text-to-image generation models. All these findings indicate that our proposed attacks pose a realistic privacy threat to the text-to-image generation models.

25.4CRAug 23, 2022

Auditing Membership Leakages of Multi-Exit Networks

Zheng Li, Yiyong Liu, Xinlei He et al.

Relying on the fact that not all inputs require the same amount of computation to yield a confident prediction, multi-exit networks are gaining attention as a prominent approach for pushing the limits of efficient deployment. Multi-exit networks endow a backbone model with early exits, allowing to obtain predictions at intermediate layers of the model and thus save computation time and/or energy. However, current various designs of multi-exit networks are only considered to achieve the best trade-off between resource usage efficiency and prediction accuracy, the privacy risks stemming from them have never been explored. This prompts the need for a comprehensive investigation of privacy risks in multi-exit networks. In this paper, we perform the first privacy analysis of multi-exit networks through the lens of membership leakages. In particular, we first leverage the existing attack methodologies to quantify the multi-exit networks' vulnerability to membership leakages. Our experimental results show that multi-exit networks are less vulnerable to membership leakages and the exit (number and depth) attached to the backbone model is highly correlated with the attack performance. Furthermore, we propose a hybrid attack that exploits the exit information to improve the performance of existing attacks. We evaluate membership leakage threat caused by our hybrid attack under three different adversarial setups, ultimately arriving at a model-free and data-free adversary. These results clearly demonstrate that our hybrid attacks are very broadly applicable, thereby the corresponding risks are much more severe than shown by existing membership inference attacks. We further present a defense mechanism called TimeGuard specifically for multi-exit networks and show that TimeGuard mitigates the newly proposed attacks perfectly.

19.0CVApr 9, 2025Code

FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution

Gene Chou, Wenqi Xian, Guandao Yang et al. · deepmind

A versatile video depth estimation model should (1) be accurate and consistent across frames, (2) produce high-resolution depth maps, and (3) support real-time streaming. We propose FlashDepth, a method that satisfies all three requirements, performing depth estimation on a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We evaluate our approach across multiple unseen datasets against state-of-the-art depth models, and find that ours outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as video editing, and online decision-making, such as robotics. We release all code and model weights at https://github.com/Eyeline-Research/FlashDepth

17.4CVAug 21, 2025Code

CineScale: Free Lunch in High-Resolution Cinematic Visual Generation

Haonan Qiu, Ning Yu, Ziqi Huang et al.

Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.

5.0CVDec 15, 2023Code

Continual Adversarial Defense

Qian Wang, Hefei Ling, Yingwei Li et al.

In response to the rapidly evolving nature of adversarial attacks against visual classifiers, numerous defenses have been proposed to generalize against as many known attacks as possible. However, designing a defense method that generalizes to all types of attacks is unrealistic, as the environment in which the defense system operates is dynamic. Over time, new attacks inevitably emerge that exploit the vulnerabilities of existing defenses and bypass them. Therefore, we propose a continual defense strategy under a practical threat model and, for the first time, introduce the Continual Adversarial Defense (CAD) framework. CAD continuously collects adversarial data online and adapts to evolving attack sequences, while adhering to four practical principles: (1) continual adaptation to new attacks without catastrophic forgetting, (2) few-shot adaptation, (3) memory-efficient adaptation, and (4) high classification accuracy on both clean and adversarial data. We explore and integrate cutting-edge techniques from continual learning, few-shot learning, and ensemble learning to fulfill the principles. Extensive experiments validate the effectiveness of our approach against multi-stage adversarial attacks and demonstrate significant improvements over a wide range of baseline methods. We further observe that CAD's defense performance tends to saturate as the number of attacks increases, indicating its potential as a persistent defense once adapted to a sufficiently diverse set of attacks. Our research sheds light on a brand-new paradigm for continual defense adaptation against dynamic and evolving attacks.

9.6AIOct 17, 2025

AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

Jitesh Jain, Shubham Maheshwari, Ning Yu et al. · gatech

Riding on the success of LLMs with retrieval-augmented generation (RAG), there has been a growing interest in augmenting agent systems with external memory databases. However, the existing systems focus on storing text information in their memory, ignoring the importance of multimodal signals. Motivated by the multimodal nature of human memory, we present AUGUSTUS, a multimodal agent system aligned with the ideas of human memory in cognitive science. Technically, our system consists of 4 stages connected in a loop: (i) encode: understanding the inputs; (ii) store in memory: saving important information; (iii) retrieve: searching for relevant context from memory; and (iv) act: perform the task. Unlike existing systems that use vector databases, we propose conceptualizing information into semantic tags and associating the tags with their context to store them in a graph-structured multimodal contextual memory for efficient concept-driven retrieval. Our system outperforms the traditional multimodal RAG approach while being 3.5 times faster for ImageNet classification and outperforming MemGPT on the MSC benchmark.

19.7CVAug 11, 2025

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Qian Wang, Ziqi Huang, Ruoxi Jia et al.

Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, a multi-agent collaborative framework designed to assist in long-sequence video storytelling by efficiently translating ideas into visual narratives. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle -- Explore, Examine, and Enhance -- to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high-quality, complete long-sequence videos. To the best of our knowledge, MAViS is the only framework that provides multimodal design output -- videos with narratives and background music.