CVAug 9, 2023
View while Moving: Efficient Video Recognition in Long-untrimmed VideosYe Tian, Mengyu Yang, Lanshan Zhang et al.
Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization. To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition. In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once. The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance. Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively. Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.
NADec 10, 2018
Coherence-Based Performance Guarantee of Regularized $\ell_{1}$-Norm Minimization and BeyondWendong Wang, Feng Zhang, Zhi Wang et al.
In this paper, we consider recovering the signal $\bm{x}\in\mathbb{R}^{n}$ from its few noisy measurements $\bm{b}=A\bm{x}+\bm{z}$, where $A\in\mathbb{R}^{m\times n}$ with $m\ll n$ is the measurement matrix, and $\bm{z}\in\mathbb{R}^{m}$ is the measurement noise/error. We first establish a coherence-based performance guarantee for a regularized $\ell_{1}$-norm minimization model to recover such signals $\bm{x}$ in the presence of the $\ell_{2}$-norm bounded noise, i.e., $\|\bm{z}\|_{2}\leqε$, and then extend these theoretical results to guarantee the robust recovery of the signals corrupted with the Dantzig Selector (DS) type noise, i.e., $\|A^{T}\bm{z}\|_{\infty}\leqε$, and the structured block-sparse signal recovery in the presence of the bounded noise. To the best of our knowledge, we first extend nontrivially the sharp uniform recovery condition derived by Cai, Wang and Xu (2010) for the constrained $\ell_{1}$-norm minimization model, which takes the form of \begin{align*} μ<\frac{1}{2k-1}, \end{align*} where $μ$ is defined as the (mutual) coherence of $A$, to two unconstrained regularized $\ell_{1}$-norm minimization models to guarantee the robust recovery of any signals (not necessary to be $k$-sparse) under the $\ell_{2}$-norm bounded noise and the DS type noise settings, respectively. Besides, a uniform recovery condition and its two resulting error estimates are also established for the first time to our knowledge, for the robust block-sparse signal recovery using a regularized mixed $\ell_{2}/\ell_{1}$-norm minimization model, and these results well complement the existing theoretical investigation on this model which focuses on the non-uniform recovery conditions and/or the robust signal recovery in presence of the random noise.
48.3AIMay 7
Large Vision-Language Models Get Lost in AttentionGongli Xi, Ye Tian, Mengyu Yang et al.
Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.
CVFeb 2
ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal ReasoningGongli Xi, Kun Wang, Zeming Gao et al.
Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model's reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.
CVMar 20, 2024
AdaViPro: Region-based Adaptive Visual Prompt for Large-Scale Models AdaptingMengyu Yang, Ye Tian, Lanshan Zhang et al.
Recently, prompt-based methods have emerged as a new alternative `parameter-efficient fine-tuning' paradigm, which only fine-tunes a small number of additional parameters while keeping the original model frozen. However, despite achieving notable results, existing prompt methods mainly focus on `what to add', while overlooking the equally important aspect of `where to add', typically relying on the manually crafted placement. To this end, we propose a region-based Adaptive Visual Prompt, named AdaViPro, which integrates the `where to add' optimization of the prompt into the learning process. Specifically, we reconceptualize the `where to add' optimization as a problem of regional decision-making. During inference, AdaViPro generates a regionalized mask map for the whole image, which is composed of 0 and 1, to designate whether to apply or discard the prompt in each specific area. Therefore, we employ Gumbel-Softmax sampling to enable AdaViPro's end-to-end learning through standard back-propagation. Extensive experiments demonstrate that our AdaViPro yields new efficiency and accuracy trade-offs for adapting pre-trained models.
81.1NIApr 9
LCMP: Distributed Long-Haul Cost-Aware Multi-Path Routing for Inter-Datacenter RDMA NetworksDong-Yang Yu, Yuchao Zhang, Xiaodi Wang et al.
RDMA-empowered cloud services are gradually deployed across datacenters (DCs) with multiple paths, which exhibit new properties of path asymmetry, delayed congestion signals, and simultaneous flow routing collisions, and further fail existing routing methods. We present LCMP, a distributed long-haul cost-aware multi-path routing framework that aims to place RDMA flows on multiple inter-DC paths, achieving low-cost, low-latency, and congestion-responsive transmission. LCMP combines a control-plane path-quality score with compact on-switch congestion signals, where the former unifies quality assessment for asymmetric paths and the latter enables responsive reaction to path congestion. LCMP further resolves the simultaneous flow decision collision problem by filtering high-cost candidates, and performing a diversity-preserving hash inside the reduced set. On an 8-DC testbed, LCMP reduces median and tail FCT slowdown by up to 76% and 64%, respectively compared to state-of-the-art (SOTA) DCN routing strategies. And large-scale NS-3 simulations under the 2000 km inter-DC scenario confirm similar improvements.
LGDec 17, 2025
CoPHo: Classifier-guided Conditional Topology Generation with Persistent HomologyGongli Xi, Ye Tian, Mengyu Yang et al.
The structure of topology underpins much of the research on performance and robustness, yet available topology data are typically scarce, necessitating the generation of synthetic graphs with desired properties for testing or release. Prior diffusion-based approaches either embed conditions into the diffusion model, requiring retraining for each attribute and hindering real-time applicability, or use classifier-based guidance post-training, which does not account for topology scale and practical constraints. In this paper, we show from a discrete perspective that gradients from a pre-trained graph-level classifier can be incorporated into the discrete reverse diffusion posterior to steer generation toward specified structural properties. Based on this insight, we propose Classifier-guided Conditional Topology Generation with Persistent Homology (CoPHo), which builds a persistent homology filtration over intermediate graphs and interprets features as guidance signals that steer generation toward the desired properties at each denoising step. Experiments on four generic/network datasets demonstrate that CoPHo outperforms existing methods at matching target metrics, and we further validate its transferability on the QM9 molecular dataset.
CVSep 4, 2025
SAC-MIL: Spatial-Aware Correlated Multiple Instance Learning for Histopathology Whole Slide Image ClassificationYu Bai, Zitong Yu, Haowen Tian et al.
We propose Spatial-Aware Correlated Multiple Instance Learning (SAC-MIL) for performing WSI classification. SAC-MIL consists of a positional encoding module to encode position information and a SAC block to perform full instance correlations. The positional encoding module utilizes the instance coordinates within the slide to encode the spatial relationships instead of the instance index in the input WSI sequence. The positional encoding module can also handle the length extrapolation issue where the training and testing sequences have different lengths. The SAC block is an MLP-based method that performs full instance correlation in linear time complexity with respect to the sequence length. Due to the simple structure of MLP, it is easy to deploy since it does not require custom CUDA kernels, compared to Transformer-based methods for WSI classification. SAC-MIL has achieved state-of-the-art performance on the CAMELYON-16, TCGA-LUNG, and TCGA-BRAC datasets. The code will be released upon acceptance.
CVAug 5, 2025
Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image ClassificationBo Zhang, Xu Xinan, Shuo Yan et al.
Recent pseudo-bag augmentation methods for Multiple Instance Learning (MIL)-based Whole Slide Image (WSI) classification sample instances from a limited number of bags, resulting in constrained diversity. To address this issue, we propose Contrastive Cross-Bag Augmentation ($C^2Aug$) to sample instances from all bags with the same class to increase the diversity of pseudo-bags. However, introducing new instances into the pseudo-bag increases the number of critical instances (e.g., tumor instances). This increase results in a reduced occurrence of pseudo-bags containing few critical instances, thereby limiting model performance, particularly on test slides with small tumor areas. To address this, we introduce a bag-level and group-level contrastive learning framework to enhance the discrimination of features with distinct semantic meanings, thereby improving model performance. Experimental results demonstrate that $C^2Aug$ consistently outperforms state-of-the-art approaches across multiple evaluation metrics.
IVAug 5, 2025
Nexus-INR: Diverse Knowledge-guided Arbitrary-Scale Multimodal Medical Image Super-ResolutionBo Zhang, JianFei Huo, Zheng Zhang et al.
Arbitrary-resolution super-resolution (ARSR) provides crucial flexibility for medical image analysis by adapting to diverse spatial resolutions. However, traditional CNN-based methods are inherently ill-suited for ARSR, as they are typically designed for fixed upsampling factors. While INR-based methods overcome this limitation, they still struggle to effectively process and leverage multi-modal images with varying resolutions and details. In this paper, we propose Nexus-INR, a Diverse Knowledge-guided ARSR framework, which employs varied information and downstream tasks to achieve high-quality, adaptive-resolution medical image super-resolution. Specifically, Nexus-INR contains three key components. A dual-branch encoder with an auxiliary classification task to effectively disentangle shared anatomical structures and modality-specific features; a knowledge distillation module using cross-modal attention that guides low-resolution modality reconstruction with high-resolution reference, enhanced by self-supervised consistency loss; an integrated segmentation module that embeds anatomical semantics to improve both reconstruction quality and downstream segmentation performance. Experiments on the BraTS2020 dataset for both super-resolution and downstream segmentation demonstrate that Nexus-INR outperforms state-of-the-art methods across various metrics.
CVAug 5, 2025
SSFMamba: Symmetry-driven Spatial-Frequency Feature Fusion for 3D Medical Image SegmentationBo Zhang, Yifan Zhang, Shuo Yan et al.
In light of the spatial domain's limited capacity for modeling global context in 3D medical image segmentation, emerging approaches have begun to incorporate frequency domain representations. However, straightforward feature extraction strategies often overlook the unique properties of frequency domain information, such as conjugate symmetry. They also fail to account for the fundamental differences in data distribution between the spatial and frequency domains, which can ultimately dilute or obscure the complementary strengths that frequency-based representations offer. In this paper, we propose SSFMamba, a Mamba based Symmetry-driven Spatial-Frequency feature fusion network for 3D medical image segmentation. SSFMamba employs a complementary dual-branch architecture that extracts features from both the spatial and frequency domains, and leverages a Mamba block to fuse these heterogeneous features to preserve global context while reinforcing local details. In the frequency domain branch, we harness Mamba's exceptional capability to extract global contextual information in conjunction with the synergistic effect of frequency domain features to further enhance global modeling. Moreover, we design a 3D multi-directional scanning mechanism to strengthen the fusion of local and global cues. Extensive experiments on the BraTS2020 and BraTS2023 datasets demonstrate that our approach consistently outperforms state-of-the-art methods across various evaluation metrics.
CVDec 1, 2024
DIVD: Deblurring with Improved Video Diffusion ModelHaoyang Long, Yan Wang, Wendong Wang
Video deblurring presents a considerable challenge owing to the complexity of blur, which frequently results from a combination of camera shakes, and object motions. In the field of video deblurring, many previous works have primarily concentrated on distortion-based metrics, such as PSNR. However, this approach often results in a weak correlation with human perception and yields reconstructions that lack realism. Diffusion models and video diffusion models have respectively excelled in the fields of image and video generation, particularly achieving remarkable results in terms of image authenticity and realistic perception. However, due to the computational complexity and challenges inherent in adapting diffusion models, there is still uncertainty regarding the potential of video diffusion models in video deblurring tasks. To explore the viability of video diffusion models in the task of video deblurring, we introduce a diffusion model specifically for this purpose. In this field, leveraging highly correlated information between adjacent frames and addressing the challenge of temporal misalignment are crucial research directions. To tackle these challenges, many improvements based on the video diffusion model are introduced in this work. As a result, our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics. Our model preserves a significant amount of detail in the images while maintaining competitive distortion metrics. Furthermore, to the best of our knowledge, this is the first time the diffusion model has been applied in video deblurring to overcome the limitations mentioned above.
ITSep 17, 2019
RIP-based performance guarantee for low-tubal-rank tensor recoveryFeng Zhang, Wendong Wang, Jianwen Huang et al.
The essential task of tensor data analysis focuses on the tensor decomposition and the corresponding notion of rank. In this paper, by introducing the notion of tensor Singular Value Decomposition (t-SVD), we establish a Regularized Tensor Nuclear Norm Minimization (RTNNM) model for low-tubal-rank tensor recovery. As we know that many variants of the Restricted Isometry Property (RIP) have proven to be crucial analysis tools for sparse recovery. In the t-SVD framework, we initiatively define a novel tensor Restricted Isometry Property (t-RIP). Furthermore, we show that any third-order tensor $\boldsymbol{\mathcal{X}}$ can stably be recovered from few linear noise measurements under some certain t-RIP conditions via the RTNNM model. We note that, as far as the authors are aware, such kind of result has not previously been reported in the literature.
MLJun 4, 2019
Tensor Restricted Isometry Property Analysis For a Large Class of Random Measurement EnsemblesFeng Zhang, Wendong Wang, Jingyao Hou et al.
In previous work, theoretical analysis based on the tensor Restricted Isometry Property (t-RIP) established the robust recovery guarantees of a low-tubal-rank tensor. The obtained sufficient conditions depend strongly on the assumption that the linear measurement maps satisfy the t-RIP. In this paper, by exploiting the probabilistic arguments, we prove that such linear measurement maps exist under suitable conditions on the number of measurements in terms of the tubal rank r and the size of third-order tensor n1, n2, n3. And the obtained minimal possible number of linear measurements is nearly optimal compared with the degrees of freedom of a tensor with tubal rank r. Specially, we consider a random sub-Gaussian distribution that includes Gaussian, Bernoulli and all bounded distributions and construct a large class of linear maps that satisfy a t-RIP with high probability. Moreover, the validity of the required number of measurements is verified by numerical experiments.