CLApr 15, 2022Code
Text Revision by On-the-Fly Representation OptimizationJingjing Li, Zichao Li, Tao Ge et al.
Text revision refers to a family of natural language generation tasks, where the source and target sequences share moderate resemblance in surface form but differentiate in attributes, such as text formality and simplicity. Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems, which rely on large-scale parallel training corpus. In this paper, we present an iterative in-place editing approach for text revision, which requires no parallel data. In this approach, we simply fine-tune a pre-trained Transformer with masked language modeling and attribute classification. During inference, the editing at each iteration is realized by two-step span replacement. At the first step, the distributed representation of the text optimizes on the fly towards an attribute function. At the second step, a text span is masked and another new one is proposed conditioned on the optimized representation. The empirical experiments on two typical and important text revision tasks, text formalization and text simplification, show the effectiveness of our approach. It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification, and gains better performance than strong unsupervised methods on text formalization \footnote{Code and model are available at \url{https://github.com/jingjingli01/OREO}}.
CVAug 1, 2023Code
Zero-Shot Learning by Harnessing Adversarial SamplesZhi Chen, Pengfei Zhang, Jingjing Li et al.
Zero-Shot Learning (ZSL) aims to recognize unseen classes by generalizing the knowledge, i.e., visual and semantic relationships, obtained from seen classes, where image augmentation techniques are commonly applied to improve the generalization ability of a model. However, this approach can also cause adverse effects on ZSL since the conventional augmentation techniques that solely depend on single-label supervision is not able to maintain semantic information and result in the semantic distortion issue consequently. In other words, image argumentation may falsify the semantic (e.g., attribute) information of an image. To take the advantage of image augmentations while mitigating the semantic distortion issue, we propose a novel ZSL approach by Harnessing Adversarial Samples (HAS). HAS advances ZSL through adversarial training which takes into account three crucial aspects: (1) robust generation by enforcing augmentations to be similar to negative classes, while maintaining correct labels, (2) reliable generation by introducing a latent space constraint to avert significant deviations from the original data manifold, and (3) diverse generation by incorporating attribute-based perturbation by adjusting images according to each semantic attribute's localization. Through comprehensive experiments on three prominent zero-shot benchmark datasets, we demonstrate the effectiveness of our adversarial samples approach in both ZSL and Generalized Zero-Shot Learning (GZSL) scenarios. Our source code is available at https://github.com/uqzhichen/HASZSL.
LGAug 17, 2023Code
Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly DetectionHaotian Si, Changhua Pei, Zhihan Li et al.
Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics' regression objectives/losses. However, our empirical study uncovers the prevalence of conflicts among metrics' regression objectives, causing MTS models to grapple with different losses. This critical aspect significantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Conflict-aware multivariate KPI Anomaly Detection algorithm. CAD offers an exclusive structure for each metric to mitigate potential conflicts while fostering inter-metric promotions. Upon thorough investigation, we find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet effective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the first practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods. Our code is accessible at https://github.com/dawnvince/MTS_CAD.
CVMar 8, 2023Code
Imbalanced Open Set Domain Adaptation via Moving-threshold Estimation and Gradual AlignmentJinghan Ru, Jun Tian, Zhekai Du et al.
Multimedia applications are often associated with cross-domain knowledge transfer, where Unsupervised Domain Adaptation (UDA) can be used to reduce the domain shifts. Open Set Domain Adaptation (OSDA) aims to transfer knowledge from a well-labeled source domain to an unlabeled target domain under the assumption that the target domain contains unknown classes. Existing OSDA methods consistently lay stress on the covariate shift, ignoring the potential label shift problem. The performance of OSDA methods degrades drastically under intra-domain class imbalance and inter-domain label shift. However, little attention has been paid to this issue in the community. In this paper, the Imbalanced Open Set Domain Adaptation (IOSDA) is explored where the covariate shift, label shift and category mismatch exist simultaneously. To alleviate the negative effects raised by label shift in OSDA, we propose Open-set Moving-threshold Estimation and Gradual Alignment (OMEGA) - a novel architecture that improves existing OSDA methods on class-imbalanced data. Specifically, a novel unknown-aware target clustering scheme is proposed to form tight clusters in the target domain to reduce the negative effects of label shift and intra-domain class imbalance. Furthermore, moving-threshold estimation is designed to generate specific thresholds for each target sample rather than using one for all. Extensive experiments on IOSDA, OSDA and OPDA benchmarks demonstrate that our method could significantly outperform existing state-of-the-arts. Code and data are available at https://github.com/mendicant04/OMEGA.
92.1SEJun 3
UModel: An Agent-Ready Observability Data Modeling Method at ScaleChanghua Pei, Zheyuan Li, Zexin Wang et al.
When networked system failures occur, automatically performing Root Cause Analysis (RCA) using observability data is critical for ensuring networked system reliability. Recently, LLM-based agents have shown promise for automating this diagnosis process through advanced reasoning and autonomous exploration. However, existing observability frameworks remain archaic, characterized by fragmented data silos, incompatible schemas, and insufficient semantic metadata, preventing agents from establishing the complex relationships required for effective RCA. To address these challenges, we present UModel, a unified ontological framework that shifts observability from data-centric to object-centric modeling. UModel constructs a virtual ontological layer where heterogeneous telemetry, entities, and expert knowledge are standardized as objects and interconnected via semantic graphs. In addition, we introduce U-SPL, a pipeline-based query interface that enables agents to autonomously explore system topologies and correlate multimodal data. By re-modeling the "AIOps 2025 Challenge" dataset using UModel, the precision of root cause localization improved by 8%, demonstrating that enhanced data organization can significantly increase the accuracy of downstream tasks. UModel provides a scalable modeling framework that, in its deployment at Alibaba Cloud for more than one year, has served tens of thousands of users, sustained millions of operations per second, and delivered sub-second query latency.
LGFeb 23, 2023
A Comprehensive Survey on Source-free Domain AdaptationZhiqi Yu, Jingjing Li, Zhekai Du et al.
Over the past decade, domain adaptation has become a widely studied branch of transfer learning that aims to improve performance on target domains by leveraging knowledge from the source domain. Conventional domain adaptation methods often assume access to both source and target domain data simultaneously, which may not be feasible in real-world scenarios due to privacy and confidentiality concerns. As a result, the research of Source-Free Domain Adaptation (SFDA) has drawn growing attention in recent years, which only utilizes the source-trained model and unlabeled target data to adapt to the target domain. Despite the rapid explosion of SFDA work, yet there has no timely and comprehensive survey in the field. To fill this gap, we provide a comprehensive survey of recent advances in SFDA and organize them into a unified categorization scheme based on the framework of transfer learning. Instead of presenting each approach independently, we modularize several components of each method to more clearly illustrate their relationships and mechanics in light of the composite properties of each method. Furthermore, we compare the results of more than 30 representative SFDA methods on three popular classification benchmarks, namely Office-31, Office-home, and VisDA, to explore the effectiveness of various technical routes and the combination effects among them. Additionally, we briefly introduce the applications of SFDA and related fields. Drawing from our analysis of the challenges facing SFDA, we offer some insights into future research directions and potential settings.
LGOct 19, 2022
Variational Model Perturbation for Source-Free Domain AdaptationMengmeng Jing, Xiantong Zhen, Jingjing Li et al.
We aim for source-free domain adaptation, where the task is to deploy a model pre-trained on source domains to target domains. The challenges stem from the distribution shift from the source to the target domain, coupled with the unavailability of any source data and labeled target data for optimization. Rather than fine-tuning the model by updating the parameters, we propose to perturb the source model to achieve adaptation to target domains. We introduce perturbations into the model parameters by variational Bayesian inference in a probabilistic framework. By doing so, we can effectively adapt the model to the target domain while largely preserving the discriminative ability. Importantly, we demonstrate the theoretical connection to learning Bayesian neural networks, which proves the generalizability of the perturbed model to target domains. To enable more efficient optimization, we further employ a parameter sharing strategy, which substantially reduces the learnable parameters compared to a fully Bayesian neural network. Our model perturbation provides a new probabilistic way for domain adaptation which enables efficient adaptation to target domains while maximally preserving knowledge in source models. Experiments on several source-free benchmarks under three different evaluation settings verify the effectiveness of the proposed variational model perturbation for source-free domain adaptation.
90.2CVJun 1
Explainable Forensics of Manipulated Segments in Untrimmed Long VideosYue Feng, Jingjing Li, Qijia Lu et al.
The rapid advancement of AI-driven video generation has transformed content creation, while simultaneously increasing the risk of misinformation through localized manipulations in long-form videos. Existing video forensic methods predominantly operate on short, independent clips, and thus fail to capture realistic scenarios where AI-generated content is sparsely embedded within otherwise authentic footage. To bridge this gap, we formulate the task of Temporal AI-Generated Segment Localization and Explanation, which targets authenticity detection, temporal localization, and interpretable analysis of manipulated segments in untrimmed long videos. We further introduce TASLE, a large-scale benchmark comprising 12,472 untrimmed videos with diverse manipulation patterns and rich annotation signals, including temporal boundaries, authenticity labels, and segment-level rationales. In addition, we propose MSLoc, a coarse-to-fine forensic baseline that combines a boundary-sensitive proposal generation module for efficient long-video scanning with an MLLM-based refinement module for precise boundary localization and interpretable reasoning. Experiments validate the effectiveness of the proposed baseline, highlighting the importance of segment-level explainable forensics for long-form AI-generated video analysis. Our dataset and code are publicly available at https://debby-0527.github.io/TASLE.
82.7MAJun 1
Agent System Operations: Categorization, Challenges, and Future DirectionsZexin Wang, Changhua Pei, Yuanhao Liu et al.
As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause localization, and resolution.
CVApr 12, 2023
Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world ApplicationsWei Ji, Jingjing Li, Qi Bi et al.
Recently, Meta AI Research approaches a general, promptable Segment Anything Model (SAM) pre-trained on an unprecedentedly large segmentation dataset (SA-1B). Without a doubt, the emergence of SAM will yield significant benefits for a wide array of practical image segmentation applications. In this study, we conduct a series of intriguing investigations into the performance of SAM across various applications, particularly in the fields of natural images, agriculture, manufacturing, remote sensing, and healthcare. We analyze and discuss the benefits and limitations of SAM, while also presenting an outlook on its future development in segmentation tasks. By doing so, we aim to give a comprehensive understanding of SAM's practical applications. This work is expected to provide insights that facilitate future research activities toward generic segmentation. Source code is publicly available.
CVSep 23, 2023
Order-preserving Consistency Regularization for Domain Adaptation and GeneralizationMengmeng Jing, Xiantong Zhen, Jingjing Li et al.
Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks.
CVDec 4, 2025Code
SAM3-I: Segment Anything with InstructionsJingjing Li, Yue Feng, Yuchen Guo et al.
Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
CVDec 18, 2022
Minimizing Maximum Model Discrepancy for Transferable Black-box Targeted AttacksAnqi Zhao, Tong Chu, Yahao Liu et al.
In this work, we study the black-box targeted attack problem from the model discrepancy perspective. On the theoretical side, we present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack. We reveal that the attack error on a target model mainly depends on empirical attack error on the substitute model and the maximum model discrepancy among substitute models. On the algorithmic side, we derive a new algorithm for black-box targeted attacks based on our theoretical analysis, in which we additionally minimize the maximum model discrepancy(M3D) of the substitute models when training the generator to generate adversarial examples. In this way, our model is capable of crafting highly transferable adversarial examples that are robust to the model variation, thus improving the success rate for attacking the black-box model. We conduct extensive experiments on the ImageNet dataset with different classification models, and our proposed approach outperforms existing state-of-the-art methods by a significant margin. Our codes will be released.
67.0CVMar 15Code
Selective Noise Suppression and Discriminative Mutual Interaction for Robust Audio-Visual SegmentationKai Peng, Yunzhe Shen, Miao Zhang et al.
The ability to capture and segment sounding objects in dynamic visual scenes is crucial for the development of Audio-Visual Segmentation (AVS) tasks. While significant progress has been made in this area, the interaction between audio and visual modalities still requires further exploration. In this work, we aim to answer the following questions: How can a model effectively suppress audio noise while enhancing relevant audio information? How can we achieve discriminative interaction between the audio and visual modalities? To this end, we propose SDAVS, equipped with the Selective Noise-Resilient Processor (SNRP) module and the Discriminative Audio-Visual Mutual Fusion (DAMF) strategy. The proposed SNRP mitigates audio noise interference by selectively emphasizing relevant auditory cues, while DAMF ensures more consistent audio-visual representations. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on benchmark AVS datasets, especially in multi-source and complex scenes. \textit{The code and model are available at https://github.com/happylife-pk/SDAVS}.
ROFeb 25Code
Self-Correcting VLA: Online Action Refinement via Sparse World ImaginationChenyv Liu, Wentao Tan, Lei Zhu et al.
Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at https://github.com/Kisaragi0/SC-VLA.
CVMay 15, 2022
Promoting Saliency From Depth: Deep Unsupervised RGB-D Saliency DetectionWei Ji, Jingjing Li, Qi Bi et al.
Growing interests in RGB-D salient object detection (RGB-D SOD) have been witnessed in recent years, owing partly to the popularity of depth sensors and the rapid progress of deep learning techniques. Unfortunately, existing RGB-D SOD methods typically demand large quantity of training images being thoroughly annotated at pixel-level. The laborious and time-consuming manual annotation has become a real bottleneck in various practical scenarios. On the other hand, current unsupervised RGB-D SOD methods still heavily rely on handcrafted feature representations. This inspires us to propose in this paper a deep unsupervised RGB-D saliency detection approach, which requires no manual pixel-level annotation during training. It is realized by two key ingredients in our training pipeline. First, a depth-disentangled saliency update (DSU) framework is designed to automatically produce pseudo-labels with iterative follow-up refinements, which provides more trustworthy supervision signals for training the saliency network. Second, an attentive training strategy is introduced to tackle the issue of noisy pseudo-labels, by properly re-weighting to highlight the more reliable pseudo-labels. Extensive experiments demonstrate the superior efficiency and effectiveness of our approach in tackling the challenging unsupervised RGB-D SOD scenarios. Moreover, our approach can also be adapted to work in fully-supervised situation. Empirical studies show the incorporation of our approach gives rise to notably performance improvement in existing supervised RGB-D SOD models.
CVSep 5, 2022
Federated Zero-Shot Learning for Visual RecognitionZhi Chen, Yadan Luo, Sen Wang et al.
Zero-shot learning is a learning regime that recognizes unseen classes by generalizing the visual-semantic relationship learned from the seen classes. To obtain an effective ZSL model, one may resort to curating training samples from multiple sources, which may inevitably raise the privacy concerns about data sharing across different organizations. In this paper, we propose a novel Federated Zero-Shot Learning FedZSL framework, which learns a central model from the decentralized data residing on edge devices. To better generalize to previously unseen classes, FedZSL allows the training data on each device sampled from the non-overlapping classes, which are far from the i.i.d. that traditional federated learning commonly assumes. We identify two key challenges in our FedZSL protocol: 1) the trained models are prone to be biased to the locally observed classes, thus failing to generalize to the unseen classes and/or seen classes appeared on other devices; 2) as each category in the training data comes from a single source, the central model is highly vulnerable to model replacement (backdoor) attacks. To address these issues, we propose three local objectives for visual-semantic alignment and cross-device alignment through relation distillation, which leverages the normalized class-wise covariance to regularize the consistency of the prediction logits across devices. To defend against the backdoor attacks, a feature magnitude defending technique is proposed. As malicious samples are less correlated to the given semantic attributes, the visual features of low magnitude will be discarded to stabilize model updates. The effectiveness and robustness of FedZSL are demonstrated by extensive experiments conducted on three zero-shot benchmark datasets.
CLJun 25, 2022
Graph Component Contrastive Learning for Concept Relatedness EstimationYueen Ma, Zixing Song, Xuming Hu et al.
Concept relatedness estimation (CRE) aims to determine whether two given concepts are related. Existing methods only consider the pairwise relationship between concepts, while overlooking the higher-order relationship that could be encoded in a concept-level graph structure. We discover that this underlying graph satisfies a set of intrinsic properties of CRE, including reflexivity, commutativity, and transitivity. In this paper, we formalize the CRE properties and introduce a graph structure named ConcreteGraph. To address the data scarcity issue in CRE, we introduce a novel data augmentation approach to sample new concept pairs from the graph. As it is intractable for data augmentation to fully capture the structural information of the ConcreteGraph due to a large amount of potential concept pairs, we further introduce a novel Graph Component Contrastive Learning framework to implicitly learn the complete structure of the ConcreteGraph. Empirical results on three datasets show significant improvement over the state-of-the-art model. Detailed ablation studies demonstrate that our proposed approach can effectively capture the high-order relationship among concepts.
CVJul 5, 2022
GSMFlow: Generation Shifts Mitigating Flow for Generalized Zero-Shot LearningZhi Chen, Yadan Luo, Sen Wang et al.
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both the seen and unseen classes by transferring semantic knowledge from seen to unseen classes. It is a promising solution to take the advantage of generative models to hallucinate realistic unseen samples based on the knowledge learned from the seen classes. However, due to the generation shifts, the synthesized samples by most existing methods may drift from the real distribution of the unseen data. To address this issue, we propose a novel flow-based generative framework that consists of multiple conditional affine coupling layers for learning unseen data generation. Specifically, we discover and address three potential problems that trigger the generation shifts, i.e., semantic inconsistency, variance collapse, and structure disorder. First, to enhance the reflection of the semantic information in the generated samples, we explicitly embed the semantic information into the transformation in each conditional affine coupling layer. Second, to recover the intrinsic variance of the real unseen features, we introduce a boundary sample mining strategy with entropy maximization to discover more difficult visual variants of semantic prototypes and hereby adjust the decision boundary of the classifiers. Third, a relative positioning strategy is proposed to revise the attribute embeddings, guiding them to fully preserve the inter-class geometric structure and further avoid structure disorder in the semantic space. Extensive experimental results on four GZSL benchmark datasets demonstrate that GSMFlow achieves the state-of-the-art performance on GZSL.
CVDec 26, 2022
Semantic Enhanced Knowledge Graph for Large-Scale Zero-Shot LearningJiwei Wei, Yang Yang, Zeyu Ma et al.
Zero-Shot Learning has been a highlighted research topic in both vision and language areas. Recently, most existing methods adopt structured knowledge information to model explicit correlations among categories and use deep graph convolutional network to propagate information between different categories. However, it is difficult to add new categories to existing structured knowledge graph, and deep graph convolutional network suffers from over-smoothing problem. In this paper, we provide a new semantic enhanced knowledge graph that contains both expert knowledge and categories semantic correlation. Our semantic enhanced knowledge graph can further enhance the correlations among categories and make it easy to absorb new categories. To propagate information on the knowledge graph, we propose a novel Residual Graph Convolutional Network (ResGCN), which can effectively alleviate the problem of over-smoothing. Experiments conducted on the widely used large-scale ImageNet-21K dataset and AWA2 dataset show the effectiveness of our method, and establish a new state-of-the-art on zero-shot learning. Moreover, our results on the large-scale ImageNet-21K with various feature extraction networks show that our method has better generalization and robustness.
ROMar 2Code
Non-Markovian Long-Horizon Robot Manipulation via Keyframe ChainingYipeng Chen, Wentao Tan, Lei Zhu et al.
Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at https://github.com/cytoplastm/KC-VLA.
CLJul 3, 2023
VOLTA: Improving Generative Diversity by Variational Mutual Information Maximizing AutoencoderYueen Ma, Dafeng Chi, Jingjing Li et al.
The natural language generation domain has witnessed great success thanks to Transformer models. Although they have achieved state-of-the-art generative quality, they often neglect generative diversity. Prior attempts to tackle this issue suffer from either low model capacity or over-complicated architectures. Some recent methods employ the VAE framework to enhance diversity, but their latent variables fully depend on the input context, restricting exploration of the latent space. In this paper, we introduce VOLTA, a framework that elevates generative diversity by bridging Transformer with VAE via a more effective cross-attention-based connection, departing from conventional embedding concatenation or summation. Additionally, we propose integrating InfoGAN-style latent codes to enable input-independent variability, further diversifying the generation. Moreover, our framework accommodates discrete inputs alongside its existing support for continuous inputs. We perform comprehensive experiments with two types of Transformers on six datasets from three different NLG tasks to show that our approach can significantly improve generative diversity while maintaining generative quality.
CLMar 20, 2024Code
An Entropy-based Text Watermarking Detection MethodYijian Lu, Aiwei Liu, Dianzhi Yu et al. · tsinghua
Text watermarking algorithms for large language models (LLMs) can effectively identify machine-generated texts by embedding and detecting hidden features in the text. Although the current text watermarking algorithms perform well in most high-entropy scenarios, its performance in low-entropy scenarios still needs to be improved. In this work, we opine that the influence of token entropy should be fully considered in the watermark detection process, $i.e.$, the weight of each token during watermark detection should be customized according to its entropy, rather than setting the weights of all tokens to the same value as in previous methods. Specifically, we propose \textbf{E}ntropy-based Text \textbf{W}atermarking \textbf{D}etection (\textbf{EWD}) that gives higher-entropy tokens higher influence weights during watermark detection, so as to better reflect the degree of watermarking. Furthermore, the proposed detection process is training-free and fully automated. From the experiments, we demonstrate that our EWD can achieve better detection performance in low-entropy scenarios, and our method is also general and can be applied to texts with different entropy distributions. Our code and data is available\footnote{\url{https://github.com/luyijian3/EWD}}. Additionally, our algorithm could be accessed through MarkLLM \cite{pan2024markllm}\footnote{\url{https://github.com/THU-BPM/MarkLLM}}.
CLApr 22, 2024Code
How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHOMan Tik Ng, Hui Tung Tse, Jen-tse Huang et al. · pku, tencent-ai
The role-play ability of Large Language Models (LLMs) has emerged as a popular research direction. However, existing studies focus on imitating well-known public figures or fictional characters, overlooking the potential for simulating ordinary individuals. Such an oversight limits the potential for advancements in digital human clones and non-player characters in video games. To bridge this gap, we introduce ECHO, an evaluative framework inspired by the Turing test. This framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses. Notably, our framework focuses on emulating average individuals rather than historical or fictional figures, presenting a unique advantage to apply the Turing Test. We evaluated three role-playing LLMs using ECHO, with GPT-3.5 and GPT-4 serving as foundational models, alongside the online application GPTs from OpenAI. Our results demonstrate that GPT-4 more effectively deceives human evaluators, and GPTs achieves a leading success rate of 48.3%. Furthermore, we investigated whether LLMs could discern between human-generated and machine-generated texts. While GPT-4 can identify differences, it could not determine which texts were human-produced. Our code and results of reproducing the role-playing LLMs are made publicly available via https://github.com/CUHK-ARISE/ECHO.
CVMar 8, 2024Code
Agile Multi-Source-Free Domain AdaptationXinyao Li, Jingjing Li, Fengling Li et al.
Efficiently utilizing rich knowledge in pretrained models has become a critical topic in the era of large models. This work focuses on adaptively utilizing knowledge from multiple source-pretrained models to an unlabeled target domain without accessing the source data. Despite being a practically useful setting, existing methods require extensive parameter tuning over each source model, which is computationally expensive when facing abundant source domains or larger source models. To address this challenge, we propose a novel approach which is free of the parameter tuning over source backbones. Our technical contribution lies in the Bi-level ATtention ENsemble (Bi-ATEN) module, which learns both intra-domain weights and inter-domain ensemble weights to achieve a fine balance between instance specificity and domain consistency. By slightly tuning source bottlenecks, we achieve comparable or even superior performance on a challenging benchmark DomainNet with less than 3% trained parameters and 8 times of throughput compared with SOTA method. Furthermore, with minor modifications, the proposed module can be easily equipped to existing methods and gain more than 4% performance boost. Code is available at https://github.com/TL-UESTC/Bi-ATEN.
IRJun 19, 2023
Att-KGCN: Tourist Attractions Recommendation System by using Attention mechanism and Knowledge Graph Convolution NetworkAhmad A. Mubarak, JingJing Li, Han Cao
The recommendation algorithm based on knowledge graphs is at a relatively mature stage. However, there are still some problems in the recommendation of specific areas. For example, in the tourism field, selecting suitable tourist attraction attributes process is complicated as the recommendation basis for tourist attractions. In this paper, we propose the improved Attention Knowledge Graph Convolution Network model, named ($Att-KGCN$), which automatically discovers the neighboring entities of the target scenic spot semantically. The attention layer aggregates relatively similar locations and represents them with an adjacent vector. Then, according to the tourist's preferred choices, the model predicts the probability of similar spots as a recommendation system. A knowledge graph dataset of tourist attractions used based on tourism data on Socotra Island-Yemen. Through experiments, it is verified that the Attention Knowledge Graph Convolution Network has a good effect on the recommendation of tourist attractions and can make more recommendations for tourists' choices.
CVMar 11, 2024Code
Split to Merge: Unifying Separated Modalities for Unsupervised Domain AdaptationXinyao Li, Yuke Li, Zhekai Du et al.
Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet, most transfer approaches for VLMs focus on either the language or visual branches, overlooking the nuanced interplay between both modalities. In this work, we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies, we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS
CLMar 6, 2024Code
CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language ModelsZexuan Qiu, Jingjing Li, Shijue Huang et al.
Developing Large Language Models (LLMs) with robust long-context capabilities has been the recent research focus, resulting in the emergence of long-context LLMs proficient in Chinese. However, the evaluation of these models remains underdeveloped due to a lack of benchmarks. To address this gap, we present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels. With CLongEval, we undertake a comprehensive assessment of 6 open-source long-context LLMs and 2 leading commercial counterparts that feature both long-context abilities and proficiency in Chinese. We also provide in-depth analysis based on the empirical results, trying to shed light on the critical capabilities that present challenges in long-context settings. The dataset, evaluation scripts, and model outputs are released.
CVDec 16, 2025
The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention SynergyZhuo Chen, Fanyue Wei, Runze Xu et al.
Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing. To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation. By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.
CVNov 11, 2025
Distributed Zero-Shot Learning for Visual RecognitionZhi Chen, Yadan Luo, Zi Huang et al.
In this paper, we propose a Distributed Zero-Shot Learning (DistZSL) framework that can fully exploit decentralized data to learn an effective model for unseen classes. Considering the data heterogeneity issues across distributed nodes, we introduce two key components to ensure the effective learning of DistZSL: a cross-node attribute regularizer and a global attribute-to-visual consensus. Our proposed cross-node attribute regularizer enforces the distances between attribute features to be similar across different nodes. In this manner, the overall attribute feature space would be stable during learning, and thus facilitate the establishment of visual-to-attribute(V2A) relationships. Then, we introduce the global attribute-tovisual consensus to mitigate biased V2A mappings learned from individual nodes. Specifically, we enforce the bilateral mapping between the attribute and visual feature distributions to be consistent across different nodes. Thus, the learned consistent V2A mapping can significantly enhance zero-shot learning across different nodes. Extensive experiments demonstrate that DistZSL achieves superior performance to the state-of-the-art in learning from distributed data.
CVDec 2, 2025
Generalizing Vision-Language Models with Dedicated Prompt GuidanceXinyao Li, Yinjie Min, Hongbo Chen et al.
Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.
ARApr 20, 2025Code
ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning ModelHaiyan Qin, Zhiwei Xie, Jingjing Li et al.
Large Language Models (LLMs) have advanced Verilog code generation significantly, yet face challenges in data quality, reasoning capabilities, and computational efficiency. This paper presents ReasoningV, a novel model employing a hybrid reasoning strategy that integrates trained intrinsic capabilities with dynamic inference adaptation for Verilog code generation. Our framework introduces three complementary innovations: (1) ReasoningV-5K, a high-quality dataset of 5,000 functionally verified instances with reasoning paths created through multi-dimensional filtering of PyraNet samples; (2) a two-stage training approach combining parameter-efficient fine-tuning for foundational knowledge with full-parameter optimization for enhanced reasoning; and (3) an adaptive reasoning mechanism that dynamically adjusts reasoning depth based on problem complexity, reducing token consumption by up to 75\% while preserving performance. Experimental results demonstrate ReasoningV's effectiveness with a pass@1 accuracy of 57.8\% on VerilogEval-human, achieving performance competitive with leading commercial models like Gemini-2.0-flash (59.5\%) and exceeding the previous best open-source model by 10.4 percentage points. ReasoningV offers a more reliable and accessible pathway for advancing AI-driven hardware design automation, with our model, data, and code available at https://github.com/BUAA-CLab/ReasoningV.
58.1HCMay 16
WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AIJingjing Li, Zhi Liu, Xiyao Jin et al.
Cultural heritage exhibitions often struggle to sustain attention and support reflective engagement. Physical exhibitions rely on fixed interpretive aids that lack adaptability to individual backgrounds or curiosity, and their effectiveness depends heavily on a visitor's Personal Context, prior knowledge, and cultural literacy. Meanwhile, digital exhibitions prioritize convenience and accessibility but risk weakening the Physical and Social Contexts that define embodied cultural experience. WhiteTesseract addresses this gap by enabling in-situ interpretation through high-resolution XR and conversational AI. The system integrates spatial intelligence via artwork recognition to allow visitors to selectively reduce environmental distractions (via diminished reality) and engage in context-aware dialogue (via large language models). The goal is to preserve the richness of the physical and social environment while providing a flexible space for personal reflection, enhancing Personal Context without compromising physical authenticity. We deployed the system in a Claude Monet exhibition and conducted a controlled user study with 26 participants. Quantitative results showed that WhiteTesseract modulation significantly increased average viewing duration from 35.3 to 98.3 seconds (p < 0.001). Analysis of 529 visitor-AI interactions revealed that 60% extended beyond factual queries to include analytical, emotional, and comparative inquiries. These findings demonstrate how XR and AI can enrich the physical exhibition experience by supporting deeper, more personalized engagement without displacing the embodied value of cultural heritage. We discuss technical and social constraints for real-world deployment and limitations of our controlled setting.
CVMar 13, 2025Code
SVIP: Semantically Contextualized Visual Patches for Zero-Shot LearningZhi Chen, Zecheng Zhao, Jingcai Guo et al.
Zero-shot learning (ZSL) aims to recognize unseen classes without labeled training examples by leveraging class-level semantic descriptors such as attributes. A fundamental challenge in ZSL is semantic misalignment, where semantic-unrelated information involved in visual features introduce ambiguity to visual-semantic interaction. Unlike existing methods that suppress semantic-unrelated information post hoc either in the feature space or the model space, we propose addressing this issue at the input stage, preventing semantic-unrelated patches from propagating through the network. To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch's semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations. Code is available at https://github.com/uqzhichen/SVIP.
CVMay 15, 2025Code
SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ ComplexityShihao Zou, Qingfeng Li, Wei Ji et al.
Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer
55.6CVMay 13
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMsJincai Huang, Shihao Zou, Yuchen Guo et al.
Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.
LGJan 23
Learning ORDER-Aware Multimodal Representations for Composite Materials DesignXinyao Li, Hangwei Qian, Jingjing Li et al.
Artificial intelligence (AI) has shown remarkable success in materials discovery and property prediction, particularly for crystalline and polymer systems where material properties and structures are dominated by discrete graph representations. Such graph-central paradigm breaks down on composite materials, which possess continuous and nonlinear design spaces that lack well-defined graph structures. General composite descriptors, e.g., fiber volume and misalignment angle, cannot fully capture the fiber distributions that fundamentally determine microstructural characteristics, necessitating the integration of heterogeneous data sources through multimodal learning. Existing alignment-oriented multimodal frameworks have proven effective on abundant crystal or polymer data under discrete, unique graph-property mapping assumptions, but fail to address the highly continuous composite design space under extreme data scarcity. In this work, we introduce ORDinal-aware imagE-tabulaR alignment (ORDER), a multimodal pretraining framework that establishes ordinality as a core principle for composite material representations. ORDER ensures that materials with similar target properties occupy nearby regions in the latent space, which effectively preserves the continuous nature of composite properties and enables meaningful interpolation between sparsely observed designs. We evaluate ORDER on a public Nanofiber-enforced composite dataset and an internally curated dataset that simulates the construction of carbon fiber T700 with diverse fiber distributions. ORDER achieves consistent improvements over state-of-the-art multimodal baselines across property prediction, cross-modal retrieval, and microstructure generation tasks.
18.2INS-DETMay 12
Machine Learning for neutron source distributionsJose Ignacio Robledo, Norberto Schmidt, Klaus Lieutenant et al.
In light of the recent advancements in machine learning, we propose a novel approach to neutron source distribution estimation through the utilisation of probabilistic generative models. The estimation is based on a Monte Carlo particle list, which is only required during the training stage of the machine learning model. Once the source distribution has been learned, the model is independent of the original particle list, allowing for further sampling in an efficient, rapid, and memory-costless manner. The performance of various generative models is evaluated, including a variational autoencoder, a normalizing flow, a generative adversarial network, and a denoising diffusion model. These approaches are then compared to existing source distribution estimations, and the advantages and disadvantages of each approach are discussed. The results demonstrate that source distributions can be modeled through the use of probabilistic generative models, which paves the way for further advancements in this field.
CLDec 13, 2023
A Survey of Text Watermarking in the Era of Large Language ModelsAiwei Liu, Leyi Pan, Yijian Lu et al. · tsinghua
Text watermarking algorithms are crucial for protecting the copyright of textual content. Historically, their capabilities and application scenarios were limited. However, recent advancements in large language models (LLMs) have revolutionized these techniques. LLMs not only enhance text watermarking algorithms with their advanced abilities but also create a need for employing these algorithms to protect their own copyrights or prevent potential misuse. This paper conducts a comprehensive survey of the current state of text watermarking technology, covering four main aspects: (1) an overview and comparison of different text watermarking techniques; (2) evaluation methods for text watermarking algorithms, including their detectability, impact on text or LLM quality, robustness under target or untargeted attacks; (3) potential application scenarios for text watermarking technology; (4) current challenges and future directions for text watermarking. This survey aims to provide researchers with a thorough understanding of text watermarking technology in the era of LLM, thereby promoting its further advancement.
CVOct 22, 2025Code
FootFormer: Estimating Stability from Visual InputKeaton Kraiger, Jingjing Li, Skanda Bharadwaj et al.
We propose FootFormer, a cross-modality approach for jointly predicting human motion dynamics directly from visual input. On multiple datasets, FootFormer achieves statistically significantly better or equivalent estimates of foot pressure distributions, foot contact maps, and center of mass (CoM), as compared with existing methods that generate one or two of those measures. Furthermore, FootFormer achieves SOTA performance in estimating stability-predictive components (CoP, CoM, BoS) used in classic kinesiology metrics. Code and data are available at https://github.com/keatonkraiger/Vision-to-Stability.git.
CVSep 23, 2025Code
Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual SegmentationYunzhe Shen, Kai Peng, Leiye Liu et al.
Audio-visual segmentation (AVS) plays a critical role in multimodal machine learning by effectively integrating audio and visual cues to precisely segment objects or regions within visual scenes. Recent AVS methods have demonstrated significant improvements. However, they overlook the inherent frequency-domain contradictions between audio and visual modalities--the pervasively interfering noise in audio high-frequency signals vs. the structurally rich details in visual high-frequency signals. Ignoring these differences can result in suboptimal performance. In this paper, we rethink the AVS task from a deeper perspective by reformulating AVS task as a frequency-domain decomposition and recomposition problem. To this end, we introduce a novel Frequency-Aware Audio-Visual Segmentation (FAVS) framework consisting of two key modules: Frequency-Domain Enhanced Decomposer (FDED) module and Synergistic Cross-Modal Consistency (SCMC) module. FDED module employs a residual-based iterative frequency decomposition to discriminate modality-specific semantics and structural features, and SCMC module leverages a mixture-of-experts architecture to reinforce semantic consistency and modality-specific feature preservation through dynamic expert routing. Extensive experiments demonstrate that our FAVS framework achieves state-of-the-art performance on three benchmark datasets, and abundant qualitative visualizations further verify the effectiveness of the proposed FDED and SCMC modules. The code will be released as open source upon acceptance of the paper.
CVOct 13, 2021Code
Domain Adaptive Semantic Segmentation without Source DataFuming You, Jingjing Li, Lei Zhu et al.
Domain adaptive semantic segmentation is recognized as a promising technique to alleviate the domain shift between the labeled source domain and the unlabeled target domain in many real-world applications, such as automatic pilot. However, large amounts of source domain data often introduce significant costs in storage and training, and sometimes the source data is inaccessible due to privacy policies. To address these problems, we investigate domain adaptive semantic segmentation without source data, which assumes that the model is pre-trained on the source domain, and then adapting to the target domain without accessing source data anymore. Since there is no supervision from the source domain data, many self-training methods tend to fall into the ``winner-takes-all'' dilemma, where the {\it majority} classes totally dominate the segmentation networks and the networks fail to classify the {\it minority} classes. Consequently, we propose an effective framework for this challenging problem with two components: positive learning and negative learning. In positive learning, we select the class-balanced pseudo-labeled pixels with intra-class threshold, while in negative learning, for each pixel, we investigate which category the pixel does not belong to with the proposed heuristic complementary label selection. Notably, our framework can be easily implemented and incorporated with other methods to further enhance the performance. Extensive experiments on two widely-used synthetic-to-real benchmarks demonstrate our claims and the effectiveness of our framework, which outperforms the baseline with a large margin. Code is available at \url{https://github.com/fumyou13/LDBE}.
MLJul 14, 2021Code
Spectrum Gaussian Processes Based On Tunable Basis FunctionsWenqi Fang, Guanlin Wu, Jingjing Li et al.
Spectral approximation and variational inducing learning for the Gaussian process are two popular methods to reduce computational complexity. However, in previous research, those methods always tend to adopt the orthonormal basis functions, such as eigenvectors in the Hilbert space, in the spectrum method, or decoupled orthogonal components in the variational framework. In this paper, inspired by quantum physics, we introduce a novel basis function, which is tunable, local and bounded, to approximate the kernel function in the Gaussian process. There are two adjustable parameters in these functions, which control their orthogonality to each other and limit their boundedness. And we conduct extensive experiments on open-source datasets to testify its performance. Compared to several state-of-the-art methods, it turns out that the proposed method can obtain satisfactory or even better results, especially with poorly chosen kernel functions.
CVJul 7, 2021Code
Mitigating Generation Shifts for Generalized Zero-Shot LearningZhi Chen, Yadan Luo, Sen Wang et al.
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training. It is natural to derive generative models and hallucinate training samples for unseen classes based on the knowledge learned from the seen samples. However, most of these models suffer from the `generation shifts', where the synthesized samples may drift from the real distribution of unseen data. In this paper, we conduct an in-depth analysis on this issue and propose a novel Generation Shifts Mitigating Flow (GSMFlow) framework, which is comprised of multiple conditional affine coupling layers for learning unseen data synthesis efficiently and effectively. In particular, we identify three potential problems that trigger the generation shifts, i.e., semantic inconsistency, variance decay, and structural permutation and address them respectively. First, to reinforce the correlations between the generated samples and the respective attributes, we explicitly embed the semantic information into the transformations in each of the coupling layers. Second, to recover the intrinsic variance of the synthesized unseen features, we introduce a visual perturbation strategy to diversify the intra-class variance of generated data and hereby help adjust the decision boundary of the classifier. Third, to avoid structural permutation in the semantic space, we propose a relative positioning strategy to manipulate the attribute embeddings, guiding which to fully preserve the inter-class geometric structure. Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings. Our code is available at: https://github.com/uqzhichen/GSMFlow
CVJun 8, 2021Code
Cross-Domain Gradient Discrepancy Minimization for Unsupervised Domain AdaptationZhekai Du, Jingjing Li, Hongzu Su et al.
Unsupervised Domain Adaptation (UDA) aims to generalize the knowledge learned from a well-labeled source domain to an unlabeled target domain. Recently, adversarial domain adaptation with two distinct classifiers (bi-classifier) has been introduced into UDA which is effective to align distributions between different domains. Previous bi-classifier adversarial learning methods only focus on the similarity between the outputs of two distinct classifiers. However, the similarity of the outputs cannot guarantee the accuracy of target samples, i.e., target samples may match to wrong categories even if the discrepancy between two classifiers is small. To challenge this issue, in this paper, we propose a cross-domain gradient discrepancy minimization (CGDM) method which explicitly minimizes the discrepancy of gradients generated by source samples and target samples. Specifically, the gradient gives a cue for the semantic information of target samples so it can be used as a good supervision to improve the accuracy of target samples. In order to compute the gradient signal of target samples, we further obtain target pseudo labels through a clustering-based self-supervised learning. Extensive experiments on three widely used UDA datasets show that our method surpasses many previous state-of-the-arts. Codes are available at https://github.com/lijin118/CGDM.
CVJan 20, 2021Code
Semantics Disentangling for Generalized Zero-Shot LearningZhi Chen, Yadan Luo, Ruihong Qiu et al.
Generalized zero-shot learning (GZSL) aims to classify samples under the assumption that some classes are not observable during training. To bridge the gap between the seen and unseen classes, most GZSL methods attempt to associate the visual features of seen classes with attributes or to generate unseen samples directly. Nevertheless, the visual features used in the prior approaches do not necessarily encode semantically related information that the shared attributes refer to, which degrades the model generalization to unseen classes. To address this issue, in this paper, we propose a novel semantics disentangling framework for the generalized zero-shot learning task (SDGZSL), where the visual features of unseen classes are firstly estimated by a conditional VAE and then factorized into semantic-consistent and semantic-unrelated latent vectors. In particular, a total correlation penalty is applied to guarantee the independence between the two factorized representations, and the semantic consistency of which is measured by the derived relation network. Extensive experiments conducted on four GZSL benchmark datasets have evidenced that the semantic-consistent features disentangled by the proposed SDGZSL are more generalizable in tasks of canonical and generalized zero-shot learning. Our source code is available at https://github.com/uqzhichen/SDGZSL.
CLOct 5, 2020Code
Discern: Discourse-Aware Entailment Reasoning Network for Conversational Machine ReadingYifan Gao, Chien-Sheng Wu, Jingjing Li et al.
Document interpretation and dialog understanding are the two major challenges for conversational machine reading. In this work, we propose Discern, a discourse-aware entailment reasoning network to strengthen the connection and enhance the understanding for both document and dialog. Specifically, we split the document into clause-like elementary discourse units (EDU) using a pre-trained discourse segmentation model, and we train our model in a weakly-supervised manner to predict whether each EDU is entailed by the user feedback in a conversation. Based on the learned EDU and entailment representations, we either reply to the user our final decision "yes/no/irrelevant" of the initial question, or generate a follow-up question to inquiry more information. Our experiments on the ShARC benchmark (blind, held-out test set) show that Discern achieves state-of-the-art results of 78.3% macro-averaged accuracy on decision making and 64.0 BLEU1 on follow-up question generation. Code and models are released at https://github.com/Yifan-Gao/Discern.
IRJun 10, 2020Code
Dual-level Semantic Transfer Deep Hashing for Efficient Social Image RetrievalLei Zhu, Hui Cui, Zhiyong Cheng et al.
Social network stores and disseminates a tremendous amount of user shared images. Deep hashing is an efficient indexing technique to support large-scale social image retrieval, due to its deep representation capability, fast retrieval speed and low storage cost. Particularly, unsupervised deep hashing has well scalability as it does not require any manually labelled data for training. However, owing to the lacking of label guidance, existing methods suffer from severe semantic shortage when optimizing a large amount of deep neural network parameters. Differently, in this paper, we propose a Dual-level Semantic Transfer Deep Hashing (DSTDH) method to alleviate this problem with a unified deep hash learning framework. Our model targets at learning the semantically enhanced deep hash codes by specially exploiting the user-generated tags associated with the social images. Specifically, we design a complementary dual-level semantic transfer mechanism to efficiently discover the potential semantics of tags and seamlessly transfer them into binary hash codes. On the one hand, instance-level semantics are directly preserved into hash codes from the associated tags with adverse noise removing. Besides, an image-concept hypergraph is constructed for indirectly transferring the latent high-order semantic correlations of images and tags into hash codes. Moreover, the hash codes are obtained simultaneously with the deep representation learning by the discrete hash optimization strategy. Extensive experiments on two public social image retrieval datasets validate the superior performance of our method compared with state-of-the-art hashing methods. The source codes of our method can be obtained at https://github.com/research2020-1/DSTDH
AIMar 5, 2024
Domain-Agnostic Mutual Prompting for Unsupervised Domain AdaptationZhekai Du, Xinyao Li, Fengling Li et al.
Conventional Unsupervised Domain Adaptation (UDA) strives to minimize distribution discrepancy between domains, which neglects to harness rich semantics from data and struggles to handle complex domain shifts. A promising technique is to leverage the knowledge of large-scale pre-trained vision-language models for more guided adaptation. Despite some endeavors, current methods often learn textual prompts to embed domain semantics for source and target domains separately and perform classification within each domain, limiting cross-domain knowledge transfer. Moreover, prompting only the language branch lacks flexibility to adapt both modalities dynamically. To bridge this gap, we propose Domain-Agnostic Mutual Prompting (DAMP) to exploit domain-invariant semantics by mutually aligning visual and textual embeddings. Specifically, the image contextual information is utilized to prompt the language branch in a domain-agnostic and instance-conditioned way. Meanwhile, visual prompts are imposed based on the domain-agnostic textual prompt to elicit domain-invariant visual embeddings. These two branches of prompts are learned mutually with a cross-attention module and regularized with a semantic-consistency loss and an instance-discrimination contrastive loss. Experiments on three UDA benchmarks demonstrate the superiority of DAMP over state-of-the-art approaches.
CVDec 24, 2025
Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time PotentialShihao Zou, Jingjing Li, Wei Ji et al.
Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.