CLJun 14, 2023Code
WizardCoder: Empowering Code Large Language Models with Evol-InstructZiyang Luo, Can Xu, Pu Zhao et al. · microsoft-research
Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM
IRAug 18, 2023Code
Graph-based Alignment and Uniformity for RecommendationLiangwei Yang, Zhiwei Liu, Chen Wang et al. · salesforce
Collaborative filtering-based recommender systems (RecSys) rely on learning representations for users and items to predict preferences accurately. Representation learning on the hypersphere is a promising approach due to its desirable properties, such as alignment and uniformity. However, the sparsity issue arises when it encounters RecSys. To address this issue, we propose a novel approach, graph-based alignment and uniformity (GraphAU), that explicitly considers high-order connectivities in the user-item bipartite graph. GraphAU aligns the user/item embedding to the dense vector representations of high-order neighbors using a neighborhood aggregator, eliminating the need to compute the burdensome alignment to high-order neighborhoods individually. To address the discrepancy in alignment losses, GraphAU includes a layer-wise alignment pooling module to integrate alignment losses layer-wise. Experiments on four datasets show that GraphAU significantly alleviates the sparsity issue and achieves state-of-the-art performance. We open-source GraphAU at https://github.com/YangLiangwei/GraphAU.
LGNov 25, 2022Code
Interpreting Unfairness in Graph Neural Networks via Training Node AttributionYushun Dong, Song Wang, Jing Ma et al.
Graph Neural Networks (GNNs) have emerged as the leading paradigm for solving graph analytical problems in various real-world applications. Nevertheless, GNNs could potentially render biased predictions towards certain demographic subgroups. Understanding how the bias in predictions arises is critical, as it guides the design of GNN debiasing mechanisms. However, most existing works overwhelmingly focus on GNN debiasing, but fall short on explaining how such bias is induced. In this paper, we study a novel problem of interpreting GNN unfairness through attributing it to the influence of training nodes. Specifically, we propose a novel strategy named Probabilistic Distribution Disparity (PDD) to measure the bias exhibited in GNNs, and develop an algorithm to efficiently estimate the influence of each training node on such bias. We verify the validity of PDD and the effectiveness of influence estimation through experiments on real-world datasets. Finally, we also demonstrate how the proposed framework could be used for debiasing GNNs. Open-source code can be found at https://github.com/yushundong/BIND.
CLFeb 16, 2023Code
NUAA-QMUL-AIIT at Memotion 3: Multi-modal Fusion with Squeeze-and-Excitation for Internet Meme Emotion AnalysisXiaoyu Guo, Jing Ma, Arkaitz Zubiaga
This paper describes the participation of our NUAA-QMUL-AIIT team in the Memotion 3 shared task on meme emotion analysis. We propose a novel multi-modal fusion method, Squeeze-and-Excitation Fusion (SEFusion), and embed it into our system for emotion classification in memes. SEFusion is a simple fusion method that employs fully connected layers, reshaping, and matrix multiplication. SEFusion learns a weight for each modality and then applies it to its own modality feature. We evaluate the performance of our system on the three Memotion 3 sub-tasks. Among all participating systems in this Memotion 3 shared task, our system ranked first on task A, fifth on task B, and second on task C. Our proposed SEFusion provides the flexibility to fuse any features from different modalities. The source code for our method is published on https://github.com/xxxxxxxxy/memotion3-SEFusion.
SEAug 20, 2024Code
CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?Yuwei Zhao, Ziyang Luo, Yuchen Tian et al.
Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our codes and benchmark are available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}.
CVFeb 6, 2023
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text RetrievalZiyang luo, Pu Zhao, Can Xu et al. · microsoft-research
Image-text retrieval (ITR) is a task to retrieve the relevant images/texts, given the query from another modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders, however, it faces challenges with low retrieval speed in large-scale retrieval scenarios. In this work, we propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts to take advantage of the bag-of-words models and efficient inverted indexes, resulting in significantly reduced retrieval latency. A crucial gap arises from the continuous nature of image data, and the requirement for a sparse vocabulary space representation. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon representations. This framework features lexicon-bottlenecked modules between the dual-stream encoders and weakened text decoders, allowing for constructing continuous bag-of-words bottlenecks to learn lexicon-importance distributions. Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art performance on two benchmark ITR datasets, MSCOCO and Flickr30k. Furthermore, in large-scale retrieval scenarios, LexLIP outperforms CLIP with a 5.5 ~ 221.3X faster retrieval speed and 13.2 ~ 48.8X less index storage memory.
LGJul 7, 2022
Learning Causal Effects on HypergraphsJing Ma, Mengting Wan, Longqi Yang et al.
Hypergraphs provide an effective abstraction for modeling multi-way group interactions among nodes, where each hyperedge can connect any number of nodes. Different from most existing studies which leverage statistical dependencies, we study hypergraphs from the perspective of causality. Specifically, in this paper, we focus on the problem of individual treatment effect (ITE) estimation on hypergraphs, aiming to estimate how much an intervention (e.g., wearing face covering) would causally affect an outcome (e.g., COVID-19 infection) of each individual node. Existing works on ITE estimation either assume that the outcome on one individual should not be influenced by the treatment assignments on other individuals (i.e., no interference), or assume the interference only exists between pairs of connected individuals in an ordinary graph. We argue that these assumptions can be unrealistic on real-world hypergraphs, where higher-order interference can affect the ultimate ITE estimations due to the presence of group interactions. In this work, we investigate high-order interference modeling, and propose a new causality learning framework powered by hypergraph neural networks. Extensive experiments on real-world hypergraphs verify the superiority of our framework over existing baselines.
68.3AIMay 31
Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-CheckingYuxi Sun, Wenbo Shang, Wei Gao et al.
In RAG-based fact-checking, LLMs are increasingly used as verifiers to check given claims against retrieved evidence. Their parametric knowledge can induce pre-evidence tendencies that may conflict with the retrieved context, yet existing evaluation frameworks do not characterize such prior-context discrepancy or measure how verifiers arbitrate between parametric and contextual signals. We introduce \textsc{PAVE} (\emph{Prior-Aware Verifier Evaluation}), a diagnostic testbed that stratifies an LLM verifier into four epistemic states based on the correctness and confidence of its pre-evidence prior and evaluates its arbitration behavior on this new benchmark, i.e., whether it persists in correct prior under misleading evidence, and whether it corrects wrong prior when accurate evidence is provided. Experiments across seven LLMs reveal unreliable and highly model-dependent prior-context arbitration, highlighting the importance of verifier selection for real-world RAG-based fact-checking applications. Based on these findings, we propose a lightweight JSD-based test-time arbitration method that improves factual reliability without modifying the underlying model, achieving competitive performance across diverse LLM families.
CLApr 19, 2022
DecBERT: Enhancing the Language Understanding of BERT with Causal Attention MasksZiyang Luo, Yadong Xi, Jing Ma et al. · bytedance
Since 2017, the Transformer-based models play critical roles in various downstream Natural Language Processing tasks. However, a common limitation of the attention mechanism utilized in Transformer Encoder is that it cannot automatically capture the information of word order, so explicit position embeddings are generally required to be fed into the target model. In contrast, Transformer Decoder with the causal attention masks is naturally sensitive to the word order. In this work, we focus on improving the position encoding ability of BERT with the causal attention masks. Furthermore, we propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark. Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE benchmark; and (3) our modification accelerates the pre-training process and DecBERT w/ PE achieves better overall performance than the baseline systems when pre-training with the same amount of computational resources.
AIJun 7, 2023
A Review on Knowledge Graphs for Healthcare: Resources, Applications, and PromisesHejie Cui, Jiaying Lu, Ran Xu et al.
This comprehensive review aims to provide an overview of the current state of Healthcare Knowledge Graphs (HKGs), including their construction, utilization models, and applications across various healthcare and biomedical research domains. We thoroughly analyzed existing literature on HKGs, covering their construction methodologies, utilization techniques, and applications in basic science research, pharmaceutical research and development, clinical decision support, and public health. The review encompasses both model-free and model-based utilization approaches and the integration of HKGs with large language models (LLMs). We searched Google Scholar for relevant papers on HKGs and classified them into the following topics: HKG construction, HKG utilization, and their downstream applications in various domains. We also discussed their special challenges and the promise for future work. The review highlights the potential of HKGs to significantly impact biomedical research and clinical practice by integrating vast amounts of biomedical knowledge from multiple domains. The synergy between HKGs and LLMs offers promising opportunities for constructing more comprehensive knowledge graphs and improving the accuracy of healthcare applications. HKGs have emerged as a powerful tool for structuring medical knowledge, with broad applications across biomedical research, clinical decision-making, and public health. This survey serves as a roadmap for future research and development in the field of HKGs, highlighting the potential of combining knowledge graphs with advanced machine learning models for healthcare transformation.
LGApr 21, 2022
Fairness in Graph Mining: A SurveyYushun Dong, Jing Ma, Song Wang et al.
Graph mining algorithms have been playing a significant role in myriad fields over the years. However, despite their promising performance on various graph analytical tasks, most of these algorithms lack fairness considerations. As a consequence, they could lead to discrimination towards certain populations when exploited in human-centered applications. Recently, algorithmic fairness has been extensively studied in graph-based applications. In contrast to algorithmic fairness on independent and identically distributed (i.i.d.) data, fairness in graph mining has exclusive backgrounds, taxonomies, and fulfilling techniques. In this survey, we provide a comprehensive and up-to-date introduction of existing literature under the context of fair graph mining. Specifically, we propose a novel taxonomy of fairness notions on graphs, which sheds light on their connections and differences. We further present an organized summary of existing techniques that promote fairness in graph mining. Finally, we summarize the widely used datasets in this emerging research field and provide insights on current research challenges and open questions, aiming at encouraging cross-breeding ideas and further advances.
73.5CLApr 15Code
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous ContradictionsZhe Hu, Tuo Liang, Jing Li et al.
Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.
LGOct 16, 2022
CLEAR: Generative Counterfactual Explanations on GraphsJing Ma, Ruocheng Guo, Saumitra Mishra et al.
Counterfactual explanations promote explainability in machine learning models by answering the question "how should an input instance be perturbed to obtain a desired predicted label?". The comparison of this instance before and after perturbation can enhance human interpretation. Most existing studies on counterfactual explanations are limited in tabular data or image data. In this work, we study the problem of counterfactual explanation generation on graphs. A few studies have explored counterfactual explanations on graphs, but many challenges of this problem are still not well-addressed: 1) optimizing in the discrete and disorganized space of graphs; 2) generalizing on unseen graphs; and 3) maintaining the causality in the generated counterfactuals without prior knowledge of the causal model. To tackle these challenges, we propose a novel framework CLEAR which aims to generate counterfactual explanations on graphs for graph-level prediction models. Specifically, CLEAR leverages a graph variational autoencoder based mechanism to facilitate its optimization and generalization, and promotes causality by leveraging an auxiliary variable to better identify the underlying causal model. Extensive experiments on both synthetic and real-world graphs validate the superiority of CLEAR over the state-of-the-art methods in different aspects.
LGOct 10, 2022
DPAR: Decoupled Graph Neural Networks with Node-Level Differential PrivacyQiuchen Zhang, Hong kyu Lee, Jing Ma et al.
Graph Neural Networks (GNNs) have achieved great success in learning with graph-structured data. Privacy concerns have also been raised for the trained models which could expose the sensitive information of graphs including both node features and the structure information. In this paper, we aim to achieve node-level differential privacy (DP) for training GNNs so that a node and its edges are protected. Node DP is inherently difficult for GNNs because all direct and multi-hop neighbors participate in the calculation of gradients for each node via layer-wise message passing and there is no bound on how many direct and multi-hop neighbors a node can have, so existing DP methods will result in high privacy cost or poor utility due to high node sensitivity. We propose a Decoupled GNN with Differentially Private Approximate Personalized PageRank (DPAR) for training GNNs with an enhanced privacy-utility tradeoff. The key idea is to decouple the feature projection and message passing via a DP PageRank algorithm which learns the structure information and uses the top-$K$ neighbors determined by the PageRank for feature aggregation. By capturing the most important neighbors for each node and avoiding the layer-wise message passing, it bounds the node sensitivity and achieves improved privacy-utility tradeoff compared to layer-wise perturbation based methods. We theoretically analyze the node DP guarantee for the two processes combined together and empirically demonstrate better utilities of DPAR with the same level of node DP compared with state-of-the-art methods.
CLDec 2, 2022
Zero-Shot Rumor Detection with Propagation Structure via Prompt LearningHongzhan Lin, Pengyao Yi, Jing Ma et al.
The spread of rumors along with breaking events seriously hinders the truth in the era of social media. Previous studies reveal that due to the lack of annotated resources, rumors presented in minority languages are hard to be detected. Furthermore, the unforeseen breaking events not involved in yesterday's news exacerbate the scarcity of data resources. In this work, we propose a novel zero-shot framework based on prompt learning to detect rumors falling in different domains or presented in different languages. More specifically, we firstly represent rumor circulated on social media as diverse propagation threads, then design a hierarchical prompt encoding mechanism to learn language-agnostic contextual representations for both prompts and rumor data. To further enhance domain adaptation, we model the domain-invariant structural features from the propagation threads, to incorporate structural position representations of influential community response. In addition, a new virtual response augmentation method is used to improve model training. Extensive experiments conducted on three real-world datasets demonstrate that our proposed model achieves much better performance than state-of-the-art methods and exhibits a superior capacity for detecting rumors at early stages.
CLApr 6, 2022
A Weakly Supervised Propagation Model for Rumor Verification and Stance Detection with Multiple Instance LearningRuichao Yang, Jing Ma, Hongzhan Lin et al.
The diffusion of rumors on microblogs generally follows a propagation tree structure, that provides valuable clues on how an original message is transmitted and responded by users over time. Recent studies reveal that rumor detection and stance detection are two different but relevant tasks which can jointly enhance each other, e.g., rumors can be debunked by cross-checking the stances conveyed by their relevant microblog posts, and stances are also conditioned on the nature of the rumor. However, most stance detection methods require enormous post-level stance labels for training, which are labor-intensive given a large number of posts. Enlightened by Multiple Instance Learning (MIL) scheme, we first represent the diffusion of claims with bottom-up and top-down trees, then propose two tree-structured weakly supervised frameworks to jointly classify rumors and stances, where only the bag-level labels concerning claim's veracity are needed. Specifically, we convert the multi-class problem into a multiple MIL-based binary classification problem where each binary model focuses on differentiating a target stance or rumor type and other types. Finally, we propose a hierarchical attention mechanism to aggregate the binary predictions, including (1) a bottom-up or top-down tree attention layer to aggregate binary stances into binary veracity; and (2) a discriminative attention layer to aggregate the binary class into finer-grained classes. Extensive experiments conducted on three Twitter-based datasets demonstrate promising performance of our model on both claim-level rumor detection and post-level stance classification compared with state-of-the-art methods.
CLNov 12, 2025Code
MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal CritiqueGailun Zeng, Ziyang Luo, Hongzhan Lin et al.
The ability of critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-CRITIC, a holistic benchmark for evaluating the critique ability of LMMs across multiple dimensions: basic, correction, and comparison. Covering 8 main task types and over 500 tasks, MM-CRITIC collects responses from various LMMs with different model sizes and is composed of 4471 samples. To enhance the evaluation reliability, we integrate expert-informed ground answers into scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-CRITIC and provide a comprehensive assessment of leading LMMs' critique capabilities under multiple dimensions. Further analysis reveals some key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions. Our code is available at https://github.com/MichealZeng0420/MM-Critic.
CLOct 31, 2025Code
MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language ModelsZixin Chen, Hongzhan Lin, Kaixin Li et al.
The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.
IRApr 24, 2022
Empowering Next POI Recommendation with Multi-Relational ModelingZheng Huang, Jing Ma, Yushun Dong et al.
With the wide adoption of mobile devices and web applications, location-based social networks (LBSNs) offer large-scale individual-level location-related activities and experiences. Next point-of-interest (POI) recommendation is one of the most important tasks in LBSNs, aiming to make personalized recommendations of next suitable locations to users by discovering preferences from users' historical activities. Noticeably, LBSNs have offered unparalleled access to abundant heterogeneous relational information about users and POIs (including user-user social relations, such as families or colleagues; and user-POI visiting relations). Such relational information holds great potential to facilitate the next POI recommendation. However, most existing methods either focus on merely the user-POI visits, or handle different relations based on over-simplified assumptions while neglecting relational heterogeneities. To fill these critical voids, we propose a novel framework, MEMO, which effectively utilizes the heterogeneous relations with a multi-network representation learning module, and explicitly incorporates the inter-temporal user-POI mutual influence with the coupled recurrent neural networks. Extensive experiments on real-world LBSN data validate the superiority of our framework over the state-of-the-art next POI recommendation methods.
CLApr 18, 2022
Detect Rumors in Microblog Posts for Low-Resource Domains via Adversarial Contrastive LearningHongzhan Lin, Jing Ma, Liangliang Chen et al.
Massive false rumors emerging along with breaking news or trending topics severely hinder the truth. Existing rumor detection approaches achieve promising performance on the yesterday's news, since there is enough corpus collected from the same domain for model training. However, they are poor at detecting rumors about unforeseen events especially those propagated in different languages due to the lack of training data and prior knowledge (i.e., low-resource regimes). In this paper, we propose an adversarial contrastive learning framework to detect rumors by adapting the features learned from well-resourced rumor data to that of the low-resourced. Our model explicitly overcomes the restriction of domain and/or language usage via language alignment and a novel supervised contrastive training paradigm. Moreover, we develop an adversarial augmentation mechanism to further enhance the robustness of low-resource rumor representation. Extensive experiments conducted on two low-resource datasets collected from real-world microblog platforms demonstrate that our framework achieves much better performance than state-of-the-art methods and exhibits a superior capacity for detecting rumors at early stages.
LGJul 17, 2023
Learning for Counterfactual Fairness from Observational DataJing Ma, Ruocheng Guo, Aidong Zhang et al.
Fairness-aware machine learning has attracted a surge of attention in many domains, such as online advertising, personalized recommendation, and social media analysis in web applications. Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age. Among many existing fairness notions, counterfactual fairness is a popular notion defined from a causal perspective. It measures the fairness of a predictor by comparing the prediction of each individual in the original world and that in the counterfactual worlds in which the value of the sensitive attribute is modified. A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data. However, in real-world scenarios, the underlying causal model is often unknown, and acquiring such human knowledge could be very difficult. In these scenarios, it is risky to directly trust the causal models obtained from information sources with unknown reliability and even causal discovery methods, as incorrect causal models can consequently bring biases to the predictor and lead to unfair predictions. In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE. Specifically, under certain general assumptions, CLAIRE effectively mitigates the biases from the sensitive attribute with a representation learning framework based on counterfactual data augmentation and an invariant penalty. Experiments conducted on both synthetic and real-world datasets validate the superiority of CLAIRE in both counterfactual fairness and prediction performance.
85.8CVMar 24
AdvSplat: Adversarial Attacks on Feed-Forward Gaussian Splatting ModelsYiran Qiao, Yiren Lu, Yunlai Zhou et al.
3D Gaussian Splatting (3DGS) is increasingly recognized as a powerful paradigm for real-time, high-fidelity 3D reconstruction. However, its per-scene optimization pipeline limits scalability and generalization, and prevents efficient inference. Recently emerged feed-forward 3DGS models address these limitations by enabling fast reconstruction from a few input views after large-scale pretraining, without scene-specific optimization. Despite their advantages and strong potential for commercial deployment, the use of neural networks as the backbone also amplifies the risk of adversarial manipulation. In this paper, we introduce AdvSplat, the first systematic study of adversarial attacks on feed-forward 3DGS. We first employ white-box attacks to reveal fundamental vulnerabilities of this model family. We then develop two improved, practically relevant, query-efficient black-box algorithms that optimize pixel-space perturbations via a frequency-domain parameterization: one based on gradient estimation and the other gradient-free, without requiring any access to model internals. Extensive experiments across multiple datasets demonstrate that AdvSplat can significantly disrupt reconstruction results by injecting imperceptible perturbations into the input images. Our findings surface an overlooked yet urgent problem in this domain, and we hope to draw the community's attention to this emerging security and robustness challenge.
LGJul 17, 2023
A Look into Causal Effects under Entangled Treatment in Graphs: Investigating the Impact of Contact on MRSA InfectionJing Ma, Chen Chen, Anil Vullikanti et al.
Methicillin-resistant Staphylococcus aureus (MRSA) is a type of bacteria resistant to certain antibiotics, making it difficult to prevent MRSA infections. Among decades of efforts to conquer infectious diseases caused by MRSA, many studies have been proposed to estimate the causal effects of close contact (treatment) on MRSA infection (outcome) from observational data. In this problem, the treatment assignment mechanism plays a key role as it determines the patterns of missing counterfactuals -- the fundamental challenge of causal effect estimation. Most existing observational studies for causal effect learning assume that the treatment is assigned individually for each unit. However, on many occasions, the treatments are pairwisely assigned for units that are connected in graphs, i.e., the treatments of different units are entangled. Neglecting the entangled treatments can impede the causal effect estimation. In this paper, we study the problem of causal effect estimation with treatment entangled in a graph. Despite a few explorations for entangled treatments, this problem still remains challenging due to the following challenges: (1) the entanglement brings difficulties in modeling and leveraging the unknown treatment assignment mechanism; (2) there may exist hidden confounders which lead to confounding biases in causal effect estimation; (3) the observational data is often time-varying. To tackle these challenges, we propose a novel method NEAT, which explicitly leverages the graph structure to model the treatment assignment mechanism, and mitigates confounding biases based on the treatment assignment modeling. We also extend our method into a dynamic setting to handle time-varying observational data. Experiments on both synthetic datasets and a real-world MRSA dataset validate the effectiveness of the proposed method, and provide insights for future applications.
CVFeb 16, 2023
Cluster-based Deep Ensemble Learning for Emotion Classification in Internet MemesXiaoyu Guo, Jing Ma, Arkaitz Zubiaga
Memes have gained popularity as a means to share visual ideas through the Internet and social media by mixing text, images and videos, often for humorous purposes. Research enabling automated analysis of memes has gained attention in recent years, including among others the task of classifying the emotion expressed in memes. In this paper, we propose a novel model, cluster-based deep ensemble learning (CDEL), for emotion classification in memes. CDEL is a hybrid model that leverages the benefits of a deep learning model in combination with a clustering algorithm, which enhances the model with additional information after clustering memes with similar facial features. We evaluate the performance of CDEL on a benchmark dataset for emotion classification, proving its effectiveness by outperforming a wide range of baseline models and achieving state-of-the-art performance. Further evaluation through ablated models demonstrates the effectiveness of the different components of CDEL.
CLSep 29, 2022
A Coarse-to-fine Cascaded Evidence-Distillation Neural Network for Explainable Fake News DetectionZhiwei Yang, Jing Ma, Hechang Chen et al.
Existing fake news detection methods aim to classify a piece of news as true or false and provide veracity explanations, achieving remarkable performances. However, they often tailor automated solutions on manual fact-checked reports, suffering from limited news coverage and debunking delays. When a piece of news has not yet been fact-checked or debunked, certain amounts of relevant raw reports are usually disseminated on various media outlets, containing the wisdom of crowds to verify the news claim and explain its verdict. In this paper, we propose a novel Coarse-to-fine Cascaded Evidence-Distillation (CofCED) neural network for explainable fake news detection based on such raw reports, alleviating the dependency on fact-checked ones. Specifically, we first utilize a hierarchical encoder for web text representation, and then develop two cascaded selectors to select the most explainable sentences for verdicts on top of the selected top-K reports in a coarse-to-fine manner. Besides, we construct two explainable fake news datasets, which are publicly available. Experimental results demonstrate that our model significantly outperforms state-of-the-art baselines and generates high-quality explanations from diverse evaluation perspectives.
LGAug 28, 2023
Fair Few-shot Learning with Auxiliary SetsSong Wang, Jing Ma, Lu Cheng et al.
Recently, there has been a growing interest in developing machine learning (ML) models that can promote fairness, i.e., eliminating biased predictions towards certain populations (e.g., individuals from a specific demographic group). Most existing works learn such models based on well-designed fairness constraints in optimization. Nevertheless, in many practical ML tasks, only very few labeled data samples can be collected, which can lead to inferior fairness performance. This is because existing fairness constraints are designed to restrict the prediction disparity among different sensitive groups, but with few samples, it becomes difficult to accurately measure the disparity, thus rendering ineffective fairness optimization. In this paper, we define the fairness-aware learning task with limited training samples as the \emph{fair few-shot learning} problem. To deal with this problem, we devise a novel framework that accumulates fairness-aware knowledge across different meta-training tasks and then generalizes the learned knowledge to meta-test tasks. To compensate for insufficient training samples, we propose an essential strategy to select and leverage an auxiliary set for each meta-test task. These auxiliary sets contain several labeled training samples that can enhance the model performance regarding fairness in meta-test tasks, thereby allowing for the transfer of learned useful fairness-oriented knowledge to meta-test tasks. Furthermore, we conduct extensive experiments on three real-world datasets to validate the superiority of our framework against the state-of-the-art baselines.
88.3LGMar 11Code
Ranking Reasoning LLMs under Test-Time ScalingMohsen Hariri, Michael Hinczewski, Jing Ma et al.
Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $τ_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.
CLOct 25, 2023
WSDMS: Debunk Fake News via Weakly Supervised Detection of Misinforming Sentences with Contextualized Social WisdomRuichao Yang, Wei Gao, Jing Ma et al.
In recent years, we witness the explosion of false and unconfirmed information (i.e., rumors) that went viral on social media and shocked the public. Rumors can trigger versatile, mostly controversial stance expressions among social media users. Rumor verification and stance detection are different yet relevant tasks. Fake news debunking primarily focuses on determining the truthfulness of news articles, which oversimplifies the issue as fake news often combines elements of both truth and falsehood. Thus, it becomes crucial to identify specific instances of misinformation within the articles. In this research, we investigate a novel task in the field of fake news debunking, which involves detecting sentence-level misinformation. One of the major challenges in this task is the absence of a training dataset with sentence-level annotations regarding veracity. Inspired by the Multiple Instance Learning (MIL) approach, we propose a model called Weakly Supervised Detection of Misinforming Sentences (WSDMS). This model only requires bag-level labels for training but is capable of inferring both sentence-level misinformation and article-level veracity, aided by relevant social media conversations that are attentively contextualized with news sentences. We evaluate WSDMS on three real-world benchmarks and demonstrate that it outperforms existing state-of-the-art baselines in debunking fake news at both the sentence and article levels.
LGAug 28, 2024Code
Certified Causal Defense with Generalizable RobustnessYiran Qiao, Yu Yin, Chen Chen et al.
While machine learning models have proven effective across various scenarios, it is widely acknowledged that many models are vulnerable to adversarial attacks. Recently, there have emerged numerous efforts in adversarial defense. Among them, certified defense is well known for its theoretical guarantees against arbitrary adversarial perturbations on input within a certain range (e.g., $l_2$ ball). However, most existing works in this line struggle to generalize their certified robustness in other data domains with distribution shifts. This issue is rooted in the difficulty of eliminating the negative impact of spurious correlations on robustness in different domains. To address this problem, in this work, we propose a novel certified defense framework GLEAN, which incorporates a causal perspective into the generalization problem in certified defense. More specifically, our framework integrates a certifiable causal factor learning component to disentangle the causal relations and spurious correlations between input and label, and thereby exclude the negative effect of spurious correlations on defense. On top of that, we design a causally certified defense strategy to handle adversarial attacks on latent causal factors. In this way, our framework is not only robust against malicious noises on data in the training distribution but also can generalize its robustness across domains with distribution shifts. Extensive experiments on benchmark datasets validate the superiority of our framework in certified robustness generalization in different data domains. Code is available in the supplementary materials.
CLMay 2, 2024Code
A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and LawZhiyu Zoey Chen, Jing Ma, Xinlu Zhang et al.
In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications, challenges, and forward-looking opportunities of LLMs within these high-stakes sectors. We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies. Moreover, we critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems that respect regulatory norms. By presenting a thorough review of current literature and practical applications, we showcase the transformative impact of LLMs, and outline the imperative for interdisciplinary cooperation, methodological advancements, and ethical vigilance. Through this lens, we aim to spark dialogue and inspire future research dedicated to maximizing the benefits of LLMs while mitigating their risks in these precision-dependent sectors. To facilitate future research on LLMs in these critical societal domains, we also initiate a reading list that tracks the latest advancements under this topic, which will be continually updated: \url{https://github.com/czyssrs/LLM_X_papers}.
LGJul 16, 2024
A Benchmark for Fairness-Aware Graph LearningYushun Dong, Song Wang, Zhenyu Lei et al.
Fairness-aware graph learning has gained increasing attention in recent years. Nevertheless, there lacks a comprehensive benchmark to evaluate and compare different fairness-aware graph learning methods, which blocks practitioners from choosing appropriate ones for broader real-world applications. In this paper, we present an extensive benchmark on ten representative fairness-aware graph learning methods. Specifically, we design a systematic evaluation protocol and conduct experiments on seven real-world datasets to evaluate these methods from multiple perspectives, including group fairness, individual fairness, the balance between different fairness criteria, and computational efficiency. Our in-depth analysis reveals key insights into the strengths and limitations of existing methods. Additionally, we provide practical guidance for applying fairness-aware graph learning methods in applications. To the best of our knowledge, this work serves as an initial step towards comprehensively understanding representative fairness-aware graph learning methods to facilitate future advancements in this area.
LGNov 3, 2022
Private Semi-supervised Knowledge Transfer for Deep Learning from Noisy LabelsQiuchen Zhang, Jing Ma, Jian Lou et al.
Deep learning models trained on large-scale data have achieved encouraging performance in many real-world tasks. Meanwhile, publishing those models trained on sensitive datasets, such as medical records, could pose serious privacy concerns. To counter these issues, one of the current state-of-the-art approaches is the Private Aggregation of Teacher Ensembles, or PATE, which achieved promising results in preserving the utility of the model while providing a strong privacy guarantee. PATE combines an ensemble of "teacher models" trained on sensitive data and transfers the knowledge to a "student" model through the noisy aggregation of teachers' votes for labeling unlabeled public data which the student model will be trained on. However, the knowledge or voted labels learned by the student are noisy due to private aggregation. Learning directly from noisy labels can significantly impact the accuracy of the student model. In this paper, we propose the PATE++ mechanism, which combines the current advanced noisy label training mechanisms with the original PATE framework to enhance its accuracy. A novel structure of Generative Adversarial Nets (GANs) is developed in order to integrate them effectively. In addition, we develop a novel noisy label detection mechanism for semi-supervised model training to further improve student model performance when training with noisy labels. We evaluate our method on Fashion-MNIST and SVHN to show the improvements on the original PATE on all measures.
CLSep 15, 2024
Causal Inference with Large Language Model: A SurveyJing Ma
Causal inference has been a pivotal challenge across diverse domains such as medicine and economics, demanding a complicated integration of human knowledge, mathematical reasoning, and data mining capabilities. Recent advancements in natural language processing (NLP), particularly with the advent of large language models (LLMs), have introduced promising opportunities for traditional causal inference tasks. This paper reviews recent progress in applying LLMs to causal inference, encompassing various tasks spanning different levels of causation. We summarize the main causal problems and approaches, and present a comparison of their evaluation results in different causal scenarios. Furthermore, we discuss key findings and outline directions for future research, underscoring the potential implications of integrating LLMs in advancing causal inference methodologies.
LGMay 21, 2022
Aligning Logits Generatively for Principled Black-Box Knowledge DistillationJing Ma, Xiang Xiang, Ke Wang et al.
Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity of data distributions. In this paper, we formalize a two-step workflow consisting of deprivatization and distillation, and theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. With its guidance, we propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one. Our method does not differentiate between treating soft or hard responses, and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, and 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points. For different teacher-student pairs, our method yields inspiring distillation performance on various benchmarks, and outperforms the previous state-of-the-art approaches.
64.7LGMay 21
CausalGuard: Conformal Inference under Graph UncertaintyVikash Singh, Weicong Chen, Debargha Ganguly et al.
Estimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.
CLApr 4, 2023
A Unified Contrastive Transfer Framework with Propagation Structure for Boosting Low-Resource Rumor DetectionHongzhan Lin, Jing Ma, Ruichao Yang et al.
The truth is significantly hampered by massive rumors that spread along with breaking news or popular topics. Since there is sufficient corpus gathered from the same domain for model training, existing rumor detection algorithms show promising performance on yesterday's news. However, due to a lack of substantial training data and prior expert knowledge, they are poor at spotting rumors concerning unforeseen events, especially those propagated in different languages (i.e., low-resource regimes). In this paper, we propose a unified contrastive transfer framework to detect rumors by adapting the features learned from well-resourced rumor data to that of the low-resourced with only few-shot annotations. More specifically, we first represent rumor circulated on social media as an undirected topology for enhancing the interaction of user opinions, and then train a Multi-scale Graph Convolutional Network via a unified contrastive paradigm to mine effective clues simultaneously from post semantics and propagation structure. Our model explicitly breaks the barriers of the domain and/or language issues, via language alignment and a novel domain-adaptive contrastive learning mechanism. To well-generalize the representation learning using a small set of annotated target events, we reveal that rumor-indicative signal is closely correlated with the uniformity of the distribution of these events. We design a target-wise contrastive training mechanism with three event-level data augmentation strategies, capable of unifying the representations by distinguishing target events. Extensive experiments conducted on four low-resource datasets collected from real-world microblog platforms demonstrate that our framework achieves much better performance than state-of-the-art methods and exhibits a superior capacity for detecting rumors at early stages.
95.8CVMar 26
Probabilistic Concept Graph Reasoning for Multimodal Misinformation DetectionRuichao Yang, Wei Gao, Xiaobin Zhu et al.
Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable and evolvable framework that reframes multimodal misinformation detection (MMD) as structured and concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.
82.2CVMay 6
Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D DiffusionYiran Qiao, Yiren Lu, Yunlai Zhou et al.
3D asset generation plays a pivotal role in fields such as gaming and virtual reality, enabling the rapid synthesis of high-fidelity 3D objects from a single or multiple images. Building on this capability, enabling style-controllable generation naturally emerges as an important and desirable direction. However, existing approaches typically rely on style images that lie within or are similar to the training distribution of 3D generation models. When presented with out-of-distribution (OOD) styles, their performance degrades significantly or even fails. To address this limitation, we introduce $\textbf{DiLAST}$: 2D Diffusion-based Latent Awakening for 3D Style Transfer. Specifically, we leverage a pretrained 2D diffusion model as a teacher to provide rich and generalizable style priors. By aligning rendered views with the target style under diffusion-based guidance, our method optimizes the structured 3D latent representation for stylization. We observe that this limitation stems not from insufficient model capacity, but from the underutilization of structured 3D latents, which are inherently expressive. Despite being trained on comparatively limited data, 3D generation models can leverage 2D diffusion guidance to steer denoising toward specific directions in latent space, thereby producing diverse, OOD styles. Extensive experiments across diverse data and multiple 3D generation backbones demonstrate the effectiveness and plug-and-play nature of our approach.
CLApr 15, 2024Code
MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming ProblemsKaixin Li, Yuchen Tian, Qisheng Hu et al.
Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models have demonstrated remarkable abilities in visual reasoning and mathematical tasks, there is little work on investigating whether these models can effectively interpret visual elements for code generation. To this end, we present MMCode, the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts. MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites, presenting significant challenges due to the extreme demand for reasoning abilities. Our experiment results show that current state-of-the-art models struggle to solve these problems. The results highlight the lack of powerful vision-code models, and we hope MMCode can serve as an inspiration for future works in this domain. The data and code are publicly available at https://github.com/likaixin2000/MMCode.
CLJan 7
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMsShidong Cao, Hongzhan Lin, Yuxuan Gu et al.
Chain-of-Thought (CoT) reasoning improves multi-step mathematical problem solving in large language models but remains vulnerable to exposure bias and error accumulation, as early mistakes propagate irreversibly through autoregressive decoding. In this work, we propose DiffCoT, a diffusion-styled CoT framework that reformulates CoT reasoning as an iterative denoising process. DiffCoT integrates diffusion principles at the reasoning-step level via a sliding-window mechanism, enabling unified generation and retrospective correction of intermediate steps while preserving token-level autoregression. To maintain causal consistency, we further introduce a causal diffusion noise schedule that respects the temporal structure of reasoning chains. Extensive experiments on three multi-step CoT reasoning benchmarks across diverse model backbones demonstrate that DiffCoT consistently outperforms existing CoT preference optimization methods, yielding improved robustness and error-correction capability in CoT reasoning.
LGNov 12, 2025
GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed GraphsLiangwei Yang, Jing Ma, Jianguo Zhang et al.
Graph neural networks (GNNs) on text--attributed graphs (TAGs) typically encode node texts using pretrained language models (PLMs) and propagate these embeddings through linear neighborhood aggregation. However, the representation spaces of modern PLMs are highly non--linear and geometrically structured, where textual embeddings reside on curved semantic manifolds rather than flat Euclidean spaces. Linear aggregation on such manifolds inevitably distorts geometry and causes semantic drift--a phenomenon where aggregated representations deviate from the intrinsic manifold, losing semantic fidelity and expressive power. To quantitatively investigate this problem, this work introduces a local PCA--based metric that measures the degree of semantic drift and provides the first quantitative framework to analyze how different aggregation mechanisms affect manifold structure. Building upon these insights, we propose Geodesic Aggregation, a manifold--aware mechanism that aggregates neighbor information along geodesics via log--exp mappings on the unit sphere, ensuring that representations remain faithful to the semantic manifold during message passing. We further develop GeoGNN, a practical instantiation that integrates spherical attention with manifold interpolation. Extensive experiments across four benchmark datasets and multiple text encoders show that GeoGNN substantially mitigates semantic drift and consistently outperforms strong baselines, establishing the importance of manifold--aware aggregation in text--attributed graph learning.
CVAug 4, 2024
View-consistent Object Removal in Radiance FieldsYiren Lu, Jing Ma, Yu Yin
Radiance Fields (RFs) have emerged as a crucial technology for 3D scene representation, enabling the synthesis of novel views with remarkable realism. However, as RFs become more widely used, the need for effective editing techniques that maintain coherence across different perspectives becomes evident. Current methods primarily depend on per-frame 2D image inpainting, which often fails to maintain consistency across views, thus compromising the realism of edited RF scenes. In this work, we introduce a novel RF editing pipeline that significantly enhances consistency by requiring the inpainting of only a single reference image. This image is then projected across multiple views using a depth-based approach, effectively reducing the inconsistencies observed with per-frame inpainting. However, projections typically assume photometric consistency across views, which is often impractical in real-world settings. To accommodate realistic variations in lighting and viewpoint, our pipeline adjusts the appearance of the projected views by generating multiple directional variants of the inpainted image, thereby adapting to different photometric conditions. Additionally, we present an effective and robust multi-view object segmentation approach as a valuable byproduct of our pipeline. Extensive experiments demonstrate that our method significantly surpasses existing frameworks in maintaining content consistency across views and enhancing visual quality. More results are available at https://vulab-ai.github.io/View-consistent_Object_Removal_in_Radiance_Fields.
LGSep 15, 2024
A Survey of Out-of-distribution Generalization for Graph Machine Learning from a Causal ViewJing Ma
Graph machine learning (GML) has been successfully applied across a wide range of tasks. Nonetheless, GML faces significant challenges in generalizing over out-of-distribution (OOD) data, which raises concerns about its wider applicability. Recent advancements have underscored the crucial role of causality-driven approaches in overcoming these generalization challenges. Distinct from traditional GML methods that primarily rely on statistical dependencies, causality-focused strategies delve into the underlying causal mechanisms of data generation and model prediction, thus significantly improving the generalization of GML across different environments. This paper offers a thorough review of recent progress in causality-involved GML generalization. We elucidate the fundamental concepts of employing causality to enhance graph model generalization and categorize the various approaches, providing detailed descriptions of their methodologies and the connections among them. Furthermore, we explore the incorporation of causality in other related important areas of trustworthy GML, such as explanation, fairness, and robustness. Concluding with a discussion on potential future research directions, this review seeks to articulate the continuing development and future potential of causality in enhancing the trustworthiness of graph machine learning.
77.0IRMay 17
Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce RecommendationYufei Guo, Jing Ma, Tianlu Zhang et al.
Multimodal item embeddings are crucial for e-commerce item-to-item (I2I) retrieval, yet real-world product images often contain promotional overlays and background clutter that inject spurious visual cues and degrade retrieval robustness. This issue is particularly pronounced in MLRM-style pipelines, where a frozen vision encoder is connected to an LLM through a lightweight connector that must selectively aggregate visual tokens. We propose Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework that leverages structured metadata as semantic guidance for visual token extraction while preserving complementary visual evidence. Concretely, TGQ-Former employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams, and introduces a lightweight reliability-aware dual-gated vector modulation module to adaptively calibrate their contributions under noisy inputs. Experiments on large-scale, real-world e-commerce datasets with full-pool retrieval show that TGQ-Former consistently outperforms strong connector baselines and end-to-end MLLMs. On average, it improves Hit Rate@100 (H@100) by 6.04%, demonstrating the effectiveness of text-guided visual encoding for robust multimodal retrieval.
CLFeb 16, 2025Code
CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language ModelsYuefei Chen, Vivek K. Singh, Jing Ma et al.
Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.
CLJul 2, 2025Code
AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on HarmfulnessZixin Chen, Hongzhan Lin, Kaixin Li et al.
The proliferation of multimodal memes in the social media era demands that multimodal Large Language Models (mLLMs) effectively understand meme harmfulness. Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy-based, model-agnostic evaluations using static datasets. These benchmarks are limited in their ability to provide up-to-date and thorough assessments, as online memes evolve dynamically. To address this, we propose AdamMeme, a flexible, agent-based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness. Through multi-agent collaboration, AdamMeme provides comprehensive evaluations by iteratively updating the meme data with challenging samples, thereby exposing specific limitations in how mLLMs interpret harmfulness. Extensive experiments show that our framework systematically reveals the varying performance of different target mLLMs, offering in-depth, fine-grained analyses of model-specific weaknesses. Our code is available at https://github.com/Lbotirx/AdamMeme.
CRJan 5, 2025Code
Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language ModelsPeihai Jiang, Xixiang Lyu, Yige Li et al.
Supervised fine-tuning has become the predominant method for adapting large pretrained models to downstream tasks. However, recent studies have revealed that these models are vulnerable to backdoor attacks, where even a small number of malicious samples can successfully embed backdoor triggers into the model. While most existing defense methods focus on post-training backdoor defense, efficiently defending against backdoor attacks during training phase remains largely unexplored. To address this gap, we propose a novel defense method called Backdoor Token Unlearning (BTU), which proactively detects and neutralizes trigger tokens during the training stage. Our work is based on two key findings: 1) backdoor learning causes distinctive differences between backdoor token parameters and clean token parameters in word embedding layers, and 2) the success of backdoor attacks heavily depends on backdoor token parameters. The BTU defense leverages these properties to identify aberrant embedding parameters and subsequently removes backdoor behaviors using a fine-grained unlearning technique. Extensive evaluations across three datasets and four types of backdoor attacks demonstrate that BTU effectively defends against these threats while preserving the model's performance on primary tasks. Our code is available at https://github.com/XDJPH/BTU.
CVNov 19, 2024Code
Invariant Shape Representation Learning For Image ClassificationTonmoy Hossain, Jing Ma, Jundong Li et al.
Geometric shape features have been widely used as strong predictors for image classification. Nevertheless, most existing classifiers such as deep neural networks (DNNs) directly leverage the statistical correlations between these shape features and target variables. However, these correlations can often be spurious and unstable across different environments (e.g., in different age groups, certain types of brain changes have unstable relations with neurodegenerative disease); hence leading to biased or inaccurate predictions. In this paper, we introduce a novel framework that for the first time develops invariant shape representation learning (ISRL) to further strengthen the robustness of image classifiers. In contrast to existing approaches that mainly derive features in the image space, our model ISRL is designed to jointly capture invariant features in latent shape spaces parameterized by deformable transformations. To achieve this goal, we develop a new learning paradigm based on invariant risk minimization (IRM) to learn invariant representations of image and shape features across multiple training distributions/environments. By embedding the features that are invariant with regard to target variables in different environments, our model consistently offers more accurate predictions. We validate our method by performing classification tasks on both simulated 2D images, real 3D brain and cine cardiovascular magnetic resonance images (MRIs). Our code is publicly available at https://github.com/tonmoy-hossain/ISRL.
LGOct 25, 2024Code
Global Graph Counterfactual Explanation: A Subgraph Mapping ApproachYinhan He, Wendy Zheng, Yaochen Zhu et al.
Graph Neural Networks (GNNs) have been widely deployed in various real-world applications. However, most GNNs are black-box models that lack explanations. One strategy to explain GNNs is through counterfactual explanation, which aims to find minimum perturbations on input graphs that change the GNN predictions. Existing works on GNN counterfactual explanations primarily concentrate on the local-level perspective (i.e., generating counterfactuals for each individual graph), which suffers from information overload and lacks insights into the broader cross-graph relationships. To address such issues, we propose GlobalGCE, a novel global-level graph counterfactual explanation method. GlobalGCE aims to identify a collection of subgraph mapping rules as counterfactual explanations for the target GNN. According to these rules, substituting certain significant subgraphs with their counterfactual subgraphs will change the GNN prediction to the desired class for most graphs (i.e., maximum coverage). Methodologically, we design a significant subgraph generator and a counterfactual subgraph autoencoder in our GlobalGCE, where the subgraphs and the rules can be effectively generated. Extensive experiments demonstrate the superiority of our GlobalGCE compared to existing baselines. Our code can be found at https://anonymous.4open.science/r/GlobalGCE-92E8.
80.2AIApr 20
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought ReasoningYuxi Sun, Aoqi Zuo, Haotian Xie et al.
Chain-of-Thought (CoT) prompting has improved LLM reasoning, but models often generate explanations that appear coherent while containing unfaithful intermediate steps. Existing self-evaluation approaches are prone to inherent biases: the model may confidently endorse coherence even when the step-to-step implication is not valid, leading to unreliable faithfulness evaluation. We propose FACT-E, a causality-inspired framework for evaluating CoT quality. FACT-E uses controlled perturbations as an instrumental signal to separate genuine step-to-step dependence from bias-driven artifacts, producing more reliable faithfulness estimates (\textit{intra-chain faithfulness}). To select trustworthy trajectories, FACT-E jointly considers \textit{intra-chain faithfulness} and \textit{CoT-to-answer consistency}, ensuring that selected chains are both faithful internally and supportive of the correct final answer. Experiments on GSM8K, MATH, and CommonsenseQA show that FACT-E improves reasoning-trajectory selection and yields stronger in-context learning exemplars. FACT-E also reliably detects flawed reasoning under noisy conditions, providing a robust metric for trustworthy LLM reasoning.