Wei Ai

CL
h-index36
58papers
1,057citations
Novelty43%
AI Score55

58 Papers

88.1CLJun 2
Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates

Paiheng Xu, Jing Liu, Wei Ai

A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers' domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature--covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.

79.7ASMar 27Code
Dual-branch Graph Domain Adaptation for Cross-scenario Multi-modal Emotion Recognition

Yuntao Shou, Jun Zhou, Tao Meng et al.

Multimodal Emotion Recognition in Conversations (MERC) aims to predict speakers' emotional states in multi-turn dialogues through text, audio, and visual cues. In real-world settings, conversation scenarios differ significantly in speakers, topics, styles, and noise levels. Existing MERC methods generally neglect these cross-scenario variations, limiting their ability to transfer models trained on a source domain to unseen target domains. To address this issue, we propose a Dual-branch Graph Domain Adaptation framework (DGDA) for multimodal emotion recognition under cross-scenario conditions. We first construct an emotion interaction graph to characterize complex emotional dependencies among utterances. A dual-branch encoder, consisting of a hypergraph neural network (HGNN) and a path neural network (PathNN), is then designed to explicitly model multivariate relationships and implicitly capture global dependencies. To enable out-of-domain generalization, a domain adversarial discriminator is introduced to learn invariant representations across domains. Furthermore, a regularization loss is incorporated to suppress the negative influence of noisy labels. To the best of our knowledge, DGDA is the first MERC framework that jointly addresses domain shift and label noise. Theoretical analysis provides tighter generalization bounds, and extensive experiments on IEMOCAP and MELD demonstrate that DGDA consistently outperforms strong baselines and better adapts to cross-scenario conversations. Our code is available at https://github.com/Xudmm1239439/DGDA-Net.

CYAug 30, 2023
Emoji Promotes Developer Participation and Issue Resolution on GitHub

Yuhang Zhou, Xuan Lu, Ge Gao et al.

Although remote working is increasingly adopted during the pandemic, many are concerned by the low-efficiency in the remote working. Missing in text-based communication are non-verbal cues such as facial expressions and body language, which hinders the effective communication and negatively impacts the work outcomes. Prevalent on social media platforms, emojis, as alternative non-verbal cues, are gaining popularity in the virtual workspaces well. In this paper, we study how emoji usage influences developer participation and issue resolution in virtual workspaces. To this end, we collect GitHub issues for a one-year period and apply causal inference techniques to measure the causal effect of emojis on the outcome of issues, controlling for confounders such as issue content, repository, and author information. We find that emojis can significantly reduce the resolution time of issues and attract more user participation. We also compare the heterogeneous effect on different types of issues. These findings deepen our understanding of the developer communities, and they provide design implications on how to facilitate interactions and broaden developer participation.

LGJun 1, 2023
Pitfalls in Link Prediction with Graph Neural Networks: Understanding the Impact of Target-link Inclusion & Better Practices

Jing Zhu, Yuhang Zhou, Vassilis N. Ioannidis et al.

While Graph Neural Networks (GNNs) are remarkably successful in a variety of high-impact applications, we demonstrate that, in link prediction, the common practices of including the edges being predicted in the graph at training and/or test have outsized impact on the performance of low-degree nodes. We theoretically and empirically investigate how these practices impact node-level performance across different degrees. Specifically, we explore three issues that arise: (I1) overfitting; (I2) distribution shift; and (I3) implicit test leakage. The former two issues lead to poor generalizability to the test data, while the latter leads to overestimation of the model's performance and directly impacts the deployment of GNNs. To address these issues in a systematic way, we introduce an effective and efficient GNN training framework, SpotTarget, which leverages our insight on low-degree nodes: (1) at training time, it excludes a (training) edge to be predicted if it is incident to at least one low-degree node; and (2) at test time, it excludes all test edges to be predicted (thus, mimicking real scenarios of using GNNs, where the test data is not included in the graph). SpotTarget helps researchers and practitioners adhere to best practices for learning from graph data, which are frequently overlooked even by the most widely-used frameworks. Our experiments on various real-world datasets show that SpotTarget makes GNNs up to 15x more accurate in sparse graphs, and significantly improves their performance for low-degree nodes in dense graphs.

AIJan 16Code
The Paradigm Shift: A Comprehensive Survey on Large Vision Language Models for Multimodal Fake News Detection

Wei Ai, Yilong Tan, Yuntao Shou et al.

In recent years, the rapid evolution of large vision-language models (LVLMs) has driven a paradigm shift in multimodal fake news detection (MFND), transforming it from traditional feature-engineering approaches to unified, end-to-end multimodal reasoning frameworks. Early methods primarily relied on shallow fusion techniques to capture correlations between text and images, but they struggled with high-level semantic understanding and complex cross-modal interactions. The emergence of LVLMs has fundamentally changed this landscape by enabling joint modeling of vision and language with powerful representation learning, thereby enhancing the ability to detect misinformation that leverages both textual narratives and visual content. Despite these advances, the field lacks a systematic survey that traces this transition and consolidates recent developments. To address this gap, this paper provides a comprehensive review of MFND through the lens of LVLMs. We first present a historical perspective, mapping the evolution from conventional multimodal detection pipelines to foundation model-driven paradigms. Next, we establish a structured taxonomy covering model architectures, datasets, and performance benchmarks. Furthermore, we analyze the remaining technical challenges, including interpretability, temporal reasoning, and domain generalization. Finally, we outline future research directions to guide the next stage of this paradigm shift. To the best of our knowledge, this is the first comprehensive survey to systematically document and analyze the transformative role of LVLMs in combating multimodal fake news. The summary of existing methods mentioned is in our Github: \href{https://github.com/Tan-YiLong/Overview-of-Fake-News-Detection}{https://github.com/Tan-YiLong/Overview-of-Fake-News-Detection}.

LGJan 29, 2023
Team Resilience under Shock: An Empirical Analysis of GitHub Repositories during Early COVID-19 Pandemic

Xuan Lu, Wei Ai, Yixin Wang et al.

While many organizations have shifted to working remotely during the COVID-19 pandemic, how the remote workforce and the remote teams are influenced by and would respond to this and future shocks remain largely unknown. Software developers have relied on remote collaborations long before the pandemic, working in virtual teams (GitHub repositories). The dynamics of these repositories through the pandemic provide a unique opportunity to understand how remote teams react under shock. This work presents a systematic analysis. We measure the overall effect of the early pandemic on public GitHub repositories by comparing their sizes and productivity with the counterfactual outcomes forecasted as if there were no pandemic. We find that the productivity level and the number of active members of these teams vary significantly during different periods of the pandemic. We then conduct a finer-grained investigation and study the heterogeneous effects of the shock on individual teams. We find that the resilience of a team is highly correlated to certain properties of the team before the pandemic. Through a bootstrapped regression analysis, we reveal which types of teams are robust or fragile to the shock.

LGJul 23, 2024
Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

Tao Meng, Fuchen Zhang, Yuntao Shou et al.

Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that {MGLRA} outperforms state-of-the-art methods.

ASSep 12, 2023
Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults

Ahmed Adel Attia, Jing Liu, Wei Ai et al.

Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this improvement can be generalized to unseen datasets. We also highlight important challenges towards improving children's ASR performance. The results showcase the viable and efficient integration of Whisper for effective children's speech recognition.

CLNov 15, 2023
Explore Spurious Correlations at the Concept Level in Language Models for Text Classification

Yuhang Zhou, Paiheng Xu, Xiaoyu Liu et al.

Language models (LMs) have achieved notable success in numerous NLP tasks, employing both fine-tuning and in-context learning (ICL) methods. While language models demonstrate exceptional performance, they face robustness challenges due to spurious correlations arising from imbalanced label distributions in training data or ICL exemplars. Previous research has primarily concentrated on word, phrase, and syntax features, neglecting the concept level, often due to the absence of concept labels and difficulty in identifying conceptual content in input texts. This paper introduces two main contributions. First, we employ ChatGPT to assign concept labels to texts, assessing concept bias in models during fine-tuning or ICL on test data. We find that LMs, when encountering spurious correlations between a concept and a label in training or prompts, resort to shortcuts for predictions. Second, we introduce a data rebalancing technique that incorporates ChatGPT-generated counterfactual data, thereby balancing label distribution and mitigating spurious correlations. Our method's efficacy, surpassing traditional token removal approaches, is validated through extensive testing.

CVJul 23, 2024
A Multi-view Mask Contrastive Learning Graph Convolutional Neural Network for Age Estimation

Yiping Zhang, Yuntao Shou, Tao Meng et al.

The age estimation task aims to use facial features to predict the age of people and is widely used in public security, marketing, identification, and other fields. However, the features are mainly concentrated in facial keypoints, and existing CNN and Transformer-based methods have inflexibility and redundancy for modeling complex irregular structures. Therefore, this paper proposes a Multi-view Mask Contrastive Learning Graph Convolutional Neural Network (MMCL-GCN) for age estimation. Specifically, the overall structure of the MMCL-GCN network contains a feature extraction stage and an age estimation stage. In the feature extraction stage, we introduce a graph structure to construct face images as input and then design a Multi-view Mask Contrastive Learning (MMCL) mechanism to learn complex structural and semantic information about face images. The learning mechanism employs an asymmetric siamese network architecture, which utilizes an online encoder-decoder structure to reconstruct the missing information from the original graph and utilizes the target encoder to learn latent representations for contrastive learning. Furthermore, to promote the two learning mechanisms better compatible and complementary, we adopt two augmentation strategies and optimize the joint losses. In the age estimation stage, we design a Multi-layer Extreme Learning Machine (ML-IELM) with identity mapping to fully use the features extracted by the online encoder. Then, a classifier and a regressor were constructed based on ML-IELM, which were used to identify the age grouping interval and accurately estimate the final age. Extensive experiments show that MMCL-GCN can effectively reduce the error of age estimation on benchmark datasets such as Adience, MORPH-II, and LAP-2016.

39.1SDApr 3
Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition

Chengling Guo, Yuntao Shou, Tao Meng et al.

Multimodal emotion recognition in conversations aims to infer utterance-level emotions by jointly modeling textual, acoustic, and visual cues within context. Despite recent progress, key challenges remain, including redundant cross-modal information, imperfect semantic alignment, and insufficient modeling of high-order speaker interactions. To address these issues, we propose a framework that combines dual-space feature disentanglement with dual-branch graph learning. A shared encoder and modality-specific encoders are used to separate modality-invariant and modality-specific representations. The invariant features are modeled by a Fourier graph neural network to capture global consistency and complementary patterns, with a frequency-domain contrastive objective to enhance discriminability. In parallel, a speaker-aware hypergraph is constructed over modality-specific features to model high-order interactions, along with a speaker-consistency constraint to maintain coherent semantics. Finally, the two branches are fused for utterance-level emotion prediction. Experiments on IEMOCAP and MELD demonstrate that the proposed method achieves superior performance over strong baselines, validating its effectiveness.

75.8CLMar 22
Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

Ying Liu, Yuntao Shou, Wei Ai et al.

In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.

49.5AIMar 22
Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

Tao Meng, Weilun Tang, Yuntao Shou et al.

Multimodal emotion recognition in conversations (MERC) aims to identify and understand the emotions expressed by speakers during utterance interaction from multiple modalities (e.g., text, audio, images, etc.). Existing studies have shown that GCN can improve the performance of MERC by modeling dependencies between speakers. However, existing methods usually use fixed parameters to process multimodal features for different emotion types, ignoring the dynamics of fusion between different modalities, which forces the model to balance performance between multiple emotion categories, thus limiting the model's performance on some specific emotions. To this end, we propose a dynamic fusion-aware graph convolutional neural network (DF-GCN) for robust recognition of multimodal emotion features in conversations. Specifically, DF-GCN integrates ordinary differential equations into graph convolutional networks (GCNs) to {capture} the dynamic nature of emotional dependencies within utterance interaction networks and leverages the prompts generated by the global information vector (GIV) of the utterance to guide the dynamic fusion of multimodal features. This allows our model to dynamically change parameters when processing each utterance feature, so that different network parameters can be equipped for different emotion categories in the inference stage, thereby achieving more flexible emotion classification and enhancing the generalization ability of the model. Comprehensive experiments conducted on two public multimodal conversational datasets {confirm} that the proposed DF-GCN model delivers superior performance, benefiting significantly from the dynamic fusion mechanism introduced.

LGJan 8
TimeGNN-Augmented Hybrid-Action MARL for Fine-Grained Task Partitioning and Energy-Aware Offloading in MEC

Wei Ai, Yun Peng, Yuntao Shou et al.

With the rapid growth of IoT devices and latency-sensitive applications, the demand for both real-time and energy-efficient computing has surged, placing significant pressure on traditional cloud computing architectures. Mobile edge computing (MEC), an emerging paradigm, effectively alleviates the load on cloud centers and improves service quality by offloading computing tasks to edge servers closer to end users. However, the limited computing resources, non-continuous power provisioning (e.g., battery-powered nodes), and highly dynamic systems of edge servers complicate efficient task scheduling and resource allocation. To address these challenges, this paper proposes a multi-agent deep reinforcement learning algorithm, TG-DCMADDPG, and constructs a collaborative computing framework for multiple edge servers, aiming to achieve joint optimization of fine-grained task partitioning and offloading. This approach incorporates a temporal graph neural network (TimeGNN) to model and predict time series of multi-dimensional server state information, thereby reducing the frequency of online interactions and improving policy predictability. Furthermore, a multi-agent deterministic policy gradient algorithm (DC-MADDPG) in a discrete-continuous hybrid action space is introduced to collaboratively optimize task partitioning ratios, transmission power, and priority scheduling strategies. Extensive simulation experiments confirm that TG-DCMADDPG achieves markedly faster policy convergence, superior energy-latency optimization, and higher task completion rates compared with existing state-of-the-art methods, underscoring its robust scalability and practical effectiveness in dynamic and constrained MEC scenarios.

CLDec 28, 2023Code
Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition

Yuntao Shou, Tao Meng, Wei Ai et al.

With the release of increasing open-source emotion recognition datasets on social media platforms and the rapid development of computing resources, multimodal emotion recognition tasks (MER) have begun to receive widespread research attention. The MER task extracts and fuses complementary semantic information from different modalities, which can classify the speaker's emotions. However, the existing feature fusion methods have usually mapped the features of different modalities into the same feature space for information fusion, which can not eliminate the heterogeneity between different modalities. Therefore, it is challenging to make the subsequent emotion class boundary learning. To tackle the above problems, we have proposed a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation, which can achieve information interaction between modalities and eliminate heterogeneity among modalities. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Specifically, we construct a graph structure for three modal features and perform contrastive representation learning on nodes with different emotions in the same modality and the same emotion in different modalities, which can improve the feature representation ability of nodes. Extensive experimental works show that the ARL-IIGCN method can significantly improve emotion recognition accuracy on IEMOCAP and MELD datasets.

IRAug 23, 2024
CSRec: Rethinking Sequential Recommendation from A Causal Perspective

Xiaoyu Liu, Jiaxin Yuan, Yuhang Zhou et al.

The essence of sequential recommender systems (RecSys) lies in understanding how users make decisions. Most existing approaches frame the task as sequential prediction based on users' historical purchase records. While effective in capturing users' natural preferences, this formulation falls short in accurately modeling actual recommendation scenarios, particularly in accounting for how unsuccessful recommendations influence future purchases. Furthermore, the impact of the RecSys itself on users' decisions has not been appropriately isolated and quantitatively analyzed. To address these challenges, we propose a novel formulation of sequential recommendation, termed Causal Sequential Recommendation (CSRec). Instead of predicting the next item in the sequence, CSRec aims to predict the probability of a recommended item's acceptance within a sequential context and backtrack how current decisions are made. Critically, CSRec facilitates the isolation of various factors that affect users' final decisions, especially the influence of the recommender system itself, thereby opening new avenues for the design of recommender systems. CSRec can be seamlessly integrated into existing methodologies. Experimental evaluations on both synthetic and real-world datasets demonstrate that the proposed implementation significantly improves upon state-of-the-art baselines.

CLSep 29, 2025Code
Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey

Yuntao Shou, Tao Meng, Wei Ai et al.

In recent years, large language models (LLMs) have driven major advances in language understanding, marking a significant step toward artificial general intelligence (AGI). With increasing demands for higher-level semantics and cross-modal fusion, multimodal large language models (MLLMs) have emerged, integrating diverse information sources (e.g., text, vision, and audio) to enhance modeling and reasoning in complex scenarios. In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier. While LLMs and MLLMs have achieved notable progress in this area, the field still lacks a systematic review that consolidates recent developments. To address this gap, this paper provides a comprehensive survey of LLMs and MLLMs for emotion recognition and reasoning, covering model architectures, datasets, and performance benchmarks. We further highlight key challenges and outline future research directions, aiming to offer researchers both an authoritative reference and practical insights for advancing this domain. To the best of our knowledge, this paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning. The summary of existing methods mentioned is in our Github: \href{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}.

CLMay 21, 2025Code
DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Yuhang Zhou, Jing Zhu, Shengyi Qian et al.

Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups, assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks. Our code and data are available at https://github.com/Tonyzhou98/disco_grpo.

SDDec 11, 2023
Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations

Tao Meng, Yuntao Shou, Wei Ai et al.

The main task of Multimodal Emotion Recognition in Conversations (MERC) is to identify the emotions in modalities, e.g., text, audio, image and video, which is a significant development direction for realizing machine intelligence. However, many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition. To tackle this problem, we systematically analyze it from three aspects: data augmentation, loss sensitivity, and sampling strategy, and propose the Class Boundary Enhanced Representation Learning (CBERL) model. Concretely, we first design a multimodal generative adversarial network to address the imbalanced distribution of {emotion} categories in raw data. Secondly, a deep joint variational autoencoder is proposed to fuse complementary semantic information across modalities and obtain discriminative feature representations. Finally, we implement a multi-task graph neural network with mask reconstruction and classification optimization to solve the problem of overfitting and underfitting in class boundary learning, and achieve cross-modal emotion recognition. We have conducted extensive experiments on the IEMOCAP and MELD benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition. Especially on the minority class fear and disgust emotion labels, our model improves the accuracy and F1 value by 10% to 20%.

CLApr 27, 2024
Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

Tao Meng, Fuchen Zhang, Yuntao Shou et al.

Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.

CVDec 5, 2023
Graph Information Bottleneck for Remote Sensing Segmentation

Yuntao Shou, Wei Ai, Tao Meng et al.

Remote sensing segmentation has a wide range of applications in environmental protection, and urban change detection, etc. Despite the success of deep learning-based remote sensing segmentation methods (e.g., CNN and Transformer), they are not flexible enough to model irregular objects. In addition, existing graph contrastive learning methods usually adopt the way of maximizing mutual information to keep the node representations consistent between different graph views, which may cause the model to learn task-independent redundant information. To tackle the above problems, this paper treats images as graph structures and introduces a simple contrastive vision GNN (SC-ViG) architecture for remote sensing segmentation. Specifically, we construct a node-masked and edge-masked graph view to obtain an optimal graph structure representation, which can adaptively learn whether to mask nodes and edges. Furthermore, this paper innovatively introduces information bottleneck theory into graph contrastive learning to maximize task-related information while minimizing task-independent redundant information. Finally, we replace the convolutional module in UNet with the SC-ViG module to complete the segmentation and classification tasks of remote sensing images. Extensive experiments on publicly available real datasets demonstrate that our method outperforms state-of-the-art remote sensing image segmentation methods.

CLDec 17, 2023
DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition

Wei Ai, Yuntao Shou, Tao Meng et al.

With the continuous development of deep learning (DL), the task of multimodal dialogue emotion recognition (MDER) has recently received extensive research attention, which is also an essential branch of DL. The MDER aims to identify the emotional information contained in different modalities, e.g., text, video, and audio, in different dialogue scenes. However, existing research has focused on modeling contextual semantic information and dialogue relations between speakers while ignoring the impact of event relations on emotion. To tackle the above issues, we propose a novel Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN) method. It models dialogue relations between speakers and captures latent event relations information. Specifically, we construct a weighted multi-relationship graph to simultaneously capture the dependencies between speakers and event relations in a dialogue. Moreover, we also introduce a Self-Supervised Masked Graph Autoencoder (SMGAE) to improve the fusion representation ability of features and structures. Next, we design a new Multiple Information Transformer (MIT) to capture the correlation between different relations, which can provide a better fuse of the multivariate information between relations. Finally, we propose a loss optimization strategy based on contrastive learning to enhance the representation learning ability of minority class features. We conduct extensive experiments on the IEMOCAP and MELD benchmark datasets, which verify the effectiveness of the DER-GCN model. The results demonstrate that our model significantly improves both the average accuracy and the f1 value of emotion recognition.

49.7SYMay 1
Economic Valuation and Optimal Deployment of Static Synchronous Series Compensators for U.S. Power System Expansion

Wei Ai, Vladimir Dvorkin, Michael T. Craig

Flexible AC Transmission Systems (FACTS), particularly Static Synchronous Series Compensators (SSSC), can improve network transfer capability and complement restricted transmission expansion. Evaluations of FACTS within large-scale, real-world power system planning are currently lacking. This paper develops a capacity expansion model for the contiguous U.S. power system toward 2050, incorporating SSSC-modified linear power flow equations and accounting for impedance feedback in transmission expansion. Cost-optimal system expansion leverages widespread nationwide SSSC deployment on small-to-medium capacity lines and reduces the number of corridors to be reinforced. Overall, SSSCs reduce annualized system costs by $1.9 billion or decrease transmission expansion requirements by 20%. The most advantageous deployments achieving benefit-cost ratios of 59 concentrated in the Midwest, facilitating the delivery of central U.S. wind power to eastern load centers. The value proposition of SSSCs is robust to cost sensitivities and potential competition from HVDC network expansion, and increases under higher demand growth and more stringent decarbonization policies. These findings provide a blueprint for leveraging SSSC deployment in the U.S. power system.

CLNov 29, 2024
SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for Incomplete Multimodal Learning in Conversational Emotion Recognition

Fangze Fu, Wei Ai, Fan Yang et al.

Multimodal Emotion Recognition in Conversations (MERC) aims to classify utterance emotions using textual, auditory, and visual modal features. Most existing MERC methods assume each utterance has complete modalities, overlooking the common issue of incomplete modalities in real-world scenarios. Recently, graph neural networks (GNNs) have achieved notable results in Incomplete Multimodal Emotion Recognition in Conversations (IMERC). However, traditional GNNs focus on binary relationships between nodes, limiting their ability to capture more complex, higher-order information. Moreover, repeated message passing can cause over-smoothing, reducing their capacity to preserve essential high-frequency details. To address these issues, we propose a Spectral Domain Reconstruction Graph Neural Network (SDR-GNN) for incomplete multimodal learning in conversational emotion recognition. SDR-GNN constructs an utterance semantic interaction graph using a sliding window based on both speaker and context relationships to model emotional dependencies. To capture higher-order and high-frequency information, SDR-GNN utilizes weighted relationship aggregation, ensuring consistent semantic feature extraction across utterances. Additionally, it performs multi-frequency aggregation in the spectral domain, enabling efficient recovery of incomplete modalities by extracting both high- and low-frequency information. Finally, multi-head attention is applied to fuse and optimize features for emotion recognition. Extensive experiments on various real-world datasets demonstrate that our approach is effective in incomplete multimodal learning and outperforms current state-of-the-art methods.

CLJan 3, 2024
A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning

Wei Ai, FuChen Zhang, Tao Meng et al.

In terms of human-computer interaction, it is becoming more and more important to correctly understand the user's emotional state in a conversation, so the task of multimodal emotion recognition (MER) started to receive more attention. However, existing emotion classification methods usually perform classification only once. Sentences are likely to be misclassified in a single round of classification. Previous work usually ignores the similarities and differences between different morphological features in the fusion process. To address the above issues, we propose a two-stage emotion recognition model based on graph contrastive learning (TS-GCL). First, we encode the original dataset with different preprocessing modalities. Second, a graph contrastive learning (GCL) strategy is introduced for these three modal data with other structures to learn similarities and differences within and between modalities. Finally, we use MLP twice to achieve the final emotion classification. This staged classification method can help the model to better focus on different levels of emotional information, thereby improving the performance of the model. Extensive experiments show that TS-GCL has superior performance on IEMOCAP and MELD datasets compared with previous methods.

CLApr 3, 2024
The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education

Paiheng Xu, Jing Liu, Nathan Jones et al.

Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers' expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that mostly focuses on low-inference instructional practices on a singular basis, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that is widely acknowledged to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers' utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.

LGOct 18, 2024
Graph Contrastive Learning via Cluster-refined Negative Sampling for Semi-supervised Text Classification

Wei Ai, Jianbin Li, Ze Wang et al.

Graph contrastive learning (GCL) has been widely applied to text classification tasks due to its ability to generate self-supervised signals from unlabeled data, thus facilitating model training. However, existing GCL-based text classification methods often suffer from negative sampling bias, where similar nodes are incorrectly paired as negative pairs. This can lead to over-clustering, where instances of the same class are divided into different clusters. To address the over-clustering issue, we propose an innovative GCL-based method of graph contrastive learning via cluster-refined negative sampling for semi-supervised text classification, namely ClusterText. Firstly, we combine the pre-trained model Bert with graph neural networks to learn text representations. Secondly, we introduce a clustering refinement strategy, which clusters the learned text representations to obtain pseudo labels. For each text node, its negative sample set is drawn from different clusters. Additionally, we propose a self-correction mechanism to mitigate the loss of true negative samples caused by clustering inconsistency. By calculating the Euclidean distance between each text node and other nodes within the same cluster, distant nodes are still selected as negative samples. Our proposed ClusterText demonstrates good scalable computing, as it can effectively extract important information from from a large amount of data. Experimental results demonstrate the superiority of ClusterText in text classification tasks.

CVDec 16, 2024
GroupFace: Imbalanced Age Estimation Based on Multi-hop Attention Graph Convolutional Network and Group-aware Margin Optimization

Yiping Zhang, Yuntao Shou, Wei Ai et al.

With the recent advances in computer vision, age estimation has significantly improved in overall accuracy. However, owing to the most common methods do not take into account the class imbalance problem in age estimation datasets, they suffer from a large bias in recognizing long-tailed groups. To achieve high-quality imbalanced learning in long-tailed groups, the dominant solution lies in that the feature extractor learns the discriminative features of different groups and the classifier is able to provide appropriate and unbiased margins for different groups by the discriminative features. Therefore, in this novel, we propose an innovative collaborative learning framework (GroupFace) that integrates a multi-hop attention graph convolutional network and a dynamic group-aware margin strategy based on reinforcement learning. Specifically, to extract the discriminative features of different groups, we design an enhanced multi-hop attention graph convolutional network. This network is capable of capturing the interactions of neighboring nodes at different distances, fusing local and global information to model facial deep aging, and exploring diverse representations of different groups. In addition, to further address the class imbalance problem, we design a dynamic group-aware margin strategy based on reinforcement learning to provide appropriate and unbiased margins for different groups. The strategy divides the sample into four age groups and considers identifying the optimum margins for various age groups by employing a Markov decision process. Under the guidance of the agent, the feature representation bias and the classification margin deviation between different groups can be reduced simultaneously, balancing inter-class separability and intra-class proximity. After joint optimization, our architecture achieves excellent performance on several age estimation benchmark datasets.

AIOct 18, 2024
MCSFF: Multi-modal Consistency and Specificity Fusion Framework for Entity Alignment

Wei Ai, Wen Deng, Hongyi Chen et al.

Multi-modal entity alignment (MMEA) is essential for enhancing knowledge graphs and improving information retrieval and question-answering systems. Existing methods often focus on integrating modalities through their complementarity but overlook the specificity of each modality, which can obscure crucial features and reduce alignment accuracy. To solve this, we propose the Multi-modal Consistency and Specificity Fusion Framework (MCSFF), which innovatively integrates both complementary and specific aspects of modalities. We utilize Scale Computing's hyper-converged infrastructure to optimize IT management and resource allocation in large-scale data processing. Our framework first computes similarity matrices for each modality using modality embeddings to preserve their unique characteristics. Then, an iterative update method denoises and enhances modality features to fully express critical information. Finally, we integrate the updated information from all modalities to create enriched and precise entity representations. Experiments show our method outperforms current state-of-the-art MMEA baselines on the MMKG dataset, demonstrating its effectiveness and practical potential.

CLJan 22, 2024
Emojis Decoded: Leveraging ChatGPT for Enhanced Understanding in Social Media Communications

Yuhang Zhou, Paiheng Xu, Xiyao Wang et al.

Emojis, which encapsulate semantics beyond mere words or phrases, have become prevalent in social network communications. This has spurred increasing scholarly interest in exploring their attributes and functionalities. However, emoji-related research and application face two primary challenges. First, researchers typically rely on crowd-sourcing to annotate emojis in order to understand their sentiments, usage intentions, and semantic meanings. Second, subjective interpretations by users can often lead to misunderstandings of emojis and cause the communication barrier. Large Language Models (LLMs) have achieved significant success in various annotation tasks, with ChatGPT demonstrating expertise across multiple domains. In our study, we assess ChatGPT's effectiveness in handling previously annotated and downstream tasks. Our objective is to validate the hypothesis that ChatGPT can serve as a viable alternative to human annotators in emoji research and that its ability to explain emoji meanings can enhance clarity and transparency in online communications. Our findings indicate that ChatGPT has extensive knowledge of emojis. It is adept at elucidating the meaning of emojis across various application scenarios and demonstrates the potential to replace human annotators in a range of tasks.

CVDec 4, 2023
CILF-CIAE: CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation

Yuntao Shou, Wei Ai, Tao Meng et al.

The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of age estimation. However, existing CLIP-based age estimation methods require high memory usage (quadratic complexity) when globally modeling images, and lack an error feedback mechanism to prompt the model about the quality of age prediction results. To tackle the above issues, we propose a novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Compared with the quadratic complexity of the attention mechanism, the proposed Fourierformer is of linear log complexity. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CILF-CIAE has achieved better age prediction results.

CLFeb 3, 2025
MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Yuhang Zhou, Giannis Karamanolakis, Victor Soto et al.

The recent success of specialized Large Language Models (LLMs) in domains such as mathematical reasoning and coding has led to growing interest in methods for merging these expert LLMs into a unified Mixture-of-Experts (MoE) model, with the goal of enhancing performance in each domain while retaining effectiveness on general tasks. However, the effective merging of expert models remains an open challenge, especially for models with highly divergent weight parameters or different architectures. State-of-the-art MoE merging methods only work with homogeneous model architectures and rely on simple unweighted averaging to merge expert layers, which does not address parameter interference and requires extensive fine-tuning of the merged MoE to restore performance. To address these limitations, this paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routing heuristics to reduce the need for MoE fine-tuning, and a novel method for merging experts with different architectures. Extensive experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.

CLNov 25, 2024
Contrastive Multi-graph Learning with Neighbor Hierarchical Sifting for Semi-supervised Text Classification

Wei Ai, Jianbin Li, Ze Wang et al.

Graph contrastive learning has been successfully applied in text classification due to its remarkable ability for self-supervised node representation learning. However, explicit graph augmentations may lead to a loss of semantics in the contrastive views. Secondly, existing methods tend to overlook edge features and the varying significance of node features during multi-graph learning. Moreover, the contrastive loss suffer from false negatives. To address these limitations, we propose a novel method of contrastive multi-graph learning with neighbor hierarchical sifting for semi-supervised text classification, namely ConNHS. Specifically, we exploit core features to form a multi-relational text graph, enhancing semantic connections among texts. By separating text graphs, we provide diverse views for contrastive learning. Our approach ensures optimal preservation of the graph information, minimizing data loss and distortion. Then, we separately execute relation-aware propagation and cross-graph attention propagation, which effectively leverages the varying correlations between nodes and edge features while harmonising the information fusion across graphs. Subsequently, we present the neighbor hierarchical sifting loss (NHS) to refine the negative selection. For one thing, following the homophily assumption, NHS masks first-order neighbors of the anchor and positives from being negatives. For another, NHS excludes the high-order neighbors analogous to the anchor based on their similarities. Consequently, it effectively reduces the occurrence of false negatives, preventing the expansion of the distance between similar samples in the embedding space. Our experiments on ThuCNews, SogouNews, 20 Newsgroups, and Ohsumed datasets achieved 95.86\%, 97.52\%, 87.43\%, and 70.65\%, which demonstrates competitive results in semi-supervised text classification.

LGDec 26, 2024
Large Language Models Meet Graph Neural Networks: A Perspective of Graph Mining

Yuxin You, Zhen Liu, Xiangchao Wen et al.

Graph mining is an important area in data mining and machine learning that involves extracting valuable information from graph-structured data. In recent years, significant progress has been made in this field through the development of graph neural networks (GNNs). However, GNNs are still deficient in generalizing to diverse graph data. Aiming to this issue, Large Language Models (LLMs) could provide new solutions for graph mining tasks with their superior semantic understanding. In this review, we systematically review the combination and application techniques of LLMs and GNNs and present a novel taxonomy for research in this interdisciplinary field, which involves three main categories: GNN-driving-LLM, LLM-driving-GNN, and GNN-LLM-co-driving. Within this framework, we reveal the capabilities of LLMs in enhancing graph feature extraction as well as improving the effectiveness of downstream tasks such as node classification, link prediction, and community detection. Although LLMs have demonstrated their great potential in handling graph-structured data, their high computational requirements and complexity remain challenges. Future research needs to continue to explore how to efficiently fuse LLMs and GNNs to achieve more powerful graph learning and reasoning capabilities and provide new impetus for the development of graph mining techniques.

CLOct 28, 2024
SEG:Seeds-Enhanced Iterative Refinement Graph Neural Network for Entity Alignment

Wei Ai, Yinghui Gao, Jianbin Li et al.

Entity alignment is crucial for merging knowledge across knowledge graphs, as it matches entities with identical semantics. The standard method matches these entities based on their embedding similarities using semi-supervised learning. However, diverse data sources lead to non-isomorphic neighborhood structures for aligned entities, complicating alignment, especially for less common and sparsely connected entities. This paper presents a soft label propagation framework that integrates multi-source data and iterative seed enhancement, addressing scalability challenges in handling extensive datasets where scale computing excels. The framework uses seeds for anchoring and selects optimal relationship pairs to create soft labels rich in neighborhood features and semantic relationship data. A bidirectional weighted joint loss function is implemented, which reduces the distance between positive samples and differentially processes negative samples, taking into account the non-isomorphic neighborhood structures. Our method outperforms existing semi-supervised approaches, as evidenced by superior results on multiple datasets, significantly improving the quality of entity alignment.

CVFeb 8, 2025
LRA-GNN: Latent Relation-Aware Graph Neural Network with Initial and Dynamic Residual for Facial Age Estimation

Yiping Zhang, Yuntao Shou, Wei Ai et al.

Face information is mainly concentrated among facial key points, and frontier research has begun to use graph neural networks to segment faces into patches as nodes to model complex face representations. However, these methods construct node-to-node relations based on similarity thresholds, so there is a problem that some latent relations are missing. These latent relations are crucial for deep semantic representation of face aging. In this novel, we propose a new Latent Relation-Aware Graph Neural Network with Initial and Dynamic Residual (LRA-GNN) to achieve robust and comprehensive facial representation. Specifically, we first construct an initial graph utilizing facial key points as prior knowledge, and then a random walk strategy is employed to the initial graph for obtaining the global structure, both of which together guide the subsequent effective exploration and comprehensive representation. Then LRA-GNN leverages the multi-attention mechanism to capture the latent relations and generates a set of fully connected graphs containing rich facial information and complete structure based on the aforementioned guidance. To avoid over-smoothing issues for deep feature extraction on the fully connected graphs, the deep residual graph convolutional networks are carefully designed, which fuse adaptive initial residuals and dynamic developmental residuals to ensure the consistency and diversity of information. Finally, to improve the estimation accuracy and generalization ability, progressive reinforcement learning is proposed to optimize the ensemble classification regressor. Our proposed framework surpasses the state-of-the-art baselines on several age estimation benchmarks, demonstrating its strength and effectiveness.

CLDec 4, 2024
Dynamic Graph Neural ODE Network for Multi-modal Emotion Recognition in Conversation

Yuntao Shou, Tao Meng, Wei Ai et al.

Multimodal emotion recognition in conversation (MERC) refers to identifying and classifying human emotional states by combining data from multiple different modalities (e.g., audio, images, text, video, etc.). Most existing multimodal emotion recognition methods use GCN to improve performance, but existing GCN methods are prone to overfitting and cannot capture the temporal dependency of the speaker's emotions. To address the above problems, we propose a Dynamic Graph Neural Ordinary Differential Equation Network (DGODE) for MERC, which combines the dynamic changes of emotions to capture the temporal dependency of speakers' emotions, and effectively alleviates the overfitting problem of GCNs. Technically, the key idea of DGODE is to utilize an adaptive mixhop mechanism to improve the generalization ability of GCNs and use the graph ODE evolution network to characterize the continuous dynamics of node representations over time and capture temporal dependencies. Extensive experiments on two publicly available multimodal emotion recognition datasets demonstrate that the proposed DGODE model has superior performance compared to various baselines. Furthermore, the proposed DGODE can also alleviate the over-smoothing problem, thereby enabling the construction of a deep GCN network.

AIApr 29, 2025
Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs

Paiheng Xu, Gang Wu, Xiang Chen et al.

Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer response times, and higher computational costs. To bridge the gap, we propose an offline simulation framework to curate a software-specific skillset, a collection of verified scripts, by exploiting LLMs and publicly available scripting guides. Our framework comprises two components: (1) task creation, using top-down functionality guidance and bottom-up API synergy exploration to generate helpful tasks; and (2) skill generation with trials, refining and validating scripts based on execution feedback. To efficiently navigate the extensive API landscape, we introduce a Graph Neural Network (GNN)-based link prediction model to capture API synergy, enabling the generation of skills involving underutilized APIs and expanding the skillset's diversity. Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation. This is the first attempt to use software scripting interfaces as a testbed for LLM-based systems, highlighting the advantages of leveraging execution feedback in a controlled environment and offering valuable insights into aligning AI capabilities with user needs in specialized software domains.

GNApr 24, 2024
Using Artificial Intelligence to Unlock Crowdfunding Success for Small Businesses

Teng Ye, Jingnan Zheng, Junhui Jin et al.

While small businesses are increasingly turning to online crowdfunding platforms for essential funding, over 40% of these campaigns may fail to raise any money, especially those from low socio-economic areas. We utilize the latest advancements in AI technology to identify crucial factors that influence the success of crowdfunding campaigns and to improve their fundraising outcomes by strategically optimizing these factors. Our best-performing machine learning model accurately predicts the fundraising outcomes of 81.0% of campaigns, primarily based on their textual descriptions. Interpreting the machine learning model allows us to provide actionable suggestions on improving the textual description before launching a campaign. We demonstrate that by augmenting just three aspects of the narrative using a large language model, a campaign becomes more preferable to 83% human evaluators, and its likelihood of securing financial support increases by 11.9%. Our research uncovers the effective strategies for crafting descriptions for small business fundraising campaigns and opens up a new realm in integrating large language models into crowdfunding methodologies.

CYFeb 22, 2024
From Adoption to Adaption: Tracing the Diffusion of New Emojis on Twitter

Yuhang Zhou, Xuan Lu, Wei Ai

In the rapidly evolving landscape of social media, the introduction of new emojis in Unicode release versions presents a structured opportunity to explore digital language evolution. Analyzing a large dataset of sampled English tweets, we examine how newly released emojis gain traction and evolve in meaning. We find that community size of early adopters and emoji semantics are crucial in determining their popularity. Certain emojis experienced notable shifts in the meanings and sentiment associations during the diffusion process. Additionally, we propose a novel framework utilizing language models to extract words and pre-existing emojis with semantically similar contexts, which enhances interpretation of new emojis. The framework demonstrates its effectiveness in improving sentiment classification performance by substituting unknown new emojis with familiar ones. This study offers a new perspective in understanding how new language units are adopted, adapted, and integrated into the fabric of online communication.

CLMar 24, 2025
SE-GNN: Seed Expanded-Aware Graph Neural Network with Iterative Optimization for Semi-supervised Entity Alignment

Tao Meng, Shuo Shan, Hongen Shao et al.

Entity alignment aims to use pre-aligned seed pairs to find other equivalent entities from different knowledge graphs (KGs) and is widely used in graph fusion-related fields. However, as the scale of KGs increases, manually annotating pre-aligned seed pairs becomes difficult. Existing research utilizes entity embeddings obtained by aggregating single structural information to identify potential seed pairs, thus reducing the reliance on pre-aligned seed pairs. However, due to the structural heterogeneity of KGs, the quality of potential seed pairs obtained using only a single structural information is not ideal. In addition, although existing research improves the quality of potential seed pairs through semi-supervised iteration, they underestimate the impact of embedding distortion produced by noisy seed pairs on the alignment effect. In order to solve the above problems, we propose a seed expanded-aware graph neural network with iterative optimization for semi-supervised entity alignment, named SE-GNN. First, we utilize the semantic attributes and structural features of entities, combined with a conditional filtering mechanism, to obtain high-quality initial potential seed pairs. Next, we designed a local and global awareness mechanism. It introduces initial potential seed pairs and combines local and global information to obtain a more comprehensive entity embedding representation, which alleviates the impact of KGs structural heterogeneity and lays the foundation for the optimization of initial potential seed pairs. Then, we designed the threshold nearest neighbor embedding correction strategy. It combines the similarity threshold and the bidirectional nearest neighbor method as a filtering mechanism to select iterative potential seed pairs and also uses an embedding correction strategy to eliminate the embedding distortion.

CLDec 16, 2024
SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation

Tao Meng, Wei Ai, Jianbin Li et al.

Text representation learning is significant as the cornerstone of natural language processing. In recent years, graph contrastive learning (GCL) has been widely used in text representation learning due to its ability to represent and capture complex text information in a self-supervised setting. However, current mainstream graph contrastive learning methods often require the incorporation of domain knowledge or cumbersome computations to guide the data augmentation process, which significantly limits the application efficiency and scope of GCL. Additionally, many methods learn text representations only by constructing word-document relationships, which overlooks the rich contextual semantic information in the text. To address these issues and exploit representative textual semantics, we present an event-based, simple, and effective graph contrastive learning (SE-GCL) for text representation. Precisely, we extract event blocks from text and construct internal relation graphs to represent inter-semantic interconnections, which can ensure that the most critical semantic information is preserved. Then, we devise a streamlined, unsupervised graph contrastive learning framework to leverage the complementary nature of the event semantic and structural information for intricate feature data capture. In particular, we introduce the concept of an event skeleton for core representation semantics and simplify the typically complex data augmentation techniques found in existing graph contrastive learning to boost algorithmic efficiency. We employ multiple loss functions to prompt diverse embeddings to converge or diverge within a confined distance in the vector space, ultimately achieving a harmonious equilibrium. We conducted experiments on the proposed SE-GCL on four standard data sets (AG News, 20NG, SougouNews, and THUCNews) to verify its effectiveness in text representation learning.

CVOct 22, 2025
A Flow Model with Low-Rank Transformers for Incomplete Multimodal Survival Analysis

Yi Yin, Yuntao Shou, Zao Dai et al.

In recent years, multimodal medical data-based survival analysis has attracted much attention. However, real-world datasets often suffer from the problem of incomplete modality, where some patient modality information is missing due to acquisition limitations or system failures. Existing methods typically infer missing modalities directly from observed ones using deep neural networks, but they often ignore the distributional discrepancy across modalities, resulting in inconsistent and unreliable modality reconstruction. To address these challenges, we propose a novel framework that combines a low-rank Transformer with a flow-based generative model for robust and flexible multimodal survival prediction. Specifically, we first formulate the concerned problem as incomplete multimodal survival analysis using the multi-instance representation of whole slide images (WSIs) and genomic profiles. To realize incomplete multimodal survival analysis, we propose a class-specific flow for cross-modal distribution alignment. Under the condition of class labels, we model and transform the cross-modal distribution. By virtue of the reversible structure and accurate density modeling capabilities of the normalizing flow model, the model can effectively construct a distribution-consistent latent space of the missing modality, thereby improving the consistency between the reconstructed data and the true distribution. Finally, we design a lightweight Transformer architecture to model intra-modal dependencies while alleviating the overfitting problem in high-dimensional modality fusion by virtue of the low-rank Transformer. Extensive experiments have demonstrated that our method not only achieves state-of-the-art performance under complete modality settings, but also maintains robust and superior accuracy under the incomplete modalities scenario.

SDJun 14, 2025
GSDNet: Revisiting Incomplete Multimodal-Diffusion from Graph Spectrum Perspective for Conversation Emotion Recognition

Yuntao Shou, Jun Yao, Tao Meng et al.

Multimodal emotion recognition in conversations (MERC) aims to infer the speaker's emotional state by analyzing utterance information from multiple sources (i.e., video, audio, and text). Compared with unimodality, a more robust utterance representation can be obtained by fusing complementary semantic information from different modalities. However, the modality missing problem severely limits the performance of MERC in practical scenarios. Recent work has achieved impressive performance on modality completion using graph neural networks and diffusion models, respectively. This inspires us to combine these two dimensions through the graph diffusion model to obtain more powerful modal recovery capabilities. Unfortunately, existing graph diffusion models may destroy the connectivity and local structure of the graph by directly adding Gaussian noise to the adjacency matrix, resulting in the generated graph data being unable to retain the semantic and topological information of the original graph. To this end, we propose a novel Graph Spectral Diffusion Network (GSDNet), which maps Gaussian noise to the graph spectral space of missing modalities and recovers the missing data according to its original distribution. Compared with previous graph diffusion methods, GSDNet only affects the eigenvalues of the adjacency matrix instead of destroying the adjacency matrix directly, which can maintain the global topological information and important spectral features during the diffusion process. Extensive experiments have demonstrated that GSDNet achieves state-of-the-art emotion recognition performance in various modality loss scenarios.

CLMay 30, 2025
VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation

Hieu Tran, Phuong-Anh Nguyen-Le, Huy Nghiem et al.

Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models' performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.

CLMay 31, 2025
The Hidden Language of Harm: Examining the Role of Emojis in Harmful Online Communication and Content Moderation

Yuhang Zhou, Yimin Xiao, Wei Ai et al.

Social media platforms have become central to modern communication, yet they also harbor offensive content that challenges platform safety and inclusivity. While prior research has primarily focused on textual indicators of offense, the role of emojis, ubiquitous visual elements in online discourse, remains underexplored. Emojis, despite being rarely offensive in isolation, can acquire harmful meanings through symbolic associations, sarcasm, and contextual misuse. In this work, we systematically examine emoji contributions to offensive Twitter messages, analyzing their distribution across offense categories and how users exploit emoji ambiguity. To address this, we propose an LLM-powered, multi-step moderation pipeline that selectively replaces harmful emojis while preserving the tweet's semantic intent. Human evaluations confirm our approach effectively reduces perceived offensiveness without sacrificing meaning. Our analysis also reveals heterogeneous effects across offense types, offering nuanced insights for online communication and emoji moderation.

IRMay 28, 2025
Extracting Research Instruments from Educational Literature Using LLMs

Jiseung Yoo, Curran Mahowald, Meiyu Li et al.

Large Language Models (LLMs) are transforming information extraction from academic literature, offering new possibilities for knowledge management. This study presents an LLM-based system designed to extract detailed information about research instruments used in the education field, including their names, types, target respondents, measured constructs, and outcomes. Using multi-step prompting and a domain-specific data schema, it generates structured outputs optimized for educational research. Our evaluation shows that this system significantly outperforms other approaches, particularly in identifying instrument names and detailed information. This demonstrates the potential of LLM-powered information extraction in educational contexts, offering a systematic way to organize research instrument information. The ability to aggregate such information at scale enhances accessibility for researchers and education leaders, facilitating informed decision-making in educational research and policy.

LGJun 27, 2024
Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations

Yuntao Shou, Wei Ai, Jiayi Du et al.

The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52\% and 35\%, respectively.

CLJun 19, 2024
Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

Yuhang Zhou, Jing Zhu, Paiheng Xu et al.

Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students' reasoning capabilities. However, current methods struggle with sequence level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.

CLJun 8, 2024
Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios

Yuhang Zhou, Wei Ai

There is increasing interest in distilling task-specific knowledge from large language models (LLM) to smaller student models. Nonetheless, LLM distillation presents a dual challenge: 1) there is a high cost associated with querying the teacher LLM, such as GPT-4, for gathering an ample number of demonstrations; 2) the teacher LLM might provide imperfect outputs with a negative impact on the student's learning process. To enhance sample efficiency within resource-constrained, imperfect teacher scenarios, we propose a three-component framework leveraging three signal types. The first signal is the student's self-consistency (consistency of student multiple outputs), which is a proxy of the student's confidence. Specifically, we introduce a ``teaching assistant'' (TA) model to assess the uncertainty of both the student's and the teacher's outputs via confidence scoring, which serves as another two signals for student training. Furthermore, we propose a two-stage training schema to first warm up the student with a small proportion of data to better utilize student's signal. Experiments have shown the superiority of our proposed framework for four complex reasoning tasks. On average, our proposed two-stage framework brings a relative improvement of up to 20.79% compared to fine-tuning without any signals across datasets.