CLDec 21, 2022Code
Language Models as Inductive ReasonersZonglin Yang, Li Dong, Xinya Du et al. · microsoft-research
Inductive reasoning is a core component of human intelligence. In the past research of inductive reasoning within computer science, formal language is used as representations of knowledge (facts and rules, more specifically). However, formal language can cause systematic problems for inductive reasoning such as disability of handling raw input such as natural language, sensitiveness to mislabeled data, and incapacity to handle ambiguous input. To this end, we propose a new paradigm (task) for inductive reasoning, which is to induce natural language rules from natural language facts, and create a dataset termed DEER containing 1.2k rule-fact pairs for the task, where rules and facts are written in natural language. New automatic metrics are also proposed and analysed for the evaluation of this task. With DEER, we investigate a modern approach for inductive reasoning where we use natural language as representation for knowledge instead of formal language and use pretrained language models as ''reasoners''. Moreover, we provide the first and comprehensive analysis of how well pretrained language models can induce natural language rules from natural language facts. We also propose a new framework drawing insights from philosophy literature for this task, which we show in the experiment section that surpasses baselines in both automatic and human evaluations. We discuss about our future perspectives for inductive reasoning in Section 7. Dataset and code are available at https://github.com/ZonglinY/Inductive_Reasoning.
CLOct 9, 2023Code
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and EthicsKai He, Rui Mao, Qika Lin et al.
The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, as well as comparing various LLMs with each other. Then we summarize related Healthcare training data, training methods, optimization strategies, and usage. Finally, the unique concerns associated with deploying LLMs in Healthcare settings are investigated, particularly regarding fairness, accountability, transparency and ethics. Our survey provide a comprehensive investigation from perspectives of both computer science and Healthcare specialty. Besides the discussion about Healthcare concerns, we supports the computer science community by compiling a collection of open source resources, such as accessible datasets, the latest methodologies, code implementations, and evaluation benchmarks in the Github. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from discriminative AI approaches to generative AI approaches, as well as a shift from model-centered methodologies to data-centered methodologies. Also, we determine that the biggest obstacle of using LLMs in Healthcare are fairness, accountability, transparency and ethics.
AIAug 24, 2023
GPTEval: A Survey on Assessments of ChatGPT and GPT-4Rui Mao, Guanyi Chen, Xulang Zhang et al.
The emergence of ChatGPT has generated much speculation in the press about its potential to disrupt social and economic systems. Its astonishing language ability has aroused strong curiosity among scholars about its performance in different domains. There have been many studies evaluating the ability of ChatGPT and GPT-4 in different tasks and disciplines. However, a comprehensive review summarizing the collective assessment findings is lacking. The objective of this survey is to thoroughly analyze prior assessments of ChatGPT and GPT-4, focusing on its language and reasoning abilities, scientific knowledge, and ethical considerations. Furthermore, an examination of the existing evaluation methods is conducted, offering several recommendations for future research in evaluating large language models.
CLSep 6, 2023
Large Language Models for Automated Open-domain Scientific Hypotheses DiscoveryZonglin Yang, Xinya Du, Junxian Li et al.
Hypothetical induction is recognized as the main reasoning type when scientists make observations about the world and try to propose hypotheses to explain those observations. Past research on hypothetical induction is under a constrained setting: (1) the observation annotations in the dataset are carefully manually handpicked sentences (resulting in a close-domain setting); and (2) the ground truth hypotheses are mostly commonsense knowledge, making the task less challenging. In this work, we tackle these problems by proposing the first dataset for social science academic hypotheses discovery, with the final goal to create systems that automatically generate valid, novel, and helpful scientific hypotheses, given only a pile of raw web corpus. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi-module framework is developed for the task, including three different feedback mechanisms to boost performance, which exhibits superior performance in terms of both GPT-4 based and expert-based evaluation. To the best of our knowledge, this is the first work showing that LLMs are able to generate novel (''not existing in literature'') and valid (''reflecting reality'') scientific hypotheses.
CLJun 16, 2023
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and BeyondFangzhi Xu, Qika Lin, Jiawei Han et al.
Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. Firstly, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Secondly, different from previous evaluations relying only on simple metrics (e.g., \emph{accuracy}), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including \emph{answer correctness}, \emph{explain correctness}, \emph{explain completeness} and \emph{explain redundancy}. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., \emph{evidence selection process} and \emph{reasoning process}. Thirdly, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., \emph{Correct}, \emph{Rigorous}, \emph{Self-aware}, \emph{Active}, \emph{Oriented} and \emph{No hallucination}). It reflects the pros and cons of LLMs and gives guiding directions for future works.
CLApr 18, 2023
MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised LearningZheng Lian, Haiyang Sun, Licai Sun et al.
The first Multimodal Emotion Recognition Challenge (MER 2023) was successfully held at ACM Multimedia. The challenge focuses on system robustness and consists of three distinct tracks: (1) MER-MULTI, where participants are required to recognize both discrete and dimensional emotions; (2) MER-NOISE, in which noise is added to test videos for modality robustness evaluation; (3) MER-SEMI, which provides a large amount of unlabeled samples for semi-supervised learning. In this paper, we introduce the motivation behind this challenge, describe the benchmark dataset, and provide some statistics about participants. To continue using this dataset after MER 2023, please sign a new End User License Agreement and send it to our official email address merchallenge.contact@gmail.com. We believe this high-quality dataset can become a new benchmark in multimodal emotion recognition, especially for the Chinese research community.
CLMar 3, 2023
Will Affective Computing Emerge from Foundation Models and General AI? A First Evaluation on ChatGPTMostafa M. Amin, Erik Cambria, Björn W. Schuller
ChatGPT has shown the potential of emerging general artificial intelligence capabilities, as it has demonstrated competent performance across many natural language processing tasks. In this work, we evaluate the capabilities of ChatGPT to perform text classification on three affective computing problems, namely, big-five personality prediction, sentiment analysis, and suicide tendency detection. We utilise three baselines, a robust language model (RoBERTa-base), a legacy word model with pretrained embeddings (Word2Vec), and a simple bag-of-words baseline (BoW). Results show that the RoBERTa trained for a specific downstream task generally has a superior performance. On the other hand, ChatGPT provides decent results, and is relatively comparable to the Word2Vec and BoW baselines. ChatGPT further shows robustness against noisy data, where Word2Vec models achieve worse results due to noise. Results indicate that ChatGPT is a good generalist model that is capable of achieving good results across various problems without any specialised training, however, it is not as good as a specialised model for a downstream task.
CLSep 15, 2022
Hierarchical Attention Network for Explainable Depression Detection on Twitter Aided by Metaphor Concept MappingsSooji Han, Rui Mao, Erik Cambria
Automatic depression detection on Twitter can help individuals privately and conveniently understand their mental health status in the early stages before seeing mental health professionals. Most existing black-box-like deep learning methods for depression detection largely focused on improving classification performance. However, explaining model decisions is imperative in health research because decision-making can often be high-stakes and life-and-death. Reliable automatic diagnosis of mental health problems including depression should be supported by credible explanations justifying models' predictions. In this work, we propose a novel explainable model for depression detection on Twitter. It comprises a novel encoder combining hierarchical attention mechanisms and feed-forward neural networks. To support psycholinguistic studies, our model leverages metaphorical concept mappings as input. Thus, it not only detects depressed individuals, but also identifies features of such users' tweets and associated metaphor concept mappings.
AISep 21, 2023
A Comprehensive Review on Financial Explainable AIWei Jie Yeo, Wihan van der Heever, Rui Mao et al.
The success of artificial intelligence (AI), and deep learning models in particular, has led to their widespread adoption across various industries due to their ability to process huge amounts of data and learn complex patterns. However, due to their lack of explainability, there are significant concerns regarding their use in critical sectors, such as finance and healthcare, where decision-making transparency is of paramount importance. In this paper, we provide a comparative survey of methods that aim to improve the explainability of deep learning models within the context of finance. We categorize the collection of explainable AI methods according to their corresponding characteristics, and we review the concerns and challenges of adopting explainable AI methods, together with future directions we deemed appropriate and important.
LGJun 23, 2022
The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and StressLukas Christ, Shahin Amiriparian, Alice Baird et al.
The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that contains audio-visual recordings of German football coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities, and (iii) the Ulm-Trier Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled with continuous emotion values (arousal and valence) of people in stressful dispositions. Using the introduced datasets, MuSe 2022 2022 addresses three contemporary affective computing problems: in the Humor Detection Sub-Challenge (MuSe-Humor), spontaneous humour has to be recognised; in the Emotional Reactions Sub-Challenge (MuSe-Reaction), seven fine-grained `in-the-wild' emotions have to be predicted; and in the Emotional Stress Sub-Challenge (MuSe-Stress), a continuous prediction of stressed emotion values is featured. The challenge is designed to attract different research communities, encouraging a fusion of their disciplines. Mainly, MuSe 2022 targets the communities of audio-visual emotion recognition, health informatics, and symbolic sentiment analysis. This baseline paper describes the datasets as well as the feature sets extracted from them. A recurrent neural network with LSTM cells is used to set competitive baseline results on the test partitions for each sub-challenge. We report an Area Under the Curve (AUC) of .8480 for MuSe-Humor; .2801 mean (from 7-classes) Pearson's Correlations Coefficient for MuSe-Reaction, as well as .4931 Concordance Correlation Coefficient (CCC) and .4761 for valence and arousal in MuSe-Stress, respectively.
AIAug 26, 2023
A Wide Evaluation of ChatGPT on Affective Computing TasksMostafa M. Amin, Rui Mao, Erik Cambria et al.
With the rise of foundation models, a new artificial intelligence paradigm has emerged, by simply using general purpose foundation models with prompting to solve problems instead of training a separate machine learning model for each problem. Such models have been shown to have emergent properties of solving problems that they were not initially trained on. The studies for the effectiveness of such models are still quite limited. In this work, we widely study the capabilities of the ChatGPT models, namely GPT-4 and GPT-3.5, on 13 affective computing problems, namely aspect extraction, aspect polarity classification, opinion extraction, sentiment analysis, sentiment intensity ranking, emotions intensity ranking, suicide tendency detection, toxicity detection, well-being assessment, engagement measurement, personality assessment, sarcasm detection, and subjectivity detection. We introduce a framework to evaluate the ChatGPT models on regression-based problems, such as intensity ranking problems, by modelling them as pairwise ranking classification. We compare ChatGPT against more traditional NLP methods, such as end-to-end recurrent neural networks and transformers. The results demonstrate the emergent abilities of the ChatGPT models on a wide range of affective computing problems, where GPT-3.5 and especially GPT-4 have shown strong performance on many problems, particularly the ones related to sentiment, emotions, or toxicity. The ChatGPT models fell short for problems with implicit signals, such as engagement measurement and subjectivity detection.
CVApr 22, 2023
A Review of Deep Learning for Video CaptioningMoloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi et al.
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human-computer interaction. In essence, VC involves understanding a video and describing it with language. Captioning is used in a host of applications from creating more accessible interfaces (e.g., low-vision navigation) to video question answering (V-QA), video retrieval and content generation. This survey covers deep learning-based VC, including but, not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC), and more. We discuss the datasets and evaluation metrics used in the field, and limitations, applications, challenges, and future directions for VC.
CLOct 22, 2023
A Survey on Semantic Processing TechniquesRui Mao, Kai He, Xulang Zhang et al.
Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.
CLApr 20, 2023
Domain-specific Continued Pretraining of Language Models for Capturing Long Context in Mental HealthShaoxiong Ji, Tianlin Zhang, Kailai Yang et al.
Pretrained language models have been used in various natural language processing applications. In the mental health domain, domain-specific language models are pretrained and released, which facilitates the early detection of mental health conditions. Social posts, e.g., on Reddit, are usually long documents. However, there are no domain-specific pretrained models for long-sequence modeling in the mental health domain. This paper conducts domain-specific continued pretraining to capture the long context for mental health. Specifically, we train and release MentalXLNet and MentalLongformer based on XLNet and Longformer. We evaluate the mental health classification performance and the long-range ability of these two domain-specific pretrained models. Our models are released in HuggingFace.
CLMar 21, 2023
Logical Reasoning over Natural Language as Knowledge Representation: A SurveyZonglin Yang, Xinya Du, Rui Mao et al.
Logical reasoning is central to human cognition and intelligence. It includes deductive, inductive, and abductive reasoning. Past research of logical reasoning within AI uses formal language as knowledge representation and symbolic reasoners. However, reasoning with formal language has proved challenging (e.g., brittleness and knowledge-acquisition bottleneck). This paper provides a comprehensive overview on a new paradigm of logical reasoning, which uses natural language as knowledge representation and pretrained language models as reasoners, including philosophical definition and categorization of logical reasoning, advantages of the new paradigm, benchmarks and methods, challenges of the new paradigm, possible future directions, and relation to related NLP fields. This new paradigm is promising since it not only alleviates many challenges of formal representation but also has advantages over end-to-end neural methods. This survey focus on transformer-based LLMs explicitly working on deductive, inductive, and abductive reasoning over English representation.
CLApr 14
Better and Worse with Scale: How Contextual Entrainment Diverges with Model SizeDikshant Kukreja, Kshitij Sah, Gautam Gupta et al. · deepmind, utoronto
Larger language models become simultaneously better and worse at handling contextual information -- better at ignoring false claims, worse at ignoring irrelevant tokens. We formalize this apparent paradox through the first scaling laws for contextual entrainment, the tendency of models to favor tokens that appeared in context regardless of relevance. Analyzing the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families, we find entrainment follows predictable power-law scaling, but with opposite trends depending on context type: semantic contexts show decreasing entrainment with scale, while non-semantic contexts show increasing entrainment. Concretely, the largest models are four times more resistant to counterfactual misinformation than the smallest, yet simultaneously twice as prone to copying arbitrary tokens. These diverging trends, which replicate across model families, suggest that semantic filtering and mechanical copying are functionally distinct behaviors that scale in opposition -- scaling alone does not resolve context sensitivity, it reshapes it.
ASMay 3, 2022
The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal BurstsAlice Baird, Panagiotis Tzirakis, Gauthier Gidel et al.
The ICML Expressive Vocalization (ExVo) Competition is focused on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to emotional expression and communication. ExVo 2022, includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers. The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts. The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions. The third, ExVo-FewShot, requires participants to leverage few-shot learning incorporating speaker identity to train a model for the recognition of 10 emotions conveyed by vocal bursts. This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies. The baseline for each track is as follows, for ExVo-MultiTask, a combined score, computing the harmonic mean of Concordance Correlation Coefficient (CCC), Unweighted Average Recall (UAR), and inverted Mean Absolute Error (MAE) ($S_{MTL}$) is at best, 0.335 $S_{MTL}$; for ExVo-Generate, we report Fréchet inception distance (FID) scores ranging from 4.81 to 8.27 (depending on the emotion) between the training set and generated samples. We then combine the inverted FID with perceptual ratings of the generated samples ($S_{Gen}$) and obtain 0.174 $S_{Gen}$; and for ExVo-FewShot, a mean CCC of 0.444 is obtained.
CLNov 17, 2022
ConNER: Consistency Training for Cross-lingual Named Entity RecognitionRan Zhou, Xin Li, Lidong Bing et al.
Cross-lingual named entity recognition (NER) suffers from data scarcity in the target languages, especially under zero-shot settings. Existing translate-train or knowledge distillation methods attempt to bridge the language gap, but often introduce a high level of noise. To solve this problem, consistency training methods regularize the model to be robust towards perturbations on data or hidden states. However, such methods are likely to violate the consistency hypothesis, or mainly focus on coarse-grain consistency. We propose ConNER as a novel consistency training framework for cross-lingual NER, which comprises of: (1) translation-based consistency training on unlabeled target-language data, and (2) dropoutbased consistency training on labeled source-language data. ConNER effectively leverages unlabeled target-language data and alleviates overfitting on the source language to enhance the cross-lingual adaptability. Experimental results show our ConNER achieves consistent improvement over various baseline methods.
LGMay 2, 2022
Deep-Attack over the Deep Reinforcement LearningYang Li, Quan Pan, Erik Cambria
Recent adversarial attack developments have made reinforcement learning more vulnerable, and different approaches exist to deploy attacks against it, where the key is how to choose the right timing of the attack. Some work tries to design an attack evaluation function to select critical points that will be attacked if the value is greater than a certain threshold. This approach makes it difficult to find the right place to deploy an attack without considering the long-term impact. In addition, there is a lack of appropriate indicators of assessment during attacks. To make the attacks more intelligent as well as to remedy the existing problems, we propose the reinforcement learning-based attacking framework by considering the effectiveness and stealthy spontaneously, while we also propose a new metric to evaluate the performance of the attack model in these two aspects. Experimental results show the effectiveness of our proposed model and the goodness of our proposed evaluation metric. Furthermore, we validate the transferability of the model, and also its robustness under the adversarial training.
IRJun 22, 2023
Recent Developments in Recommender Systems: A SurveyYang Li, Kangbo Liu, Ranjan Satapathy et al.
In this technical survey, we comprehensively summarize the latest advancements in the field of recommender systems. The objective of this study is to provide an overview of the current state-of-the-art in the field and highlight the latest trends in the development of recommender systems. The study starts with a comprehensive summary of the main taxonomy of recommender systems, including personalized and group recommender systems, and then delves into the category of knowledge-based recommender systems. In addition, the survey analyzes the robustness, data bias, and fairness issues in recommender systems, summarizing the evaluation metrics used to assess the performance of these systems. Finally, the study provides insights into the latest trends in the development of recommender systems and highlights the new directions for future research in the field.
CLJul 6, 2023
Can ChatGPT's Responses Boost Traditional Natural Language Processing?Mostafa M. Amin, Erik Cambria, Björn W. Schuller
The employment of foundation models is steadily expanding, especially with the launch of ChatGPT and the release of other foundation models. These models have shown the potential of emerging capabilities to solve problems, without being particularly trained to solve. A previous work demonstrated these emerging capabilities in affective computing tasks; the performance quality was similar to traditional Natural Language Processing (NLP) techniques, but falling short of specialised trained models, like fine-tuning of the RoBERTa language model. In this work, we extend this by exploring if ChatGPT has novel knowledge that would enhance existing specialised models when they are fused together. We achieve this by investigating the utility of verbose responses from ChatGPT about solving a downstream task, in addition to studying the utility of fusing that with existing NLP methods. The study is conducted on three affective computing problems, namely sentiment analysis, suicide tendency detection, and big-five personality assessment. The results conclude that ChatGPT has indeed novel knowledge that can improve existing NLP techniques by way of fusion, be it early or late fusion.
CLAug 18, 2024
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment AnalysisMeng Luo, Hao Fei, Bobo Li et al.
While existing Aspect-based Sentiment Analysis (ABSA) has received extensive effort and advancement, there are still gaps in defining a more holistic research target seamlessly integrating multimodality, conversation context, fine-granularity, and also covering the changing sentiment dynamics as well as cognitive causal rationales. This paper bridges the gaps by introducing a multimodal conversational ABSA, where two novel subtasks are proposed: 1) Panoptic Sentiment Sextuple Extraction, panoramically recognizing holder, target, aspect, opinion, sentiment, rationale from multi-turn multi-party multimodal dialogue. 2) Sentiment Flipping Analysis, detecting the dynamic sentiment transformation throughout the conversation with the causal reasons. To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism. Extensive evaluations demonstrate the superiority of our methods over strong baselines, validating the efficacy of all our proposed methods. The work is expected to open up a new era for the ABSA community, and thus all our codes and data are open at https://PanoSent.github.io/
CLAug 27, 2024
Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image GenerationMohammad Nadeem, Shahab Saquib Sohail, Erik Cambria et al.
Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. At the same time, researchers are striving to identify the limitations of these tools to improve them further. Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content. In the present work, we have identified a fundamental limitation related to the image generation ability of LLMs, and termed it The NO Syndrome. This negation blindness refers to LLMs inability to correctly comprehend NO related natural language prompts to generate the desired images. Interestingly, all tested LLMs including GPT-4, Gemini, and Copilot were found to be suffering from this syndrome. To demonstrate the generalization of this limitation, we carried out simulation experiments and conducted entropy-based and benchmark statistical analysis tests on various LLMs in multiple languages, including English, Hindi, and French. We conclude that the NO syndrome is a significant flaw in current LLMs that needs to be addressed. A related finding of this study showed a consistent discrepancy between image and textual responses as a result of this NO syndrome. We posit that the introduction of a negation context-aware reinforcement learning based feedback loop between the LLMs textual response and generated image could help ensure the generated text is based on both the LLMs correct contextual understanding of the negation query and the generated visual output.
CLMar 5, 2023
FinXABSA: Explainable Finance through Aspect-Based Sentiment AnalysisKeane Ong, Wihan van der Heever, Ranjan Satapathy et al.
This paper presents a novel approach for explainability in financial analysis by deriving financially-explainable statistical relationships through aspect-based sentiment analysis, Pearson correlation, Granger causality & uncertainty coefficient. The proposed methodology involves constructing an aspect list from financial literature and applying aspect-based sentiment analysis on social media text to compute sentiment scores for each aspect. Pearson correlation is then applied to uncover financially explainable relationships between aspect sentiment scores and stock prices. Findings for derived relationships are made robust by applying Granger causality to determine the forecasting ability of each aspect sentiment score for stock prices. Finally, an added layer of interpretability is added by evaluating uncertainty coefficient scores between aspect sentiment scores and stock prices. This allows us to determine the aspects whose sentiment scores are most statistically significant for stock prices. Relative to other methods, our approach provides a more informative and accurate understanding of the relationship between sentiment analysis and stock prices. Specifically, this methodology enables an interpretation of the statistical relationship between aspect-based sentiment scores and stock prices, which offers explainability to AI-driven financial decision-making.
CVJul 19, 2023
TbExplain: A Text-based Explanation Method for Scene Classification Models with the Statistical Prediction CorrectionAmirhossein Aminimehr, Pouya Khani, Amirali Molaei et al.
The field of Explainable Artificial Intelligence (XAI) aims to improve the interpretability of black-box machine learning models. Building a heatmap based on the importance value of input features is a popular method for explaining the underlying functions of such models in producing their predictions. Heatmaps are almost understandable to humans, yet they are not without flaws. Non-expert users, for example, may not fully understand the logic of heatmaps (the logic in which relevant pixels to the model's prediction are highlighted with different intensities or colors). Additionally, objects and regions of the input image that are relevant to the model prediction are frequently not entirely differentiated by heatmaps. In this paper, we propose a framework called TbExplain that employs XAI techniques and a pre-trained object detector to present text-based explanations of scene classification models. Moreover, TbExplain incorporates a novel method to correct predictions and textually explain them based on the statistics of objects in the input image when the initial prediction is unreliable. To assess the trustworthiness and validity of the text-based explanations, we conducted a qualitative experiment, and the findings indicated that these explanations are sufficiently reliable. Furthermore, our quantitative and qualitative experiments on TbExplain with scene classification datasets reveal an improvement in classification accuracy over ResNet variants.
CLMar 7, 2023
Adaptive Knowledge Distillation between Text and Speech Pre-trained ModelsJinjie Ni, Yukun Ma, Wen Wang et al.
Learning on a massive amount of speech corpus leads to the recent success of many self-supervised speech models. With knowledge distillation, these models may also benefit from the knowledge encoded by language models that are pre-trained on rich sources of texts. The distillation process, however, is challenging due to the modal disparity between textual and speech embedding spaces. This paper studies metric-based distillation to align the embedding space of text and speech with only a small amount of data without modifying the model structure. Since the semantic and granularity gap between text and speech has been omitted in literature, which impairs the distillation, we propose the Prior-informed Adaptive knowledge Distillation (PAD) that adaptively leverages text/speech units of variable granularity and prior distributions to achieve better global and local alignments between text and speech pre-trained models. We evaluate on three spoken language understanding benchmarks to show that PAD is more effective in transferring linguistic knowledge than other metric-based distillation approaches.
AISep 26, 2022
Knowledge Representation for Conceptual, Motivational, and Affective Processes in Natural Language CommunicationSeng-Beng Ho, Zhaoxia Wang, Boon-Kiat Quek et al.
Natural language communication is an intricate and complex process. The speaker usually begins with an intention and motivation of what is to be communicated, and what effects are expected from the communication, while taking into consideration the listener's mental model to concoct an appropriate sentence. The listener likewise has to interpret what the speaker means, and respond accordingly, also with the speaker's mental state in mind. To do this successfully, conceptual, motivational, and affective processes have to be represented appropriately to drive the language generation and understanding processes. Language processing has succeeded well with the big data approach in applications such as chatbots and machine translation. However, in human-robot collaborative social communication and in using natural language for delivering precise instructions to robots, a deeper representation of the conceptual, motivational, and affective processes is needed. This paper capitalizes on the UGALRS (Unified General Autonomous and Language Reasoning System) framework and the CD+ (Conceptual Representation Plus) representational scheme to illustrate how social communication through language is supported by a knowledge representational scheme that handles conceptual, motivational, and affective processes in a deep and general way. Though a small set of concepts, motivations, and emotions is treated in this paper, its main contribution is in articulating a general framework of knowledge representation and processing to link these aspects together in serving the purpose of natural language communication for an intelligent system.
CVJul 23, 2023
EnTri: Ensemble Learning with Tri-level Representations for Explainable Scene RecognitionAmirhossein Aminimehr, Amirali Molaei, Erik Cambria
Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance due to challenges posed by inter-class similarities and intra-class dissimilarities. Furthermore, prior research has primarily focused on improving classification accuracy, yet it has given less attention to achieving interpretable, precise scene classification. Therefore, we are motivated to propose EnTri, an ensemble scene recognition framework that employs ensemble learning using a hierarchy of visual features. EnTri represents features at three distinct levels of detail: pixel-level, semantic segmentation-level, and object class and frequency level. By incorporating distinct feature encoding schemes of differing complexity and leveraging ensemble strategies, our approach aims to improve classification accuracy while enhancing transparency and interpretability via visual and textual explanations. To achieve interpretability, we devised an extension algorithm that generates both visual and textual explanations highlighting various properties of a given scene that contribute to the final prediction of its category. This includes information about objects, statistics, spatial layout, and textural details. Through experiments on benchmark scene classification datasets, EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches, with an accuracy of 87.69%, 75.56%, and 99.17% on the MIT67, SUN397, and UIUC8 datasets, respectively.
CLDec 4, 2025Code
LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General IntelligenceWenjin Liu, Haoran Luo, Xin Feng et al.
Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.
CLJul 21, 2024
XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language ModelsErik Cambria, Lorenzo Malandri, Fabio Mercorio et al.
In this survey, we address the key challenges in Large Language Models (LLM) research, focusing on the importance of interpretability. Driven by increasing interest from AI and business sectors, we highlight the need for transparency in LLMs. We examine the dual paths in current LLM research and eXplainable Artificial Intelligence (XAI): enhancing performance through XAI and the emerging focus on model interpretability. Our paper advocates for a balanced approach that values interpretability equally with functional advancements. Recognizing the rapid development in LLM research, our survey includes both peer-reviewed and preprint (arXiv) papers, offering a comprehensive overview of XAI's role in LLM research. We conclude by urging the research community to advance both LLM and XAI fields together.
CLNov 19, 2023
Rethinking Large Language Models in Mental Health ApplicationsShaoxiong Ji, Tianlin Zhang, Kailai Yang et al.
Large Language Models (LLMs) have become valuable assets in mental health, showing promise in both classification tasks and counseling applications. This paper offers a perspective on using LLMs in mental health applications. It discusses the instability of generative models for prediction and the potential for generating hallucinatory outputs, underscoring the need for ongoing audits and evaluations to maintain their reliability and dependability. The paper also distinguishes between the often interchangeable terms ``explainability'' and ``interpretability'', advocating for developing inherently interpretable methods instead of relying on potentially hallucinated self-explanations generated by LLMs. Despite the advancements in LLMs, human counselors' empathetic understanding, nuanced interpretation, and contextual awareness remain irreplaceable in the sensitive and complex realm of mental health counseling. The use of LLMs should be approached with a judicious and considerate mindset, viewing them as tools that complement human expertise rather than seeking to replace it.
AIAug 23, 2024
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive SurveyQika Lin, Yifan Zhu, Xin Mei et al.
The rapid development of artificial intelligence has constantly reshaped the field of intelligent healthcare and medicine. As a vital technology, multimodal learning has increasingly garnered interest due to data complementarity, comprehensive modeling form, and great application potential. Currently, numerous researchers are dedicating their attention to this field, conducting extensive studies and constructing abundant intelligent systems. Naturally, an open question arises that has multimodal learning delivered universal intelligence in healthcare? To answer the question, we adopt three unique viewpoints for a holistic analysis. Firstly, we conduct a comprehensive survey of the current progress of medical multimodal learning from the perspectives of datasets, task-oriented methods, and universal foundation models. Based on them, we further discuss the proposed question from five issues to explore the real impacts of advanced techniques in healthcare, from data and technologies to performance and ethics. The answer is that current technologies have NOT achieved universal intelligence and there remains a significant journey to undertake. Finally, in light of the above reviews and discussions, we point out ten potential directions for exploration towards the goal of universal intelligence in healthcare.
CYJul 3, 2024
Explainable Natural Language Processing for Corporate Sustainability AnalysisKeane Ong, Rui Mao, Ranjan Satapathy et al.
Sustainability commonly refers to entities, such as individuals, companies, and institutions, having a non-detrimental (or even positive) impact on the environment, society, and the economy. With sustainability becoming a synonym of acceptable and legitimate behaviour, it is being increasingly demanded and regulated. Several frameworks and standards have been proposed to measure the sustainability impact of corporations, including United Nations' sustainable development goals and the recently introduced global sustainability reporting framework, amongst others. However, the concept of corporate sustainability is complex due to the diverse and intricate nature of firm operations (i.e. geography, size, business activities, interlinks with other stakeholders). As a result, corporate sustainability assessments are plagued by subjectivity both within data that reflect corporate sustainability efforts (i.e. corporate sustainability disclosures) and the analysts evaluating them. This subjectivity can be distilled into distinct challenges, such as incompleteness, ambiguity, unreliability and sophistication on the data dimension, as well as limited resources and potential bias on the analyst dimension. Put together, subjectivity hinders effective cost attribution to entities non-compliant with prevailing sustainability expectations, potentially rendering sustainability efforts and its associated regulations futile. To this end, we argue that Explainable Natural Language Processing (XNLP) can significantly enhance corporate sustainability analysis. Specifically, linguistic understanding algorithms (lexical, semantic, syntactic), integrated with XAI capabilities (interpretability, explainability, faithfulness), can bridge gaps in analyst resources and mitigate subjectivity problems within data.
AIMay 13Code
SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic OrchestrationMingda Zhang, Tiesunlong Shen, Haoran Luo et al.
In recent years, a variety of powerful LLM-based agentic systems have been applied to automate complex tasks through task orchestration. However, existing orchestration methods still face key challenges, including strategy collapse under reward maximization, high gradient variance with opaque credit assignment, and unguided skill evolution whose decisions are typically made by directly prompting an LLM to judge rather than derived from principled training signals. To address these challenges, we propose SkillFlow, a flow-based framework that takes a trainable Supervisor as the agent and a structured environment with dynamic skill library and frozen executor, automating task orchestration through multi-turn interaction. SkillFlow employs Tempered Trajectory Balance (TTB), a regression-based flow-matching loss that samples trajectories proportional to reward, preserving diverse orchestration strategies rather than collapsing to a single mode. The same flow objective yields a jointly learned backward policy that provides transparent per-step credit assignment at zero additional inference cost. Building on these flow diagnostics, a recursive skill evolution mechanism determines when to evolve, what skills to create or prune, and where decision gaps lie -- closing the loop from training signal to autonomous capability growth. Experimental results on 14 datasets show that SkillFlow significantly outperforms baselines across question answering, mathematical reasoning, code generation, and real-world interactive decision making tasks. Our code is available at https://anonymous.4open.science/r/SkillFlow-E850.
CLJul 19, 2024Code
Prompted Aspect Key Point Analysis for Quantitative Review SummarizationAn Quang Tang, Xiuzhen Zhang, Minh Ngoc Dinh et al.
Key Point Analysis (KPA) aims for quantitative summarization that provides key points (KPs) as succinct textual summaries and quantities measuring their prevalence. KPA studies for arguments and reviews have been reported in the literature. A majority of KPA studies for reviews adopt supervised learning to extract short sentences as KPs before matching KPs to review comments for quantification of KP prevalence. Recent abstractive approaches still generate KPs based on sentences, often leading to KPs with overlapping and hallucinated opinions, and inaccurate quantification. In this paper, we propose Prompted Aspect Key Point Analysis (PAKPA) for quantitative review summarization. PAKPA employs aspect sentiment analysis and prompted in-context learning with Large Language Models (LLMs) to generate and quantify KPs grounded in aspects for business entities, which achieves faithful KPs with accurate quantification, and removes the need for large amounts of annotated data for supervised training. Experiments on the popular review dataset Yelp and the aspect-oriented review summarization dataset SPACE show that our framework achieves state-of-the-art performance. Source code and data are available at: https://github.com/antangrocket1312/PAKPA
CYMar 21
The data heat island effect: quantifying the impact of AI data centers in a warming worldAndrea Marinoni, Pietro Lio', Erik Cambria et al.
The strong and continuous increase of AI-based services leads to the steady proliferation of AI data centres worldwide with the unavoidable escalation of their power consumption. It is unknown how this energy demand for computational purposes will impact the surrounding environment. Here, we focus our attention on the heat dissipation of AI hyperscalers. Taking advantage of land surface temperature measurements acquired by remote sensing platforms over the last decades, we are able to obtain a robust assessment of the temperature increase recorded in the areas surrounding AI data centres globally. We estimate that the land surface temperature increases by 2°C on average after the start of operations of an AI data centre, inducing local microclimate zones, which we call the data heat island effect. We assess the impact on the communities, quantifying that more than 340 million people could be affected by this temperature increase. Our results show that the data heat island effect could have a remarkable influence on communities and regional welfare in the future, hence becoming part of the conversation around environmentally sustainable AI worldwide.
CLFeb 9Code
Affective Flow Language Model for Emotional Support ConversationChenghui Zou, Ning Wang, Tiesunlong Shen et al.
Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi-turn support remains challenging.This is because existing alignment schemes rely on sparse outcome-level signals, thus offering limited supervision for intermediate strategy decisions. To fill this gap, this paper proposes affective flow language model for emotional support conversation (AFlow), a framework that introduces fine-grained supervision on dialogue prefixes by modeling a continuous affective flow along multi-turn trajectories. AFlow can estimate intermediate utility over searched trajectories and learn preference-consistent strategy transitions. To improve strategy coherence and empathetic response quality, a subpath-level flow-balance objective is presented to propagate preference signals to intermediate states. Experiment results show consistent and significant improvements over competitive baselines in diverse emotional contexts. Remarkably, AFlow with a compact open-source backbone outperforms proprietary LMMs such as GPT-4o and Claude-3.5 on major ESC metrics. Our code is available at https://github.com/chzou25-lgtm/AffectiveFlow.
CLNov 2, 2025Code
Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement LearningWenjin Liu, Haoran Luo, Xueyuan Lin et al.
Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.
HCMay 2
Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future ProspectsRaziyeh Zall, Alireza Kheyrkhah, Erik Cambria et al.
The development of agents with emotional intelligence is becoming increasingly vital due to their significant role in human-computer interaction and the growing integration of computer systems across various sectors of society. Affective computing aims to design intelligent systems that can recognize, evoke, and express human emotions, thereby emulating human emotional intelligence. While previous reviews have focused on specific aspects of this field, there has been limited comprehensive research that encompasses emotion understanding, elicitation, and expression, along with the related challenges. This survey addresses this gap by providing a holistic overview of core components of artificial emotion intelligence. It covers emotion understanding through multimodal data processing, as well as affective cognition, which includes cognitive appraisal, emotion mapping, and adaptive modulation in decision-making, learning, and reasoning. Additionally, it addresses the synthesis of emotional expression across text, speech, and facial modalities to enhance human-agent interaction. This paper identifies and analyzes the key challenges and issues encountered in the development of affective systems, covering state-of-the-art methodologies designed to address them. Finally, we highlight promising future directions, with particular emphasis on the potential of generative technologies to advance affective computing.
CVFeb 13
IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language ModelsAarish Shah Mohsin, Mohammed Tayyab Ilyas Khan, Mohammad Nadeem et al.
Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.
AIJan 15
LLMdoctor: Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment of Large Language ModelsTiesunlong Shen, Rui Mao, Jin Wang et al.
Aligning Large Language Models (LLMs) with human preferences is critical, yet traditional fine-tuning methods are computationally expensive and inflexible. While test-time alignment offers a promising alternative, existing approaches often rely on distorted trajectory-level signals or inefficient sampling, fundamentally capping performance and failing to preserve the generative diversity of the base model. This paper introduces LLMdoctor, a novel framework for efficient test-time alignment that operates via a patient-doctor paradigm. It integrates token-level reward acquisition with token-level flow-guided preference optimization (TFPO) to steer a large, frozen patient LLM with a smaller, specialized doctor model. Unlike conventional methods that rely on trajectory-level rewards, LLMdoctor first extracts fine-grained, token-level preference signals from the patient model's behavioral variations. These signals then guide the training of the doctor model via TFPO, which establishes flow consistency across all subtrajectories, enabling precise token-by-token alignment while inherently preserving generation diversity. Extensive experiments demonstrate that LLMdoctor significantly outperforms existing test-time alignment methods and even surpasses the performance of full fine-tuning approaches like DPO.
CLFeb 19, 2024Code
How Interpretable are Reasoning Explanations from Prompting Large Language Models?Wei Jie Yeo, Ranjan Satapathy, Rick Siow Mong Goh et al.
Prompt Engineering has garnered significant attention for enhancing the performance of large language models across a multitude of tasks. Techniques such as the Chain-of-Thought not only bolster task performance but also delineate a clear trajectory of reasoning steps, offering a tangible form of explanation for the audience. Prior works on interpretability assess the reasoning chains yielded by Chain-of-Thought solely along a singular axis, namely faithfulness. We present a comprehensive and multifaceted evaluation of interpretability, examining not only faithfulness but also robustness and utility across multiple commonsense reasoning benchmarks. Likewise, our investigation is not confined to a single prompting technique; it expansively covers a multitude of prevalent prompting techniques employed in large language models, thereby ensuring a wide-ranging and exhaustive evaluation. In addition, we introduce a simple interpretability alignment technique, termed Self-Entailment-Alignment Chain-of-thought, that yields more than 70\% improvements across multiple dimensions of interpretability. Code is available at https://github.com/SenticNet/CoT_interpretability
CLSep 11, 2024
Recent Trends of Multimodal Affective Computing: A Survey from NLP PerspectiveGuimin Hu, Yi Xin, Weimin Lyu et al.
Multimodal affective computing (MAC) has garnered increasing attention due to its broad applications in analyzing human behaviors and intentions, especially in text-dominated multimodal affective computing field. This survey presents the recent trends of multimodal affective computing from NLP perspective through four hot tasks: multimodal sentiment analysis, multimodal emotion recognition in conversation, multimodal aspect-based sentiment analysis and multimodal multi-label emotion recognition. The goal of this survey is to explore the current landscape of multimodal affective research, identify development trends, and highlight the similarities and differences across various tasks, offering a comprehensive report on the recent progress in multimodal affective computing from an NLP perspective. This survey covers the formalization of tasks, provides an overview of relevant works, describes benchmark datasets, and details the evaluation metrics for each task. Additionally, it briefly discusses research in multimodal affective computing involving facial expressions, acoustic signals, physiological signals, and emotion causes. Additionally, we discuss the technical approaches, challenges, and future directions in multimodal affective computing. To support further research, we released a repository that compiles related works in multimodal affective computing, providing detailed resources and references for the community.
LGApr 26, 2024Code
MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion RecognitionZheng Lian, Haiyang Sun, Licai Sun et al.
Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing the dataset size and building more effective algorithms. However, due to problems such as complex environments and inaccurate annotations, current systems are hard to meet the demands of practical applications. Therefore, we organize the MER series of competitions to promote the development of this field. Last year, we launched MER2023, focusing on three interesting topics: multi-label learning, noise robustness, and semi-supervised learning. In this year's MER2024, besides expanding the dataset size, we further introduce a new track around open-vocabulary emotion recognition. The main purpose of this track is that existing datasets usually fix the label space and use majority voting to enhance the annotator consistency. However, this process may lead to inaccurate annotations, such as ignoring non-majority or non-candidate labels. In this track, we encourage participants to generate any number of labels in any category, aiming to describe emotional states as accurately as possible. Our baseline code relies on MERTools and is available at: https://github.com/zeroQiaoba/MERTools/tree/master/MER2024.
CLFeb 29, 2024Code
SemEval 2024 -- Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF)Shivani Kumar, Md Shad Akhtar, Erik Cambria et al.
We present SemEval-2024 Task 10, a shared task centred on identifying emotions and finding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for code-mixed dialogues, emotion flip reasoning for code-mixed dialogues, and emotion flip reasoning for English dialogues. Participating systems were tasked to automatically execute one or more of these subtasks. The datasets for these tasks comprise manually annotated conversations focusing on emotions and triggers for emotion shifts (The task data is available at https://github.com/LCS2-IIITD/EDiReF-SemEval2024.git). A total of 84 participants engaged in this task, with the most adept systems attaining F1-scores of 0.70, 0.79, and 0.76 for the respective subtasks. This paper summarises the results and findings from 24 teams alongside their system descriptions.
AIFeb 11
OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy OptimizationKeane Ong, Sabri Boughorbel, Luwei Xiao et al.
To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.
CLJan 29
Enhancing Language Models for Robust Greenwashing DetectionNeil Heinrich Braun, Keane Ong, Rui Mao et al.
Sustainability reports are critical for ESG assessment, yet greenwashing and vague claims often undermine their reliability. Existing NLP models lack robustness to these practices, typically relying on surface-level patterns that generalize poorly. We propose a parameter-efficient framework that structures LLM latent spaces by combining contrastive learning with an ordinal ranking objective to capture graded distinctions between concrete actions and ambiguous claims. Our approach incorporates gated feature modulation to filter disclosure noise and utilizes MetaGradNorm to stabilize multi-objective optimization. Experiments in cross-category settings demonstrate superior robustness over standard baselines while revealing a trade-off between representational rigidity and generalization.
CLApr 8, 2025Code
SEA-LION: Southeast Asian Languages in One NetworkRaymond Ng, Thanh Ngan Nguyen, Yuli Huang et al. · meta-ai
Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.
AINov 3, 2025
Robust Multimodal Sentiment Analysis via Double Information BottleneckHuiting Huang, Tieliang Gong, Kai He et al.
Multimodal sentiment analysis has received significant attention across diverse research domains. Despite advancements in algorithm design, existing approaches suffer from two critical limitations: insufficient learning of noise-contaminated unimodal data, leading to corrupted cross-modal interactions, and inadequate fusion of multimodal representations, resulting in discarding discriminative unimodal information while retaining multimodal redundant information. To address these challenges, this paper proposes a Double Information Bottleneck (DIB) strategy to obtain a powerful, unified compact multimodal representation. Implemented within the framework of low-rank Renyi's entropy functional, DIB offers enhanced robustness against diverse noise sources and computational tractability for high-dimensional data, as compared to the conventional Shannon entropy-based methods. The DIB comprises two key modules: 1) learning a sufficient and compressed representation of individual unimodal data by maximizing the task-relevant information and discarding the superfluous information, and 2) ensuring the discriminative ability of multimodal representation through a novel attention bottleneck fusion mechanism. Consequently, DIB yields a multimodal representation that effectively filters out noisy information from unimodal data while capturing inter-modal complementarity. Extensive experiments on CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single validate the effectiveness of our method. The model achieves 47.4% accuracy under the Acc-7 metric on CMU-MOSI and 81.63% F1-score on CH-SIMS, outperforming the second-best baseline by 1.19%. Under noise, it shows only 0.36% and 0.29% performance degradation on CMU-MOSI and CMU-MOSEI respectively.
AIMar 23
Beyond Correlation: Refutation-Validated Aspect-Based Sentiment Analysis for Explainable Energy Market ReturnsWihan van der Heever, Keane Ong, Ranjan Satapathy et al.
This paper proposes a refutation-validated framework for aspect-based sentiment analysis in financial markets, addressing the limitations of correlational studies that cannot distinguish genuine associations from spurious ones. Using X data for the energy sector, we test whether aspect-level sentiment signals show robust, refutation-validated relationships with equity returns. Our pipeline combines net-ratio scoring with z-normalization, OLS with Newey West HAC errors, and refutation tests including placebo, random common cause, subset stability, and bootstrap. Across six energy tickers, only a few associations survive all checks, while renewables show aspect and horizon specific responses. While not establishing causality, the framework provides statistically robust, directionally interpretable signals, with limited sample size (six stocks, one quarter) constraining generalizability and framing this work as a methodological proof of concept.