Soyeon Caren Han

CL
h-index26
60papers
4,529citations
Novelty40%
AI Score56

60 Papers

CLJun 1, 2022Code
InducT-GCN: Inductive Graph Convolutional Networks for Text Classification

Kunze Wang, Soyeon Caren Han, Josiah Poon

Text classification aims to assign labels to textual units by making use of global information. Recent studies have applied graph neural network (GNN) to capture the global word co-occurrence in a corpus. Existing approaches require that all the nodes (training and test) in a graph are present during training, which are transductive and do not naturally generalise to unseen nodes. To make those models inductive, they use extra resources, like pretrained word embedding. However, high-quality resource is not always available and hard to train. Under the extreme settings with no extra resource and limited amount of training set, can we still learn an inductive graph-based text classification model? In this paper, we introduce a novel inductive graph-based text classification framework, InducT-GCN (InducTive Graph Convolutional Networks for Text classification). Compared to transductive models that require test documents in training, we construct a graph based on the statistics of training documents only and represent document vectors with a weighted sum of word vectors. We then conduct one-directional GCN propagation during testing. Across five text classification benchmarks, our InducT-GCN outperformed state-of-the-art methods that are either transductive in nature or pre-trained additional resources. We also conducted scalability testing by gradually increasing the data size and revealed that our InducT-GCN can reduce the time and space complexity. The code is available on: https://github.com/usydnlp/InductTGCN.

CVAug 17, 2022Code
Understanding Attention for Vision-and-Language Tasks

Feiqi Cao, Soyeon Caren Han, Siqu Long et al.

Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region's and textual token's significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attention-based cross modal models, and/or pretrained models. Our code is available at: https://github.com/adlnlp/Attention_VL

CLNov 11, 2022Code
In-game Toxic Language Detection: Shared Task and Attention Residuals

Yuanzhe Jia, Weixuan Wu, Feiqi Cao et al.

In-game toxic language becomes the hot potato in the gaming industry and community. There have been several online game toxicity analysis frameworks and models proposed. However, it is still challenging to detect toxicity due to the nature of in-game chat, which has extremely short length. In this paper, we describe how the in-game toxic language shared task has been established using the real-world in-game chat data. In addition, we propose and introduce the model/framework for toxic language token tagging (slot filling) from the in-game chat. The relevant code is publicly available on GitHub: https://github.com/Yuanzhe-Jia/In-Game-Toxic-Detection

CLAug 15, 2024Code
MIDAS: Multi-level Intent, Domain, And Slot Knowledge Distillation for Multi-turn NLU

Yan Li, So-Eon Kim, Seong-Bae Park et al.

Although Large Language Models (LLMs) can generate coherent text, they often struggle to recognise user intent behind queries. In contrast, Natural Language Understanding (NLU) models interpret the purpose and key information of user input for responsive interactions. Existing NLU models typically map utterances to a dual-level semantic frame, involving sentence-level intent (SI) and word-level slot (WS) labels. However, real-life conversations primarily consist of multi-turn dialogues, requiring the interpretation of complex and extended exchanges. Researchers encounter challenges in addressing all facets of multi-turn dialogue using a unified NLU model. This paper introduces MIDAS, a novel approach leveraging multi-level intent, domain, and slot knowledge distillation for multi-turn NLU. We construct distinct teachers for SI detection, WS filling, and conversation-level domain (CD) classification, each fine-tuned for specific knowledge. A multi-teacher loss is proposed to facilitate the integration of these teachers, guiding a student model in multi-turn dialogue tasks. Results demonstrate the efficacy of our model in improving multi-turn conversation understanding, showcasing the potential for advancements in NLU through multi-level dialogue knowledge distillation. Our implementation is open-sourced on https://github.com/adlnlp/Midas.

CVNov 29, 2022Code
PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals

Zhihao Zhang, Siwen Luo, Junyi Chen et al.

We propose a PiggyBack, a Visual Question Answering platform that allows users to apply the state-of-the-art visual-language pretrained models easily. The PiggyBack supports the full stack of visual question answering tasks, specifically data processing, model fine-tuning, and result visualisation. We integrate visual-language models, pretrained by HuggingFace, an open-source API platform of deep learning technologies; however, it cannot be runnable without programming skills or deep learning understanding. Hence, our PiggyBack supports an easy-to-use browser-based user interface with several deep learning visual language pretrained models for general users and domain experts. The PiggyBack includes the following benefits: Free availability under the MIT License, Portability due to web-based and thus runs on almost any platform, A comprehensive data creation and processing technique, and ease of use on deep learning-based visual language pretrained models. The demo video is available on YouTube and can be found at https://youtu.be/iz44RZ1lF4s.

CVApr 15, 2022
Vision-and-Language Pretrained Models: A Survey

Siqu Long, Feiqi Cao, Soyeon Caren Han et al.

Pretrained models have produced great success in both Computer Vision (CV) and Natural Language Processing (NLP). This progress leads to learning joint representations of vision and language pretraining by feeding visual and linguistic contents into a multi-layer transformer, Visual-Language Pretrained Models (VLPMs). In this paper, we present an overview of the major advances achieved in VLPMs for producing joint representations of vision and language. As the preliminaries, we briefly describe the general task definition and genetic architecture of VLPMs. We first discuss the language and vision data encoding methods and then present the mainstream VLPM structure as the core content. We further summarise several essential pretraining and fine-tuning strategies. Finally, we highlight three future directions for both CV and NLP researchers to provide insightful guidance.

CLApr 23, 2023
Graph Neural Networks for Text Classification: A Survey

Kunze Wang, Yihao Ding, Soyeon Caren Han

Text Classification is the most essential and fundamental problem in Natural Language Processing. While numerous recent text classification models applied the sequential deep learning technique, graph neural network-based models can directly deal with complex structured text data and exploit global information. Many real text classification applications can be naturally cast into a graph, which captures words, documents, and corpus global features. In this survey, we bring the coverage of methods up to 2023, including corpus-level and document-level graph neural networks. We discuss each of these methods in detail, dealing with the graph construction mechanisms and the graph-based learning process. As well as the technological survey, we look at issues behind and future directions addressed in text classification using graph neural networks. We also cover datasets, evaluation metrics, and experiment design and present a summary of published performance on the publicly available benchmarks. Note that we present a comprehensive comparison between different techniques and identify the pros and cons of various evaluation metrics in this survey.

CLApr 10, 2022
ME-GCN: Multi-dimensional Edge-Embedded Graph Convolutional Networks for Semi-supervised Text Classification

Kunze Wang, Soyeon Caren Han, Siqu Long et al.

Compared to sequential learning models, graph-based neural networks exhibit excellent ability in capturing global information and have been used for semi-supervised learning tasks. Most Graph Convolutional Networks are designed with the single-dimensional edge feature and failed to utilise the rich edge information about graphs. This paper introduces the ME-GCN (Multi-dimensional Edge-enhanced Graph Convolutional Networks) for semi-supervised text classification. A text graph for an entire corpus is firstly constructed to describe the undirected and multi-dimensional relationship of word-to-word, document-document, and word-to-document. The graph is initialised with corpus-trained multi-dimensional word and document node representation, and the relations are represented according to the distance of those words/documents nodes. Then, the generated graph is trained with ME-GCN, which considers the edge features as multi-stream signals, and each stream performs a separate graph convolutional operation. Our ME-GCN can integrate a rich source of graph edge information of the entire text corpus. The results have demonstrated that our proposed model has significantly outperformed the state-of-the-art methods across eight benchmark datasets.

CVAug 22, 2022
Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis

Siwen Luo, Yihao Ding, Siqu Long et al.

Recognizing the layout of unstructured digital documents is crucial when parsing the documents into the structured, machine-readable format for downstream applications. Recent studies in Document Layout Analysis usually rely on computer vision models to understand documents while ignoring other information, such as context information or relation of document components, which are vital to capture. Our Doc-GCN presents an effective way to harmonize and integrate heterogeneous aspects for Document Layout Analysis. We first construct graphs to explicitly describe four main aspects, including syntactic, semantic, density, and appearance/visual information. Then, we apply graph convolutional networks for representing each aspect of information and use pooling to integrate them. Finally, we aggregate each aspect and feed them into 2-layer MLPs for document layout component classification. Our Doc-GCN achieves new state-of-the-art results in three widely used DLA datasets.

CLAug 23, 2022
K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment

Jean Lee, Taejun Lim, Heejun Lee et al.

Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baseline experiments on K-MHaS using Korean-BERT-based language models with six different metrics. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.

CLJan 23, 2023
StockEmotions: Discover Investor Emotions for Financial Sentiment Analysis and Multivariate Time Series

Jean Lee, Hoyoul Luis Youn, Josiah Poon et al.

There has been growing interest in applying NLP techniques in the financial domain, however, resources are extremely limited. This paper introduces StockEmotions, a new dataset for detecting emotions in the stock market that consists of 10,000 English comments collected from StockTwits, a financial social media platform. Inspired by behavioral finance, it proposes 12 fine-grained emotion classes that span the roller coaster of investor emotion. Unlike existing financial sentiment datasets, StockEmotions presents granular features such as investor sentiment classes, fine-grained emotions, emojis, and time series data. To demonstrate the usability of the dataset, we perform a dataset analysis and conduct experimental downstream tasks. For financial sentiment/emotion classification tasks, DistilBERT outperforms other baselines, and for multivariate time series forecasting, a Temporal Attention LSTM model combining price index, text, and emotion features achieves the best performance than using a single feature.

CVApr 13, 2023
PDFVQA: A New Dataset for Real-World VQA on PDF Documents

Yihao Ding, Siwen Luo, Hyunsuk Chung et al.

Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions. We proposed a new document-based VQA dataset, PDF-VQA, to comprehensively examine the document understanding from various aspects, including document element recognition, document layout structural understanding as well as contextual understanding and key information extraction. Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages. We also propose a new graph-based VQA model that explicitly integrates the spatial and hierarchically structural relationships between different document elements to boost the document structural understanding. The performances are compared with several baselines over different question types and tasks\footnote{The full dataset will be released after paper acceptance.

AIMay 27, 2022
V-Doc : Visual questions answers with Documents

Yihao Ding, Zhe Huang, Runlin Wang et al.

We propose V-Doc, a question-answering tool using document images and PDF, mainly for researchers and general non-deep learning experts looking to generate, process, and understand the document visual question answering tasks. The V-Doc supports generating and using both extractive and abstractive question-answer pairs using documents images. The extractive QA selects a subset of tokens or phrases from the document contents to predict the answers, while the abstractive QA recognises the language in the content and generates the answer based on the trained model. Both aspects are crucial to understanding the documents, especially in an image format. We include a detailed scenario of question generation for the abstractive QA task. V-Doc supports a wide range of datasets and models, and is highly extensible through a declarative, framework-agnostic platform.

CLAug 2, 2024
Deep Learning based Visually Rich Document Content Understanding: A Survey

Yihao Ding, Soyeon Caren Han, Jean Lee et al.

Visually Rich Documents (VRDs) play a vital role in domains such as academia, finance, healthcare, and marketing, as they convey information through a combination of text, layout, and visual elements. Traditional approaches to extracting information from VRDs rely heavily on expert knowledge and manual annotation, making them labor-intensive and inefficient. Recent advances in deep learning have transformed this landscape by enabling multimodal models that integrate vision, language, and layout features through pretraining, significantly improving information extraction performance. This survey presents a comprehensive overview of deep learning-based frameworks for VRD Content Understanding (VRD-CU). We categorize existing methods based on their modeling strategies and downstream tasks, and provide a comparative analysis of key components, including feature representation, fusion techniques, model architectures, and pretraining objectives. Additionally, we highlight the strengths and limitations of each approach and discuss their suitability for different applications. The paper concludes with a discussion of current challenges and emerging trends, offering guidance for future research and practical deployment in real-world scenarios.

CLOct 24, 2023
Re-Temp: Relation-Aware Temporal Representation Learning for Temporal Knowledge Graph Completion

Kunze Wang, Soyeon Caren Han, Josiah Poon

Temporal Knowledge Graph Completion (TKGC) under the extrapolation setting aims to predict the missing entity from a fact in the future, posing a challenge that aligns more closely with real-world prediction problems. Existing research mostly encodes entities and relations using sequential graph neural networks applied to recent snapshots. However, these approaches tend to overlook the ability to skip irrelevant snapshots according to entity-related relations in the query and disregard the importance of explicit temporal information. To address this, we propose our model, Re-Temp (Relation-Aware Temporal Representation Learning), which leverages explicit temporal embedding as input and incorporates skip information flow after each timestamp to skip unnecessary information for prediction. Additionally, we introduce a two-phase forward propagation method to prevent information leakage. Through the evaluation on six TKGC (extrapolation) datasets, we demonstrate that our model outperforms all eight recent state-of-the-art models by a significant margin.

84.3CLMar 15
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

Mohammad Reza Ghasemi Madani, Soyeon Caren Han, Shuo Yang et al.

Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model's internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model's decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

CLJun 17, 2022
A Quantitative and Qualitative Analysis of Suicide Ideation Detection using Deep Learning

Siqu Long, Rina Cabral, Josiah Poon et al.

For preventing youth suicide, social media platforms have received much attention from researchers. A few researches apply machine learning, or deep learning-based text classification approaches to classify social media posts containing suicidality risk. This paper replicated competitive social media-based suicidality detection/prediction models. We evaluated the feasibility of detecting suicidal ideation using multiple datasets and different state-of-the-art deep learning models, RNN-, CNN-, and Attention-based models. Using two suicidality evaluation datasets, we evaluated 28 combinations of 7 input embeddings with 4 commonly used deep learning models and 5 pretrained language models in quantitative and qualitative ways. Our replication study confirms that deep learning works well for social media-based suicidality detection in general, but it highly depends on the dataset's quality.

CLJul 12, 2024
3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection

Rina Carines Cabral, Siwen Luo, Josiah Poon et al.

The significance of mental health classification is paramount in contemporary society, where digital platforms serve as crucial sources for monitoring individuals' well-being. However, existing social media mental health datasets primarily consist of text-only samples, potentially limiting the efficacy of models trained on such data. Recognising that humans utilise cross-modal information to comprehend complex situations or issues, we present a novel approach to address the limitations of current methodologies. In this work, we introduce a Multimodal and Multi-Teacher Knowledge Distillation model for Mental Health Classification, leveraging insights from cross-modal human understanding. Unlike conventional approaches that often rely on simple concatenation to integrate diverse features, our model addresses the challenge of appropriately representing inputs of varying natures (e.g., texts and sounds). To mitigate the computational complexity associated with integrating all features into a single model, we employ a multimodal and multi-teacher architecture. By distributing the learning process across multiple teachers, each specialising in a particular feature extraction aspect, we enhance the overall mental health classification performance. Through experimental validation, we demonstrate the efficacy of our model in achieving improved performance.

CLFeb 4, 2024Code
A Survey of Large Language Models in Finance (FinLLMs)

Jean Lee, Nicholas Stevens, Soyeon Caren Han et al.

Large Language Models (LLMs) have shown remarkable capabilities across a wide variety of Natural Language Processing (NLP) tasks and have attracted attention from multiple domains, including financial services. Despite the extensive research into general-domain LLMs, and their immense potential in finance, Financial LLM (FinLLM) research remains limited. This survey provides a comprehensive overview of FinLLMs, including their history, techniques, performance, and opportunities and challenges. Firstly, we present a chronological overview of general-domain Pre-trained Language Models (PLMs) through to current FinLLMs, including the GPT-series, selected open-source LLMs, and financial LMs. Secondly, we compare five techniques used across financial PLMs and FinLLMs, including training methods, training data, and fine-tuning methods. Thirdly, we summarize the performance evaluations of six benchmark tasks and datasets. In addition, we provide eight advanced financial NLP tasks and datasets for developing more sophisticated FinLLMs. Finally, we discuss the opportunities and the challenges facing FinLLMs, such as hallucination, privacy, and efficiency. To support AI research in finance, we compile a collection of accessible datasets and evaluation benchmarks on GitHub.

CLAug 9, 2024
MSG-Chart: Multimodal Scene Graph for ChartQA

Yue Dai, Soyeon Caren Han, Wei Liu

Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts. To address this challenge, we design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns. Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart. This graph module can be easily integrated with different vision transformers as inductive bias. Our experiments demonstrate that incorporating the proposed graph module enhances the understanding of charts' elements' structure and semantics, thereby improving performance on publicly available benchmarks, ChartQA and OpenCQA.

CLAug 12, 2023
MC-DRE: Multi-Aspect Cross Integration for Drug Event/Entity Extraction

Jie Yang, Soyeon Caren Han, Siqu Long et al.

Extracting meaningful drug-related information chunks, such as adverse drug events (ADE), is crucial for preventing morbidity and saving many lives. Most ADEs are reported via an unstructured conversation with the medical context, so applying a general entity recognition approach is not sufficient enough. In this paper, we propose a new multi-aspect cross-integration framework for drug entity/event detection by capturing and aligning different context/language/knowledge properties from drug-related documents. We first construct multi-aspect encoders to describe semantic, syntactic, and medical document contextual information by conducting those slot tagging tasks, main drug entity/event detection, part-of-speech tagging, and general medical named entity recognition. Then, each encoder conducts cross-integration with other contextual information in three ways: the key-value cross, attention cross, and feedforward cross, so the multi-encoders are integrated in depth. Our model outperforms all SOTA on two widely used tasks, flat entity detection and discontinuous event extraction.

IRJul 31, 2023
Workshop on Document Intelligence Understanding

Soyeon Caren Han, Yihao Ding, Siwen Luo et al.

Document understanding and information extraction include different tasks to understand a document and extract valuable information automatically. Recently, there has been a rising demand for developing document understanding among different domains, including business, law, and medicine, to boost the efficiency of work that is associated with a large number of documents. This workshop aims to bring together researchers and industry developers in the field of document intelligence and understanding diverse document types to boost automatic document processing and understanding techniques. We also released a data challenge on the recently introduced document-level VQA dataset, PDFVQA. The PDFVQA challenge examines the structural and contextual understandings of proposed models on the natural full document level of multiple consecutive document pages by including questions with a sequence of answers extracted from multi-pages of the full document. This task helps to boost the document understanding step from the single-page level to the full document level understanding.

CLDec 21, 2022
Spoken Language Understanding for Conversational AI: Recent Advances and Future Direction

Soyeon Caren Han, Siqu Long, Henry Weld et al.

When a human communicates with a machine using natural language on the web and online, how can it understand the human's intention and semantic context of their talk? This is an important AI task as it enables the machine to construct a sensible answer or perform a useful action for the human. Meaning is represented at the sentence level, identification of which is known as intent detection, and at the word level, a labelling task called slot filling. This dual-level joint task requires innovative thinking about natural language and deep learning network design, and as a result, many approaches and models have been proposed and applied. This tutorial will discuss how the joint task is set up and introduce Spoken Language Understanding/Natural Language Understanding (SLU/NLU) with Deep Learning techniques. We will cover the datasets, experiments and metrics used in the field. We will describe how the machine uses the latest NLP and Deep Learning techniques to address the joint task, including recurrent and attention-based Transformer networks and pre-trained models (e.g. BERT). We will then look in detail at a network that allows the two levels of the task, intent classification and slot filling, to interact to boost performance explicitly. We will do a code demonstration of a Python notebook for this model and attendees will have an opportunity to watch coding demo tasks on this joint NLU to further their understanding.

CVNov 9, 2022
SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation

Anh Duc Bui, Soyeon Caren Han, Josiah Poon

Scene Graph Generation (SGG) serves a comprehensive representation of the images for human understanding as well as visual understanding tasks. Due to the long tail bias problem of the object and predicate labels in the available annotated data, the scene graph generated from current methodologies can be biased toward common, non-informative relationship labels. Relationship can sometimes be non-mutually exclusive, which can be described from multiple perspectives like geometrical relationships or semantic relationships, making it even more challenging to predict the most suitable relationship label. In this work, we proposed the SG-Shuffle pipeline for scene graph generation with 3 components: 1) Parallel Transformer Encoder, which learns to predict object relationships in a more exclusive manner by grouping relationship labels into groups of similar purpose; 2) Shuffle Transformer, which learns to select the final relationship labels from the category-specific feature generated in the previous step; and 3) Weighted CE loss, used to alleviate the training bias caused by the imbalanced dataset.

IRSep 9, 2022
SUPER-Rec: SUrrounding Position-Enhanced Representation for Recommendation

Taejun Lim, Siqu Long, Josiah Poon et al.

Collaborative filtering problems are commonly solved based on matrix completion techniques which recover the missing values of user-item interaction matrices. In a matrix, the rating position specifically represents the user given and the item rated. Previous matrix completion techniques tend to neglect the position of each element (user, item and ratings) in the matrix but mainly focus on semantic similarity between users and items to predict the missing value in a matrix. This paper proposes a novel position-enhanced user/item representation training model for recommendation, SUPER-Rec. We first capture the rating position in the matrix using the relative positional rating encoding and store the position-enhanced rating information and its user-item relationship to the fixed dimension of embedding that is not affected by the matrix size. Then, we apply the trained position-enhanced user and item representations to the simplest traditional machine learning models to highlight the pure novelty of our representation learning model. We contribute the first formal introduction and quantitative analysis of position-enhanced item representation in the recommendation domain and produce a principled discussion about our SUPER-Rec to the outperformed performance of typical collaborative filtering recommendation tasks with both explicit and implicit feedback.

AIFeb 24
Physics-based phenomenological characterization of cross-modal bias in multimodal models

Hyeongmo Kim, Sohyun Kang, Yerin Choi et al.

The term 'algorithmic fairness' is used to evaluate whether AI models operate fairly in both comparative (where fairness is understood as formal equality, such as "treat like cases as like") and non-comparative (where unfairness arises from the model's inaccuracy, arbitrariness, or inscrutability) contexts. Recent advances in multimodal large language models (MLLMs) are breaking new ground in multimodal understanding, reasoning, and generation; however, we argue that inconspicuous distortions arising from complex multimodal interaction dynamics can lead to systematic bias. The purpose of this position paper is twofold: first, it is intended to acquaint AI researchers with phenomenological explainable approaches that rely on the physical entities that the machine experiences during training/inference, as opposed to the traditional cognitivist symbolic account or metaphysical approaches; second, it is to state that this phenomenological doctrine will be practically useful for tackling algorithmic fairness issues in MLLMs. We develop a surrogate physics-based model that describes transformer dynamics (i.e., semantic network structure and self-/cross-attention) to analyze the dynamics of cross-modal bias in MLLM, which are not fully captured by conventional embedding- or representation-level analyses. We support this position through multi-input diagnostic experiments: 1) perturbation-based analyses of emotion classification using Qwen2.5-Omni and Gemma 3n, and 2) dynamical analysis of Lorenz chaotic time-series prediction through the physical surrogate. Across two architecturally distinct MLLMs, we show that multimodal inputs can reinforce modality dominance rather than mitigate it, as revealed by structured error-attractor patterns under systematic label perturbation, complemented by dynamical analysis.

CLMar 19, 2025Code
SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection

Haoyi Li, Angela Yifei Yuan, Soyeon Caren Han et al.

The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of high-quality synthetic datasets for training. To address this issue, we propose SPADE, a structured framework for detecting synthetic dialogues using prompt-based positive and negative samples. Our proposed methods yield 14 new dialogue datasets, which we benchmark against eight MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by proposed augmentation frameworks, offering a practical approach to enhancing LLM application security. Considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. Our open-source datasets, code and prompts can be downloaded from https://github.com/AngieYYF/SPADE-customer-service-dialogue.

CLAug 26, 2025Code
EMMM, Explain Me My Model! Explainable Machine Generated Text Detection in Dialogues

Angela Yifei Yuan, Haoyi Li, Soyeon Caren Han et al.

The rapid adoption of large language models (LLMs) in customer service introduces new risks, as malicious actors can exploit them to conduct large-scale user impersonation through machine-generated text (MGT). Current MGT detection methods often struggle in online conversational settings, reducing the reliability and interpretability essential for trustworthy AI deployment. In customer service scenarios where operators are typically non-expert users, explanation become crucial for trustworthy MGT detection. In this paper, we propose EMMM, an explanation-then-detection framework that balances latency, accuracy, and non-expert-oriented interpretability. Experimental results demonstrate that EMMM provides explanations accessible to non-expert users, with 70\% of human evaluators preferring its outputs, while achieving competitive accuracy compared to state-of-the-art models and maintaining low latency, generating outputs within 1 second. Our code and dataset are open-sourced at https://github.com/AngieYYF/EMMM-explainable-chatbot-detection.

CLFeb 4, 2025Code
A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)

Yan Li, Tianyi Zhang, Zechuan Li et al.

Transformer-based Large Language Models (LLMs) struggle with inputs exceeding their training context window due to positional out-of-distribution (O.O.D.) issues that disrupt attention. Existing solutions, including fine-tuning and training-free methods, face challenges like inefficiency, redundant interpolation, logit outliers, or loss of local positional information. We propose Greedy Attention Logit Interpolation (GALI), a training-free method that improves length extrapolation by greedily reusing pretrained positional intervals and interpolating attention logit to eliminate outliers. GALI achieves stable and superior performance across a wide range of long-context tasks without requiring input-length-specific tuning. Our analysis further reveals that LLMs interpret positional intervals unevenly and that restricting interpolation to narrower ranges improves performance, even on short-context tasks. GALI represents a step toward more robust and generalizable long-text processing in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at https://github.com/adlnlp/Gali.

CLOct 14, 2024Code
ChuLo: Chunk-Level Key Information Representation for Long Document Understanding

Yan Li, Soyeon Caren Han, Yue Dai et al.

Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating inputs, sparse self-attention, and chunking, attempt to mitigate these issues, but they often lead to information loss and hinder the model's ability to capture long-range dependencies. In this paper, we introduce ChuLo, a novel chunk representation method for long document understanding that addresses these limitations. Our ChuLo groups input tokens using unsupervised keyphrase extraction, emphasizing semantically important keyphrase based chunks to retain core document content while reducing input length. This approach minimizes information loss and improves the efficiency of Transformer-based models. Preserving all tokens in long document understanding, especially token classification tasks, is important to ensure that fine-grained annotations, which depend on the entire sequence context, are not lost. We evaluate our method on multiple long document classification tasks and long document token classification tasks, demonstrating its effectiveness through comprehensive qualitative and quantitative analysis. Our implementation is open-sourced on https://github.com/adlnlp/Chulo.

CLFeb 25
When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning

Muku Akasaka, Soyeon Caren Han

Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures. A common strategy is to inject additional information at inference time, such as explicit spatial cues, external commonsense knowledge, or chain-of-thought (CoT) reasoning instructions. However, it remains unclear when such information genuinely improves reasoning and when it introduces noise. In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks. We examine (i) the type and number of spatial contexts, (ii) the amount and relevance of injected commonsense knowledge, and (iii) the interaction between spatial grounding and CoT prompting. Our results reveal a consistent pattern: more information does not necessarily yield better reasoning. Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise. These findings highlight the importance of selective, task-aligned information injection and provide practical guidance for designing reliable multimodal reasoning pipelines.

AIMar 5
EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

Shuo Yang, Soyeon Caren Han, Xueqi Ma et al.

LLM-based agents depend on effective tool-use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long-horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single-aspect, which ignore cross-module error propagation. To address these limitations, we propose EvoTool, a self-evolving framework that optimizes a modular tool-use policy via a gradient-free evolutionary paradigm. EvoTool decomposes agent's tool-use policy into four modules, including Planner, Selector, Caller, and Synthesizer, and iteratively improves them in a self-improving loop through three novel mechanisms. Trajectory-Grounded Blame Attribution uses diagnostic traces to localize failures to a specific module. Feedback-Guided Targeted Mutation then edits only that module via natural-language critique. Diversity-Aware Population Selection preserves complementary candidates to ensure solution diversity. Across four benchmarks, EvoTool outperforms strong baselines by over 5 points on both GPT-4.1 and Qwen3-8B, while achieving superior efficiency and transferability. The code will be released once paper is accepted.

CLMar 24, 2025
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Shuo Yang, Siwen Luo, Soyeon Caren Han et al.

Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-Language Models (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.

66.7AIMar 13
ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning

Shuo Yang, Soyeon Caren Han, Yihao Ding et al.

Large Language Model (LLM) agents are increasingly applied to complex, multi-step tasks that require interaction with diverse external tools across various domains. However, current LLM agent tool planning methods typically rely on greedy, reactive tool selection strategies that lack foresight and fail to account for inter-tool dependencies. In this paper, we present ToolTree, a novel Monte Carlo tree search-inspired planning paradigm for tool planning. ToolTree explores possible tool usage trajectories using a dual-stage LLM evaluation and bidirectional pruning mechanism that enables the agent to make informed, adaptive decisions over extended tool-use sequences while pruning less promising branches before and after the tool execution. Empirical evaluations across both open-set and closed-set tool planning tasks on 4 benchmarks demonstrate that ToolTree consistently improves performance while keeping the highest efficiency, achieving an average gain of around 10\% compared to the state-of-the-art planning paradigm.

LGFeb 2
When Is Rank-1 Enough? Geometry-Guided Initialization for Parameter-Efficient Fine-Tuning

Haoran Zhao, Soyeon Caren Han, Eduard Hovy

Parameter-efficient fine-tuning (PEFT) is a standard way to adapt multimodal large language models, yet extremely low-rank settings -- especially rank-1 LoRA -- are often unstable. We show that this instability is not solely due to limited capacity: in the rank-1 regime, optimization is highly sensitive to the update direction. Concretely, pretrained vision and text features form mismatched anisotropic regions, yielding a dominant "gap" direction that acts like a translation component and disproportionately steers early gradients under rank-1 constraints. Analyzing pretrained representations, we identify a modality-gap axis that dominates early gradient flow, while a random rank-1 initialization is unlikely to align with it, leading to weak gradients and training collapse. We propose Gap-Init, a geometry-aware initialization that aligns the rank-1 LoRA direction with an estimated modality-gap vector from a small calibration set, while keeping the initial LoRA update zero. Across multiple vision-language tasks and backbones, Gap-Init consistently stabilizes rank-1 training and can match or outperform strong rank-8 baselines. Our results suggest that at the extreme low-rank limit, initial alignment can matter as much as rank itself.

AIFeb 24
Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

Dhita Putri Pratama, Soyeon Caren Han, Yihao Ding

Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.

CVFeb 1, 2024
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

Eileen Wang, Soyeon Caren Han, Josiah Poon

Visual storytelling aims to automatically generate a coherent story based on a given image sequence. Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable story. However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations that includes human action motivation and its social interaction commonsense knowledge. SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights. This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm. Our proposed framework produces stories superior across multiple metrics in terms of visual grounding, coherence, diversity, and humanness, per both automatic and human evaluations.

CVApr 19, 2024
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering

Yihao Ding, Kaixuan Ren, Jiabin Huang et al.

Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA.

CVJun 2, 2025
VRD-IU: Lessons from Visually Rich Document Intelligence and Understanding

Yihao Ding, Soyeon Caren Han, Yan Li et al.

Visually Rich Document Understanding (VRDU) has emerged as a critical field in document intelligence, enabling automated extraction of key information from complex documents across domains such as medical, financial, and educational applications. However, form-like documents pose unique challenges due to their complex layouts, multi-stakeholder involvement, and high structural variability. Addressing these issues, the VRD-IU Competition was introduced, focusing on extracting and localizing key information from multi-format forms within the Form-NLU dataset, which includes digital, printed, and handwritten documents. This paper presents insights from the competition, which featured two tracks: Track A, emphasizing entity-based key information retrieval, and Track B, targeting end-to-end key information localization from raw document images. With over 20 participating teams, the competition showcased various state-of-the-art methodologies, including hierarchical decomposition, transformer-based retrieval, multimodal feature fusion, and advanced object detection techniques. The top-performing models set new benchmarks in VRDU, providing valuable insights into document intelligence.

CLJan 8, 2025
Graph-Based Multimodal Contrastive Learning for Chart Question Answering

Yue Dai, Soyeon Caren Han, Wei Liu

Chart question answering (ChartQA) is challenged by the heterogeneous composition of chart elements and the subtle data patterns they encode. This work introduces a novel joint multimodal scene graph framework that explicitly models the relationships among chart components and their underlying structures. The framework integrates both visual and textual graphs to capture structural and semantic characteristics, while a graph contrastive learning strategy aligns node representations across modalities enabling their seamless incorporation into a transformer decoder as soft prompts. Moreover, a set of tailored Chain of Thought (CoT) prompts is proposed to enhance multimodal large language models (MLLMs) in zero-s ot scenarios by mitigating hallucinations. Extensive evaluations on benchmarks including ChartQA, OpenCQA, and ChartX demonstrate significant performance improvements and validate the efficacy of the proposed approach.

CLMar 9
BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

Biao Xiang, Soyeon Caren Han, Yihao Ding

Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures. The dataset supports both chain-like and fan-out structures and provides explicit multi-hop reasoning annotations for step-level evaluation beyond answer accuracy. Experiments with state-of-the-art LLMs and multimodal retrieval-augmented generation (RAG) systems reveal systematic deficiencies in evidence aggregation and grounding that remain hidden under conventional answer-only evaluation. BRIDGE provides a targeted testbed for diagnosing reasoning failures in long multimodal documents.

CLNov 5, 2024
Multimodal Commonsense Knowledge Distillation for Visual Question Answering

Shuo Yang, Siwen Luo, Soyeon Caren Han

Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.

CLNov 4, 2024
TriG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition

Rina Carines Cabral, Soyeon Caren Han, Areej Alhassan et al.

Discontinuous Named Entity Recognition (DNER) presents a challenging problem where entities may be scattered across multiple non-adjacent tokens, making traditional sequence labelling approaches inadequate. Existing methods predominantly rely on custom tagging schemes to handle these discontinuous entities, resulting in models tightly coupled to specific tagging strategies and lacking generalisability across diverse datasets. To address these challenges, we propose TriG-NER, a novel Triplet-Grid Framework that introduces a generalisable approach to learning robust token-level representations for discontinuous entity extraction. Our framework applies triplet loss at the token level, where similarity is defined by word pairs existing within the same entity, effectively pulling together similar and pushing apart dissimilar ones. This approach enhances entity boundary detection and reduces the dependency on specific tagging schemes by focusing on word-pair relationships within a flexible grid structure. We evaluate TriG-NER on three benchmark DNER datasets and demonstrate significant improvements over existing grid-based architectures. These results underscore our framework's effectiveness in capturing complex entity structures and its adaptability to various tagging schemes, setting a new benchmark for discontinuous entity extraction.

SDFeb 1
HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection

Zhili Nicholas Liang, Soyeon Caren Han, Qizhou Wang et al.

Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.

AINov 20, 2025
MUSEKG: A Knowledge Graph Over Museum Collections

Jinhao Li, Jianzhong Qi, Soyeon Caren Han et al.

Digital transformation in the cultural heritage sector has produced vast yet fragmented collections of artefact data. Existing frameworks for museum information systems struggle to integrate heterogeneous metadata, unstructured documents, and multimodal artefacts into a coherent and queryable form. We present MuseKG, an end-to-end knowledge-graph framework that unifies structured and unstructured museum data through symbolic-neural integration. MuseKG constructs a typed property graph linking objects, people, organisations, and visual or textual labels, and supports natural language queries. Evaluations on real museum collections demonstrate robust performance across queries over attributes, relations, and related entities, surpassing large-language-model zero-shot, few-shot and SPARQL prompt baselines. The results highlight the importance of symbolic grounding for interpretable and scalable cultural heritage reasoning, and pave the way for web-scale integration of digital heritage knowledge.

CLOct 27, 2025
MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models

Suchan Lee, Jihoon Choi, Sohyeon Lee et al.

Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.

CVSep 27, 2025
SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction

Yihao Ding, Soyeon Caren Han, Yanbei Jiang et al.

Domain-specific Visually Rich Document Understanding (VRDU) presents significant challenges due to the complexity and sensitivity of documents in fields such as medicine, finance, and material science. Existing Large (Multimodal) Language Models (LLMs/MLLMs) achieve promising results but face limitations such as hallucinations, inadequate domain adaptation, and reliance on extensive fine-tuning datasets. This paper introduces SynDoc, a novel framework that combines discriminative and generative models to address these challenges. SynDoc employs a robust synthetic data generation workflow, using structural information extraction and domain-specific query generation to produce high-quality annotations. Through adaptive instruction tuning, SynDoc improves the discriminative model's ability to extract domain-specific knowledge. At the same time, a recursive inferencing mechanism iteratively refines the output of both models for stable and accurate predictions. This framework demonstrates scalable, efficient, and precise document understanding and bridges the gap between domain-specific adaptation and general world knowledge for document key information extraction tasks.

CLAug 20, 2025
DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

Jiwon Park, Seohyun Pyeon, Jinwoo Kim et al.

Despite recent advances in large language models (LLMs), most QA benchmarks are still confined to single-paragraph or single-document settings, failing to capture the complexity of real-world information-seeking tasks. Practical QA often requires multi-hop reasoning over information distributed across multiple documents, modalities, and structural formats. Although prior datasets made progress in this area, they rely heavily on Wikipedia-based content and unimodal plain text, with shallow reasoning paths that typically produce brief phrase-level or single-sentence answers, thus limiting their realism and generalizability. We propose DocHop-QA, a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering. Constructed from publicly available scientific documents sourced from PubMed, DocHop-QA is domain-agnostic and incorporates diverse information formats, including textual passages, tables, and structural layout cues. Unlike existing datasets, DocHop-QA does not rely on explicitly hyperlinked documents; instead, it supports open-ended reasoning through semantic similarity and layout-aware evidence synthesis. To scale realistic QA construction, we designed an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. We evaluated DocHop-QA through four tasks spanning structured index prediction, generative answering, and multimodal integration, reflecting both discriminative and generative paradigms. These tasks demonstrate DocHop-QA's capacity to support complex, multimodal reasoning across multiple documents.

AIAug 16, 2025
UniCast: A Unified Multimodal Prompting Framework for Time Series Forecasting

Sehyuk Park, Soyeon Caren Han, Eduard Hovy

Time series forecasting is a foundational task across domains, such as finance, healthcare, and environmental monitoring. While recent advances in Time Series Foundation Models (TSFMs) have demonstrated strong generalisation through large-scale pretraining, existing models operate predominantly in a unimodal setting, ignoring the rich multimodal context, such as visual and textual signals, that often accompanies time series data in real-world scenarios. This paper introduces a novel parameter-efficient multimodal framework, UniCast, that extends TSFMs to jointly leverage time series, vision, and text modalities for enhanced forecasting performance. Our method integrates modality-specific embeddings from pretrained Vision and Text Encoders with a frozen TSFM via soft prompt tuning, enabling efficient adaptation with minimal parameter updates. This design not only preserves the generalisation strength of the foundation model but also enables effective cross-modal interaction. Extensive experiments across diverse time-series forecasting benchmarks demonstrate that UniCast consistently and significantly outperforms all existing TSFM baselines. The findings highlight the critical role of multimodal context in advancing the next generation of general-purpose time series forecasters.

CLJun 13, 2024
3M: Multi-modal Multi-task Multi-teacher Learning for Game Event Detection

Thye Shan Ng, Feiqi Cao, Soyeon Caren Han

Esports has rapidly emerged as a global phenomenon with an ever-expanding audience via platforms, like YouTube. Due to the inherent complexity nature of the game, it is challenging for newcomers to comprehend what the event entails. The chaotic nature of online chat, the fast-paced speech of the game commentator, and the game-specific user interface further compound the difficulty for users in comprehending the gameplay. To overcome these challenges, it is crucial to integrate the Multi-Modal (MM) information from the platform and understand the event. The paper introduces a new MM multi-teacher-based game event detection framework, with the ultimate goal of constructing a comprehensive framework that enhances the comprehension of the ongoing game situation. While conventional MM models typically prioritise aligning MM data through concurrent training towards a unified objective, our framework leverages multiple teachers trained independently on different tasks to accomplish the Game Event Detection. The experiment clearly shows the effectiveness of the proposed MM multi-teacher framework.