LGSep 3, 2022Code
Disconnected Emerging Knowledge Graph Oriented Inductive Link PredictionYufeng Zhang, Weiqing Wang, Hongzhi Yin et al.
Inductive link prediction (ILP) is to predict links for unseen entities in emerging knowledge graphs (KGs), considering the evolving nature of KGs. A more challenging scenario is that emerging KGs consist of only unseen entities, called as disconnected emerging KGs (DEKGs). Existing studies for DEKGs only focus on predicting enclosing links, i.e., predicting links inside the emerging KG. The bridging links, which carry the evolutionary information from the original KG to DEKG, have not been investigated by previous work so far. To fill in the gap, we propose a novel model entitled DEKG-ILP (Disconnected Emerging Knowledge Graph Oriented Inductive Link Prediction) that consists of the following two components. (1) The module CLRM (Contrastive Learning-based Relation-specific Feature Modeling) is developed to extract global relation-based semantic features that are shared between original KGs and DEKGs with a novel sampling strategy. (2) The module GSM (GNN-based Subgraph Modeling) is proposed to extract the local subgraph topological information around each link in KGs. The extensive experiments conducted on several benchmark datasets demonstrate that DEKG-ILP has obvious performance improvements compared with state-of-the-art methods for both enclosing and bridging link prediction. The source code is available online.
CLJun 10, 2022Code
Unsupervised Sentence Simplification via Dependency ParsingVy Vo, Weiqing Wang, Wray Buntine
Text simplification is the task of rewriting a text so that it is readable and easily understood. In this paper, we propose a simple yet novel unsupervised sentence simplification system that harnesses parsing structures together with sentence embeddings to produce linguistically effective simplifications. This means our model is capable of introducing substantial modifications to simplify a sentence while maintaining its original semantics and adequate fluency. We establish the unsupervised state-of-the-art at 39.13 SARI on TurkCorpus set and perform competitively against supervised baselines on various quality metrics. Furthermore, we demonstrate our framework's extensibility to other languages via a proof-of-concept on Vietnamese data. Code for reproduction is published at \url{https://github.com/isVy08/USDP}.
AIDec 23, 2022
HiTSKT: A Hierarchical Transformer Model for Session-Aware Knowledge TracingFucai Ke, Weiqing Wang, Weicong Tan et al.
Knowledge tracing (KT) aims to leverage students' learning histories to estimate their mastery levels on a set of pre-defined skills, based on which the corresponding future performance can be accurately predicted. As an important way of providing personalized experience for online education, KT has gained increased attention in recent years. In practice, a student's learning history comprises answers to sets of massed questions, each known as a session, rather than merely being a sequence of independent answers. Theoretically, within and across these sessions, students' learning dynamics can be very different. Therefore, how to effectively model the dynamics of students' knowledge states within and across the sessions is crucial for handling the KT problem. Most existing KT models treat student's learning records as a single continuing sequence, without capturing the sessional shift of students' knowledge state. To address the above issue, we propose a novel hierarchical transformer model, named HiTSKT, comprises an interaction(-level) encoder to capture the knowledge a student acquires within a session, and a session(-level) encoder to summarise acquired knowledge across the past sessions. To predict an interaction in the current session, a knowledge retriever integrates the summarised past-session knowledge with the previous interactions' information into proper knowledge representations. These representations are then used to compute the student's current knowledge state. Additionally, to model the student's long-term forgetting behaviour across the sessions, a power-law-decay attention mechanism is designed and deployed in the session encoder, allowing it to emphasize more on the recent sessions. Extensive experiments on three public datasets demonstrate that HiTSKT achieves new state-of-the-art performance on all the datasets compared with six state-of-the-art KT models.
AISep 3, 2022
MMKGR: Multi-hop Multi-modal Knowledge Graph ReasoningShangfei Zheng, Weiqing Wang, Jianfeng Qu et al.
Multi-modal knowledge graphs (MKGs) include not only the relation triplets, but also related multi-modal auxiliary data (i.e., texts and images), which enhance the diversity of knowledge. However, the natural incompleteness has significantly hindered the applications of MKGs. To tackle the problem, existing studies employ the embedding-based reasoning models to infer the missing knowledge after fusing the multi-modal features. However, the reasoning performance of these methods is limited due to the following problems: (1) ineffective fusion of multi-modal auxiliary features; (2) lack of complex reasoning ability as well as inability to conduct the multi-hop reasoning which is able to infer more missing knowledge. To overcome these problems, we propose a novel model entitled MMKGR (Multi-hop Multi-modal Knowledge Graph Reasoning). Specifically, the model contains the following two components: (1) a unified gate-attention network which is designed to generate effective multi-modal complementary features through sufficient attention interaction and noise reduction; (2) a complementary feature-aware reinforcement learning method which is proposed to predict missing elements by performing the multi-hop reasoning process, based on the features obtained in component (1). The experimental results demonstrate that MMKGR outperforms the state-of-the-art approaches in the MKG reasoning task.
LGAug 8, 2024Code
Scalable Transformer for High Dimensional Multivariate Time Series ForecastingXin Zhou, Weiqing Wang, Wray Buntine et al.
Deep models for Multivariate Time Series (MTS) forecasting have recently demonstrated significant success. Channel-dependent models capture complex dependencies that channel-independent models cannot capture. However, the number of channels in real-world applications outpaces the capabilities of existing channel-dependent models, and contrary to common expectations, some models underperform the channel-independent models in handling high-dimensional data, which raises questions about the performance of channel-dependent models. To address this, our study first investigates the reasons behind the suboptimal performance of these channel-dependent models on high-dimensional MTS data. Our analysis reveals that two primary issues lie in the introduced noise from unrelated series that increases the difficulty of capturing the crucial inter-channel dependencies, and challenges in training strategies due to high-dimensional data. To address these issues, we propose STHD, the Scalable Transformer for High-Dimensional Multivariate Time Series Forecasting. STHD has three components: a) Relation Matrix Sparsity that limits the noise introduced and alleviates the memory issue; b) ReIndex applied as a training strategy to enable a more flexible batch size setting and increase the diversity of training data; and c) Transformer that handles 2-D inputs and captures channel dependencies. These components jointly enable STHD to manage the high-dimensional MTS while maintaining computational feasibility. Furthermore, experimental results show STHD's considerable improvement on three high-dimensional datasets: Crime-Chicago, Wiki-People, and Traffic. The source code and dataset are publicly available https://github.com/xinzzzhou/ScalableTransformer4HighDimensionMTSF.git.
CLJan 30, 2023
On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model: An Empirical Study on CodexTerry Yue Zhuo, Zhuang Li, Yujin Huang et al.
Semantic parsing is a technique aimed at constructing a structured representation of the meaning of a natural-language question. Recent advancements in few-shot language models trained on code have demonstrated superior performance in generating these representations compared to traditional unimodal language models, which are trained on downstream tasks. Despite these advancements, existing fine-tuned neural semantic parsers are susceptible to adversarial attacks on natural-language inputs. While it has been established that the robustness of smaller semantic parsers can be enhanced through adversarial training, this approach is not feasible for large language models in real-world scenarios, as it requires both substantial computational resources and expensive human annotation on in-domain semantic parsing data. This paper presents the first empirical study on the adversarial robustness of a large prompt-based language model of code, \codex. Our results demonstrate that the state-of-the-art (SOTA) code-language models are vulnerable to carefully crafted adversarial examples. To address this challenge, we propose methods for improving robustness without the need for significant amounts of labeled data or heavy computational resources.
CLMar 21, 2022
Paraphrasing Techniques for Maritime QA systemFatemeh Shiri, Terry Yue Zhuo, Zhuang Li et al.
There has been an increasing interest in incorporating Artificial Intelligence (AI) into Defence and military systems to complement and augment human intelligence and capabilities. However, much work still needs to be done toward achieving an effective human-machine partnership. This work is aimed at enhancing human-machine communications by developing a capability for automatically translating human natural language into a machine-understandable language (e.g., SQL queries). Techniques toward achieving this goal typically involve building a semantic parser trained on a very large amount of high-quality manually-annotated data. However, in many real-world Defence scenarios, it is not feasible to obtain such a large amount of training data. To the best of our knowledge, there are few works trying to explore the possibility of training a semantic parser with limited manually-paraphrased data, in other words, zero-shot. In this paper, we investigate how to exploit paraphrasing methods for the automated generation of large-scale training datasets (in the form of paraphrased utterances and their corresponding logical forms in SQL format) and present our experimental results using real-world data in the maritime domain.
ASAug 20, 2023
The DKU-DUKEECE System for the Manipulation Region Location Task of ADD 2023Zexin Cai, Weiqing Wang, Yikang Wang et al.
This paper introduces our system designed for Track 2, which focuses on locating manipulated regions, in the second Audio Deepfake Detection Challenge (ADD 2023). Our approach involves the utilization of multiple detection systems to identify splicing regions and determine their authenticity. Specifically, we train and integrate two frame-level systems: one for boundary detection and the other for deepfake detection. Additionally, we employ a third VAE model trained exclusively on genuine data to determine the authenticity of a given audio clip. Through the fusion of these three systems, our top-performing solution for the ADD challenge achieves an impressive 82.23% sentence accuracy and an F1 score of 60.66%. This results in a final ADD score of 0.6713, securing the first rank in Track 2 of ADD 2023.
CLAug 12, 2023
Generating Faithful Text From a Knowledge Graph with Noisy Reference TextTahsina Hashem, Weiqing Wang, Derry Tanti Wijaya et al.
Knowledge Graph (KG)-to-Text generation aims at generating fluent natural-language text that accurately represents the information of a given knowledge graph. While significant progress has been made in this task by exploiting the power of pre-trained language models (PLMs) with appropriate graph structure-aware modules, existing models still fall short of generating faithful text, especially when the ground-truth natural-language text contains additional information that is not present in the graph. In this paper, we develop a KG-to-text generation model that can generate faithful natural-language text from a given graph, in the presence of noisy reference text. Our framework incorporates two core ideas: Firstly, we utilize contrastive learning to enhance the model's ability to differentiate between faithful and hallucinated information in the text, thereby encouraging the decoder to generate text that aligns with the input graph. Secondly, we empower the decoder to control the level of hallucination in the generated text by employing a controllable text generation technique. We evaluate our model's performance through the standard quantitative metrics as well as a ChatGPT-based quantitative and qualitative analysis. Our evaluation demonstrates the superior performance of our model over state-of-the-art KG-to-text models on faithfulness.
ASSep 10, 2024
Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text SystemsTaejin Park, Ivan Medennikov, Kunal Dhawan et al.
Sortformer is an encoder-based speaker diarization model designed for supervising speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the proposed Sortformer and multi-speaker architecture will enable the seamless integration of speaker tagging capabilities into foundational speech-to-text systems and multimodal large language models (LLMs), offering an easily adoptable and user-friendly mechanism to enhance their versatility and performance in speaker-aware tasks. The code and trained models are made publicly available through the NVIDIA NeMo Framework.
96.6CVMar 18
VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse ObservationsFucai Ke, Zhixi Cai, Boying Li et al.
Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded question-answer pairs. Using this benchmark, a comprehensive evaluation of state-of-the-art vision-language and spatial models reveals that multi-view reasoning remains largely unsolved, with most models performing only marginally above random guessing. We further investigate whether training can bridge this gap. Our proposed Grounded Chain-of-Thought with Visual Evidence substantially improves performance under moderate difficulty, and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation. We further conduct difficulty-aware scaling analyses across model size, data scale, reasoning depth, and visibility constraints, indicating that while geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge.
26.5CLMay 7
How Does A Text Preprocessing Pipeline Affect Ontology Matching?Zhangcheng Qiang, Kerry Taylor, Weiqing Wang
The classical text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many systems for ontology matching (OM). However, the lack of standardisation in text preprocessing creates diversity in the mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on 8 Ontology Alignment Evaluation Initiative (OAEI) tracks with 49 distinct alignments. We find that Tokenisation and Normalisation (categorised as Phase 1 text preprocessing) are more effective than Stop Words Removal and Stemming/Lemmatisation (categorised as Phase 2 text preprocessing). We propose two novel approaches to repair unwanted false mappings that occur in Phase 2 text preprocessing. One is a pre hoc logic-based repair approach used before text preprocessing, employing an ontology-specific check to find common words that cause false mappings. The other repair approach is the post hoc large language model (LLM)-based approach, used after text preprocessing, which utilises the strong background knowledge provided by LLMs to repair non-existent and counter-intuitive false mappings. The experimental results indicate that these two approaches can significantly improve the matching correctness and the overall matching performance.
CVSep 6, 2024
Generating Faithful and Salient Text from Multimodal DataTahsina Hashem, Weiqing Wang, Derry Tanti Wijaya et al.
While large multimodal models (LMMs) have obtained strong performance on many multimodal tasks, they may still hallucinate while generating text. Their performance on detecting salient features from visual data is also unclear. In this paper, we develop a framework to generate faithful and salient text from mixed-modal data, which includes images and structured data ( represented in knowledge graphs or tables). Specifically, we train a small vision critic model to identify hallucinated and non-salient features from the image modality. The critic model also generates a list of salient image features. This information is used in the post editing step to improve the generation quality. Experiments on two datasets show that our framework improves LMMs' generation quality on both faithfulness and saliency, outperforming recent techniques aimed at reducing hallucination.
AISep 21, 2024
OAEI-LLM: A Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology MatchingZhangcheng Qiang, Kerry Taylor, Weiqing Wang et al.
Hallucinations of large language models (LLMs) commonly occur in domain-specific downstream tasks, with no exception in ontology matching (OM). The prevalence of using LLMs for OM raises the need for benchmarks to better understand LLM hallucinations. The OAEI-LLM dataset is an extended version of the Ontology Alignment Evaluation Initiative (OAEI) datasets that evaluate LLM-specific hallucinations in OM tasks. We outline the methodology used in dataset construction and schema extension, and provide examples of potential use cases.
AIFeb 22, 2023
DNG: Taxonomy Expansion by Exploring the Intrinsic Directed Structure on Non-gaussian SpaceSonglin Zhai, Weiqing Wang, Yuanfang Li et al.
Taxonomy expansion is the process of incorporating a large number of additional nodes (i.e., "queries") into an existing taxonomy (i.e., "seed"), with the most important step being the selection of appropriate positions for each query. Enormous efforts have been made by exploring the seed's structure. However, existing approaches are deficient in their mining of structural information in two ways: poor modeling of the hierarchical semantics and failure to capture directionality of is-a relation. This paper seeks to address these issues by explicitly denoting each node as the combination of inherited feature (i.e., structural part) and incremental feature (i.e., supplementary part). Specifically, the inherited feature originates from "parent" nodes and is weighted by an inheritance factor. With this node representation, the hierarchy of semantics in taxonomies (i.e., the inheritance and accumulation of features from "parent" to "child") could be embodied. Additionally, based on this representation, the directionality of is-a relation could be easily translated into the irreversible inheritance of features. Inspired by the Darmois-Skitovich Theorem, we implement this irreversibility by a non-Gaussian constraint on the supplementary feature. A log-likelihood learning objective is further utilized to optimize the proposed model (dubbed DNG), whereby the required non-Gaussianity is also theoretically ensured. Extensive experimental results on two real-world datasets verify the superiority of DNG relative to several strong baselines.
43.6AIApr 14
OM4OV: Leveraging Ontology Matching for Ontology VersioningZhangcheng Qiang, Kerry Taylor, Weiqing Wang
Due to the dynamic nature of the Semantic Web, version control is necessary to manage changes in widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component of efficient ontology management, many approaches treat OV as similar to ontology matching (OM) and directly reuse OM systems for OV tasks. In this study, we systematically analyse similarities and differences between OM and OV and formalise an OM4OV pipeline to offer more advanced OV support. The pipeline is implemented and evaluated in the state-of-the-art OM system Agent-OM. The experimental results indicate that OM systems can be effectively reused for OV tasks, but without necessary extensions, can produce skewed measurements, poor performance in detecting update entities, and limited explanation of false mappings. To tackle these issues, we propose an optimisation method called the cross-reference (CR) mechanism, which builds on existing OM alignments to reduce the number of matching candidates and to improve overall OV performance.
AISep 30, 2024
OM4OV: Leveraging Ontology Matching for Ontology VersioningZhangcheng Qiang, Kerry Taylor, Weiqing Wang
Due to the dynamic nature of the Semantic Web, version control is necessary to manage changes in widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component of efficient ontology management, many approaches treat OV as similar to ontology matching (OM) and directly reuse OM systems for OV tasks. In this study, we systematically analyse similarities and differences between OM and OV and formalise an OM4OV pipeline to offer more advanced OV support. The pipeline is implemented and evaluated in the state-of-the-art OM system Agent-OM. The experimental results indicate that OM systems can be effectively reused for OV tasks, but without necessary extensions, can produce skewed measurements, poor performance in detecting update entities, and limited explanation of false mappings. To tackle these issues, we propose an optimisation method called the cross-reference (CR) mechanism, which builds on existing OM alignments to reduce the number of matching candidates and to improve overall OV performance.
CVMar 25, 2025
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking TuningFucai Ke, Vijay Kumar B G, Xingjian Leng et al.
Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.
CVAug 24, 2025
Explain Before You Answer: A Survey on Compositional Visual ReasoningFucai Ke, Joy Hsu, Zhixi Cai et al.
Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.
AIJan 13, 2025
Unveiling the Potential of Text in High-Dimensional Time Series ForecastingXin Zhou, Weiqing Wang, Shilin Qu et al.
Time series forecasting has traditionally focused on univariate and multivariate numerical data, often overlooking the benefits of incorporating multimodal information, particularly textual data. In this paper, we propose a novel framework that integrates time series models with Large Language Models to improve high-dimensional time series forecasting. Inspired by multimodal models, our method combines time series and textual data in the dual-tower structure. This fusion of information creates a comprehensive representation, which is then processed through a linear layer to generate the final forecast. Extensive experiments demonstrate that incorporating text enhances high-dimensional time series forecasting performance. This work paves the way for further research in multimodal time series forecasting.
LGMay 21, 2025
MoTime: A Dataset Suite for Multimodal Time Series ForecastingXin Zhou, Weiqing Wang, Francisco J. Baldán et al.
While multimodal data sources are increasingly available from real-world forecasting, most existing research remains on unimodal time series. In this work, we present MoTime, a suite of multimodal time series forecasting datasets that pair temporal signals with external modalities such as text, metadata, and images. Covering diverse domains, MoTime supports structured evaluation of modality utility under two scenarios: 1) the common forecasting task, where varying-length history is available, and 2) cold-start forecasting, where no historical data is available. Experiments show that external modalities can improve forecasting performance in both scenarios, with particularly strong benefits for short series in some datasets, though the impact varies depending on data characteristics. By making datasets and findings publicly available, we aim to support more comprehensive and realistic benchmarks in future multimodal time series forecasting research.
LGJan 29, 2024
OntoMedRec: Logically-Pretrained Model-Agnostic Ontology Encoders for Medication RecommendationWeicong Tan, Weiqing Wang, Xin Zhou et al.
Most existing medication recommendation models learn representations for medical concepts based on electronic health records (EHRs) and make recommendations with learnt representations. However, most medications appear in the dataset for limited times, resulting in insufficient learning of their representations. Medical ontologies are the hierarchical classification systems for medical terms where similar terms are in the same class on a certain level. In this paper, we propose OntoMedRec, the logically-pretrained and model-agnostic medical Ontology Encoders for Medication Recommendation that addresses data sparsity problem with medical ontologies. We conduct comprehensive experiments on benchmark datasets to evaluate the effectiveness of OntoMedRec, and the result shows the integration of OntoMedRec improves the performance of various models in both the entire EHR datasets and the admissions with few-shot medications. We provide the GitHub repository for the source code on https://anonymous.4open.science/r/OntoMedRec-D123
42.8SYMar 24
Safe Decentralized Operation of EV Virtual Power Plant with Limited Network Visibility via Multi-Agent Reinforcement LearningChenghao Huang, Jiarong Fan, Weiqing Wang et al.
As power systems advance toward net-zero targets, behind-the-meter renewables are driving rapid growth in distributed energy resources (DERs). Virtual power plants (VPPs) increasingly coordinate these resources to support power distribution network (PDN) operation, with EV charging stations (EVCSs) emerging as a key asset due to their strong impact on local voltages. However, in practice, VPPs must make operational decisions with only partial visibility of PDN states, relying on limited, aggregated information shared by the distribution system operator. This work proposes a safety-enhanced VPP framework for coordinating multiple EVCSs under such realistic information constraints to ensure voltage security while maintaining economic operation. We develop Transformer-assisted Lagrangian Multi-Agent Proximal Policy Optimization (TL-MAPPO), in which EVCS agents learn decentralized charging policies via centralized training with Lagrangian regularization to enforce voltage and demand-satisfaction constraints. A transformer-based embedding layer deployed on each EVCS agent captures temporal correlations among prices, loads, and charging demand to improve decision quality. Experiments on a realistic 33-bus PDN show that the proposed framework reduces voltage violations by approximately 45% and operational costs by approximately 10% compared to representative multi-agent DRL baselines, highlighting its potential for practical VPP deployment.
SDAug 7, 2025
SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcriptionRaymond Grossman, Taejin Park, Kunal Dhawan et al.
We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications.
IRJul 20, 2025
Privacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack PerspectiveYubo Wang, Min Tang, Nuo Shen et al.
The large language model (LLM) powered recommendation paradigm has been proposed to address the limitations of traditional recommender systems, which often struggle to handle cold start users or items with new IDs. Despite its effectiveness, this study uncovers that LLM empowered recommender systems are vulnerable to reconstruction attacks that can expose both system and user privacy. To examine this threat, we present the first systematic study on inversion attacks targeting LLM empowered recommender systems, where adversaries attempt to reconstruct original prompts that contain personal preferences, interaction histories, and demographic attributes by exploiting the output logits of recommendation models. We reproduce the vec2text framework and optimize it using our proposed method called Similarity Guided Refinement, enabling more accurate reconstruction of textual prompts from model generated logits. Extensive experiments across two domains (movies and books) and two representative LLM based recommendation models demonstrate that our method achieves high fidelity reconstructions. Specifically, we can recover nearly 65 percent of the user interacted items and correctly infer age and gender in 87 percent of the cases. The experiments also reveal that privacy leakage is largely insensitive to the victim model's performance but highly dependent on domain consistency and prompt complexity. These findings expose critical privacy vulnerabilities in LLM empowered recommender systems.
IRJul 14, 2025
MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic CorporaTuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang et al.
Continually updating model-based indexes in generative retrieval with new documents remains challenging, as full retraining is computationally expensive and impractical under resource constraints. We propose MixLoRA-DSI, a novel framework that combines an expandable mixture of Low-Rank Adaptation experts with a layer-wise out-of-distribution (OOD)-driven expansion strategy. Instead of allocating new experts for each new corpus, our proposed expansion strategy enables sublinear parameter growth by selectively introducing new experts only when significant number of OOD documents are detected. Experiments on NQ320k and MS MARCO Passage demonstrate that MixLoRA-DSI outperforms full-model update baselines, with minimal parameter overhead and substantially lower training costs.
CLMar 25, 2025
OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology MatchingZhangcheng Qiang, Kerry Taylor, Weiqing Wang et al.
Hallucinations are often inevitable in downstream tasks using large language models (LLMs). To tackle the substantial challenge of addressing hallucinations for LLM-based ontology matching (OM) systems, we introduce a new benchmark dataset OAEI-LLM-T. The dataset evolves from seven TBox datasets in the Ontology Alignment Evaluation Initiative (OAEI), capturing hallucinations of ten different LLMs performing OM tasks. These OM-specific hallucinations are organised into two primary categories and six sub-categories. We showcase the usefulness of the dataset in constructing an LLM leaderboard for OM tasks and for fine-tuning LLMs used in OM tasks.
CLNov 6, 2024
How Does A Text Preprocessing Pipeline Affect Ontology Matching?Zhangcheng Qiang, Kerry Taylor, Weiqing Wang
The classical text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many systems for ontology matching (OM). However, the lack of standardisation in text preprocessing creates diversity in the mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on 8 Ontology Alignment Evaluation Initiative (OAEI) tracks with 49 distinct alignments. We find that Tokenisation and Normalisation (categorised as Phase 1 text preprocessing) are more effective than Stop Words Removal and Stemming/Lemmatisation (categorised as Phase 2 text preprocessing). We propose two novel approaches to repair unwanted false mappings that occur in Phase 2 text preprocessing. One is an ad hoc logic-based repair approach used before text preprocessing, employing an ontology-specific check to find common words that cause false mappings. The other repair approach is the post hoc large language model (LLM)-based approach, used after text preprocessing, which utilises the strong background knowledge provided by LLMs to repair non-existent and counter-intuitive false mappings. The experimental results indicate that these two approaches can significantly improve the matching correctness and the overall matching performance.
LGOct 15, 2024
WPFed: Web-based Personalized Federation for Decentralized SystemsGuanhua Ye, Jifeng He, Weiqing Wang et al.
Decentralized learning has become crucial for collaborative model training in environments where data privacy and trust are paramount. In web-based applications, clients are liberated from traditional fixed network topologies, enabling the establishment of arbitrary peer-to-peer (P2P) connections. While this flexibility is highly promising, it introduces a fundamental challenge: the optimal selection of neighbors to ensure effective collaboration. To address this, we introduce WPFed, a fully decentralized, web-based learning framework designed to enable globally optimal neighbor selection. WPFed employs a dynamic communication graph and a weighted neighbor selection mechanism. By assessing inter-client similarity through Locality-Sensitive Hashing (LSH) and evaluating model quality based on peer rankings, WPFed enables clients to identify personalized optimal neighbors on a global scale while preserving data privacy. To enhance security and deter malicious behavior, WPFed integrates verification mechanisms for both LSH codes and performance rankings, leveraging blockchain-driven announcements to ensure transparency and verifiability. Through extensive experiments on multiple real-world datasets, we demonstrate that WPFed significantly improves learning outcomes and system robustness compared to traditional federated learning methods. Our findings highlight WPFed's potential to facilitate effective and secure decentralized collaborative learning across diverse and interconnected web environments.
IRJun 18, 2024
PromptDSI: Prompt-based Rehearsal-free Continual Learning for Document RetrievalTuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang et al.
Differentiable Search Index (DSI) utilizes pre-trained language models to perform indexing and document retrieval via end-to-end learning without relying on external indexes. However, DSI requires full re-training to index new documents, causing significant computational inefficiencies. Continual learning (CL) offers a solution by enabling the model to incrementally update without full re-training. Existing CL solutions in document retrieval rely on memory buffers or generative models for rehearsal, which is infeasible when accessing previous training data is restricted due to privacy concerns. To this end, we introduce PromptDSI, a prompt-based, rehearsal-free continual learning approach for document retrieval. PromptDSI follows the Prompt-based Continual Learning (PCL) framework, using learnable prompts to efficiently index new documents without accessing previous documents or queries. To improve retrieval latency, we remove the initial forward pass of PCL, which otherwise greatly increases training and inference time, with a negligible trade-off in performance. Additionally, we introduce a novel topic-aware prompt pool that employs neural topic embeddings as fixed keys, eliminating the instability of prompt key optimization while maintaining competitive performance with existing PCL prompt pools. In a challenging rehearsal-free continual learning setup, we demonstrate that PromptDSI variants outperform rehearsal-based baselines, match the strong cache-based baseline in mitigating forgetting, and significantly improving retrieval performance on new corpora.
CVMar 19, 2024
HYDRA: A Hyper Agent for Dynamic Compositional Visual ReasoningFucai Ke, Zhixi Cai, Simindokht Jahangard et al.
Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encoded in Large Language Models (LLMs) to perform planning, reasoning, or both, without considering the effect of their decisions on the visual reasoning process, which can lead to errors or failed procedures. To address these challenges, we introduce HYDRA, a multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning. HYDRA integrates three essential modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive controller, and a reasoner. The planner and reasoner modules utilize an LLM to generate instruction samples and executable code from the selected instruction, respectively, while the RL agent dynamically interacts with these modules, making high-level decisions on selection of the best instruction sample given information from the historical state stored through a feedback loop. This adaptable design enables HYDRA to adjust its actions based on previous feedback received during the reasoning process, leading to more reliable reasoning outputs and ultimately enhancing its overall effectiveness. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets.
LGMay 24, 2023
Newton-Cotes Graph Neural Networks: On the Time Evolution of Dynamic SystemsLingbing Guo, Weiqing Wang, Zhuo Chen et al.
Reasoning system dynamics is one of the most important analytical approaches for many scientific studies. With the initial state of a system as input, the recent graph neural networks (GNNs)-based methods are capable of predicting the future state distant in time with high accuracy. Although these methods have diverse designs in modeling the coordinates and interacting forces of the system, we show that they actually share a common paradigm that learns the integration of the velocity over the interval between the initial and terminal coordinates. However, their integrand is constant w.r.t. time. Inspired by this observation, we propose a new approach to predict the integration based on several velocity estimations with Newton-Cotes formulas and prove its effectiveness theoretically. Extensive experiments on several benchmarks empirically demonstrate consistent and significant improvement compared with the state-of-the-art methods.
ASFeb 6, 2022
Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for M2MeT ChallengeWeiqing Wang, Xiaoyi Qin, Ming Li
In this paper, we present the speaker diarization system for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) from team DKU_DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-based target-speaker voice activity detection (TS-VAD) to find the overlap between speakers. For the single-channel scenario, we separately train a model for each of the 8 channels and fuse the results. We also employ the cross-channel self-attention to further improve the performance, where the non-linear spatial correlations between different channels are learned and fused. Experimental results on the evaluation set show that the single-channel TS-VAD reduces the DER by over 75% from 12.68\% to 3.14%. The multi-channel TS-VAD further reduces the DER by 28% and achieves a DER of 2.26%. Our final submitted system achieves a DER of 2.98% on the AliMeeting test set, which ranks 1st in the M2MET challenge.
ASNov 27, 2021
Low-Latency Online Speaker Diarization with Graph-Based Label GenerationYucong Zhang, Qinjian Lin, Weiqing Wang et al.
This paper introduces an online speaker diarization system that can handle long-time audio with low latency. We enable Agglomerative Hierarchy Clustering (AHC) to work in an online fashion by introducing a label matching algorithm. This algorithm solves the inconsistency between output labels and hidden labels that are generated each turn. To ensure the low latency in the online setting, we introduce a variant of AHC, namely chkpt-AHC, to cluster the speakers. In addition, we propose a speaker embedding graph to exploit a graph-based re-clustering method, further improving the performance. In the experiment, we evaluate our systems on both DIHARD3 and VoxConverse datasets. The experimental results show that our proposed online systems have better performance than our baseline online system and have comparable performance to our offline systems. We find out that the framework combining the chkpt-AHC method and the label matching algorithm works well in the online setting. Moreover, the chkpt-AHC method greatly reduces the time cost, while the graph-based re-clustering method helps improve the performance.
ASSep 5, 2021
The DKU-DukeECE-Lenovo System for the Diarization Task of the 2021 VoxCeleb Speaker Recognition ChallengeWeiqing Wang, Danwei Cai, Qingjian Lin et al.
This report describes the submission of the DKU-DukeECE-Lenovo team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2021 track 4. Our system including a voice activity detection (VAD) model, a speaker embedding model, two clustering-based speaker diarization systems with different similarity measurements, two different overlapped speech detection (OSD) models, and a target-speaker voice activity detection (TS-VAD) model. Our final submission, consisting of 5 independent systems, achieves a DER of 5.07% on the challenge test set.
LGAug 16, 2021
Task-wise Split Gradient Boosting Trees for Multi-center Diabetes PredictionMingcheng Chen, Zhenghui Wang, Zhiyun Zhao et al.
Diabetes prediction is an important data science application in the social healthcare domain. There exist two main challenges in the diabetes prediction task: data heterogeneity since demographic and metabolic data are of different types, data insufficiency since the number of diabetes cases in a single medical center is usually limited. To tackle the above challenges, we employ gradient boosting decision trees (GBDT) to handle data heterogeneity and introduce multi-task learning (MTL) to solve data insufficiency. To this end, Task-wise Split Gradient Boosting Trees (TSGB) is proposed for the multi-center diabetes prediction task. Specifically, we firstly introduce task gain to evaluate each task separately during tree construction, with a theoretical analysis of GBDT's learning objective. Secondly, we reveal a problem when directly applying GBDT in MTL, i.e., the negative task gain problem. Finally, we propose a novel split method for GBDT in MTL based on the task gain statistics, named task-wise split, as an alternative to standard feature-wise split to overcome the mentioned negative task gain problem. Extensive experiments on a large-scale real-world diabetes dataset and a commonly used benchmark dataset demonstrate TSGB achieves superior performance against several state-of-the-art methods. Detailed case studies further support our analysis of negative task gain problems and provide insightful findings. The proposed TSGB method has been deployed as an online diabetes risk assessment software for early diagnosis.
ASFeb 6, 2021
The DKU-Duke-Lenovo System Description for the Third DIHARD Speech Diarization ChallengeWeiqing Wang, Qingjian Lin, Danwei Cai et al.
In this paper, we present the submitted system for the third DIHARD Speech Diarization Challenge from the DKU-Duke-Lenovo team. Our system consists of several modules: voice activity detection (VAD), segmentation, speaker embedding extraction, attentive similarity scoring, agglomerative hierarchical clustering. In addition, the target speaker VAD (TSVAD) is used for the phone call data to further improve the performance. Our final submitted system achieves a DER of 15.43% for the core evaluation set and 13.39% for the full evaluation set on task 1, and we also get a DER of 21.63% for core evaluation set and 18.90% for full evaluation set on task 2.
LGFeb 4, 2021
Temporal Cascade and Structural Modelling of EHRs for Granular Readmission PredictionBhagya Hettige, Weiqing Wang, Yuan-Fang Li et al.
Predicting (1) when the next hospital admission occurs and (2) what will happen in the next admission about a patient by mining electronic health record (EHR) data can provide granular readmission predictions to assist clinical decision making. Recurrent neural network (RNN) and point process models are usually employed in modelling temporal sequential data. Simple RNN models assume that sequences of hospital visits follow strict causal dependencies between consecutive visits. However, in the real-world, a patient may have multiple co-existing chronic medical conditions, i.e., multimorbidity, which results in a cascade of visits where a non-immediate historical visit can be most influential to the next visit. Although a point process (e.g., Hawkes process) is able to model a cascade temporal relationship, it strongly relies on a prior generative process assumption. We propose a novel model, MEDCAS, to address these challenges. MEDCAS combines the strengths of RNN-based models and point processes by integrating point processes in modelling visit types and time gaps into an attention-based sequence-to-sequence learning model, which is able to capture the temporal cascade relationships. To supplement the patients with short visit sequences, a structural modelling technique with graph-based methods is used to construct the markers of the point process in MEDCAS. Extensive experiments on three real-world EHR datasets have been performed and the results demonstrate that \texttt{MEDCAS} outperforms state-of-the-art models in both tasks.
ASOct 25, 2020
An iterative framework for self-supervised deep speaker representation learningDanwei Cai, Weiqing Wang, Ming Li
In this paper, we propose an iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as pseudo labels to train a new DNN. Moreover, we iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN. Speaker verification experiments are conducted on the VoxCeleb dataset. The results show that our proposed iterative self-supervised learning framework outperformed previous works using self-supervision. The speaker network after 5 iterations obtains a 61% performance gain over the speaker embedding model trained with contrastive loss.
LGOct 22, 2020
Learning Transferrable Parameters for Long-tailed Sequential User Behavior ModelingJianwen Yin, Chenghao Liu, Weiqing Wang et al.
Sequential user behavior modeling plays a crucial role in online user-oriented services, such as product purchasing, news feed consumption, and online advertising. The performance of sequential modeling heavily depends on the scale and quality of historical behaviors. However, the number of user behaviors inherently follows a long-tailed distribution, which has been seldom explored. In this work, we argue that focusing on tail users could bring more benefits and address the long tails issue by learning transferrable parameters from both optimization and feature perspectives. Specifically, we propose a gradient alignment optimizer and adopt an adversarial training scheme to facilitate knowledge transfer from the head to the tail. Such methods can also deal with the cold-start problem of new users. Moreover, it could be directly adaptive to various well-established sequential models. Extensive experiments on four real-world datasets verify the superiority of our framework compared with the state-of-the-art baselines.
LGDec 8, 2019
$\mathtt{MedGraph:}$ Structural and Temporal Representation Learning of Electronic Medical RecordsBhagya Hettige, Yuan-Fang Li, Weiqing Wang et al.
Electronic medical record (EMR) data contains historical sequences of visits of patients, and each visit contains rich information, such as patient demographics, hospital utilisation and medical codes, including diagnosis, procedure and medication codes. Most existing EMR embedding methods capture visit-code associations by constructing input visit representations as binary vectors with a static vocabulary of medical codes. With this limited representation, they fail in encapsulating rich attribute information of visits (demographics and utilisation information) and/or codes (e.g., medical code descriptions). Furthermore, current work considers visits of the same patient as discrete-time events and ignores time gaps between them. However, the time gaps between visits depict dynamics of the patient's medical history inducing varying influences on future visits. To address these limitations, we present $\mathtt{MedGraph}$, a supervised EMR embedding method that captures two types of information: (1) the visit-code associations in an attributed bipartite graph, and (2) the temporal sequencing of visits through a point process. $\mathtt{MedGraph}$ produces Gaussian embeddings for visits and codes to model the uncertainty. We evaluate the performance of $\mathtt{MedGraph}$ through an extensive experimental study and show that $\mathtt{MedGraph}$ outperforms state-of-the-art EMR embedding methods in several medical risk prediction tasks.
LGDec 2, 2019
Gaussian Embedding of Large-scale Attributed GraphsBhagya Hettige, Yuan-Fang Li, Weiqing Wang et al.
Graph embedding methods transform high-dimensional and complex graph contents into low-dimensional representations. They are useful for a wide range of graph analysis tasks including link prediction, node classification, recommendation and visualization. Most existing approaches represent graph nodes as point vectors in a low-dimensional embedding space, ignoring the uncertainty present in the real-world graphs. Furthermore, many real-world graphs are large-scale and rich in content (e.g. node attributes). In this work, we propose GLACE, a novel, scalable graph embedding method that preserves both graph structure and node attributes effectively and efficiently in an end-to-end manner. GLACE effectively models uncertainty through Gaussian embeddings, and supports inductive inference of new nodes based on their attributes. In our comprehensive experiments, we evaluate GLACE on real-world graphs, and the results demonstrate that GLACE significantly outperforms state-of-the-art embedding methods on multiple graph analysis tasks.
IRMar 12, 2015
Geo-SAGE: A Geographical Sparse Additive Generative Model for Spatial Item RecommendationWeiqing Wang, Hongzhi Yin, Ling Chen et al.
With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important means to help people discover attractive and interesting venues and events, especially when users travel out of town. However, this recommendation is very challenging compared to the traditional recommender systems. A user can visit only a limited number of spatial items, leading to a very sparse user-item matrix. Most of the items visited by a user are located within a short distance from where he/she lives, which makes it hard to recommend items when the user travels to a far away place. Moreover, user interests and behavior patterns may vary dramatically across different geographical regions. In light of this, we propose Geo-SAGE, a geographical sparse additive generative model for spatial item recommendation in this paper. Geo-SAGE considers both user personal interests and the preference of the crowd in the target region, by exploiting both the co-occurrence pattern of spatial items and the content of spatial items. To further alleviate the data sparsity issue, Geo-SAGE exploits the geographical correlation by smoothing the crowd's preferences over a well-designed spatial index structure called spatial pyramid. We conduct extensive experiments to evaluate the performance of our Geo-SAGE model on two real large-scale datasets. The experimental results clearly demonstrate our Geo-SAGE model outperforms the state-of-the-art in the two tasks of both out-of-town and home-town recommendations.