Yasuko Matsubara

h-index15

30papers

213citations

Novelty56%

AI Score62

Ranked #1,423 of 194,257 authors (top 1%)#389 in LG (top 1%)

30 Papers

16.8AIMay 10Code

EpiGraph: A Knowledge Graph and Benchmark for Evidence-Intensive Reasoning in Epilepsy

Yuyang Dai, Zheng Chen, Jathurshan Pradeepkumar et al.

Epilepsy diagnosis and treatment require evidence-intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textsc{EpiGraph}, a large-scale epilepsy knowledge graph and benchmark for evaluating knowledge-augmented clinical reasoning. \textsc{EpiGraph} integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. Built upon this graph, \textsc{EpiBench} defines five clinically motivated tasks spanning clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph-RAG settings. Results show that integrating \textsc{EpiGraph} consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30--41\%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence-grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge-augmented LLMs in real-world neurological settings. Our code is available at: https://github.com/LabRAI/EEG-KG.

1.4LGFeb 9Code

Interpretable Dynamic Network Modeling of Tensor Time Series via Kronecker Time-Varying Graphical Lasso

Shingo Higashiguchi, Koki Kawabata, Yasuko Matsubara et al.

With the rapid development of web services, large amounts of time series data are generated and accumulated across various domains such as finance, healthcare, and online platforms. As such data often co-evolves with multiple variables interacting with each other, estimating the time-varying dependencies between variables (i.e., the dynamic network structure) has become crucial for accurate modeling. However, real-world data is often represented as tensor time series with multiple modes, resulting in large, entangled networks that are hard to interpret and computationally intensive to estimate. In this paper, we propose Kronecker Time-Varying Graphical Lasso (KTVGL), a method designed for modeling tensor time series. Our approach estimates mode-specific dynamic networks in a Kronecker product form, thereby avoiding overly complex entangled structures and producing interpretable modeling results. Moreover, the partitioned network structure prevents the exponential growth of computational time with data dimension. In addition, our method can be extended to stream algorithms, making the computational time independent of the sequence length. Experiments on synthetic data show that the proposed method achieves higher edge estimation accuracy than existing methods while requiring less computation time. To further demonstrate its practical value, we also present a case study using real-world data. Our source code and datasets are available at https://github.com/Higashiguchi-Shingo/KTVGL.

2.3AIAug 18, 2024Code

BernGraph: Probabilistic Graph Neural Networks for EHR-based Medication Recommendations

Xihao Piao, Pei Gao, Zheng Chen et al.

The medical community believes binary medical event outcomes in EHR data contain sufficient information for making a sensible recommendation. However, there are two challenges to effectively utilizing such data: (1) modeling the relationship between massive 0,1 event outcomes is difficult, even with expert knowledge; (2) in practice, learning can be stalled by the binary values since the equally important 0 entries propagate no learning signals. Currently, there is a large gap between the assumed sufficient information and the reality that no promising results have been shown by utilizing solely the binary data: visiting or secondary information is often necessary to reach acceptable performance. In this paper, we attempt to build the first successful binary EHR data-oriented drug recommendation system by tackling the two difficulties, making sensible drug recommendations solely using the binary EHR medical records. To this end, we take a statistical perspective to view the EHR data as a sample from its cohorts and transform them into continuous Bernoulli probabilities. The transformed entries not only model a deterministic binary event with a distribution but also allow reflecting \emph{event-event} relationship by conditional probability. A graph neural network is learned on top of the transformation. It captures event-event correlations while emphasizing \emph{event-to-patient} features. Extensive results demonstrate that the proposed method achieves state-of-the-art performance on large-scale databases, outperforming baseline methods that use secondary information by a large margin. The source code is available at \url{https://github.com/chenzRG/BEHRMecom}

10.7LGMar 7, 2023Code

Fast and Multi-aspect Mining of Complex Time-stamped Event Streams

Kota Nakamura, Yasuko Matsubara, Koki Kawabata et al.

Given a huge, online stream of time-evolving events with multiple attributes, such as online shopping logs: (item, price, brand, time), and local mobility activities: (pick-up and drop-off locations, time), how can we summarize large, dynamic high-order tensor streams? How can we see any hidden patterns, rules, and anomalies? Our answer is to focus on two types of patterns, i.e., ''regimes'' and ''components'', for which we present CubeScope, an efficient and effective method over high-order tensor streams. Specifically, it identifies any sudden discontinuity and recognizes distinct dynamical patterns, ''regimes'' (e.g., weekday/weekend/holiday patterns). In each regime, it also performs multi-way summarization for all attributes (e.g., item, price, brand, and time) and discovers hidden ''components'' representing latent groups (e.g., item/brand groups) and their relationship. Thanks to its concise but effective summarization, CubeScope can also detect the sudden appearance of anomalies and identify the types of anomalies that occur in practice. Our proposed method has the following properties: (a) Effective: it captures dynamical multi-aspect patterns, i.e., regimes and components, and statistically summarizes all the events; (b) General: it is practical for successful application to data compression, pattern discovery, and anomaly detection on various types of tensor streams; (c) Scalable: our algorithm does not depend on the length of the data stream and its dimensionality. Extensive experiments on real datasets demonstrate that CubeScope finds meaningful patterns and anomalies correctly, and consistently outperforms the state-of-the-art methods as regards accuracy and execution speed.

7.3GNSep 2, 2024Code

MLOmics: Cancer Multi-Omics Database for Machine Learning

Ziwei Yang, Rikuto Kotoge, Xihao Piao et al.

Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals, including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. In this paper, we introduce MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.

9.2LGSep 16, 2024Code

Mining of Switching Sparse Networks for Missing Value Imputation in Multivariate Time Series

Kohei Obata, Koki Kawabata, Yasuko Matsubara et al.

Multivariate time series data suffer from the problem of missing values, which hinders the application of many analytical methods. To achieve the accurate imputation of these missing values, exploiting inter-correlation by employing the relationships between sequences (i.e., a network) is as important as the use of temporal dependency, since a sequence normally correlates with other sequences. Moreover, exploiting an adequate network depending on time is also necessary since the network varies over time. However, in real-world scenarios, we normally know neither the network structure nor when the network changes beforehand. Here, we propose a missing value imputation method for multivariate time series, namely MissNet, that is designed to exploit temporal dependency with a state-space model and inter-correlation by switching sparse networks. The network encodes conditional independence between features, which helps us understand the important relationships for imputation visually. Our algorithm, which scales linearly with reference to the length of the data, alternatively infers networks and fills in missing values using the networks while discovering the switching of the networks. Extensive experiments demonstrate that MissNet outperforms the state-of-the-art algorithms for multivariate time series imputation and provides interpretable results.

4.3SIAug 7, 2022

Mining Reaction and Diffusion Dynamics in Social Activities

Taichi Murayama, Yasuko Matsubara, Yasushi Sakurai

Large quantifies of online user activity data, such as weekly web search volumes, which co-evolve with the mutual influence of several queries and locations, serve as an important social sensor. It is an important task to accurately forecast the future activity by discovering latent interactions from such data, i.e., the ecosystems between each query and the flow of influences between each area. However, this is a difficult problem in terms of data quantity and complex patterns covering the dynamics. To tackle the problem, we propose FluxCube, which is an effective mining method that forecasts large collections of co-evolving online user activity and provides good interpretability. Our model is the expansion of a combination of two mathematical models: a reaction-diffusion system provides a framework for modeling the flow of influences between local area groups and an ecological system models the latent interactions between each query. Also, by leveraging the concept of physics-informed neural networks, FluxCube achieves high interpretability obtained from the parameters and high forecasting performance, together. Extensive experiments on real datasets showed that FluxCube outperforms comparable models in terms of the forecasting accuracy, and each component in FluxCube contributes to the enhanced performance. We then show some case studies that FluxCube can extract useful latent interactions between queries and area groups.

2.3GNAug 17, 2023

MoCLIM: Towards Accurate Cancer Subtyping via Multi-Omics Contrastive Learning with Omics-Inference Modeling

Ziwei Yang, Zheng Chen, Yasuko Matsubara et al.

Precision medicine fundamentally aims to establish causality between dysregulated biochemical mechanisms and cancer subtypes. Omics-based cancer subtyping has emerged as a revolutionary approach, as different level of omics records the biochemical products of multistep processes in cancers. This paper focuses on fully exploiting the potential of multi-omics data to improve cancer subtyping outcomes, and hence developed MoCLIM, a representation learning framework. MoCLIM independently extracts the informative features from distinct omics modalities. Using a unified representation informed by contrastive learning of different omics modalities, we can well-cluster the subtypes, given cancer, into a lower latent space. This contrast can be interpreted as a projection of inter-omics inference observed in biological networks. Experimental results on six cancer datasets demonstrate that our approach significantly improves data fit and subtyping performance in fewer high-dimensional cancer instances. Moreover, our framework incorporates various medical evaluations as the final component, providing high interpretability in medical analysis.

7.8LGJun 3

AdaKoop: Efficient Modeling of Nonlinear Dynamics from Nonstationary Data Streams with Koopman Operator Regression

Naoki Chihara, Ren Fujiwara, Yasuko Matsubara et al.

Real-time data analysis requires the ability to accurately and adaptively address nonlinear dynamics in a nonstationary data stream while preserving computational efficiency. However, nonlinear dynamics are so complex that capturing dynamically changing nonlinear patterns and utilizing them for downstream tasks under strict time constraints is nontrivial. To bridge the gap between nonlinear complexity and computational tractability, this study applies Koopman operator theory, which states that nonlinear dynamics can be represented as linear transitions in an infinite-dimensional space. Building upon finite-dimensional approximations of this operator, we present AdaKoop, an efficient streaming algorithm for modeling nonlinear dynamics over nonstationary data streams. Our approach utilizes a probabilistic framework grounded in Koopman operator theory, treating both raw observations and reproducing kernel Hilbert space (RKHS) features as emissions from latent vectors. This dual-view formulation allows nonlinear dynamics to be expressed as a tractable linear system. Therefore, AdaKoop enables the efficient and stable modeling of nonlinear dynamics in a streaming fashion, avoiding the prohibitive computational costs of iterative nonlinear optimization. Furthermore, to address nonstationarity in data streams, AdaKoop adaptively detects the switching of patterns via statistical hypothesis testing for abrupt pattern shifts and incrementally updates model parameters to handle continuous changes. Extensive experiments on a total of 71 practical benchmark datasets across various domains demonstrate that AdaKoop outperforms state-of-the-art methods in terms of real-time forecasting accuracy and computational efficiency.

6.3LGMay 25

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai

This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or input delay changes degrade model performance, and the trade-off among accuracy, robustness, and memory usage arises when using multiple small models for each time-series pattern. To address these issues, this paper presents an online framework/method that treats streaming time series as dynamic mixtures of time-delay systems. This framework maintains robustness of model tracking and reduces memory usage by summarizing past regimes using a fixed-length representation that captures both the system dynamics and input-output delays. Concretely, this approach constructs a summary system tensor using the system's Markov parameter series, capturing both dynamic behavior and delay characteristics. If necessary, a tensor decomposition algorithm extracts relevant past models from the tensor and helps select the system that best fits the current regime. This method enables rapid adaptation to environmental changes and is computationally efficient. Tests on real datasets show that DelayMix consistently outperforms other methods, achieving superior forecast accuracy and faster adaptation to delays, especially for highly non-stationary data.

6.0AIFeb 26

ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks

Haohui Jia, Zheng Chen, Lingwei Zhu et al.

Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications. Conventional latent variable methods typically model continuous brain dynamics through discretizing time with recurrent architecture, which necessarily results in compounded cumulative prediction errors and failure of capturing instantaneous, nonlinear characteristics of EEGs. We propose ODEBRAIN, a Neural ODE latent dynamic forecasting framework to overcome these challenges by integrating spatio-temporal-frequency features into spectral graph nodes, followed by a Neural ODE modeling the continuous latent dynamics. Our design ensures that latent representations can capture stochastic variations of complex brain states at any given time point. Extensive experiments verify that ODEBRAIN can improve significantly over existing methods in forecasting EEG dynamics with enhanced robustness and generalization capabilities.

6.5LGMay 20

When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai

Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER - a detector- and model-agnostic, data-only test that estimates the post-drift data size required for stable retraining. CALIPER exploits state dependence in streams generated by dynamical systems: we run a single-pass weighted local regression over the post-drift window and track a one-step proxy error as a function of a locality parameter $θ$. When an effective sample size gate is satisfied, a monotonically non-increasing trend in this error with increasing a locality parameter indicates that the data size is sufficiently informative for retraining. We also provide a theoretical analysis of our method, and we show that the algorithm has a low per-update time and memory. Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates. CALIPER closes the gap between drift detection and data-sufficient adaptation in streaming learning.

2.7LGJan 8

Fast Mining and Dynamic Time-to-Event Prediction over Multi-sensor Data Streams

Kota Nakamura, Koki Kawabata, Yasuko Matsubara et al.

Given real-time sensor data streams obtained from machines, how can we continuously predict when a machine failure will occur? This work aims to continuously forecast the timing of future events by analyzing multi-sensor data streams. A key characteristic of real-world data streams is their dynamic nature, where the underlying patterns evolve over time. To address this, we present TimeCast, a dynamic prediction framework designed to adapt to these changes and provide accurate, real-time predictions of future event time. Our proposed method has the following properties: (a) Dynamic: it identifies the distinct time-evolving patterns (i.e., stages) and learns individual models for each, enabling us to make adaptive predictions based on pattern shifts. (b) Practical: it finds meaningful stages that capture time-varying interdependencies between multiple sensors and improve prediction performance; (c) Scalable: our algorithm scales linearly with the input size and enables online model updates on data streams. Extensive experiments on real datasets demonstrate that TimeCast provides higher prediction accuracy than state-of-the-art methods while finding dynamic changes in data streams with a great reduction in computational time.

1.4LGFeb 19

TIFO: Time-Invariant Frequency Operator for Stationarity-Aware Representation Learning in Time Series

Xihao Piao, Zheng Chen, Lingwei Zhu et al.

Nonstationary time series forecasting suffers from the distribution shift issue due to the different distributions that produce the training and test data. Existing methods attempt to alleviate the dependence by, e.g., removing low-order moments from each individual sample. These solutions fail to capture the underlying time-evolving structure across samples and do not model the complex time structure. In this paper, we aim to address the distribution shift in the frequency space by considering all possible time structures. To this end, we propose a Time-Invariant Frequency Operator (TIFO), which learns stationarity-aware weights over the frequency spectrum across the entire dataset. The weight representation highlights stationary frequency components while suppressing non-stationary ones, thereby mitigating the distribution shift issue in time series. To justify our method, we show that the Fourier transform of time series data implicitly induces eigen-decomposition in the frequency space. TIFO is a plug-and-play approach that can be seamlessly integrated into various forecasting models. Experiments demonstrate our method achieves 18 top-1 and 6 top-2 results out of 28 forecasting settings. Notably, it yields 33.3% and 55.3% improvements in average MSE on the ETTm2 dataset. In addition, TIFO reduces computational costs by 60% -70% compared to baseline methods, demonstrating strong scalability across diverse forecasting models.

2.4AIFeb 26

RepSPD: Enhancing SPD Manifold Representation in EEGs via Dynamic Graphs

Haohui Jia, Zheng Chen, Lingwei Zhu et al.

Decoding brain activity from electroencephalography (EEG) is crucial for neuroscience and clinical applications. Among recent advances in deep learning for EEG, geometric learning stands out as its theoretical underpinnings on symmetric positive definite (SPD) allows revealing structural connectivity analysis in a physics-grounded manner. However, current SPD-based methods focus predominantly on statistical aggregation of EEGs, with frequency-specific synchronization and local topological structures of brain regions neglected. Given this, we propose RepSPD, a novel geometric deep learning (GDL)-based model. RepSPD implements a cross-attention mechanism on the Riemannian manifold to modulate the geometric attributes of SPD with graph-derived functional connectivity features. On top of this, we introduce a global bidirectional alignment strategy to reshape tangent-space embeddings, mitigating geometric distortions caused by curvature and thereby enhancing geometric consistency. Extensive experiments demonstrate that our proposed framework significantly outperforms existing EEG representation methods, exhibiting superior robustness and generalization capabilities.

1.4LGFeb 27Code

Disentangled Mode-Specific Representations for Tensor Time Series via Contrastive Learning

Kohei Obata, Taichi Murayama, Zheng Chen et al.

Multi-mode tensor time series (TTS) can be found in many domains, such as search engines and environmental monitoring systems. Learning representations of a TTS benefits various applications, but it is also challenging since the complexities inherent in the tensor hinder the realization of rich representations. In this paper, we propose a novel representation learning method designed specifically for TTS, namely MoST. Specifically, MoST uses a tensor slicing approach to reduce the complexity of the TTS structure and learns representations that can be disentangled into individual non-temporal modes. Each representation captures mode-specific features, which are the relationship between variables within the same mode, and mode-invariant features, which are in common in representations of different modes. We employ a contrastive learning framework to learn parameters; the loss function comprises two parts intended to learn representation in a mode-specific way and mode-invariant way, effectively exploiting disentangled representations as augmentations. Extensive experiments on real-world datasets show that MoST consistently outperforms the state-of-the-art methods in terms of classification and forecasting accuracy. Code is available at https://github.com/KoheiObata/MoST.

4.3SIMay 1, 2025Code

D-Tracker: Modeling Interest Diffusion in Social Activity Tensor Data Streams

Shingo Higashiguchi, Yasuko Matsubara, Koki Kawabata et al.

Large quantities of social activity data, such as weekly web search volumes and the number of new infections with infectious diseases, reflect peoples' interests and activities. It is important to discover temporal patterns from such data and to forecast future activities accurately. However, modeling and forecasting social activity data streams is difficult because they are high-dimensional and composed of multiple time-varying dynamics such as trends, seasonality, and interest diffusion. In this paper, we propose D-Tracker, a method for continuously capturing time-varying temporal patterns within social activity tensor data streams and forecasting future activities. Our proposed method has the following properties: (a) Interpretable: it incorporates the partial differential equation into a tensor decomposition framework and captures time-varying temporal patterns such as trends, seasonality, and interest diffusion between locations in an interpretable manner; (b) Automatic: it has no hyperparameters and continuously models tensor data streams fully automatically; (c) Scalable: the computation time of D-Tracker is independent of the time series length. Experiments using web search volume data obtained from GoogleTrends, and COVID-19 infection data obtained from COVID-19 Open Data Repository show that our method can achieve higher forecasting accuracy in less computation time than existing methods while extracting the interest diffusion between locations. Our source code and datasets are available at {https://github.com/Higashiguchi-Shingo/D-Tracker.

25.1LGJun 13, 2024Code

Fredformer: Frequency Debiased Transformer for Time Series Forecasting

Xihao Piao, Zheng Chen, Taichi Murayama et al.

The Transformer model has shown leading performance in time series forecasting. Nevertheless, in some complex scenarios, it tends to learn low-frequency features in the data and overlook high-frequency features, showing a frequency bias. This bias prevents the model from accurately capturing important high-frequency data features. In this paper, we undertook empirical analyses to understand this bias and discovered that frequency bias results from the model disproportionately focusing on frequency features with higher energy. Based on our analysis, we formulate this bias and propose Fredformer, a Transformer-based framework designed to mitigate frequency bias by learning features equally across different frequency bands. This approach prevents the model from overlooking lower amplitude features important for accurate forecasting. Extensive experiments show the effectiveness of our proposed approach, which can outperform other baselines in different real-world time-series datasets. Furthermore, we introduce a lightweight variant of the Fredformer with an attention matrix approximation, which achieves comparable performance but with much fewer parameters and lower computation costs. The code is available at: https://github.com/chenzRG/Fredformer

14.2LGFeb 19, 2024Code

Dynamic Multi-Network Mining of Tensor Time Series

Kohei Obata, Koki Kawabata, Yasuko Matsubara et al.

Subsequence clustering of time series is an essential task in data mining, and interpreting the resulting clusters is also crucial since we generally do not have prior knowledge of the data. Thus, given a large collection of tensor time series consisting of multiple modes, including timestamps, how can we achieve subsequence clustering for tensor time series and provide interpretable insights? In this paper, we propose a new method, Dynamic Multi-network Mining (DMM), that converts a tensor time series into a set of segment groups of various lengths (i.e., clusters) characterized by a dependency network constrained with l1-norm. Our method has the following properties. (a) Interpretable: it characterizes the cluster with multiple networks, each of which is a sparse dependency network of a corresponding non-temporal mode, and thus provides visible and interpretable insights into the key relationships. (b) Accurate: it discovers the clusters with distinct networks from tensor time series according to the minimum description length (MDL). (c) Scalable: it scales linearly in terms of the input data size when solving a non-convex problem to optimize the number of segments and clusters, and thus it is applicable to long-range and high-dimensional tensors. Extensive experiments with synthetic datasets confirm that our method outperforms the state-of-the-art methods in terms of clustering accuracy. We then use real datasets to demonstrate that DMM is useful for providing interpretable insights from tensor time series.

12.5LGDec 20, 2024

Long-Term EEG Partitioning for Seizure Onset Detection

Zheng Chen, Yasuko Matsubara, Yasushi Sakurai et al.

Deep learning models have recently shown great success in classifying epileptic patients using EEG recordings. Unfortunately, classification-based methods lack a sound mechanism to detect the onset of seizure events. In this work, we propose a two-stage framework, SODor, that explicitly models seizure onset through a novel task formulation of subsequence clustering. Given an EEG sequence, the framework first learns a set of second-level embeddings with label supervision. It then employs model-based clustering to explicitly capture long-term temporal dependencies in EEG sequences and identify meaningful subsequences. Epochs within a subsequence share a common cluster assignment (normal or seizure), with cluster or state transitions representing successful onset detections. Extensive experiments on three datasets demonstrate that our method can correct misclassifications, achieving 5\%-11\% classification improvements over other baselines and accurately detecting seizure onsets.

14.4LGFeb 13, 2025

Modeling Time-evolving Causality over Data Streams

Naoki Chihara, Yasuko Matsubara, Ren Fujiwara et al.

Given an extensive, semi-infinite collection of multivariate coevolving data sequences (e.g., sensor/web activity streams) whose observations influence each other, how can we discover the time-changing cause-and-effect relationships in co-evolving data streams? How efficiently can we reveal dynamical patterns that allow us to forecast future values? In this paper, we present a novel streaming method, ModePlait, which is designed for modeling such causal relationships (i.e., time-evolving causality) in multivariate co-evolving data streams and forecasting their future values. The solution relies on characteristics of the causal relationships that evolve over time in accordance with the dynamic changes of exogenous variables. ModePlait has the following properties: (a) Effective: it discovers the time-evolving causality in multivariate co-evolving data streams by detecting the transitions of distinct dynamical patterns adaptively. (b) Accurate: it enables both the discovery of time-evolving causality and the forecasting of future values in a streaming fashion. (c) Scalable: our algorithm does not depend on data stream length and thus is applicable to very large sequences. Extensive experiments on both synthetic and real-world datasets demonstrate that our proposed model outperforms state-of-the-art methods in terms of discovering the time-evolving causality as well as forecasting.

9.2LGOct 15, 2024

SplitSEE: A Splittable Self-supervised Framework for Single-Channel EEG Representation Learning

Rikuto Kotoge, Zheng Chen, Tasuku Kimura et al.

While end-to-end multi-channel electroencephalography (EEG) learning approaches have shown significant promise, their applicability is often constrained in neurological diagnostics, such as intracranial EEG resources. When provided with a single-channel EEG, how can we learn representations that are robust to multi-channels and scalable across varied tasks, such as seizure prediction? In this paper, we present SplitSEE, a structurally splittable framework designed for effective temporal-frequency representation learning in single-channel EEG. The key concept of SplitSEE is a self-supervised framework incorporating a deep clustering task. Given an EEG, we argue that the time and frequency domains are two distinct perspectives, and hence, learned representations should share the same cluster assignment. To this end, we first propose two domain-specific modules that independently learn domain-specific representation and address the temporal-frequency tradeoff issue in conventional spectrogram-based methods. Then, we introduce a novel clustering loss to measure the information similarity. This encourages representations from both domains to coherently describe the same input by assigning them a consistent cluster. SplitSEE leverages a pre-training-to-fine-tuning framework within a splittable architecture and has following properties: (a) Effectiveness: it learns representations solely from single-channel EEG but has even outperformed multi-channel baselines. (b) Robustness: it shows the capacity to adapt across different channels with low performance variance. Superior performance is also achieved with our collected clinical dataset. (c) Scalability: With just one fine-tuning epoch, SplitSEE achieves high and stable performance using partial model layers.

13.0LGMay 27, 2025

Robust and Explainable Detector of Time Series Anomaly via Augmenting Multiclass Pseudo-Anomalies

Kohei Obata, Yasuko Matsubara, Yasushi Sakurai

Unsupervised anomaly detection in time series has been a pivotal research area for decades. Current mainstream approaches focus on learning normality, on the assumption that all or most of the samples in the training set are normal. However, anomalies in the training set (i.e., anomaly contamination) can be misleading. Recent studies employ data augmentation to generate pseudo-anomalies and learn the boundary separating the training samples from the augmented samples. Although this approach mitigates anomaly contamination if augmented samples mimic unseen real anomalies, it suffers from several limitations. (1) Covering a wide range of time series anomalies is challenging. (2) It disregards augmented samples that resemble normal samples (i.e., false anomalies). (3) It places too much trust in the labels of training and augmented samples. In response, we propose RedLamp, which employs diverse data augmentations to generate multiclass pseudo-anomalies and learns the multiclass boundary. Such multiclass pseudo-anomalies cover a wide variety of time series anomalies. We conduct multiclass classification using soft labels, which prevents the model from being overconfident and ensures its robustness against contaminated/false anomalies. The learned latent space is inherently explainable as it is trained to separate pseudo-anomalies into multiclasses. Extensive experiments demonstrate the effectiveness of RedLamp in anomaly detection and its robustness against anomaly contamination.

7.9LGDec 11, 2024Code

Modeling Latent Non-Linear Dynamical System over Time Series

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai

We study the problem of modeling a non-linear dynamical system when given a time series by deriving equations directly from the data. Despite the fact that time series data are given as input, models for dynamics and estimation algorithms that incorporate long-term temporal dependencies are largely absent from existing studies. In this paper, we introduce a latent state to allow time-dependent modeling and formulate this problem as a dynamics estimation problem in latent states. We face multiple technical challenges, including (1) modeling latent non-linear dynamics and (2) solving circular dependencies caused by the presence of latent states. To tackle these challenging problems, we propose a new method, Latent Non-Linear equation modeling (LaNoLem), that can model a latent non-linear dynamical system and a novel alternating minimization algorithm for effectively estimating latent states and model parameters. In addition, we introduce criteria to control model complexity without human intervention. Compared with the state-of-the-art model, LaNoLem achieves competitive performance for estimating dynamics while outperforming other methods in prediction.

4.6LGOct 17, 2024

GeSubNet: Gene Interaction Inference for Disease Subtype Network Generation

Ziwei Yang, Zheng Chen, Xin Liu et al.

Retrieving gene functional networks from knowledge databases presents a challenge due to the mismatch between disease networks and subtype-specific variations. Current solutions, including statistical and deep learning methods, often fail to effectively integrate gene interaction knowledge from databases or explicitly learn subtype-specific interactions. To address this mismatch, we propose GeSubNet, which learns a unified representation capable of predicting gene interactions while distinguishing between different disease subtypes. Graphs generated by such representations can be considered subtype-specific networks. GeSubNet is a multi-step representation learning framework with three modules: First, a deep generative model learns distinct disease subtypes from patient gene expression profiles. Second, a graph neural network captures representations of prior gene networks from knowledge databases, ensuring accurate physical gene interactions. Finally, we integrate these two representations using an inference loss that leverages graph generation capabilities, conditioned on the patient separation loss, to refine subtype-specific information in the learned representation. GeSubNet consistently outperforms traditional methods, with average improvements of 30.6%, 21.0%, 20.1%, and 56.6% across four graph evaluation metrics, averaged over four cancer datasets. Particularly, we conduct a biological simulation experiment to assess how the behavior of selected genes from over 11,000 candidates affects subtypes or patient distributions. The results show that the generated network has the potential to identify subtype-specific genes with an 83% likelihood of impacting patient distribution shifts.

1.4LGFeb 4

Multi-Aspect Mining and Anomaly Detection for Heterogeneous Tensor Streams

Soshi Kakio, Yasuko Matsubara, Ren Fujiwara et al.

Analysis and anomaly detection in event tensor streams consisting of timestamps and multiple attributes - such as communication logs(time, IP address, packet length)- are essential tasks in data mining. While existing tensor decomposition and anomaly detection methods provide useful insights, they face the following two limitations. (i) They cannot handle heterogeneous tensor streams, which comprises both categorical attributes(e.g., IP address) and continuous attributes(e.g., packet length). They typically require either discretizing continuous attributes or treating categorical attributes as continuous, both of which distort the underlying statistical properties of the data.Furthermore, incorrect assumptions about the distribution family of continuous attributes often degrade the model's performance. (ii) They discretize timestamps, failing to track the temporal dynamics of streams(e.g., trends, abnormal events), which makes them ineffective for detecting anomalies at the group level, referred to as 'group anomalies' (e.g, DoS attacks). To address these challenges, we propose HeteroComp, a method for continuously summarizing heterogeneous tensor streams into 'components' representing latent groups in each attribute and their temporal dynamics, and detecting group anomalies. Our method employs Gaussian process priors to model unknown distributions of continuous attributes, and temporal dynamics, which directly estimate probability densities from data. Extracted components give concise but effective summarization, enabling accurate group anomaly detection. Extensive experiments on real datasets demonstrate that HeteroComp outperforms the state-of-the-art algorithms for group anomaly detection accuracy, and its computational time does not depend on the data stream length.

1.4LGFeb 27

Selective Denoising Diffusion Model for Time Series Anomaly Detection

Kohei Obata, Zheng Chen, Yasuko Matsubara et al.

Time series anomaly detection (TSAD) has been an important area of research for decades, with reconstruction-based methods, mostly based on generative models, gaining popularity and demonstrating success. Diffusion models have recently attracted attention due to their advanced generative capabilities. Existing diffusion-based methods for TSAD rely on a conditional strategy, which reconstructs input instances from white noise with the aid of the conditioner. However, this poses challenges in accurately reconstructing the normal parts, resulting in suboptimal detection performance. In response, we propose a novel diffusion-based method, named AnomalyFilter, which acts as a selective filter that only denoises anomaly parts in the instance while retaining normal parts. To build such a filter, we mask Gaussian noise during the training phase and conduct the denoising process without adding noise to the instances. The synergy of the two simple components greatly enhances the performance of naive diffusion models. Extensive experiments on five datasets demonstrate that AnomalyFilter achieves notably low reconstruction error on normal parts, providing empirical support for its effectiveness in anomaly detection. AnomalyFilter represents a pioneering approach that focuses on the noise design of diffusion models specifically tailored for TSAD.

16.9LGSep 19, 2025

EvoBrain: Dynamic Multi-Channel EEG Graph Modeling for Time-Evolving Brain Networks

Rikuto Kotoge, Zheng Chen, Tasuku Kimura et al.

Dynamic GNNs, which integrate temporal and spatial features in Electroencephalography (EEG) data, have shown great potential in automating seizure detection. However, fully capturing the underlying dynamics necessary to represent brain states, such as seizure and non-seizure, remains a non-trivial task and presents two fundamental challenges. First, most existing dynamic GNN methods are built on temporally fixed static graphs, which fail to reflect the evolving nature of brain connectivity during seizure progression. Second, current efforts to jointly model temporal signals and graph structures and, more importantly, their interactions remain nascent, often resulting in inconsistent performance. To address these challenges, we present the first theoretical analysis of these two problems, demonstrating the effectiveness and necessity of explicit dynamic modeling and time-then-graph dynamic GNN method. Building on these insights, we propose EvoBrain, a novel seizure detection model that integrates a two-stream Mamba architecture with a GCN enhanced by Laplacian Positional Encoding, following neurological insights. Moreover, EvoBrain incorporates explicitly dynamic graph structures, allowing both nodes and edges to evolve over time. Our contributions include (a) a theoretical analysis proving the expressivity advantage of explicit dynamic modeling and time-then-graph over other approaches, (b) a novel and efficient model that significantly improves AUROC by 23% and F1 score by 30%, compared with the dynamic GNN baseline, and (c) broad evaluations of our method on the challenging early seizure prediction tasks.

9.4LGMar 2, 2025

CyberCScope: Mining Skewed Tensor Streams and Online Anomaly Detection in Cybersecurity Systems

Kota Nakamura, Koki Kawabata, Shungo Tanaka et al.

Cybersecurity systems are continuously producing a huge number of time-stamped events in the form of high-order tensors, such as {count; time, port, flow duration, packet size, . . . }, and so how can we detect anomalies/intrusions in real time? How can we identify multiple types of intrusions and capture their characteristic behaviors? The tensor data consists of categorical and continuous attributes and the data distributions of continuous attributes typically exhibit skew. These data properties require handling skewed infinite and finite dimensional spaces simultaneously. In this paper, we propose a novel streaming method, namely CyberCScope. The method effectively decomposes incoming tensors into major trends while explicitly distinguishing between categorical and skewed continuous attributes. To our knowledge, it is the first to compute hybrid skewed infinite and finite dimensional decomposition. Based on this decomposition, it streamingly finds distinct time-evolving patterns, enabling the detection of multiple types of anomalies. Extensive experiments on large-scale real datasets demonstrate that CyberCScope detects various intrusions with higher accuracy than state-of-the-art baselines while providing meaningful summaries for the intrusions that occur in practice.

4.1LGFeb 25, 2025

ExPath: Targeted Pathway Inference for Biological Knowledge Bases via Graph Learning and Explanation

Rikuto Kotoge, Ziwei Yang, Zheng Chen et al.

Retrieving targeted pathways in biological knowledge bases, particularly when incorporating wet-lab experimental data, remains a challenging task and often requires downstream analyses and specialized expertise. In this paper, we frame this challenge as a solvable graph learning and explaining task and propose a novel subgraph inference framework, ExPAth, that explicitly integrates experimental data to classify various graphs (bio-networks) in biological databases. The links (representing pathways) that contribute more to classification can be considered as targeted pathways. Our framework can seamlessly integrate biological foundation models to encode the experimental molecular data. We propose ML-oriented biological evaluations and a new metric. The experiments involving 301 bio-networks evaluations demonstrate that pathways inferred by ExPath are biologically meaningful, achieving up to 4.5x higher Fidelity+ (necessity) and 14x lower Fidelity- (sufficiency) than explainer baselines, while preserving signaling chains up to 4x longer.