SYDec 21, 2018
Offline timed pattern matching under uncertaintyÉtienne André, Ichiro Hasuo, Masaki Waga
Given a log and a specification, timed pattern matching aims at exhibiting for which start and end dates a specification holds on that log. For example, "a given action is always followed by another action before a given deadline". This problem has strong connections with monitoring real-time systems. We address here timed pattern matching in presence of an uncertain specification, i.e., that may contain timing parameters (e.g., the deadline can be uncertain or unknown). That is, we want to know for which start and end dates, and for what values of the deadline, this property holds. Or what is the minimum or maximum deadline (together with the corresponding start and end dates) for which this property holds. We propose here a framework for timed pattern matching based on parametric timed model checking. In contrast to most parametric timed problems, the solution is effectively computable, and we perform experiments using IMITATOR to show the applicability of our approach.
LGJul 27, 2022
Dynamic Shielding for Reinforcement Learning in Black-Box EnvironmentsMasaki Waga, Ezequiel Castellano, Sasinee Pruekprasert et al.
It is challenging to use reinforcement learning (RL) in cyber-physical systems due to the lack of safety guarantees during learning. Although there have been various proposals to reduce undesired behaviors during learning, most of these techniques require prior system knowledge, and their applicability is limited. This paper aims to reduce undesired behaviors during learning without requiring any prior system knowledge. We propose dynamic shielding: an extension of a model-based safe RL technique called shielding using automata learning. The dynamic shielding technique constructs an approximate system model in parallel with RL using a variant of the RPNI algorithm and suppresses undesired explorations due to the shield constructed from the learned model. Through this combination, potentially unsafe actions can be foreseen before the agent experiences them. Experiments show that our dynamic shield significantly decreases the number of undesired events during training.
FLJul 13, 2024
Learning Weighted Finite Automata over the Max-Plus Semiring and its TerminationTakamasa Okudono, Masaki Waga, Taro Sekiyama et al.
Active learning of finite automata has been vigorously pursued for the purposes of analysis and explanation of black-box systems. In this paper, we study an L*-style learning algorithm for weighted automata over the max-plus semiring. The max-plus setting exposes a "consistency" issue in the previously studied semiring-generic extension of L*: we show that it can fail to maintain consistency of tables, and can thus make equivalence queries on obviously wrong hypothesis automata. We present a theoretical fix by a mathematically clean notion of column-closedness. We also present a nontrivial and reasonably broad class of weighted languages over the max-plus semiring in which our algorithm terminates.
SEJul 15, 2023
Probabilistic Black-Box Checking via Active MDP LearningJunya Shijubo, Masaki Waga, Kohei Suenaga
We introduce a novel methodology for testing stochastic black-box systems, frequently encountered in embedded systems. Our approach enhances the established black-box checking (BBC) technique to address stochastic behavior. Traditional BBC primarily involves iteratively identifying an input that breaches the system's specifications by executing the following three phases: the learning phase to construct an automaton approximating the black box's behavior, the synthesis phase to identify a candidate counterexample from the learned automaton, and the validation phase to validate the obtained candidate counterexample and the learned automaton against the original black-box system. Our method, ProbBBC, refines the conventional BBC approach by (1) employing an active Markov Decision Process (MDP) learning method during the learning phase, (2) incorporating probabilistic model checking in the synthesis phase, and (3) applying statistical hypothesis testing in the validation phase. ProbBBC uniquely integrates these techniques rather than merely substituting each method in the traditional BBC; for instance, the statistical hypothesis testing and the MDP learning procedure exchange information regarding the black-box system's observation with one another. The experiment results suggest that ProbBBC outperforms an existing method, especially for systems with limited observation.
CVOct 31, 2022
BOREx: Bayesian-Optimization--Based Refinement of Saliency Map for Image- and Video-Classification ModelsAtsushi Kikuchi, Kotaro Uchida, Masaki Waga et al.
Explaining a classification result produced by an image- and video-classification model is one of the important but challenging issues in computer vision. Many methods have been proposed for producing heat-map--based explanations for this purpose, including ones based on the white-box approach that uses the internal information of a model (e.g., LRP, Grad-CAM, and Grad-CAM++) and ones based on the black-box approach that does not use any internal information (e.g., LIME, SHAP, and RISE). We propose a new black-box method BOREx (Bayesian Optimization for Refinement of visual model Explanation) to refine a heat map produced by any method. Our observation is that a heat-map--based explanation can be seen as a prior for an explanation method based on Bayesian optimization. Based on this observation, BOREx conducts Gaussian process regression (GPR) to estimate the saliency of each pixel in a given image starting from the one produced by another explanation method. Our experiments statistically demonstrate that the refinement by BOREx improves low-quality heat maps for image- and video-classification results.
CLFeb 11
SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale CorporaMasataka Yoneda, Yusuke Matsushita, Go Kamoda et al.
We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.
CLMar 5, 2025
SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus SearchesHiroyuki Deguchi, Go Kamoda, Yusuke Matsushita et al.
Researchers and practitioners in natural language processing and computational linguistics frequently observe and analyze the real language usage in large-scale corpora. For that purpose, they often employ off-the-shelf pattern-matching tools, such as grep, and keyword-in-context concordancers, which is widely used in corpus linguistics for gathering examples. Nonetheless, these existing techniques rely on surface-level string matching, and thus they suffer from the major limitation of not being able to handle orthographic variations and paraphrasing -- notable and common phenomena in any natural language. In addition, existing continuous approaches such as dense vector search tend to be overly coarse, often retrieving texts that are unrelated but share similar topics. Given these challenges, we propose a novel algorithm that achieves \emph{soft} (or semantic) yet efficient pattern matching by relaxing a surface-level matching with word embeddings. Our algorithm is highly scalable with respect to the size of the corpus text utilizing inverted indexes. We have prepared an efficient implementation, and we provide an accessible web tool. Our experiments demonstrate that the proposed method (i) can execute searches on billion-scale corpora in less than a second, which is comparable in speed to surface-level string matching and dense vector search; (ii) can extract harmful instances that semantically match queries from a large set of English and Japanese Wikipedia articles; and (iii) can be effectively applied to corpus-linguistic analyses of Latin, a language with highly diverse inflections.
NEApr 11, 2020
Genetic Algorithm for the Weight Maximization Problem on Weighted AutomataElena Gutiérrez, Takamasa Okudono, Masaki Waga et al.
The weight maximization problem (WMP) is the problem of finding the word of highest weight on a weighted finite state automaton (WFA). It is an essential question that emerges in many optimization problems in automata theory. Unfortunately, the general problem can be shown to be undecidable, whereas its bounded decisional version is NP-complete. Designing efficient algorithms that produce approximate solutions to the WMP in reasonable time is an appealing research direction that can lead to several new applications including formal verification of systems abstracted as WFAs. In particular, in combination with a recent procedure that translates a recurrent neural network into a weighted automaton, an algorithm for the WMP can be used to analyze and verify the network by exploiting the simpler and more compact automata model. In this work, we propose, implement and evaluate a metaheuristic based on genetic algorithms to approximate solutions to the WMP. We experimentally evaluate its performance on examples from the literature and show its potential on different applications.
LGApr 5, 2019
Weighted Automata Extraction from Recurrent Neural Networks via Regression on State SpacesTakamasa Okudono, Masaki Waga, Taro Sekiyama et al.
We present a method to extract a weighted finite automaton (WFA) from a recurrent neural network (RNN). Our algorithm is based on the WFA learning algorithm by Balle and Mohri, which is in turn an extension of Angluin's classic \lstar algorithm. Our technical novelty is in the use of \emph{regression} methods for the so-called equivalence queries, thus exploiting the internal state space of an RNN to prioritize counterexample candidates. This way we achieve a quantitative/weighted extension of the recent work by Weiss, Goldberg and Yahav that extracts DFAs. We experimentally evaluate the accuracy, expressivity and efficiency of the extracted WFAs.
FLMay 11, 2019
Symbolic Monitoring against Specifications Parametric in Time and DataMasaki Waga, Étienne André, Ichiro Hasuo
Monitoring consists in deciding whether a log meets a given specification. In this work, we propose an automata-based formalism to monitor logs in the form of actions associated with time stamps and arbitrarily data values over infinite domains. Our formalism uses both timing parameters and data parameters, and is able to output answers symbolic in these parameters and in the log segments where the property is satisfied or violated. We implemented our approach in an ad-hoc prototype SyMon, and experiments show that its high expressive power still allows for efficient online monitoring.