Makoto Onizuka

h-index18

18papers

120citations

Novelty48%

AI Score50

Ranked #21,619 of 194,257 authors (top 11%)#5,272 in LG (top 13%)

18 Papers

17.7LGJun 18, 2022Code

Beyond Real-world Benchmark Datasets: An Empirical Study of Node Classification with GNNs

Seiji Maekawa, Koki Noda, Yuya Sasaki et al.

Graph Neural Networks (GNNs) have achieved great success on a node classification task. Despite the broad interest in developing and evaluating GNNs, they have been assessed with limited benchmark datasets. As a result, the existing evaluation of GNNs lacks fine-grained analysis from various characteristics of graphs. Motivated by this, we conduct extensive experiments with a synthetic graph generator that can generate graphs having controlled characteristics for fine-grained analysis. Our empirical studies clarify the strengths and weaknesses of GNNs from four major characteristics of real-world graphs with class labels of nodes, i.e., 1) class size distributions (balanced vs. imbalanced), 2) edge connection proportions between classes (homophilic vs. heterophilic), 3) attribute values (biased vs. random), and 4) graph sizes (small vs. large). In addition, to foster future research on GNNs, we publicly release our codebase that allows users to evaluate various GNNs with various graphs. We hope this work offers interesting insights for future research.

10.4LGJun 27, 2022Code

An Empirical Study of Personalized Federated Learning

Koji Matsuda, Yuya Sasaki, Chuan Xiao et al.

Federated learning is a distributed machine learning approach in which a single server and multiple clients collaboratively build machine learning models without sharing datasets on clients. A challenging issue of federated learning is data heterogeneity (i.e., data distributions may differ across clients). To cope with this issue, numerous federated learning methods aim at personalized federated learning and build optimized models for clients. Whereas existing studies empirically evaluated their own methods, the experimental settings (e.g., comparison methods, datasets, and client setting) in these studies differ from each other, and it is unclear which personalized federate learning method achieves the best performance and how much progress can be made by using these methods instead of standard (i.e., non-personalized) federated learning. In this paper, we benchmark the performance of existing personalized federated learning through comprehensive experiments to evaluate the characteristics of each method. Our experimental study shows that (1) there are no champion methods, (2) large data heterogeneity often leads to high accurate predictions, and (3) standard federated learning methods (e.g. FedAvg) with fine-tuning often outperform personalized federated learning methods. We open our benchmark tool FedBench for researchers to conduct experimental studies with various experimental settings.

1.2DBJun 8, 2023Code

Learned spatial data partitioning

Keizo Hori, Yuya Sasaki, Daichi Amagata et al.

Due to the significant increase in the size of spatial data, it is essential to use distributed parallel processing systems to efficiently analyze spatial data. In this paper, we first study learned spatial data partitioning, which effectively assigns groups of big spatial data to computers based on locations of data by using machine learning techniques. We formalize spatial data partitioning in the context of reinforcement learning and develop a novel deep reinforcement learning algorithm. Our learning algorithm leverages features of spatial data partitioning and prunes ineffective learning processes to find optimal partitions efficiently. Our experimental study, which uses Apache Sedona and real-world spatial data, demonstrates that our method efficiently finds partitions for accelerating distance join queries and reduces the workload run time by up to 59.4%.

1.8LGJun 21, 2022Code

Predicting Parking Lot Availability by Graph-to-Sequence Model: A Case Study with SmartSantander

Yuya Sasaki, Junya Takayama, Juan Ramón Santana et al.

Nowadays, so as to improve services and urban areas livability, multiple smart city initiatives are being carried out throughout the world. SmartSantander is a smart city project in Santander, Spain, which has relied on wireless sensor network technologies to deploy heterogeneous sensors within the city to measure multiple parameters, including outdoor parking information. In this paper, we study the prediction of parking lot availability using historical data from more than 300 outdoor parking sensors with SmartSantander. We design a graph-to-sequence model to capture the periodical fluctuation and geographical proximity of parking lots. For developing and evaluating our model, we use a 3-year dataset of parking lot availability in the city of Santander. Our model achieves a high accuracy compared with existing sequence-to-sequence models, which is accurate enough to provide a parking information service in the city. We apply our model to a smartphone application to be widely used by citizens and tourists.

5.4AINov 11, 2023Code

BClean: A Bayesian Data Cleaning System

Jianbin Qin, Sifan Huang, Yaoshu Wang et al.

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%.

4.6LGJul 25, 2022Code

GNN Transformation Framework for Improving Efficiency and Scalability

Seiji Maekawa, Yuya Sasaki, George Fletcher et al.

We propose a framework that automatically transforms non-scalable GNNs into precomputation-based GNNs which are efficient and scalable for large-scale graphs. The advantages of our framework are two-fold; 1) it transforms various non-scalable GNNs to scale well to large-scale graphs by separating local feature aggregation from weight learning in their graph convolution, 2) it efficiently executes precomputation on GPU for large-scale graphs by decomposing their edges into small disjoint and balanced sets. Through extensive experiments with large-scale graphs, we demonstrate that the transformed GNNs run faster in training time than existing GNNs while achieving competitive accuracy to the state-of-the-art GNNs. Consequently, our transformation framework provides simple and efficient baselines for future research on scalable GNNs.

4.6LGJul 6, 2022

Scaling Private Deep Learning with Low-Rank and Sparse Gradients

Ryuichi Ito, Seng Pei Liew, Tsubasa Takahashi et al.

Applying Differentially Private Stochastic Gradient Descent (DPSGD) to training modern, large-scale neural networks such as transformer-based models is a challenging task, as the magnitude of noise added to the gradients at each iteration scales with model dimension, hindering the learning capability significantly. We propose a unified framework, $\textsf{LSG}$, that fully exploits the low-rank and sparse structure of neural networks to reduce the dimension of gradient updates, and hence alleviate the negative impacts of DPSGD. The gradient updates are first approximated with a pair of low-rank matrices. Then, a novel strategy is utilized to sparsify the gradients, resulting in low-dimensional, less noisy updates that are yet capable of retaining the performance of neural networks. Empirical evaluation on natural language processing and computer vision tasks shows that our method outperforms other state-of-the-art baselines.

5.3LGJun 14, 2023Code

A Simple and Scalable Graph Neural Network for Large Directed Graphs

Seiji Maekawa, Yuya Sasaki, Makoto Onizuka

Node classification is one of the hottest tasks in graph analysis. Though existing studies have explored various node representations in directed and undirected graphs, they have overlooked the distinctions of their capabilities to capture the information of graphs. To tackle the limitation, we investigate various combinations of node representations (aggregated features vs. adjacency lists) and edge direction awareness within an input graph (directed vs. undirected). We address the first empirical study to benchmark the performance of various GNNs that use either combination of node representations and edge direction awareness. Our experiments demonstrate that no single combination stably achieves state-of-the-art results across datasets, which indicates that we need to select appropriate combinations depending on the dataset characteristics. In response, we propose a simple yet holistic classification method A2DUG which leverages all combinations of node representations in directed and undirected graphs. We demonstrate that A2DUG stably performs well on various datasets and improves the accuracy up to 11.29 compared with the state-of-the-art methods. To spur the development of new methods, we publicly release our complete codebase under the MIT license.

1.2DBMar 31, 2023Code

Scardina: Scalable Join Cardinality Estimation by Multiple Density Estimators

Ryuichi Ito, Yuya Sasaki, Chuan Xiao et al.

In recent years, machine learning-based cardinality estimation methods are replacing traditional methods. This change is expected to contribute to one of the most important applications of cardinality estimation, the query optimizer, to speed up query processing. However, none of the existing methods do not precisely estimate cardinalities when relational schemas consist of many tables with strong correlations between tables/attributes. This paper describes that multiple density estimators can be combined to effectively target the cardinality estimation of data with large and complex schemas having strong correlations. We propose Scardina, a new join cardinality estimation method using multiple partitioned models based on the schema structure.

8.7DBJun 2

Workload acceleration by optimizing materialized view selection using local search

Kaina Anderson, Yohanes Yohanie Fridelin Panduman, Yuya Sasaki et al.

The growing size of database workloads has made view selection a key performance challenge. Materializing frequent sub-queries in workloads improves query efficiency, but it incurs significant view maintenance costs due to updates. Although existing methods such as BIGSUBS address this trade-off between the benefit of using materialized views and the overhead of view maintenance, they have two drawbacks: insufficient maintenance cost modeling and ineffective view selection due to probabilistic techniques. We propose a novel view selection method that incorporates incremental view maintenance cost directly into the optimization objective of an integer linear program and applies local search to efficiently explore the solution space. In order to apply local search to the view selection problem, we develop neighboring solutions using sub-query containment, and select initial solutions based on sub-query frequency, utility, or utility per storage unit. Experiments using Redbench, a benchmark simulating real-world query workloads on Amazon Redshift, show that our approach outperforms BIGSUBS in both optimization utility and the quality of selected views.

1.6CLJan 12

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

Rei Taniguchi, Yuyang Dong, Makoto Onizuka et al.

Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one-shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state-of-the-art layer-wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.

18.5CRMay 28, 2025

Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems

Ronny Ko, Jiseong Jeong, Shuyuan Zheng et al.

Large language models (LLMs) are rapidly evolving into autonomous agents that cooperate across organizational boundaries, enabling joint disaster response, supply-chain optimization, and other tasks that demand decentralized expertise without surrendering data ownership. Yet, cross-domain collaboration shatters the unified trust assumptions behind current alignment and containment techniques. An agent benign in isolation may, when receiving messages from an untrusted peer, leak secrets or violate policy, producing risks driven by emergent multi-agent dynamics rather than classical software bugs. This position paper maps the security agenda for cross-domain multi-agent LLM systems. We introduce seven categories of novel security challenges, for each of which we also present plausible attacks, security evaluation metrics, and future research guidelines.

4.1LGApr 8, 2025

CKGAN: Training Generative Adversarial Networks Using Characteristic Kernel Integral Probability Metrics

Kuntian Zhang, Simin Yu, Yaoshu Wang et al.

In this paper, we propose CKGAN, a novel generative adversarial network (GAN) variant based on an integral probability metrics framework with characteristic kernel (CKIPM). CKIPM, as a distance between two probability distributions, is designed to optimize the lowerbound of the maximum mean discrepancy (MMD) in a reproducing kernel Hilbert space, and thus can be used to train GANs. CKGAN mitigates the notorious problem of mode collapse by mapping the generated images back to random noise. To save the effort of selecting the kernel function manually, we propose a soft selection method to automatically learn a characteristic kernel function. The experimental evaluation conducted on a set of synthetic and real image benchmarks (MNIST, CelebA, etc.) demonstrates that CKGAN generally outperforms other MMD-based GANs. The results also show that at the cost of moderately more training time, the automatically selected kernel function delivers very close performance to the best of manually fine-tuned one on real image benchmarks and is able to improve the performances of other MMD-based GANs.

2.0IRJan 30, 2022

Misato Horiuchi, Yuya Sasaki, Chuan Xiao et al.

Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency.

11.3LGOct 15, 2021

FedMe: Federated Learning via Model Exchange

Koji Matsuda, Yuya Sasaki, Chuan Xiao et al.

Federated learning is a distributed machine learning method in which a single server and multiple clients collaboratively build machine learning models without sharing datasets on clients. Numerous methods have been proposed to cope with the data heterogeneity issue in federated learning. Existing solutions require a model architecture tuned by the central server, yet a major technical challenge is that it is difficult to tune the model architecture due to the absence of local data on the central server. In this paper, we propose Federated learning via Model exchange (FedMe), which personalizes models with automatic model architecture tuning during the learning process. The novelty of FedMe lies in its learning process: clients exchange their models for model architecture tuning and model training. First, to optimize the model architectures for local data, clients tune their own personalized models by comparing to exchanged models and picking the one that yields the best performance. Second, clients train both personalized models and exchanged models by using deep mutual learning, in spite of different model architectures across the clients. We perform experiments on three real datasets and show that FedMe outperforms state-of-the-art federated learning methods while tuning model architectures automatically.

1.6LGAug 16, 2021

AIREX: Neural Network-based Approach for Air Quality Inference in Unmonitored Cities

Yuya Sasaki, Kei Harada, Shohei Yamasaki et al.

Urban air pollution is a major environmental problem affecting human health and quality of life. Monitoring stations have been established to continuously obtain air quality information, but they do not cover all areas. Thus, there are numerous methods for spatially fine-grained air quality inference. Since existing methods aim to infer air quality of locations only in monitored cities, they do not assume inferring air quality in unmonitored cities. In this paper, we first study the air quality inference in unmonitored cities. To accurately infer air quality in unmonitored cities, we propose a neural network-based approach AIREX. The novelty of AIREX is employing a mixture-of-experts approach, which is a machine learning technique based on the divide-and-conquer principle, to learn correlations of air quality between multiple cities. To further boost the performance, it employs attention mechanisms to compute impacts of air quality inference from the monitored cities to the locations in the unmonitored city. We show, through experiments on a real-world air quality dataset, that AIREX achieves higher accuracy than state-of-the-art methods.

7.3DBMay 20, 2020Code

Consistent and Flexible Selectivity Estimation for High-Dimensional Data

Yaoshu Wang, Chuan Xiao, Jianbin Qin et al.

Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion. Answering this problem accurately and efficiently is essential to many applications, such as density estimation, outlier detection, query optimization, and data integration. The estimation problem is especially challenging for large-scale high-dimensional data due to the curse of dimensionality, the large variance of selectivity across different queries, and the need to make the estimator consistent (i.e., the selectivity is non-decreasing in the threshold). We propose a new deep learning-based model that learns a query-dependent piecewise linear function as selectivity estimator, which is flexible to fit the selectivity curve of any distance function and query object, while guaranteeing that the output is non-decreasing in the threshold. To improve the accuracy for large datasets, we propose to partition the dataset into multiple disjoint subsets and build a local model on each of them. We perform experiments on real datasets and show that the proposed model consistently outperforms state-of-the-art models in accuracy in an efficient way and is useful for real applications.

1.5LGSep 21, 2018Code

Non-linear Attributed Graph Clustering by Symmetric NMF with PU Learning

Seiji Maekawa, Koh Takeuch, Makoto Onizuka

We consider the clustering problem of attributed graphs. Our challenge is how we can design an effective and efficient clustering method that precisely captures the hidden relationship between the topology and the attributes in real-world graphs. We propose Non-linear Attributed Graph Clustering by Symmetric Non-negative Matrix Factorization with Positive Unlabeled Learning. The features of our method are three holds. 1) it learns a non-linear projection function between the different cluster assignments of the topology and the attributes of graphs so as to capture the complicated relationship between the topology and the attributes in real-world graphs, 2) it leverages the positive unlabeled learning to take the effect of partially observed positive edges into the cluster assignment, and 3) it achieves efficient computational complexity, $O((n^2+mn)kt)$, where $n$ is the vertex size, $m$ is the attribute size, $k$ is the number of clusters, and $t$ is the number of iterations for learning the cluster assignment. We conducted experiments extensively for various clustering methods with various real datasets to validate that our method outperforms the former clustering methods regarding the clustering quality.