DBJun 2
Workload acceleration by optimizing materialized view selection using local searchKaina Anderson, Yohanes Yohanie Fridelin Panduman, Yuya Sasaki et al.
The growing size of database workloads has made view selection a key performance challenge. Materializing frequent sub-queries in workloads improves query efficiency, but it incurs significant view maintenance costs due to updates. Although existing methods such as BIGSUBS address this trade-off between the benefit of using materialized views and the overhead of view maintenance, they have two drawbacks: insufficient maintenance cost modeling and ineffective view selection due to probabilistic techniques. We propose a novel view selection method that incorporates incremental view maintenance cost directly into the optimization objective of an integer linear program and applies local search to efficiently explore the solution space. In order to apply local search to the view selection problem, we develop neighboring solutions using sub-query containment, and select initial solutions based on sub-query frequency, utility, or utility per storage unit. Experiments using Redbench, a benchmark simulating real-world query workloads on Amazon Redshift, show that our approach outperforms BIGSUBS in both optimization utility and the quality of selected views.
LGJun 18, 2022
Beyond Real-world Benchmark Datasets: An Empirical Study of Node Classification with GNNsSeiji Maekawa, Koki Noda, Yuya Sasaki et al.
Graph Neural Networks (GNNs) have achieved great success on a node classification task. Despite the broad interest in developing and evaluating GNNs, they have been assessed with limited benchmark datasets. As a result, the existing evaluation of GNNs lacks fine-grained analysis from various characteristics of graphs. Motivated by this, we conduct extensive experiments with a synthetic graph generator that can generate graphs having controlled characteristics for fine-grained analysis. Our empirical studies clarify the strengths and weaknesses of GNNs from four major characteristics of real-world graphs with class labels of nodes, i.e., 1) class size distributions (balanced vs. imbalanced), 2) edge connection proportions between classes (homophilic vs. heterophilic), 3) attribute values (biased vs. random), and 4) graph sizes (small vs. large). In addition, to foster future research on GNNs, we publicly release our codebase that allows users to evaluate various GNNs with various graphs. We hope this work offers interesting insights for future research.
LGJun 27, 2022
An Empirical Study of Personalized Federated LearningKoji Matsuda, Yuya Sasaki, Chuan Xiao et al.
Federated learning is a distributed machine learning approach in which a single server and multiple clients collaboratively build machine learning models without sharing datasets on clients. A challenging issue of federated learning is data heterogeneity (i.e., data distributions may differ across clients). To cope with this issue, numerous federated learning methods aim at personalized federated learning and build optimized models for clients. Whereas existing studies empirically evaluated their own methods, the experimental settings (e.g., comparison methods, datasets, and client setting) in these studies differ from each other, and it is unclear which personalized federate learning method achieves the best performance and how much progress can be made by using these methods instead of standard (i.e., non-personalized) federated learning. In this paper, we benchmark the performance of existing personalized federated learning through comprehensive experiments to evaluate the characteristics of each method. Our experimental study shows that (1) there are no champion methods, (2) large data heterogeneity often leads to high accurate predictions, and (3) standard federated learning methods (e.g. FedAvg) with fine-tuning often outperform personalized federated learning methods. We open our benchmark tool FedBench for researchers to conduct experimental studies with various experimental settings.
DBJun 8, 2023
Learned spatial data partitioningKeizo Hori, Yuya Sasaki, Daichi Amagata et al.
Due to the significant increase in the size of spatial data, it is essential to use distributed parallel processing systems to efficiently analyze spatial data. In this paper, we first study learned spatial data partitioning, which effectively assigns groups of big spatial data to computers based on locations of data by using machine learning techniques. We formalize spatial data partitioning in the context of reinforcement learning and develop a novel deep reinforcement learning algorithm. Our learning algorithm leverages features of spatial data partitioning and prunes ineffective learning processes to find optimal partitions efficiently. Our experimental study, which uses Apache Sedona and real-world spatial data, demonstrates that our method efficiently finds partitions for accelerating distance join queries and reduces the workload run time by up to 59.4%.
LGJun 21, 2022
Predicting Parking Lot Availability by Graph-to-Sequence Model: A Case Study with SmartSantanderYuya Sasaki, Junya Takayama, Juan Ramón Santana et al.
Nowadays, so as to improve services and urban areas livability, multiple smart city initiatives are being carried out throughout the world. SmartSantander is a smart city project in Santander, Spain, which has relied on wireless sensor network technologies to deploy heterogeneous sensors within the city to measure multiple parameters, including outdoor parking information. In this paper, we study the prediction of parking lot availability using historical data from more than 300 outdoor parking sensors with SmartSantander. We design a graph-to-sequence model to capture the periodical fluctuation and geographical proximity of parking lots. For developing and evaluating our model, we use a 3-year dataset of parking lot availability in the city of Santander. Our model achieves a high accuracy compared with existing sequence-to-sequence models, which is accurate enough to provide a parking information service in the city. We apply our model to a smartphone application to be widely used by citizens and tourists.
LGJul 25, 2022
GNN Transformation Framework for Improving Efficiency and ScalabilitySeiji Maekawa, Yuya Sasaki, George Fletcher et al.
We propose a framework that automatically transforms non-scalable GNNs into precomputation-based GNNs which are efficient and scalable for large-scale graphs. The advantages of our framework are two-fold; 1) it transforms various non-scalable GNNs to scale well to large-scale graphs by separating local feature aggregation from weight learning in their graph convolution, 2) it efficiently executes precomputation on GPU for large-scale graphs by decomposing their edges into small disjoint and balanced sets. Through extensive experiments with large-scale graphs, we demonstrate that the transformed GNNs run faster in training time than existing GNNs while achieving competitive accuracy to the state-of-the-art GNNs. Consequently, our transformation framework provides simple and efficient baselines for future research on scalable GNNs.
LGJul 6, 2022
Scaling Private Deep Learning with Low-Rank and Sparse GradientsRyuichi Ito, Seng Pei Liew, Tsubasa Takahashi et al.
Applying Differentially Private Stochastic Gradient Descent (DPSGD) to training modern, large-scale neural networks such as transformer-based models is a challenging task, as the magnitude of noise added to the gradients at each iteration scales with model dimension, hindering the learning capability significantly. We propose a unified framework, $\textsf{LSG}$, that fully exploits the low-rank and sparse structure of neural networks to reduce the dimension of gradient updates, and hence alleviate the negative impacts of DPSGD. The gradient updates are first approximated with a pair of low-rank matrices. Then, a novel strategy is utilized to sparsify the gradients, resulting in low-dimensional, less noisy updates that are yet capable of retaining the performance of neural networks. Empirical evaluation on natural language processing and computer vision tasks shows that our method outperforms other state-of-the-art baselines.
LGJun 14, 2023
A Simple and Scalable Graph Neural Network for Large Directed GraphsSeiji Maekawa, Yuya Sasaki, Makoto Onizuka
Node classification is one of the hottest tasks in graph analysis. Though existing studies have explored various node representations in directed and undirected graphs, they have overlooked the distinctions of their capabilities to capture the information of graphs. To tackle the limitation, we investigate various combinations of node representations (aggregated features vs. adjacency lists) and edge direction awareness within an input graph (directed vs. undirected). We address the first empirical study to benchmark the performance of various GNNs that use either combination of node representations and edge direction awareness. Our experiments demonstrate that no single combination stably achieves state-of-the-art results across datasets, which indicates that we need to select appropriate combinations depending on the dataset characteristics. In response, we propose a simple yet holistic classification method A2DUG which leverages all combinations of node representations in directed and undirected graphs. We demonstrate that A2DUG stably performs well on various datasets and improves the accuracy up to 11.29 compared with the state-of-the-art methods. To spur the development of new methods, we publicly release our complete codebase under the MIT license.
DBMar 31, 2023
Scardina: Scalable Join Cardinality Estimation by Multiple Density EstimatorsRyuichi Ito, Yuya Sasaki, Chuan Xiao et al.
In recent years, machine learning-based cardinality estimation methods are replacing traditional methods. This change is expected to contribute to one of the most important applications of cardinality estimation, the query optimizer, to speed up query processing. However, none of the existing methods do not precisely estimate cardinalities when relational schemas consist of many tables with strong correlations between tables/attributes. This paper describes that multiple density estimators can be combined to effectively target the cardinality estimation of data with large and complex schemas having strong correlations. We propose Scardina, a new join cardinality estimation method using multiple partitioned models based on the schema structure.
LGAug 30, 2023
Explainable Graph Neural Architecture Search via Monte-Carlo Tree Search (Full version)Yuya Sasaki
The number of graph neural network (GNN) architectures has increased rapidly due to the growing adoption of graph analysis. Although we use GNNs in wide application scenarios, it is a laborious task to design/select optimal GNN architectures in diverse graphs. To reduce human efforts, graph neural architecture search (Graph NAS) has been used to search for a sub-optimal GNN architecture that combines existing components. However, existing Graph NAS methods lack explainability to understand the reasons why the model architecture is selected because they use complex search space and neural models to select architecture. Therefore, we propose an explainable Graph NAS method, called ExGNAS, which consists of (i) a simple search space that can adapt to various graphs and (ii) a search algorithm with Monte-Carlo tree that makes the decision process explainable. The combination of our search space and algorithm achieves finding accurate GNN models and the important functions within the search space. We comprehensively evaluate ExGNAS compared with four state-of-the-art Graph NAS methods in twelve graphs. Our experimental results show that ExGNAS achieves high average accuracy and efficiency; improving accuracy up to 26.1% and reducing run time up to 88%. Furthermore, we show the effectiveness of explainability by questionnaire-based user study and architecture analysis.
DBAug 4, 2024
Mining Path Association Rules in Large Property Graphs (with Appendix)Yuya Sasaki, Panagiotis Karras
How can we mine frequent path regularities from a graph with edge labels and vertex attributes? The task of association rule mining successfully discovers regular patterns in item sets and substructures. Still, to our best knowledge, this concept has not yet been extended to path patterns in large property graphs. In this paper, we introduce the problem of path association rule mining (PARM). Applied to any \emph{reachability path} between two vertices within a large graph, PARM discovers regular ways in which path patterns, identified by vertex attributes and edge labels, co-occur with each other. We develop an efficient and scalable algorithm PIONEER that exploits an anti-monotonicity property to effectively prune the search space. Further, we devise approximation techniques and employ parallelization to achieve scalable path association rule mining. Our experimental study using real-world graph data verifies the significance of path association rules and the efficiency of our solutions.
DLMar 24
Systemic Gendered Citation Imbalance in Computer Science: Evidence from Conferences and JournalsKazuki Nakajima, Yuya Sasaki, Sohei Tokuno et al.
Gender imbalance persists across science, technology, engineering, and mathematics (STEM) fields, including computer science, where it appears in researcher demographics, productivity, recognition, hiring, and career progression. Given computer science's rapid expansion and global influence, addressing this imbalance is essential for broadening participation and fueling innovation. Although journal-oriented disciplines exhibit consistent gender imbalances in citation practices, it remains unclear whether similar patterns arise in the conference-centric culture of computer science. Here, we systematically investigate gender imbalance in citations of conference and journal papers in computer science. We find that papers for which a woman is listed as either first or last author receive fewer citations than expected, partly because of homophilic citation tendencies (i.e., authors tend to cite papers that share specific attributes). This imbalance is especially pronounced for conference papers--particularly those published at top-tier venues--relative to journals. Moreover, we find that the prominence of the first or last author and the structure of their local co-authorship networks are potential drivers of these imbalances. By exploring how conference-centric publishing practices can amplify systemic imbalances in computer science, our study offers insights that may inform efforts to foster more equitable representation in academia.
SINov 26, 2025
Learning Multi-Order Block Structure in Higher-Order NetworksKazuki Nakajima, Yuya Sasaki, Takeaki Uno et al.
Higher-order networks, naturally described as hypergraphs, are essential for modeling real-world systems involving interactions among three or more entities. Stochastic block models offer a principled framework for characterizing mesoscale organization, yet their extension to hypergraphs involves a trade-off between expressive power and computational complexity. A recent simplification, a single-order model, mitigates this complexity by assuming a single affinity pattern governs interactions of all orders. This universal assumption, however, may overlook order-dependent structural details. Here, we propose a framework that relaxes this assumption by introducing a multi-order block structure, in which different affinity patterns govern distinct subsets of interaction orders. Our framework is based on a multi-order stochastic block model and searches for the optimal partition of the set of interaction orders that maximizes out-of-sample hyperlink prediction performance. Analyzing a diverse range of real-world networks, we find that multi-order block structures are prevalent. Accounting for them not only yields better predictive performance over the single-order model but also uncovers sharper, more interpretable mesoscale organization. Our findings reveal that order-dependent mechanisms are a key feature of the mesoscale organization of real-world higher-order networks.
LGOct 21, 2025
Benchmarking Fairness-aware Graph Neural Networks in Knowledge GraphsYuya Sasaki
Graph neural networks (GNNs) are powerful tools for learning from graph-structured data but often produce biased predictions with respect to sensitive attributes. Fairness-aware GNNs have been actively studied for mitigating biased predictions. However, no prior studies have evaluated fairness-aware GNNs on knowledge graphs, which are one of the most important graphs in many applications, such as recommender systems. Therefore, we introduce a benchmarking study on knowledge graphs. We generate new graphs from three knowledge graphs, YAGO, DBpedia, and Wikidata, that are significantly larger than the existing graph datasets used in fairness studies. We benchmark inprocessing and preprocessing methods in different GNN backbones and early stopping conditions. We find several key insights: (i) knowledge graphs show different trends from existing datasets; clearer trade-offs between prediction accuracy and fairness metrics than other graphs in fairness-aware GNNs, (ii) the performance is largely affected by not only fairness-aware GNN methods but also GNN backbones and early stopping conditions, and (iii) preprocessing methods often improve fairness metrics, while inprocessing methods improve prediction accuracy.
LGJun 14, 2025
Algorithmic Fairness: Not a Purely Technical but Socio-Technical PropertyYijun Bian, Lei You, Yuya Sasaki et al.
The rapid trend of deploying artificial intelligence (AI) and machine learning (ML) systems in socially consequential domains has raised growing concerns about their trustworthiness, including potential discriminatory behaviours. Research in algorithmic fairness has generated a proliferation of mathematical definitions and metrics, yet persistent misconceptions and limitations -- both within and beyond the fairness community -- limit their effectiveness, such as an unreached consensus on its understanding, prevailing measures primarily tailored to binary group settings, and superficial handling for intersectional contexts. Here we critically remark on these misconceptions and argue that fairness cannot be reduced to purely technical constraints on models; we also examine the limitations of existing fairness measures through conceptual analysis and empirical illustrations, showing their limited applicability in the face of complex real-world scenarios, challenging prevailing views on the incompatibility between accuracy and fairness as well as that among fairness measures themselves, and outlining three worth-considering principles in the design of fairness measures. We believe these findings will help bridge the gap between technical formalisation and social realities and meet the challenges of real-world AI/ML deployment.
AIMar 24, 2024
Public Perceptions of Fairness Metrics Across BordersYuya Sasaki, Sohei Tokuno, Haruka Maeda et al.
Which fairness metrics are appropriately applicable in your contexts? There may be instances of discordance regarding the perception of fairness, even when the outcomes comply with established fairness metrics. Several questionnaire-based surveys have been conducted to evaluate fairness metrics with human perceptions of fairness. However, these surveys were limited in scope, including only a few hundred participants within a single country. In this study, we conduct an international survey to evaluate public perceptions of various fairness metrics in decision-making scenarios. We collected responses from 1,000 participants in each of China, France, Japan, and the United States, amassing a total of 4,000 participants, to analyze the preferences of fairness metrics. Our survey consists of three distinct scenarios paired with four fairness metrics. This investigation explores the relationship between personal attributes and the choice of fairness metrics, uncovering a significant influence of national context on these preferences.
MLMar 2, 2024
High-Dimensional Tail Index Regression: with An Application to Text Analyses of Viral Posts in Social MediaYuya Sasaki, Jing Tao, Yulong Wang
Motivated by the empirical observation of power-law distributions in the credits (e.g., "likes") of viral social media posts, we introduce a high-dimensional tail index regression model and propose methods for estimation and inference of its parameters. First, we present a regularized estimator, establish its consistency, and derive its convergence rate. Second, we introduce a debiasing technique for the regularized estimator to facilitate inference and prove its asymptotic normality. Third, we extend our approach to handle large-scale online streaming data using stochastic gradient descent. Simulation studies corroborate our theoretical findings. We apply these methods to the text analysis of viral posts on X (formerly Twitter) related to LGBTQ+ topics.
IRJan 30, 2022
Similarity Search on Computational NotebooksMisato Horiuchi, Yuya Sasaki, Chuan Xiao et al.
Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency.
LGOct 15, 2021
FedMe: Federated Learning via Model ExchangeKoji Matsuda, Yuya Sasaki, Chuan Xiao et al.
Federated learning is a distributed machine learning method in which a single server and multiple clients collaboratively build machine learning models without sharing datasets on clients. Numerous methods have been proposed to cope with the data heterogeneity issue in federated learning. Existing solutions require a model architecture tuned by the central server, yet a major technical challenge is that it is difficult to tune the model architecture due to the absence of local data on the central server. In this paper, we propose Federated learning via Model exchange (FedMe), which personalizes models with automatic model architecture tuning during the learning process. The novelty of FedMe lies in its learning process: clients exchange their models for model architecture tuning and model training. First, to optimize the model architectures for local data, clients tune their own personalized models by comparing to exchanged models and picking the one that yields the best performance. Second, clients train both personalized models and exchanged models by using deep mutual learning, in spite of different model architectures across the clients. We perform experiments on three real datasets and show that FedMe outperforms state-of-the-art federated learning methods while tuning model architectures automatically.
LGAug 16, 2021
AIREX: Neural Network-based Approach for Air Quality Inference in Unmonitored CitiesYuya Sasaki, Kei Harada, Shohei Yamasaki et al.
Urban air pollution is a major environmental problem affecting human health and quality of life. Monitoring stations have been established to continuously obtain air quality information, but they do not cover all areas. Thus, there are numerous methods for spatially fine-grained air quality inference. Since existing methods aim to infer air quality of locations only in monitored cities, they do not assume inferring air quality in unmonitored cities. In this paper, we first study the air quality inference in unmonitored cities. To accurately infer air quality in unmonitored cities, we propose a neural network-based approach AIREX. The novelty of AIREX is employing a mixture-of-experts approach, which is a machine learning technique based on the divide-and-conquer principle, to learn correlations of air quality between multiple cities. To further boost the performance, it employs attention mechanisms to compute impacts of air quality inference from the monitored cities to the locations in the unmonitored city. We show, through experiments on a real-world air quality dataset, that AIREX achieves higher accuracy than state-of-the-art methods.