Tatsuya Akutsu

LG
h-index18
16papers
33citations
Novelty47%
AI Score50

16 Papers

QMMay 28
Mixing Vector Model for Copolymer Inference via Mixed Integer Linear Programming

Jianshen Zhu, Raveena Rai, Taiyo Sohkawa et al.

A novel two-phase molecule inference framework, mol-infer, has recently been developed to infer chemical graphs with prescribed abstract structures and desired property values through mixed integer linear programming (MILP) under the two-layered model, with guaranteed optimality and exactness relative to the given learned prediction function and structural constraints. In this study, we extend this framework to copolymers by introducing a simple feature representation, called the mixing vector (MV) model. In the proposed model, a copolymer feature vector is represented as a convex combination of MILP-tractable monomer descriptors weighted by the mixing ratio of the constituent monomers. This representation does not require explicit sequence-class information and is therefore naturally compatible with MILP-based inverse design. Under this model, we construct prediction functions for several copolymer property datasets using artificial neural networks, reduced quadratic multiple linear regression, and random forests. The proposed representation achieves practically useful predictive performance across multiple physicochemical property datasets; in particular, the best test R^2 score exceeds 0.7 for nine of the ten datasets and exceeds 0.9 for six datasets. We also formulate a multi-monomer inverse-design problem under the MV representation with a prescribed mixing ratio and show that the resulting MILP instances remain tractable, even for three-monomer settings. Finally, we perform an external consistency check by re-evaluating the inferred candidates and comparing the re-computed property values with those predicted by the learned model. Overall, the proposed framework gives a tractable first step toward model-level exact inverse design of copolymers under the two-layered model.

LGSep 6, 2024Code
A Unified Approach to Inferring Chemical Compounds with the Desired Aqueous Solubility

Muniba Batool, Naveed Ahmed Azam, Jianshen Zhu et al.

Aqueous solubility (AS) is a key physiochemical property that plays a crucial role in drug discovery and material design. We report a novel unified approach to predict and infer chemical compounds with the desired AS based on simple deterministic graph-theoretic descriptors, multiple linear regression (MLR) and mixed integer linear programming (MILP). Selected descriptors based on a forward stepwise procedure enabled the simplest regression model, MLR, to achieve significantly good prediction accuracy compared to the existing approaches, achieving the accuracy in the range [0.7191, 0.9377] for 29 diverse datasets. By simulating these descriptors and learning models as MILPs, we inferred mathematically exact and optimal compounds with the desired AS, prescribed structures, and up to 50 non-hydrogen atoms in a reasonable time range [6, 1204] seconds. These findings indicate a strong correlation between the simple graph-theoretic descriptors and the AS of compounds, potentially leading to a deeper understanding of their AS without relying on widely used complicated chemical descriptors and complex machine learning models that are computationally expensive, and therefore difficult to use for inference. An implementation of the proposed approach is available at https://github.com/ku-dml/mol-infer/tree/master/AqSol.

BMSep 13, 2022
Molecular Design Based on Integer Programming and Quadratic Descriptors in a Two-layered Model

Jianshen Zhu, Naveed Ahmed Azam, Shengjuan Cao et al.

A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property, where design of novel drugs is an important topic in bioinformatics and chemo-informatics. The framework infers a desired chemical graph by solving a mixed integer linear program (MILP) that simulates the computation process of a feature function defined by a two-layered model on chemical graphs and a prediction function constructed by a machine learning method. A set of graph theoretical descriptors in the feature function plays a key role to derive a compact formulation of such an MILP. To improve the learning performance of prediction functions in the framework maintaining the compactness of the MILP, this paper utilizes the product of two of those descriptors as a new descriptor and then designs a method of reducing the number of descriptors. The results of our computational experiments suggest that the proposed method improved the learning performance for many chemical properties and can infer a chemical structure with up to 50 non-hydrogen atoms.

CEApr 27, 2023
Molecular Design Based on Integer Programming and Splitting Data Sets by Hyperplanes

Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi et al.

A novel framework for designing the molecular structure of chemical compounds with a desired chemical property has recently been proposed. The framework infers a desired chemical graph by solving a mixed integer linear program (MILP) that simulates the computation process of a feature function defined by a two-layered model on chemical graphs and a prediction function constructed by a machine learning method. To improve the learning performance of prediction functions in the framework, we design a method that splits a given data set $\mathcal{C}$ into two subsets $\mathcal{C}^{(i)},i=1,2$ by a hyperplane in a chemical space so that most compounds in the first (resp., second) subset have observed values lower (resp., higher) than a threshold $θ$. We construct a prediction function $ψ$ to the data set $\mathcal{C}$ by combining prediction functions $ψ_i,i=1,2$ each of which is constructed on $\mathcal{C}^{(i)}$ independently. The results of our computational experiments suggest that the proposed method improved the learning performance for several chemical properties to which a good prediction function has been difficult to construct.

LGAug 9, 2024
Cycle-Configuration: A Novel Graph-theoretic Descriptor Set for Molecular Inference

Bowen Song, Jianshen Zhu, Naveed Ahmed Azam et al.

In this paper, we propose a novel family of descriptors of chemical graphs, named cycle-configuration (CC), that can be used in the standard "two-layered (2L) model" of mol-infer, a molecular inference framework based on mixed integer linear programming (MILP) and machine learning (ML). Proposed descriptors capture the notion of ortho/meta/para patterns that appear in aromatic rings, which has been impossible in the framework so far. Computational experiments show that, when the new descriptors are supplied, we can construct prediction functions of similar or better performance for all of the 27 tested chemical properties. We also provide an MILP formulation that asks for a chemical graph with desired properties under the 2L model with CC descriptors (2L+CC model). We show that a chemical graph with up to 50 non-hydrogen vertices can be inferred in a practical time.

LGOct 12, 2025Code
Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance

Mamoona Ghafoor, Tatsuya Akutsu

The generation of trees with a specified tree edit distance has significant applications across various fields, including computational biology, structured data analysis, and image processing. Recently, generative networks have been increasingly employed to synthesize new data that closely resembles the original datasets. However, the appropriate size and depth of generative networks required to generate data with a specified tree edit distance remain unclear. In this paper, we theoretically establish the existence and construction of generative networks capable of producing trees similar to a given tree with respect to the tree edit distance. Specifically, for a given rooted, ordered, and vertex-labeled tree T of size n + 1 with labels from an alphabet Σ, and a non-negative integer d, we prove that all rooted, ordered, and vertex-labeled trees over Σwith tree edit distance at most d from T can be generated using a ReLU-based generative network with size O(n^3 ) and constant depth. The proposed networks were implemented and evaluated for generating trees with up to 21 nodes. Due to their deterministic architecture, the networks successfully generated all valid trees within the specified tree edit distance. In contrast, state-of-the-art graph generative models GraphRNN and GraphGDP, which rely on non-deterministic mechanisms, produced significantly fewer valid trees, achieving validation rates of only up to 35% and 48%, respectively. These findings provide a theoretical foundation towards construction of compact generative models and open new directions for exact and valid tree-structured data generation. An implementation of the proposed networks is available at https://github.com/MGANN-KU/TreeGen_ReLUNetworks.

LGApr 7
ReLU Networks for Exact Generation of Similar Graphs

Mamoona Ghafoor, Tatsuya Akutsu

Generation of graphs constrained by a specified graph edit distance from a source graph is important in applications such as cheminformatics, network anomaly synthesis, and structured data augmentation. Despite the growing demand for such constrained generative models in areas including molecule design and network perturbation analysis, the neural architectures required to provably generate graphs within a bounded graph edit distance remain largely unexplored. In addition, existing graph generative models are predominantly data-driven and depend heavily on the availability and quality of training data, which may result in generated graphs that do not satisfy the desired edit distance constraints. In this paper, we address these challenges by theoretically characterizing ReLU neural networks capable of generating graphs within a prescribed graph edit distance from a given graph. In particular, we show the existence of constant depth and O(n^2 d) size ReLU networks that deterministically generate graphs within edit distance d from a given input graph with n vertices, eliminating reliance on training data while guaranteeing validity of the generated graphs. Experimental evaluations demonstrate that the proposed network successfully generates valid graphs for instances with up to 1400 vertices and edit distance bounds up to 140, whereas baseline generative models fail to generate graphs with the desired edit distance. These results provide a theoretical foundation for constructing compact generative models with guaranteed validity.

LGJul 5, 2025
Combining Graph Neural Networks and Mixed Integer Linear Programming for Molecular Inference under the Two-Layered Model

Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi et al.

Recently, a novel two-phase framework named mol-infer for inference of chemical compounds with prescribed abstract structures and desired property values has been proposed. The framework mol-infer is primarily based on using mixed integer linear programming (MILP) to simulate the computational process of machine learning methods and describe the necessary and sufficient conditions to ensure such a chemical graph exists. The existing approaches usually first convert the chemical compounds into handcrafted feature vectors to construct prediction functions, but because of the limit on the kinds of descriptors originated from the need for tractability in the MILP formulation, the learning performances on datasets of some properties are not good enough. A lack of good learning performance can greatly lower the quality of the inferred chemical graphs, and thus improving learning performance is of great importance. On the other hand, graph neural networks (GNN) offer a promising machine learning method to directly utilize the chemical graphs as the input, and many existing GNN-based approaches to the molecular property prediction problem have shown that they can enjoy better learning performances compared to the traditional approaches that are based on feature vectors. In this study, we develop a molecular inference framework based on mol-infer, namely mol-infer-GNN, that utilizes GNN as the learning method while keeping the great flexibility originated from the two-layered model on the abstract structure of the chemical graph to be inferred. We conducted computational experiments on the QM9 dataset to show that our proposed GNN model can obtain satisfying learning performances for some properties despite its simple structure, and can infer small chemical graphs comprising up to 20 non-hydrogen atoms within reasonable computational time.

CHEM-PHFeb 17, 2025
Towards Environment-Sensitive Molecular Inference via Mixed Integer Linear Programming

Jianshen Zhu, Mao Takekida, Naveed Ahmed Azam et al.

Traditional QSAR/QSPR and inverse QSAR/QSPR methods often assume that chemical properties are dictated by single molecules, overlooking the influence of molecular interactions and environmental factors. In this paper, we introduce a novel QSAR/QSPR framework that can capture the combined effects of multiple molecules (e.g., small molecules or polymers) and experimental conditions on property values. We design a feature function to integrate the information of multiple molecules and the environment. Specifically, for the property Flory-Huggins $χ$-parameter, which characterizes the thermodynamic properties between the solute and the solvent, and varies in temperatures, we demonstrate through computational experimental results that our approach can achieve a competitively high learning performance compared to existing works on predicting $χ$-parameter values, while inferring the solute polymers with up to 50 non-hydrogen atoms in their monomer forms in a relatively short time. A comparison study with the simulation software J-OCTA demonstrates that the polymers inferred by our methods are of high quality.

LGDec 16, 2023
On the Trade-off between the Number of Nodes and the Number of Trees in a Random Forest

Tatsuya Akutsu, Avraham A. Melkman, Atsuhiro Takasu

In this paper, we focus on the prediction phase of a random forest and study the problem of representing a bag of decision trees using a smaller bag of decision trees, where we only consider binary decision problems on the binary domain and simple decision trees in which an internal node is limited to querying the Boolean value of a single variable. As a main result, we show that the majority function of $n$ variables can be represented by a bag of $T$ ($< n$) decision trees each with polynomial size if $n-T$ is a constant, where $n$ and $T$ must be odd (in order to avoid the tie break). We also show that a bag of $n$ decision trees can be represented by a bag of $T$ decision trees each with polynomial size if $n-T$ is a constant and a small classification error is allowed. A related result on the $k$-out-of-$n$ functions is presented too.

NEDec 21, 2021
On the Size and Width of the Decoder of a Boolean Threshold Autoencoder

Tatsuya Akutsu, Avraham A. Melkman

In this paper, we study the size and width of autoencoders consisting of Boolean threshold functions, where an autoencoder is a layered neural network whose structure can be viewed as consisting of an encoder, which compresses an input vector to a lower dimensional vector, and a decoder which transforms the low-dimensional vector back to the original input vector exactly (or approximately). We focus on the decoder part, and show that $Ω(\sqrt{Dn/d})$ and $O(\sqrt{Dn})$ nodes are required to transform $n$ vectors in $d$-dimensional binary space to $D$-dimensional binary space. We also show that the width can be reduced if we allow small errors, where the error is defined as the average of the Hamming distance between each vector input to the encoder part and the resulting vector output by the decoder.

LGAug 24, 2021
A Method for Inferring Polymers Based on Linear Regression and Integer Programming

Ryota Ido, Shengjuan Cao, Jianshen Zhu et al.

A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property using both artificial neural networks and mixed integer linear programming. In this paper, we design a new method for inferring a polymer based on the framework. For this, we introduce a new way of representing a polymer as a form of monomer and define new descriptors that feature the structure of polymers. We also use linear regression as a building block of constructing a prediction function in the framework. The results of our computational experiments reveal a set of chemical properties on polymers to which a prediction function constructed with linear regression performs well. We also observe that the proposed method can infer polymers with up to 50 non-hydrogen atoms in a monomer form.

LGAug 23, 2021
Molecular Design Based on Artificial Neural Networks, Integer Programming and Grid Neighbor Search

Naveed Ahmed Azam, Jianshen Zhu, Kazuya Haraguchi et al.

A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property using both artificial neural networks and mixed integer linear programming. In the framework, a chemical graph with a target chemical value is inferred as a feasible solution of a mixed integer linear program that represents a prediction function and other requirements on the structure of graphs. In this paper, we propose a procedure for generating other feasible solutions of the mixed integer linear program by searching the neighbor of output chemical graph in a search space. The procedure is combined in the framework as a new building block. The results of our computational experiments suggest that the proposed method can generate an additional number of new chemical graphs with up to 50 non-hydrogen atoms.

LGJul 6, 2021
An Inverse QSAR Method Based on Linear Regression and Integer Programming

Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi et al.

Recently a novel framework has been proposed for designing the molecular structure of chemical compounds using both artificial neural networks (ANNs) and mixed integer linear programming (MILP). In the framework, we first define a feature vector $f(C)$ of a chemical graph $C$ and construct an ANN that maps $x=f(C)$ to a predicted value $η(x)$ of a chemical property $π$ to $C$. After this, we formulate an MILP that simulates the computation process of $f(C)$ from $C$ and that of $η(x)$ from $x$. Given a target value $y^*$ of the chemical property $π$, we infer a chemical graph $C^\dagger$ such that $η(f(C^\dagger))=y^*$ by solving the MILP. In this paper, we use linear regression to construct a prediction function $η$ instead of ANNs. For this, we derive an MILP formulation that simulates the computation process of a prediction function by linear regression. The results of computational experiments suggest our method can infer chemical graphs with around up to 50 non-hydrogen atoms.

LGApr 21, 2020
On the Compressive Power of Boolean Threshold Autoencoders

Avraham A. Melkman, Sini Guo, Wai-Ki Ching et al.

An autoencoder is a layered neural network whose structure can be viewed as consisting of an encoder, which compresses an input vector of dimension $D$ to a vector of low dimension $d$, and a decoder which transforms the low-dimensional vector back to the original input vector (or one that is very similar). In this paper we explore the compressive power of autoencoders that are Boolean threshold networks by studying the numbers of nodes and layers that are required to ensure that the numbers of nodes and layers that are required to ensure that each vector in a given set of distinct input binary vectors is transformed back to its original. We show that for any set of $n$ distinct vectors there exists a seven-layer autoencoder with the smallest possible middle layer, (i.e., its size is logarithmic in $n$), but that there is a set of $n$ vectors for which there is no three-layer autoencoder with a middle layer of the same size. In addition we present a kind of trade-off: if a considerably larger middle layer is permissible then a five-layer autoencoder does exist. We also study encoding by itself. The results we obtain suggest that it is the decoding that constitutes the bottleneck of autoencoding. For example, there always is a three-layer Boolean threshold encoder that compresses $n$ vectors into a dimension that is reduced to twice the logarithm of $n$.

MLJun 3, 2014
Maximum margin classifier working in a set of strings

Hitoshi Koyano, Morihiro Hayashida, Tatsuya Akutsu

Numbers and numerical vectors account for a large portion of data. However, recently the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem on the asymptotic behavior of a consensus sequence of strings, which is the counterpart of the mean of numerical vectors, as demonstrated in the probability theory on a metric space of strings developed by one of the authors and his colleague in a previous study [18]. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein--protein interactions using amino acid sequences.