AISep 23, 2022
Predicting the Future of AI with AI: High-quality link prediction in an exponentially growing knowledge networkMario Krenn, Lorenzo Buffoni, Bruno Coutinho et al.
A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could significantly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over the last years, making it challenging for human researchers to keep track of the progress. Here, we use AI techniques to predict the future research directions of AI itself. We develop a new graph-based benchmark based on real-world data -- the Science4Cast benchmark, which aims to predict the future state of an evolving semantic network of AI. For that, we use more than 100,000 research papers and build up a knowledge network with more than 64,000 concept nodes. We then present ten diverse methods to tackle this task, ranging from pure statistical to pure learning methods. Surprisingly, the most powerful methods use a carefully curated set of network features, rather than an end-to-end AI approach. It indicates a great potential that can be unleashed for purely ML approaches without human knowledge. Ultimately, better predictions of new future research directions will be a crucial component of more advanced research suggestion tools.
GTOct 24, 2017
Product-Mix Auctions and Tropical GeometryNgoc Mai Tran, Josephine Yu
In a recent and ongoing work, Baldwin and Klemperer explored a connection between tropical geometry and economics. They gave a sufficient condition for the existence of competitive equilibrium in product-mix auctions of indivisible goods. This result, which we call the Unimodularity Theorem, can also be traced back to the work of Danilov, Koshevoy, and Murota in discrete convex analysis. We give a new proof of the Unimodularity Theorem via the classical unimodularity theorem in integer programming. We give a unified treatment of these results via tropical geometry and formulate a new sufficient condition for competitive equilibrium when there are only two types of product. Generalizations of our theorem in higher dimensions are equivalent to various forms of the Oda conjecture in algebraic geometry.
MEMar 6, 2011
Pairwise ranking: choice of method can produce arbitrarily different rank orderNgoc Mai Tran
We examine three methods for ranking by pairwise comparison: Principal Eigenvector, HodgeRank and Tropical Eigenvector. It is shown that the choice of method can produce arbitrarily different rank order.To be precise, for any two of the three methods, and for any pair of rankings of at least four items, there exists a comparison matrix for the items such that the rankings found by the two methods are the prescribed ones. We discuss the implications of this result in practice, study the geometry of the methods, and state some open problems.
MEJan 23, 2012
HodgeRank is the limit of Perron RankNgoc Mai Tran
We study the map which takes an elementwise positive matrix to the k-th root of the principal eigenvector of its k-th Hadamard power. We show that as $k$ tends to 0 one recovers the row geometric mean vector and discuss the geometric significance of this convergence. In the context of pairwise comparison ranking, our result states that HodgeRank is the limit of Perron Rank, thereby providing a novel mathematical link between two important pairwise ranking methods.
SINov 29, 2021
Improving random walk rankings with feature selection and imputationNgoc Mai Tran, Yangxinyu Xie
The Science4cast Competition consists of predicting new links in a semantic network, with each node representing a concept and each edge representing a link proposed by a paper relating two concepts. This network contains information from 1994-2017, with a discretization of days (which represents the publication date of the underlying papers). Team Hash Brown's final submission, \emph{ee5a}, achieved a score of 0.92738 on the test set. Our team's score ranks \emph{second place}, 0.01 below the winner's score. This paper details our model, its intuition, and the performance of its variations in the test set.
STSep 22, 2021
Minimax Rates for High-Dimensional Random Tessellation ForestsEliza O'Reilly, Ngoc Mai Tran
Random forests are a popular class of algorithms used for regression and classification. The algorithm introduced by Breiman in 2001 and many of its variants are ensembles of randomized decision trees built from axis-aligned partitions of the feature space. One such variant, called Mondrian forests, was proposed to handle the online setting and is the first class of random forests for which minimax rates were obtained in arbitrary dimension. However, the restriction to axis-aligned splits fails to capture dependencies between features, and random forests that use oblique splits have shown improved empirical performance for many tasks. In this work, we show that a large class of random forests with general split directions also achieve minimax optimal convergence rates in arbitrary dimension. This class includes STIT forests, a generalization of Mondrian forests to arbitrary split directions, as well as random forests derived from Poisson hyperplane tessellations. These are the first results showing that random forest variants with oblique splits can obtain minimax optimality in arbitrary dimension. Our proof technique relies on the novel application of the theory of stationary random tessellations in stochastic geometry to statistical learning theory.
MLFeb 11, 2021
Estimating a Directed Tree for ExtremesNgoc Mai Tran, Johannes Buck, Claudia Klüppelberg
We propose a new method to estimate a root-directed spanning tree from extreme data. A prominent example is a river network, to be discovered from extreme flow measured at a set of stations. Our new algorithm utilizes qualitative aspects of a max-linear Bayesian network, which has been designed for modelling causality in extremes. The algorithm estimates bivariate scores and returns a root-directed spanning tree. It performs extremely well on benchmark data and new data. We prove that the new estimator is consistent under a max-linear Bayesian network model with noise. We also assess its strengths and limitations in a small simulation study.
MLMar 19, 2020
Clustering with Fast, Automated and Reproducible assessment applied to longitudinal neural trackingHanlin Zhu, Xue Li, Liuyang Sun et al.
Across many areas, from neural tracking to database entity resolution, manual assessment of clusters by human experts presents a bottleneck in rapid development of scalable and specialized clustering methods. To solve this problem we develop C-FAR, a novel method for Fast, Automated and Reproducible assessment of multiple hierarchical clustering algorithms simultaneously. Our algorithm takes any number of hierarchical clustering trees as input, then strategically queries pairs for human feedback, and outputs an optimal clustering among those nominated by these trees. While it is applicable to large dataset in any domain that utilizes pairwise comparisons for assessment, our flagship application is the cluster aggregation step in spike-sorting, the task of assigning waveforms (spikes) in recordings to neurons. On simulated data of 96 neurons under adverse conditions, including drifting and 25\% blackout, our algorithm produces near-perfect tracking relative to the ground truth. Our runtime scales linearly in the number of input trees, making it a competitive computational tool. These results indicate that C-FAR is highly suitable as a model selection and assessment tool in clustering tasks.
LGOct 24, 2017
Classification on Large Networks: A Quantitative Bound via Motifs and GraphonsAndreas Haupt, Mohammad Khatami, Thomas Schultz et al.
When each data point is a large graph, graph statistics such as densities of certain subgraphs (motifs) can be used as feature vectors for machine learning. While intuitive, motif counts are expensive to compute and difficult to work with theoretically. Via graphon theory, we give an explicit quantitative bound for the ability of motif homomorphisms to distinguish large networks under both generative and sampling noise. Furthermore, we give similar bounds for the graph spectrum and connect it to homomorphism densities of cycles. This results in an easily computable classifier on graph data with theoretical performance guarantee. Our method yields competitive results on classification tasks for the autoimmune disease Lupus Erythematosus.