SIJun 1, 2023
When Does Bottom-up Beat Top-down in Hierarchical Community Detection?Maximilien Dreveton, Daichi Kuroda, Matthias Grossglauser et al.
Hierarchical clustering of networks consists in finding a tree of communities, such that lower levels of the hierarchy reveal finer-grained community structures. There are two main classes of algorithms tackling this problem. Divisive (top-down) algorithms recursively partition the nodes into two communities, until a stopping rule indicates that no further split is needed. In contrast, agglomerative (bottom-up) algorithms first identify the smallest community structure and then repeatedly merge the communities using a linkage method. In this article, we establish theoretical guarantees for the recovery of the hierarchical tree and community structure of a Hierarchical Stochastic Block Model by a bottom-up algorithm. We also establish that this bottom-up algorithm attains the information-theoretic threshold for exact recovery at intermediate levels of the hierarchy. Notably, these recovery conditions are less restrictive compared to those existing for top-down algorithms. This shows that bottom-up algorithms extend the feasible region for achieving exact recovery at intermediate levels. Numerical experiments on both synthetic and real data sets confirm the superiority of bottom-up algorithms over top-down algorithms. We also observe that top-down algorithms can produce dendrograms with inversions. These findings contribute to a better understanding of hierarchical clustering techniques and their applications in network analysis.
CLJul 16, 2023
It's All Relative: Interpretable Models for Scoring Bias in DocumentsAswin Suresh, Chi-Hsuan Wu, Matthias Grossglauser
We propose an interpretable model to score the bias present in web documents, based only on their textual content. Our model incorporates assumptions reminiscent of the Bradley-Terry axioms and is trained on pairs of revisions of the same Wikipedia article, where one version is more biased than the other. While prior approaches based on absolute bias classification have struggled to obtain a high accuracy for the task, we are able to develop a useful model for scoring bias by learning to perform pairwise comparisons of bias accurately. We show that we can interpret the parameters of the trained model to discover the words most indicative of bias. We also apply our model in three different settings - studying the temporal evolution of bias in Wikipedia articles, comparing news sources based on bias, and scoring bias in law amendments. In each case, we demonstrate that the outputs of the model can be explained and validated, even for the two domains that are outside the training-data domain. We also use the model to compare the general level of bias between domains, where we see that legal texts are the least biased and news media are the most biased, with Wikipedia articles in between. Given its high performance, simplicity, interpretability, and wide applicability, we hope the model will be useful for a large community, including Wikipedia and news editors, political and social scientists, and the general public.
LGNov 15, 2023
Efficiently Escaping Saddle Points for Policy OptimizationSadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash et al.
Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance. In recent years, several variance-reduced PG methods have been proposed with a theoretical guarantee of converging to an approximate first-order stationary point (FOSP) with the sample complexity of $O(ε^{-3})$. However, FOSPs could be bad local optima or saddle points. Moreover, these algorithms often use importance sampling (IS) weights which could impair the statistical effectiveness of variance reduction. In this paper, we propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $\tilde{O}(ε^{-3})$. This rate improves the best-known sample complexity for achieving approximate SOSPs by a factor of $O(ε^{-0.5})$. Moreover, the proposed variance reduction technique bypasses IS weights by using HVP terms. Our experimental results show that the proposed algorithm outperforms the state of the art and is more robust to changes in random seeds.
IRJun 2, 2023
Fast Interactive Search with a Scale-Free Comparison OracleDaniyar Chumbalov, Lars Klein, Lucas Maystre et al.
A comparison-based search algorithm lets a user find a target item $t$ in a database by answering queries of the form, ``Which of items $i$ and $j$ is closer to $t$?'' Instead of formulating an explicit query (such as one or several keywords), the user navigates towards the target via a sequence of such (typically noisy) queries. We propose a scale-free probabilistic oracle model called $γ$-CKL for such similarity triplets $(i,j;t)$, which generalizes the CKL triplet model proposed in the literature. The generalization affords independent control over the discriminating power of the oracle and the dimension of the feature space containing the items. We develop a search algorithm with provably exponential rate of convergence under the $γ$-CKL oracle, thanks to a backtracking strategy that deals with the unavoidable errors in updating the belief region around the target. We evaluate the performance of the algorithm both over the posited oracle and over several real-world triplet datasets. We also report on a comprehensive user study, where human subjects navigate a database of face portraits.
CLSep 20, 2023
Studying Lobby Influence in the European ParliamentAswin Suresh, Lazar Radojevic, Francesco Salvi et al.
We present a method based on natural language processing (NLP), for studying the influence of interest groups (lobbies) in the law-making process in the European Parliament (EP). We collect and analyze novel datasets of lobbies' position papers and speeches made by members of the EP (MEPs). By comparing these texts on the basis of semantic similarity and entailment, we are able to discover interpretable links between MEPs and lobbies. In the absence of a ground-truth dataset of such links, we perform an indirect validation by comparing the discovered links with a dataset, which we curate, of retweet links between MEPs and lobbies, and with the publicly disclosed meetings of MEPs. Our best method achieves an AUC score of 0.77 and performs significantly better than several baselines. Moreover, an aggregate analysis of the discovered links, between groups of related lobbies and political groups of MEPs, correspond to the expectations from the ideology of the groups (e.g., center-left groups are associated with social causes). We believe that this work, which encompasses the methodology, datasets, and results, is a step towards enhancing the transparency of the intricate decision-making processes within democratic institutions.
AIMay 8
Inference Time Causal Probing in LLMsSadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash et al.
Causal probing methods aim to test and control how internal representations influence the behavior of generative models. In causal probing, an intervention modifies hidden states so that a property takes on a different value. Most existing approaches define such interventions by training an auxiliary probe classifier, which ties the method to a specific task or model and risks misalignment with the model's predictive geometry. We propose Hidden-state Driven Margin Intervention (HDMI), a probe-free, gradient-based technique that directly steers hidden states using the model's native output. HDMI applies a margin objective that increases the probability of a target continuation while decreasing that of the source, without relying on probe classifiers. We further introduce a lookahead variant (LA-HDMI) for text editing that backpropagates through the softmax embeddings, modifying the current hidden state so that the likelihood of user-specified tokens increases in next token generations while preserving fluency. To evaluate interventions, we measure completeness (whether the targeted property changes as intended) and selectivity (whether unrelated properties are preserved), and report their harmonic mean as an overall measure of reliability. HDMI consistently achieves higher reliability than prior methods on the LGD agreement corpus and the CausalGym benchmark, across Meta-Llama-3-8B-Instruct, and Pythia-70M.
STFeb 23, 2024
Universal Lower Bounds and Optimal Rates: Achieving Minimax Clustering Error in Sub-Exponential Mixture ModelsMaximilien Dreveton, Alperen Gözeten, Matthias Grossglauser et al.
Clustering is a pivotal challenge in unsupervised machine learning and is often investigated through the lens of mixture models. The optimal error rate for recovering cluster labels in Gaussian and sub-Gaussian mixture models involves ad hoc signal-to-noise ratios. Simple iterative algorithms, such as Lloyd's algorithm, attain this optimal error rate. In this paper, we first establish a universal lower bound for the error rate in clustering any mixture model, expressed through a Chernoff divergence, a more versatile measure of model information than signal-to-noise ratios. We then demonstrate that iterative algorithms attain this lower bound in mixture models with sub-exponential tails, notably emphasizing location-scale mixtures featuring Laplace-distributed errors. Additionally, for datasets better modelled by Poisson or Negative Binomial mixtures, we study mixture models whose distributions belong to an exponential family. In such mixtures, we establish that Bregman hard clustering, a variant of Lloyd's algorithm employing a Bregman divergence, is rate optimal.
LGAug 15, 2025
Fusing Rewards and Preferences in Reinforcement LearningSadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash et al.
We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy's log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we prove that minimizing DFA's preference loss recovers the entropy-regularized Soft Actor-Critic (SAC) policy. Our simulation results show that DFA trained on generated preferences matches or exceeds SAC on six control environments and demonstrates a more stable training process. With only a semi-synthetic preference dataset under Bradley-Terry model, our algorithm outperforms reward-modeling reinforcement learning from human feedback (RLHF) baselines in a stochastic GridWorld and approaches the performance of an oracle with true rewards.
LGMay 23, 2024
Causal Effect Identification in a Sub-Population with Latent VariablesAmir Mohammad Abouei, Ehsan Mokhtarian, Negar Kiyavash et al.
The s-ID problem seeks to compute a causal effect in a specific sub-population from the observational data pertaining to the same sub population (Abouei et al., 2023). This problem has been addressed when all the variables in the system are observable. In this paper, we consider an extension of the s-ID problem that allows for the presence of latent variables. To tackle the challenges induced by the presence of latent variables in a sub-population, we first extend the classical relevant graphical definitions, such as c-components and Hedges, initially defined for the so-called ID problem (Pearl, 1995; Tian & Pearl, 2002), to their new counterparts. Subsequently, we propose a sound algorithm for the s-ID problem with latent variables.
LGFeb 27, 2025
Recommendations with Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix FactorizationSuryanarayana Sankagiri, Jalal Etesami, Matthias Grossglauser
This paper provides a theoretical analysis of a new learning problem for recommender systems where users provide feedback by comparing pairs of items instead of rating them individually. We assume that comparisons stem from latent user and item features, which reduces the task of predicting preferences to learning these features from comparison data. Similar to the classical matrix factorization problem, the main challenge in this learning task is that the resulting loss function is nonconvex. Our analysis shows that the loss function exhibits (restricted) strong convexity near the true solution, which ensures gradient-based methods converge exponentially, given an appropriate warm start. Importantly, this result holds in a sparse data regime, where each user compares only a few pairs of items. Our main technical contribution is to extend certain concentration inequalities commonly used in matrix completion to our model. Our work demonstrates that learning personalized recommendations from comparison data is computationally and statistically efficient.
LGNov 22, 2025
Hierarchical Linkage Clustering Beyond Binary Trees and UltrametricsMaximilien Dreveton, Matthias Grossglauser, Daichi Kuroda et al.
Hierarchical clustering seeks to uncover nested structures in data by constructing a tree of clusters, where deeper levels reveal finer-grained relationships. Traditional methods, including linkage approaches, face three major limitations: (i) they always return a hierarchy, even if none exists, (ii) they are restricted to binary trees, even if the true hierarchy is non-binary, and (iii) they are highly sensitive to the choice of linkage function. In this paper, we address these issues by introducing the notion of a valid hierarchy and defining a partial order over the set of valid hierarchies. We prove the existence of a finest valid hierarchy, that is, the hierarchy that encodes the maximum information consistent with the similarity structure of the data set. In particular, the finest valid hierarchy is not constrained to binary structures and, when no hierarchical relationships exist, collapses to a star tree. We propose a simple two-step algorithm that first constructs a binary tree via a linkage method and then prunes it to enforce validity. We establish necessary and sufficient conditions on the linkage function under which this procedure exactly recovers the finest valid hierarchy, and we show that all linkage functions satisfying these conditions yield the same hierarchy after pruning. Notably, classical linkage rules such as single, complete, and average satisfy these conditions, whereas Ward's linkage fails to do so.
LGOct 24, 2025
Optimal Graph Clustering without Edge Density SignalsMaximilien Dreveton, Elaine Siyu Liu, Matthias Grossglauser et al.
This paper establishes the theoretical limits of graph clustering under the Popularity-Adjusted Block Model (PABM), addressing limitations of existing models. In contrast to the Stochastic Block Model (SBM), which assumes uniform vertex degrees, and to the Degree-Corrected Block Model (DCBM), which applies uniform degree corrections across clusters, PABM introduces separate popularity parameters for intra- and inter-cluster connections. Our main contribution is the characterization of the optimal error rate for clustering under PABM, which provides novel insights on clustering hardness: we demonstrate that unlike SBM and DCBM, cluster recovery remains possible in PABM even when traditional edge-density signals vanish, provided intra- and inter-cluster popularity coefficients differ. This highlights a dimension of degree heterogeneity captured by PABM but overlooked by DCBM: local differences in connectivity patterns can enhance cluster separability independently of global edge densities. Finally, because PABM exhibits a richer structure, its expected adjacency matrix has rank between $k$ and $k^2$, where $k$ is the number of clusters. As a result, spectral embeddings based on the top $k$ eigenvectors may fail to capture important structural information. Our numerical experiments on both synthetic and real datasets confirm that spectral clustering algorithms incorporating $k^2$ eigenvectors outperform traditional spectral approaches.
IROct 2, 2025
Ranking Items from Discrete Ratings: The Cost of Unknown User ThresholdsOscar Villemaud, Suryanarayana Sankagiri, Matthias Grossglauser
Ranking items is a central task in many information retrieval and recommender systems. User input for the ranking task often comes in the form of ratings on a coarse discrete scale. We ask whether it is possible to recover a fine-grained item ranking from such coarse-grained ratings. We model items as having scores and users as having thresholds; a user rates an item positively if the item's score exceeds the user's threshold. Although all users agree on the total item order, estimating that order is challenging when both the scores and the thresholds are latent. Under our model, any ranking method naturally partitions the $n$ items into bins; the bins are ordered, but the items inside each bin are still unordered. Users arrive sequentially, and every new user can be queried to refine the current ranking. We prove that achieving a near-perfect ranking, measured by Spearman distance, requires $Θ(n^2)$ users (and therefore $Ω(n^2)$ queries). This is significantly worse than the $O(n\log n)$ queries needed to rank from comparisons; the gap reflects the additional queries needed to identify the users who have the appropriate thresholds. Our bound also quantifies the impact of a mismatch between score and threshold distributions via a quadratic divergence factor. To show the tightness of our results, we provide a ranking algorithm whose query complexity matches our bound up to a logarithmic factor. Our work reveals a tension in online ranking: diversity in thresholds is necessary to merge coarse ratings from many users into a fine-grained ranking, but this diversity has a cost if the thresholds are a priori unknown.
LGAug 26, 2025
Recycling History: Efficient Recommendations from Contextual Dueling BanditsSuryanarayana Sankagiri, Jalal Etesami, Pouria Fatemi et al.
The contextual duelling bandit problem models adaptive recommender systems, where the algorithm presents a set of items to the user, and the user's choice reveals their preference. This setup is well suited for implicit choices users make when navigating a content platform, but does not capture other possible comparison queries. Motivated by the fact that users provide more reliable feedback after consuming items, we propose a new bandit model that can be described as follows. The algorithm recommends one item per time step; after consuming that item, the user is asked to compare it with another item chosen from the user's consumption history. Importantly, in our model, this comparison item can be chosen without incurring any additional regret, potentially leading to better performance. However, the regret analysis is challenging because of the temporal dependency in the user's history. To overcome this challenge, we first show that the algorithm can construct informative queries provided the history is rich, i.e., satisfies a certain diversity condition. We then show that a short initial random exploration phase is sufficient for the algorithm to accumulate a rich history with high probability. This result, proven via matrix concentration bounds, yields $O(\sqrt{T})$ regret guarantees. Additionally, our simulations show that reusing past items for comparisons can lead to significantly lower regret than only comparing between simultaneously recommended items.
LGAug 20, 2025
Measuring IIA Violations in Similarity Choices with Bayesian ModelsHugo Sales Corrêa, Suryanarayana Sankagiri, Daniel Ratton Figueiredo et al.
Similarity choice data occur when humans make choices among alternatives based on their similarity to a target, e.g., in the context of information retrieval and in embedding learning settings. Classical metric-based models of similarity choice assume independence of irrelevant alternatives (IIA), a property that allows for a simpler formulation. While IIA violations have been detected in many discrete choice settings, the similarity choice setting has received scant attention. This is because the target-dependent nature of the choice complicates IIA testing. We propose two statistical methods to test for IIA: a classical goodness-of-fit test and a Bayesian counterpart based on the framework of Posterior Predictive Checks (PPC). This Bayesian approach, our main technical contribution, quantifies the degree of IIA violation beyond its mere significance. We curate two datasets: one with choice sets designed to elicit IIA violations, and another with randomly generated choice sets from the same item universe. Our tests confirmed significant IIA violations on both datasets, and notably, we find a comparable degree of violation between them. Further, we devise a new PPC test for population homogeneity. Results show that the population is indeed homogenous, suggesting that the IIA violations are driven by context effects -- specifically, interactions within the choice sets. These results highlight the need for new similarity choice models that account for such context effects.
LGJul 6, 2025
Hierarchical Reinforcement Learning with Targeted Causal InterventionsSadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash et al.
Hierarchical reinforcement learning (HRL) improves the efficiency of long-horizon reinforcement-learning tasks with sparse rewards by decomposing the task into a hierarchy of subgoals. The main challenge of HRL is efficient discovery of the hierarchical structure among subgoals and utilizing this structure to achieve the final goal. We address this challenge by modeling the subgoal structure as a causal graph and propose a causal discovery algorithm to learn it. Additionally, rather than intervening on the subgoals at random during exploration, we harness the discovered causal model to prioritize subgoal interventions based on their importance in attaining the final goal. These targeted interventions result in a significantly more efficient policy in terms of the training cost. Unlike previous work on causal HRL, which lacked theoretical analysis, we provide a formal analysis of the problem. Specifically, for tree structures and, for a variant of Erdős-Rényi random graphs, our approach results in remarkable improvements. Our experimental results on HRL tasks also illustrate that our proposed framework outperforms existing work in terms of training cost.
SIJun 6, 2024
Why the Metric Backbone Preserves Community StructureMaximilien Dreveton, Charbel Chucri, Matthias Grossglauser et al.
The metric backbone of a weighted graph is the union of all-pairs shortest paths. It is obtained by removing all edges $(u,v)$ that are not the shortest path between $u$ and $v$. In networks with well-separated communities, the metric backbone tends to preserve many inter-community edges, because these edges serve as bridges connecting two communities, but tends to delete many intra-community edges because the communities are dense. This suggests that the metric backbone would dilute or destroy the community structure of the network. However, this is not borne out by prior empirical work, which instead showed that the metric backbone of real networks preserves the community structure of the original network well. In this work, we analyze the metric backbone of a broad class of weighted random graphs with communities, and we formally prove the robustness of the community structure with respect to the deletion of all the edges that are not in the metric backbone. An empirical comparison of several graph sparsification techniques confirms our theoretical finding and shows that the metric backbone is an efficient sparsifier in the presence of communities.
LGJun 19, 2020
Self-Supervised Prototypical Transfer Learning for Few-Shot ClassificationCarlos Medina, Arnout Devos, Matthias Grossglauser
Most approaches in few-shot learning rely on costly annotated data related to the goal task domain during (pre-)training. Recently, unsupervised meta-learning methods have exchanged the annotation requirement for a reduction in few-shot classification performance. Simultaneously, in settings with realistic domain shift, common transfer learning has been shown to outperform supervised meta-learning. Building on these insights and on advances in self-supervised learning, we propose a transfer learning approach which constructs a metric embedding that clusters unlabeled prototypical samples and their augmentations closely together. This pre-trained embedding is a starting point for few-shot classification by summarizing class clusters and fine-tuning. We demonstrate that our self-supervised prototypical transfer learning approach ProtoTransfer outperforms state-of-the-art unsupervised meta-learning methods on few-shot tasks from the mini-ImageNet dataset. In few-shot experiments with domain shift, our approach even has comparable performance to supervised methods, but requires orders of magnitude fewer labels.
MLNov 26, 2019
A User Study of Perceived Carbon FootprintVictor Kristof, Valentin Quelquejay-Leclère, Robin Zbinden et al.
We propose a statistical model to understand people's perception of their carbon footprint. Driven by the observation that few people think of CO2 impact in absolute terms, we design a system to probe people's perception from simple pairwise comparisons of the relative carbon footprint of their actions. The formulation of the model enables us to take an active-learning approach to selecting the pairs of actions that are maximally informative about the model parameters. We define a set of 18 actions and collect a dataset of 2183 comparisons from 176 users on a university campus. The early results reveal promising directions to improve climate communication and enhance climate mitigation.
LGNov 1, 2019
Learning Hawkes Processes from a Handful of EventsFarnood Salehi, William Trouleau, Matthias Grossglauser et al.
Learning the causal-interaction network of multivariate Hawkes processes is a useful task in many applications. Maximum-likelihood estimation is the most common approach to solve the problem in the presence of long observation sequences. However, when only short sequences are available, the lack of data amplifies the risk of overfitting and regularization becomes critical. Due to the challenges of hyper-parameter tuning, state-of-the-art methods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power of representation of the model. To solve both issues, we develop in this work an efficient algorithm based on variational expectation-maximization. Our approach is able to optimize over an extended set of hyper-parameters. It is also able to take into account the uncertainty in the model parameters by learning a posterior distribution over them. Experimental results on both synthetic and real datasets show that our approach significantly outperforms state-of-the-art methods under short observation sequences.
LGMay 31, 2019
Regression Networks for Meta-Learning Few-Shot ClassificationArnout Devos, Matthias Grossglauser
We propose regression networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set, given only a small number of examples of each class. In high dimensional embedding spaces the direction of data generally contains richer information than magnitude. Next to this, state-of-the-art few-shot metric methods that compare distances with aggregated class representations, have shown superior performance. Combining these two insights, we propose to meta-learn classification of embedded points by regressing the closest approximation in every class subspace while using the regression error as a distance metric. Similarly to recent approaches for few-shot learning, regression networks reflect a simple inductive bias that is beneficial in this limited-data regime and they achieve excellent results, especially when more aggregate class representations can be formed with multiple shots.
MLMay 13, 2019
Scalable and Efficient Comparison-based Search without FeaturesDaniyar Chumbalov, Lucas Maystre, Matthias Grossglauser
We consider the problem of finding a target object $t$ using pairwise comparisons, by asking an oracle questions of the form \emph{"Which object from the pair $(i,j)$ is more similar to $t$?"}. Objects live in a space of latent features, from which the oracle generates noisy answers. First, we consider the {\em non-blind} setting where these features are accessible. We propose a new Bayesian comparison-based search algorithm with noisy answers; it has low computational complexity yet is efficient in the number of queries. We provide theoretical guarantees, deriving the form of the optimal query and proving almost sure convergence to the target $t$. Second, we consider the \emph{blind} setting, where the object features are hidden from the search algorithm. In this setting, we combine our search method and a new distributional triplet embedding algorithm into one scalable learning framework called \textsc{Learn2Search}. We show that the query complexity of our approach on two real-world datasets is on par with the non-blind setting, which is not achievable using any of the current state-of-the-art embedding methods. Finally, we demonstrate the efficacy of our framework by conducting an experiment with users searching for movie actors.
MLMar 18, 2019
Pairwise Comparisons with Flexible Time-DynamicsLucas Maystre, Victor Kristof, Matthias Grossglauser
Inspired by applications in sports where the skill of players or teams competing against each other varies over time, we propose a probabilistic model of pairwise-comparison outcomes that can capture a wide range of time dynamics. We achieve this by replacing the static parameters of a class of popular pairwise-comparison models by continuous-time Gaussian processes; the covariance function of these processes enables expressive dynamics. We develop an efficient inference algorithm that computes an approximate Bayesian posterior distribution. Despite the flexbility of our model, our inference algorithm requires only a few linear-time iterations over the data and can take advantage of modern multiprocessor computer architectures. We apply our model to several historical databases of sports outcomes and find that our approach outperforms competing approaches in terms of predictive performance, scales to millions of observations, and generates compelling visualizations that help in understanding and interpreting the data.
DSApr 25, 2018
Analysis of a Canonical Labeling Algorithm for the Alignment of Correlated Erdős-Rényi GraphsOsman Emre Dai, Daniel Cullina, Negar Kiyavash et al.
Graph alignment in two correlated random graphs refers to the task of identifying the correspondence between vertex sets of the graphs. Recent results have characterized the exact information-theoretic threshold for graph alignment in correlated Erdős-Rényi graphs. However, very little is known about the existence of efficient algorithms to achieve graph alignment without seeds. In this work we identify a region in which a straightforward $O(n^{11/5} \log n )$-time canonical labeling algorithm, initially introduced in the context of graph isomorphism, succeeds in aligning correlated Erdős-Rényi graphs. The algorithm has two steps. In the first step, all vertices are labeled by their degrees and a trivial minimum distance alignment (i.e., sorting vertices according to their degrees) matches a fixed number of highest degree vertices in the two graphs. Having identified this subset of vertices, the remaining vertices are matched using a alignment algorithm for bipartite graphs.
APJan 12, 2018
Can Who-Edits-What Predict Edit Survival?Ali Batuhan Yardım, Victor Kristof, Lucas Maystre et al.
As the number of contributors to online peer-production systems grows, it becomes increasingly important to predict whether the edits that users make will eventually be beneficial to the project. Existing solutions either rely on a user reputation system or consist of a highly specialized predictor that is tailored to a specific peer-production system. In this work, we explore a different point in the solution space that goes beyond user reputation but does not involve any content-based feature of the edits. We view each edit as a game between the editor and the component of the project. We posit that the probability that an edit is accepted is a function of the editor's skill, of the difficulty of editing the component and of a user-component interaction term. Our model is broadly applicable, as it only requires observing data about who makes an edit, what the edit affects and whether the edit survives or not. We apply our model on Wikipedia and the Linux kernel, two examples of large-scale peer-production systems, and we seek to understand whether it can effectively predict edit survival: in both cases, we provide a positive answer. Our approach significantly outperforms those based solely on user reputation and bridges the gap with specialized predictors that use content-based features. It is simple to implement, computationally inexpensive, and in addition it enables us to discover interesting structure in the data.
MLOct 20, 2016
ChoiceRank: Identifying Preferences from Node Traffic in NetworksLucas Maystre, Matthias Grossglauser
Understanding how users navigate in a network is of high interest in many applications. We consider a setting where only aggregate node-level traffic is observed and tackle the task of learning edge transition probabilities. We cast it as a preference learning problem, and we study a model where choices follow Luce's axiom. In this case, the $O(n)$ marginal counts of node visits are a sufficient statistic for the $O(n^2)$ transition probabilities. We show how to make the inference problem well-posed regardless of the network's structure, and we present ChoiceRank, an iterative algorithm that scales to networks that contains billions of nodes and edges. We apply the model to two clickstream datasets and show that it successfully recovers the transition probabilities using only the network structure and marginal (node-level) traffic data. Finally, we also consider an application to mobility networks and apply the model to one year of rides on New York City's bicycle-sharing system.
LGSep 5, 2016
The Player Kernel: Learning Team Strengths Based on Implicit Player ContributionsLucas Maystre, Victor Kristof, Antonio J. González Ferrer et al.
In this work, we draw attention to a connection between skill-based models of game outcomes and Gaussian process classification models. The Gaussian process perspective enables a) a principled way of dealing with uncertainty and b) rich models, specified through kernel functions. Using this connection, we tackle the problem of predicting outcomes of football matches between national teams. We develop a player kernel that relates any two football matches through the players lined up on the field. This makes it possible to share knowledge gained from observing matches between clubs (available in large quantities) and matches between national teams (available only in limited quantities). We evaluate our approach on the Euro 2008, 2012 and 2016 final tournaments.
MLFeb 19, 2015
Just Sort It! A Simple and Effective Approach to Active Preference LearningLucas Maystre, Matthias Grossglauser
We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsistent with the ranking? We give favorable guarantees for Quicksort for the popular Bradley-Terry model, under natural assumptions on the parameters. Furthermore, we empirically demonstrate that sorting algorithms lead to a very simple and effective active learning strategy: repeatedly sort the items. This strategy performs as well as state-of-the-art methods (and much better than random sampling) at a minuscule fraction of the computational cost.