CLNov 1, 2019
Sequence Modeling with Unconstrained Generation OrderDmitrii Emelianenko, Elena Voita, Pavel Serdyukov
The dominant approach to sequence generation is to produce a sequence in some predefined order, e.g. left to right. In contrast, we propose a more general model that can generate the output sequence by inserting tokens in any arbitrary order. Our model learns decoding order as a result of its training procedure. Our experiments show that this model is superior to fixed order models on a number of sequence generation tasks, such as Machine Translation, Image-to-LaTeX and Image Captioning.
HCJun 20, 2019
Latent Distribution Assumption for Unbiased and Consistent Consensus ModellingValentina Fedorova, Gleb Gusev, Pavel Serdyukov
We study the problem of aggregation noisy labels. Usually, it is solved by proposing a stochastic model for the process of generating noisy labels and then estimating the model parameters using the observed noisy labels. A traditional assumption underlying previously introduced generative models is that each object has one latent true label. In contrast, we introduce a novel latent distribution assumption, implying that a unique true label for an object might not exist, but rather each object might have a specific distribution generating a latent subjective label each time the object is observed. Our experiments showed that the novel assumption is more suitable for difficult tasks, when there is an ambiguity in choosing a "true" label for certain objects.
HCSep 3, 2018
Online Evaluation for Effective Web Service DevelopmentRoman Budylin, Alexey Drutsa, Gleb Gusev et al.
Development of the majority of the leading web services and software products today is generally guided by data-driven decisions based on evaluation that ensures a steady stream of updates, both in terms of quality and quantity. Large internet companies use online evaluation on a day-to-day basis and at a large scale. The number of smaller companies using A/B testing in their development cycle is also growing. Web development across the board strongly depends on quality of experimentation platforms. In this tutorial, we overview state-of-the-art methods underlying everyday evaluation pipelines at some of the leading Internet companies. Software engineers, designers, analysts, service or product managers --- beginners, advanced specialists, and researchers --- can learn how to make web service development data-driven and do it effectively.
CLMay 25, 2018
Context-Aware Neural Machine Translation Learns Anaphora ResolutionElena Voita, Pavel Serdyukov, Rico Sennrich et al.
Standard machine translation systems process sentences in isolation and hence ignore extra-sentential information, even though extended context can both prevent mistakes in ambiguous cases and improve translation coherence. We introduce a context-aware neural machine translation model designed in such way that the flow of information from the extended context to the translation model can be controlled and analyzed. We experiment with an English-Russian subtitles dataset, and observe that much of what is captured by our model deals with improving pronoun translation. We measure correspondences between induced attention distributions and coreference relations and observe that the model implicitly captures anaphora. It is consistent with gains for sentences where pronouns need to be gendered in translation. Beside improvements in anaphoric cases, the model also improves in overall BLEU, both over its context-agnostic version (+0.7) and over simple concatenation of the context and source sentences (+0.6).
LGFeb 19, 2018
Finding Influential Training Samples for Gradient Boosted Decision TreesBoris Sharchilev, Yury Ustinovsky, Pavel Serdyukov et al.
We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model's predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.
CLApr 26, 2017
Riemannian Optimization for Skip-Gram Negative SamplingAlexander Fonarev, Oleksii Hrinchuk, Gleb Gusev et al.
Skip-Gram Negative Sampling (SGNS) word embedding model, well known by its implementation in "word2vec" software, is usually optimized by stochastic gradient descent. However, the optimization of SGNS objective can be viewed as a problem of searching for a good matrix with the low-rank constraint. The most standard way to solve this type of problems is to apply Riemannian optimization framework to optimize the SGNS objective over the manifold of required low-rank matrices. In this paper, we propose an algorithm that optimizes SGNS objective using Riemannian optimization and demonstrates its superiority over popular competitors, such as the original method to train SGNS and SVD over SPPMI matrix.
IRDec 13, 2016
User Model-Based Intent-Aware Metrics for Multilingual Search EvaluationAlexey Drutsa, Andrey Shutovich, Philipp Pushnyakov et al.
Despite the growing importance of multilingual aspect of web search, no appropriate offline metrics to evaluate its quality are proposed so far. At the same time, personal language preferences can be regarded as intents of a query. This approach translates the multilingual search problem into a particular task of search diversification. Furthermore, the standard intent-aware approach could be adopted to build a diversified metric for multilingual search on the basis of a classical IR metric such as ERR. The intent-aware approach estimates user satisfaction under a user behavior model. We show however that the underlying user behavior models is not realistic in the multilingual case, and the produced intent-aware metric do not appropriately estimate the user satisfaction. We develop a novel approach to build intent-aware user behavior models, which overcome these limitations and convert to quality metrics that better correlate with standard online metrics of user satisfaction.
IRNov 28, 2016
Prediction of Video Popularity in the Absence of Reliable Data from Video Hosting Services: Utility of Traces Left by Users on the WebAlexey Drutsa, Gleb Gusev, Pavel Serdyukov
With the growth of user-generated content, we observe the constant rise of the number of companies, such as search engines, content aggregators, etc., that operate with tremendous amounts of web content not being the services hosting it. Thus, aiming to locate the most important content and promote it to the users, they face the need of estimating the current and predicting the future content popularity. In this paper, we approach the problem of video popularity prediction not from the side of a video hosting service, as done in all previous studies, but from the side of an operating company, which provides a popular video search service that aggregates content from different video hosting websites. We investigate video popularity prediction based on features from three primary sources available for a typical operating company: first, the content hosting provider may deliver its data via its API, second, the operating company makes use of its own search and browsing logs, third, the company crawls information about embeds of a video and links to a video page from publicly available resources on the Web. We show that video popularity prediction based on the embed and link data coupled with the internal search and browsing data significantly improves video popularity prediction based only on the data provided by the video hosting and can even adequately replace the API data in the cases when it is partly or completely unavailable.
IROct 16, 2016
Efficient Rectangular Maximal-Volume Algorithm for Rating Elicitation in Collaborative FilteringAlexander Fonarev, Alexander Mikhalev, Pavel Serdyukov et al.
Cold start problem in Collaborative Filtering can be solved by asking new users to rate a small seed set of representative items or by asking representative users to rate a new item. The question is how to build a seed set that can give enough preference information for making good recommendations. One of the most successful approaches, called Representative Based Matrix Factorization, is based on Maxvol algorithm. Unfortunately, this approach has one important limitation --- a seed set of a particular size requires a rating matrix factorization of fixed rank that should coincide with that size. This is not necessarily optimal in the general case. In the current paper, we introduce a fast algorithm for an analytical generalization of this approach that we call Rectangular Maxvol. It allows the rank of factorization to be lower than the required size of the seed set. Moreover, the paper includes the theoretical analysis of the method's error, the complexity analysis of the existing methods and the comparison to the state-of-the-art approaches.
IRDec 5, 2013
Intent Models for Contextualising and Diversifying Query SuggestionsEugene Kharitonov, Craig Macdonald, Pavel Serdyukov et al.
The query suggestion or auto-completion mechanisms help users to type less while interacting with a search engine. A basic approach that ranks suggestions according to their frequency in the query logs is suboptimal. Firstly, many candidate queries with the same prefix can be removed as redundant. Secondly, the suggestions can also be personalised based on the user's context. These two directions to improve the aforementioned mechanisms' quality can be in opposition: while the latter aims to promote suggestions that address search intents that a user is likely to have, the former aims to diversify the suggestions to cover as many intents as possible. We introduce a contextualisation framework that utilises a short-term context using the user's behaviour within the current search session, such as the previous query, the documents examined, and the candidate query suggestions that the user has discarded. This short-term context is used to contextualise and diversify the ranking of query suggestions, by modelling the user's information need as a mixture of intent-specific user models. The evaluation is performed offline on a set of approximately 1.0M test user sessions. Our results suggest that the proposed approach significantly improves query suggestions compared to the baseline approach.
IRJul 23, 2013
Timely crawling of high-quality ephemeral new contentDamien Lefortier, Liudmila Ostroumova, Egor Samosvat et al.
Nowadays, more and more people use the Web as their primary source of up-to-date information. In this context, fast crawling and indexing of newly created Web pages has become crucial for search engines, especially because user traffic to a significant fraction of these new pages (like news, blog and forum posts) grows really quickly right after they appear, but lasts only for several days. In this paper, we study the problem of timely finding and crawling of such ephemeral new pages (in terms of user interest). Traditional crawling policies do not give any particular priority to such pages and may thus crawl them not quickly enough, and even crawl already obsolete content. We thus propose a new metric, well thought out for this task, which takes into account the decrease of user interest for ephemeral pages over time. We show that most ephemeral new pages can be found at a relatively small set of content sources and present a procedure for finding such a set. Our idea is to periodically recrawl content sources and crawl newly created pages linked from them, focusing on high-quality (in terms of user interest) content. One of the main difficulties here is to divide resources between these two activities in an efficient way. We find the adaptive balance between crawls and recrawls by maximizing the proposed metric. Further, we incorporate search engine click logs to give our crawler an insight about the current user demands. Efficiency of our approach is finally demonstrated experimentally on real-world data.
SIAug 11, 2012
Empirical Validation of the Buckley--Osthus Model for the Web Host Graph: Degree and Edge DistributionsMaxim Zhukovskiy, Dmitry Vinogradov, Yuri Pritykin et al.
There has been a lot of research on random graph models for large real-world networks such as those formed by hyperlinks between web pages in the world wide web. Though largely successful qualitatively in capturing their key properties, such models may lack important quantitative characteristics of Internet graphs. While preferential attachment random graph models were shown to be capable of reflecting the degree distribution of the webgraph, their ability to reflect certain aspects of the edge distribution was not yet well studied. In this paper, we consider the Buckley--Osthus implementation of preferential attachment and its ability to model the web host graph in two aspects. One is the degree distribution that we observe to follow the power law, as often being the case for real-world graphs. Another one is the two-dimensional edge distribution, the number of edges between vertices of given degrees. We fit a single "initial attractiveness" parameter $a$ of the model, first with respect to the degree distribution of the web host graph, and then, absolutely independently, with respect to the edge distribution. Surprisingly, the values of $a$ we obtain turn out to be nearly the same. Therefore the same model with the same value of the parameter $a$ fits very well the two independent and basic aspects of the web host graph. In addition, we demonstrate that other models completely lack the asymptotic behavior of the edge distribution of the web host graph, even when accurately capturing the degree distribution. To the best of our knowledge, this is the first attempt for a real graph of Internet to describe the distribution of edges between vertices with respect to their degrees.