IRDec 19, 2017
Large-Scale Vandalism Detection with Linear Classifiers - The Conkerberry Vandalism Detector at WSDM Cup 2017Alexey Grigorev
Nowadays many artificial intelligence systems rely on knowledge bases for enriching the information they process. Such Knowledge Bases are usually difficult to obtain and therefore they are crowdsourced: they are available for everyone on the internet to suggest edits and add new information. Unfortunately, they are sometimes targeted by vandals who put inaccurate or offensive information there. This is especially bad for the systems that use these Knowledge Bases: for them it is important to use reliable information to make correct inferences. One of such knowledge bases is Wikidata, and to fight vandals the organizers of WSDM Cup 2017 challenged participants to build a model for detecting mistrustful edits. In this paper we present the second place solution to the cup: we show that it is possible to achieve competitive performance with simple linear classification. With our approach we can achieve AU ROC of 0.938 on the test data. Additionally, compared to other approaches, ours is significantly faster. The solution is made available on GitHub.
IROct 1, 2017
Identifying Clickbait Posts on Social Media with an Ensemble of Linear ModelsAlexey Grigorev
The purpose of a clickbait is to make a link so appealing that people click on it. However, the content of such articles is often not related to the title, shows poor quality, and at the end leaves the reader unsatisfied. To help the readers, the organizers of the clickbait challenge (http://www.clickbait-challenge.org/) asked the participants to build a machine learning model for scoring articles with respect to their "clickbaitness". In this paper we propose to solve the clickbait problem with an ensemble of Linear SVM models, and our approach was tested successfully in the challenge: it showed great performance of 0.036 MSE and ranked 3rd among all the solutions to the contest.
IRJan 13, 2016
Identifier Namespaces in Mathematical NotationAlexey Grigorev
In this thesis, we look at the problem of assigning each identifier of a document to a namespace. At the moment, there does not exist a special dataset where all identifiers are grouped to namespaces, and therefore we need to create such a dataset ourselves. To do that, we need to find groups of documents that use identifiers in the same way. This can be done with cluster analysis methods. We argue that documents can be represented by the identifiers they contain, and this approach is similar to representing textual information in the Vector Space Model. Because of this, we can apply traditional document clustering techniques for namespace discovery. Because the problem is new, there is no gold standard dataset, and it is hard to evaluate the performance of our method. To overcome it, we first use Java source code as a dataset for our experiments, since it contains the namespace information. We verify that our method can partially recover namespaces from source code using only information about identifiers. The algorithms are evaluated on the English Wikipedia, and the proposed method can extract namespaces on a variety of topics. After extraction, the namespaces are organized into a hierarchical structure by using existing classification schemes such as MSC, PACS and ACM. We also apply it to the Russian Wikipedia, and the results are consistent across the languages. To our knowledge, the problem of introducing namespaces to mathematics has not been studied before, and prior to our work there has been no dataset where identifiers are grouped into namespaces. Thus, our result is not only a good start, but also a good indicator that automatic namespace discovery is possible.