DLJan 28, 2017
How to Search the Internet Archive Without Indexing ItNattiya Kanhabua, Philipp Kemkes, Wolfgang Nejdl et al.
Significant parts of cultural heritage are produced on the web during the last decades. While easy accessibility to the current web is a good baseline, optimal access to the past web faces several challenges. This includes dealing with large-scale web archive collections and lacking of usage logs that contain implicit human feedback most relevant for today's web search. In this paper, we propose an entity-oriented search system to support retrieval and analytics on the Internet Archive. We use Bing to retrieve a ranked list of results from the current web. In addition, we link retrieved results to the WayBack Machine; thus allowing keyword search on the Internet Archive without processing and indexing its raw archived content. Our search system complements existing web archive search tools through a user-friendly interface, which comes close to the functionalities of modern web search engines (e.g., keyword search, query auto-completion and related query suggestion), and provides a great benefit of taking user feedback on the current web into account also for web archive search. Through extensive experiments, we conduct quantitative and qualitative analyses in order to provide insights that enable further research on and practical applications of web archives.
IRJan 14, 2017
Semantic Annotation for Microblog Topics Using Wikipedia Temporal InformationTuan Tran, Nam Khanh Tran, Teka Hadgu Asmelash et al.
Trending topics in microblogs such as Twitter are valuable resources to understand social aspects of real-world events. To enable deep analyses of such trends, semantic annotation is an effective approach; yet the problem of annotating microblog trending topics is largely unexplored by the research community. In this work, we tackle the problem of mapping trending Twitter topics to entities from Wikipedia. We propose a novel model that complements traditional text-based approaches by rewarding entities that exhibit a high temporal correlation with topics during their burst time period. By exploiting temporal information from the Wikipedia edit history and page view logs, we have improved the annotation performance by 17-28\%, as compared to the competitive baselines.
IRDec 20, 2016
Classification and Learning-to-rank Approaches for Cross-Device Matching at CIKM Cup 2016Nam Khanh Tran
In this paper, we propose two methods for tackling the problem of cross-device matching for online advertising at CIKM Cup 2016. The first method considers the matching problem as a binary classification task and solve it by utilizing ensemble learning techniques. The second method defines the matching problem as a ranking task and effectively solve it with using learning-to-rank algorithms. The results show that the proposed methods obtain promising results, in which the ranking-based method outperforms the classification-based method for the task.