Kirill Solovev

2papers

2 Papers

61.1CLMay 18
Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

Ruggero Marino Lazzaroni, Jana Lasser, Kirill Solovev

Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.

LGFeb 16, 2021
Integrating Floor Plans into Hedonic Models for Rent Price Appraisal

Kirill Solovev, Nicolas Pröllochs

Online real estate platforms have become significant marketplaces facilitating users' search for an apartment or a house. Yet it remains challenging to accurately appraise a property's value. Prior works have primarily studied real estate valuation based on hedonic price models that take structured data into account while accompanying unstructured data is typically ignored. In this study, we investigate to what extent an automated visual analysis of apartment floor plans on online real estate platforms can enhance hedonic rent price appraisal. We propose a tailored two-staged deep learning approach to learn price-relevant designs of floor plans from historical price data. Subsequently, we integrate the floor plan predictions into hedonic rent price models that account for both structural and locational characteristics of an apartment. Our empirical analysis based on a unique dataset of 9174 real estate listings suggests that current hedonic models underutilize the available data. We find that (1) the visual design of floor plans has significant explanatory power regarding rent prices - even after controlling for structural and locational apartment characteristics, and (2) harnessing floor plans results in an up to 10.56% lower out-of-sample prediction error. We further find that floor plans yield a particularly high gain in prediction performance for older and smaller apartments. Altogether, our empirical findings contribute to the existing research body by establishing the link between the visual design of floor plans and real estate prices. Moreover, our approach has important implications for online real estate platforms, which can use our findings to enhance user experience in their real estate listings.