Evaluating Web Content Quality via Multi-scale Features
This addresses web content quality evaluation for web processing applications, but appears incremental as it builds on existing feature types and datasets.
The paper tackled web content quality measurement by developing automatic statistical methods using multi-scale features (statistical content, link, and TFIDF features), and showed the algorithm was effective on a multi-language dataset with features providing complementary identification abilities.
Web content quality measurement is crucial to various web content processing applications. This paper will explore multi-scale features which may affect the quality of a host, and develop automatic statistical methods to evaluate the Web content quality. The extracted properties include statistical content features, page and host level link features and TFIDF features. The experiments on ECML/PKDD 2010 Discovery Challenge data set show that the algorithm is effective and feasible for the quality tasks of multiple languages, and the multi-scale features have different identification ability and provide good complement to each other for most tasks.