CLMay 9, 2023

PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network

arXiv:2305.05378v14 citations
Originality Incremental advance
AI Analysis

This addresses the problem of efficient webpage classification for web information mining, offering an automated alternative to manual feature engineering, though it appears incremental as it combines existing techniques.

The paper tackles webpage classification by proposing PLM-GNN, a method that jointly encodes text and HTML DOM trees using pre-trained language models and graph neural networks, achieving strong performance on datasets like KI-04, SWDE, and a practical scholar's homepage dataset.

The number of web pages is growing at an exponential rate, accumulating massive amounts of data on the web. It is one of the key processes to classify webpages in web information mining. Some classical methods are based on manually building features of web pages and training classifiers based on machine learning or deep learning. However, building features manually requires specific domain knowledge and usually takes a long time to validate the validity of features. Considering webpages generated by the combination of text and HTML Document Object Model(DOM) trees, we propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes