A Deep Learning-Based Approach for Measuring the Domain Similarity of Persian Texts
This addresses a domain-specific problem for Persian text analysis in e-commerce, but it is incremental as it applies existing methods to a new language and dataset.
The paper tackles the problem of measuring domain similarity between Persian advertisement texts by proposing a deep learning approach that uses word embeddings and neural networks, achieving an F1 score of 0.9865.
In this paper, we propose a novel approach for measuring the degree of similarity between categories of two pieces of Persian text, which were published as descriptions of two separate advertisements. We built an appropriate dataset for this work using a dataset which consists of advertisements posted on an e-commerce website. We generated a significant number of paired texts from this dataset and assigned each pair a score from 0 to 3, which demonstrates the degree of similarity between the domains of the pair. In this work, we represent words with word embedding vectors derived from word2vec. Then deep neural network models are used to represent texts. Eventually, we employ concatenation of absolute difference and bit-wise multiplication and a fully-connected neural network to produce a probability distribution vector for the score of the pairs. Through a supervised learning approach, we trained our model on a GPU, and our best model achieved an F1 score of 0.9865.