CL AI LGJun 4, 2020

Experiments on Paraphrase Identification Using Quora Question Pairs Dataset

arXiv:2006.02648v216 citations

Originality Synthesis-oriented

AI Analysis

This work addresses paraphrase identification for question pairs, but it is incremental as it applies existing methods to a specific dataset.

The paper tackled the problem of identifying similar questions using the Quora question pairs dataset, achieving up to 97% accuracy through experiments with various feature extraction methods and algorithms.

We modeled the Quora question pairs dataset to identify a similar question. The dataset that we use is provided by Quora. The task is a binary classification. We tried several methods and algorithms and different approach from previous works. For feature extraction, we used Bag of Words including Count Vectorizer, and Term Frequency-Inverse Document Frequency with unigram for XGBoost and CatBoost. Furthermore, we also experimented with WordPiece tokenizer which improves the model performance significantly. We achieved up to 97 percent accuracy. Code and Dataset.

View on arXiv PDF

Similar