CLIRDec 23, 2024

Comparative Analysis of Document-Level Embedding Methods for Similarity Scoring on Shakespeare Sonnets and Taylor Swift Lyrics

arXiv:2412.17552v11 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work provides insights into embedding methods for similarity scoring in contrasting textual domains, but it is incremental as it applies existing techniques to new data without major innovations.

This study compared TF-IDF, Word2Vec, and BERT embeddings for document similarity scoring on Shakespeare sonnets and Taylor Swift lyrics, finding that Word2Vec showed superior semantic generalization in cross-domain comparisons while BERT underperformed due to lack of domain-specific fine-tuning.

This study evaluates the performance of TF-IDF weighting, averaged Word2Vec embeddings, and BERT embeddings for document similarity scoring across two contrasting textual domains. By analysing cosine similarity scores, the methods' strengths and limitations are highlighted. The findings underscore TF-IDF's reliance on lexical overlap and Word2Vec's superior semantic generalisation, particularly in cross-domain comparisons. BERT demonstrates lower performance in challenging domains, likely due to insufficient domainspecific fine-tuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes