Ivan P. Yamshchikov

CL
h-index22
46papers
4,629citations
Novelty28%
AI Score52

46 Papers

CLDec 3, 2025Code
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models

Taido Purason, Pavel Chizhov, Ivan P. Yamshchikov et al.

Tokenizer adaptation plays an important role in transferring pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training, which adapts a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source package.

CLNov 10, 2022
BERT in Plutarch's Shadows

Ivan P. Yamshchikov, Alexey Tikhonov, Yorgos Pantis et al.

The extensive surviving corpus of the ancient scholar Plutarch of Chaeronea (ca. 45-120 CE) also contains several texts which, according to current scholarly opinion, did not originate with him and are therefore attributed to an anonymous author Pseudo-Plutarch. These include, in particular, the work Placita Philosophorum (Quotations and Opinions of the Ancient Philosophers), which is extremely important for the history of ancient philosophy. Little is known about the identity of that anonymous author and its relation to other authors from the same period. This paper presents a BERT language model for Ancient Greek. The model discovers previously unknown statistical properties relevant to these literary, philosophical, and historical problems and can shed new light on this authorship question. In particular, the Placita Philosophorum, together with one of the other Pseudo-Plutarch texts, shows similarities with the texts written by authors from an Alexandrian context (2nd/3rd century CE).

CLNov 9, 2022
What is Wrong with Language Models that Can Not Tell a Story?

Ivan P. Yamshchikov, Alexey Tikhonov

This paper argues that a deeper understanding of narrative and the successful generation of longer subjectively interesting texts is a vital bottleneck that hinders the progress in modern Natural Language Processing (NLP) and may even be in the whole field of Artificial Intelligence. We demonstrate that there are no adequate datasets, evaluation methods, and even operational concepts that could be used to start working on narrative processing.

CLNov 3, 2023
Post Turing: Mapping the landscape of LLM Evaluation

Alexey Tikhonov, Ivan P. Yamshchikov

In the rapidly evolving landscape of Large Language Models (LLMs), introduction of well-defined and standardized evaluation methodologies remains a crucial challenge. This paper traces the historical trajectory of LLM evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We categorize the evolution of LLMs into distinct periods, each characterized by its unique benchmarks and evaluation criteria. As LLMs increasingly mimic human-like behaviors, traditional evaluation proxies, such as the Turing test, have become less reliable. We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models. Through an analysis of common evaluation methodologies, we advocate for a qualitative shift in assessment approaches, underscoring the importance of standardization and objective criteria. This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.

CLSep 6, 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Pavel Chizhov, Catherine Arnett, Elizaveta Korotkova et al.

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.

54.6CLApr 15
From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

Pavel Chizhov, Egor Bogomolov, Ivan P. Yamshchikov

Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak attacks and lowers the risk of hallucinations. In this work, we investigate the efficiency of code tokenization, in particular from the perspective of data source diversity. We demonstrate that code tokenizers are prone to producing unused, and thus under-trained, tokens due to the imbalance in repository and language diversity in the training data, as well as the dominance of source-specific, repetitive tokens that are often unusable in future inference. By modifying the BPE objective and introducing merge skipping, we implement different techniques under the name Source-Attributed BPE (SA-BPE) to regularize BPE training and minimize overfitting, thereby substantially reducing the number of under-trained tokens while maintaining the same inference procedure as with regular BPE. This provides an effective tool suitable for production use.

49.5CLApr 20
Model in Distress: Sentiment Analysis on French Synthetic Social Media

Pierre-Carl Langlais, Pavel Chizhov, Yannick Detrois et al.

Automated analysis of customer feedback on social media is hindered by three challenges: the high cost of annotated training data, the scarcity of evaluation sets, especially in multilingual settings, and privacy concerns that prevent data sharing and reproducibility. We address these issues by developing a generalizable synthetic data generation pipeline applied to a case study on customer distress detection in French public transportation. Our approach utilizes backtranslation with fine-tuned models to generate 1.7 million synthetic tweets from a small seed corpus, complemented by synthetic reasoning traces. We train 600M-parameter reasoners with English and French reasoning that achieve 77-79% accuracy on human-annotated evaluation data, matching or exceeding SOTA proprietary LLMs and specialized encoders. Beyond reducing annotation costs, our pipeline preserves privacy by eliminating the exposure of sensitive user data. Our methodology can be adopted for other use cases and languages.

CLOct 29, 2024Code
Toxicity of the Commons: Curating Open-Source Pre-Training Data

Catherine Arnett, Eliot Jones, Ivan P. Yamshchikov et al.

Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.

CLNov 20, 2022
Pragmatic Constraint on Distributional Semantics

Elizaveta Zhemchuzhina, Nikolai Filippov, Ivan P. Yamshchikov

This paper studies the limits of language models' statistical learning in the context of Zipf's law. First, we demonstrate that Zipf-law token distribution emerges irrespective of the chosen tokenization. Second, we show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics. Namely, the tokens that have a one-to-one correspondence with one semantic concept have different statistical properties than those with semantic ambiguity. Finally, we demonstrate how these properties interfere with statistical learning procedures motivated by distributional semantics.

CLAug 4, 2022
Vocabulary Transfer for Biomedical Texts: Add Tokens if You Can Not Add Data

Priyanka Singh, Vladislav D. Mosin, Ivan P. Yamshchikov

Working within specific NLP subdomains presents significant challenges, primarily due to a persistent deficit of data. Stringent privacy concerns and limited data accessibility often drive this shortage. Additionally, the medical domain demands high accuracy, where even marginal improvements in model performance can have profound impacts. In this study, we investigate the potential of vocabulary transfer to enhance model performance in biomedical NLP tasks. Specifically, we focus on vocabulary extension, a technique that involves expanding the target vocabulary to incorporate domain-specific biomedical terms. Our findings demonstrate that vocabulary extension, leads to measurable improvements in both downstream model performance and inference time.

CLSep 27, 2024
Individuation in Neural Models with and without Visual Grounding

Alexey Tikhonov, Lisa Bylinina, Ivan P. Yamshchikov

We show differences between a language-and-vision model CLIP and two text-only models - FastText and SBERT - when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.

CLMar 4
The Company You Keep: How LLMs Respond to Dark Triad Traits

Zeyi Lu, Angelica Henestrosa, Pavel Chizhov et al.

Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this behavior is encouraged, it may become problematic when interacting with user prompts that reflect negative social tendencies. Such responses risk amplifying harmful behavior rather than mitigating it. In this study, we examine how LLMs respond to user prompts expressing varying degrees of Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a curated dataset. Our analysis reveals differences across models, whereby all models predominantly exhibit corrective behavior, while showing reinforcing output in certain cases. Model behavior also depends on the severity level and differs in the sentiment of the response. Our findings raise implications for designing safer conversational systems that can detect and respond appropriately when users escalate from benign to harmful requests.

LGFeb 9, 2023
Rehabilitating Homeless: Dataset and Key Insights

Anna Bykova, Nikolay Filippov, Ivan P. Yamshchikov

This paper presents a large anonymized dataset of homelessness alongside insights into the data-driven rehabilitation of homeless people. The dataset was gathered by a large nonprofit organization working on rehabilitating the homeless for twenty years. This is the first dataset that we know of that contains rich information on thousands of homeless individuals seeking rehabilitation. We show how data analysis can help to make the rehabilitation of homeless people more effective and successful. Thus, we hope this paper alerts the data science community to the problem of homelessness.

CLApr 25, 2025Code
Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family

Pierre-Carl Langlais, Pavel Chizhov, Mattia Nee et al.

We introduce a new generation of small reasoning models for RAG, search, and source summarization. Pleias-RAG-350m and Pleias-RAG-1B are mid-trained on a large synthetic dataset emulating the retrieval of a wide variety of multilingual open sources from the Common Corpus. They provide native support for citation and grounding with literal quotes and reintegrate multiple features associated with RAG workflows, such as query routing, query reformulation, and source reranking. Pleias-RAG-350m and Pleias-RAG-1B outperform SLMs below 4 billion parameters on standardized RAG benchmarks (HotPotQA, 2wiki) and are competitive with popular larger models, including Qwen-2.5-7B, Llama-3.1-8B, and Gemma-3-4B. They are the only SLMs to date maintaining consistent RAG performance across leading European languages and ensuring systematic reference grounding for statements. Due to their size and ease of deployment on constrained infrastructure and higher factuality by design, the models unlock a range of new use cases for generative AI.

CLJan 31, 2024
LLMs Simulate Big Five Personality Traits: Further Evidence

Aleksandra Sorokovikova, Natalia Fedorova, Sharwin Rezagholi et al.

An empirical investigation into the simulation of the Big Five personality traits by large language models (LLMs), namely Llama2, GPT4, and Mixtral, is presented. We analyze the personality traits simulated by these models and their stability. This contributes to the broader understanding of the capabilities of LLMs to simulate personality traits and the respective implications for personalized human-computer interaction.

CLOct 11, 2024
Sui Generis: Large Language Models for Authorship Attribution and Verification in Latin

Gleb Schmidt, Svetlana Gorovaia, Ivan P. Yamshchikov

This paper evaluates the performance of Large Language Models (LLMs) in authorship attribution and authorship verification tasks for Latin texts of the Patristic Era. The study showcases that LLMs can be robust in zero-shot authorship verification even on short texts without sophisticated feature engineering. Yet, the models can also be easily "mislead" by semantics. The experiments also demonstrate that steering the model's authorship analysis and decision-making is challenging, unlike what is reported in the studies dealing with high-resource modern languages. Although LLMs prove to be able to beat, under certain circumstances, the traditional baselines, obtaining a nuanced and truly explainable decision requires at best a lot of experimentation.

CLJun 12, 2025
Surface Fairness, Deep Bias: A Comparative Study of Bias in Language Models

Aleksandra Sorokovikova, Pavel Chizhov, Iuliia Eremenko et al.

Modern language models are trained on large amounts of data. These data inevitably include controversial and stereotypical content, which contains all sorts of biases related to gender, origin, age, etc. As a result, the models express biased points of view or produce different results based on the assigned personality or the personality of the user. In this paper, we investigate various proxy measures of bias in large language models (LLMs). We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores. However, if we reformulate the task and ask a model to grade the user's answer, this shows more significant signs of bias. Finally, if we ask the model for salary negotiation advice, we see pronounced bias in the answers. With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle: modern LLM users do not need to pre-prompt the description of their persona since the model already knows their socio-demographics.

SEJun 2, 2025
Greening AI-enabled Systems with Software Engineering: A Research Agenda for Environmentally Sustainable AI Practices

Luís Cruz, João Paulo Fernandes, Maja H. Kirkeby et al.

The environmental impact of Artificial Intelligence (AI)-enabled systems is increasing rapidly, and software engineering plays a critical role in developing sustainable solutions. The "Greening AI with Software Engineering" CECAM-Lorentz workshop (no. 1358, 2025) funded by the Centre Européen de Calcul Atomique et Moléculaire and the Lorentz Center, provided an interdisciplinary forum for 29 participants, from practitioners to academics, to share knowledge, ideas, practices, and current results dedicated to advancing green software and AI research. The workshop was held February 3-7, 2025, in Lausanne, Switzerland. Through keynotes, flash talks, and collaborative discussions, participants identified and prioritized key challenges for the field. These included energy assessment and standardization, benchmarking practices, sustainability-aware architectures, runtime adaptation, empirical methodologies, and education. This report presents a research agenda emerging from the workshop, outlining open research directions and practical recommendations to guide the development of environmentally sustainable AI-enabled systems rooted in software engineering principles.

CLJun 2, 2025
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Pierre-Carl Langlais, Carlos Rosas Hinostroza, Mattia Nee et al.

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for language model pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets; in addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this technical report, we present the detailed provenance of data assembling and the details of dataset filtering and curation. Being already used by such industry leaders as Anthropic and multiple LLM training projects, we believe that Common Corpus will become a critical infrastructure for open science research in LLMs.

CLApr 10, 2025
What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

Pavel Chizhov, Mattia Nee, Pierre-Carl Langlais et al.

Common-sense reasoning is a key language model capability because it encapsulates not just specific factual knowledge but rather general language and world understanding. Measuring common-sense reasoning, therefore, is crucial for language models of different sizes and applications. One of the most widely used benchmarks for evaluating such capabilities is HellaSwag; however, in this paper, we show that it has severe construct validity issues. These issues range from basic ungrammaticality and numerous typos to misleading prompts or equally correct options. Furthermore, we show that if models are evaluated only on answer texts, or with "Lorem ipsum dolor..." instead of the question, more than 65% of model predictions remain the same, and this cannot be attributed merely to contamination. Since benchmark scores are an essential part of model selection in both research and commercial applications, these validity issues can have severe consequences. In particular, knowing that taking benchmark scores at face value is ubiquitous, inadequate evaluation leads to ill-informed decisions about models. In this paper, we thoroughly investigate critical validity issues posed by HellaSwag and illustrate them with various evaluations using generative language models of different sizes. We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state. Based on the results of our study, we propose requirements that should be met by future common-sense reasoning benchmarks. In addition, we release GoldenSwag, a corrected subset of HellaSwag, which, to our belief, facilitates acceptable common-sense reasoning evaluation.

CLDec 12, 2024
CleanComedy: Creating Friendly Humor through Generative Techniques

Dmitry Vikhorev, Daria Galimzianova, Svetlana Gorovaia et al.

Humor generation is a challenging task in natural language processing due to limited resources and the quality of existing datasets. Available humor language resources often suffer from toxicity and duplication, limiting their effectiveness for training robust models. This paper proposes CleanComedy, a specialized, partially annotated toxicity-filtered corpus of English and Russian jokes collected from various sources. We study the effectiveness of our data filtering approach through a survey on humor and toxicity levels in various joke groups. In addition, we study advances in computer humor generation by comparing jokes written by humans with various groups of generative jokes, including our baseline models trained on the CleanComedy datasets.

CLAug 1, 2025
Better Call Claude: Can LLMs Detect Changes of Writing Style?

Johannes Römisch, Svetlana Gorovaia, Mariia Halchynska et al.

This article explores the zero-shot performance of state-of-the-art large language models (LLMs) on one of the most challenging tasks in authorship analysis: sentence-level style change detection. Benchmarking four LLMs on the official PAN~2024 and 2025 "Multi-Author Writing Style Analysis" datasets, we present several observations. First, state-of-the-art generative models are sensitive to variations in writing style - even at the granular level of individual sentences. Second, their accuracy establishes a challenging baseline for the task, outperforming suggested baselines of the PAN competition. Finally, we explore the influence of semantics on model predictions and present evidence suggesting that the latest generation of LLMs may be more sensitive to content-independent and purely stylistic signals than previously reported.

CLFeb 22, 2024
Vygotsky Distance: Measure for Benchmark Task Similarity

Maxim K. Surkov, Ivan P. Yamshchikov

Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate similarity between benchmark tasks, we call this similarity measure "Vygotsky distance". The core idea of this similarity measure is that it is based on relative performance of the "students" on a given task, rather that on the properties of the task itself. If two tasks are close to each other in terms of Vygotsky distance the models tend to have similar relative performance on them. Thus knowing Vygotsky distance between tasks one can significantly reduce the number of evaluation tasks while maintaining a high validation quality. Experiments on various benchmarks, including GLUE, SuperGLUE, CLUE, and RussianSuperGLUE, demonstrate that a vast majority of NLP benchmarks could be at least 40% smaller in terms of the tasks included. Most importantly, Vygotsky distance could also be used for the validation of new tasks thus increasing the generalization potential of the future NLP models.

CLAug 22, 2025
ComicScene154: A Scene Dataset for Comic Analysis

Sandro Paval, Ivan P. Yamshchikov, Pascal Meißner

Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.

CLAug 1, 2025
Team "better_call_claude": Style Change Detection using a Sequential Sentence Pair Classifier

Gleb Schmidt, Johannes Römisch, Mariia Halchynska et al.

Style change detection - identifying the points in a document where writing style shifts - remains one of the most important and challenging problems in computational authorship analysis. At PAN 2025, the shared task challenges participants to detect style switches at the most fine-grained level: individual sentences. The task spans three datasets, each designed with controlled and increasing thematic variety within documents. We propose to address this problem by modeling the content of each problem instance - that is, a series of sentences - as a whole, using a Sequential Sentence Pair Classifier (SSPC). The architecture leverages a pre-trained language model (PLM) to obtain representations of individual sentences, which are then fed into a bidirectional LSTM (BiLSTM) to contextualize them within the document. The BiLSTM-produced vectors of adjacent sentences are concatenated and passed to a multi-layer perceptron for prediction per adjacency. Building on the work of previous PAN participants classical text segmentation, the approach is relatively conservative and lightweight. Nevertheless, it proves effective in leveraging contextual information and addressing what is arguably the most challenging aspect of this year's shared task: the notorious problem of "stylistically shallow", short sentences that are prevalent in the proposed benchmark data. Evaluated on the official PAN-2025 test datasets, the model achieves strong macro-F1 scores of 0.923, 0.828, and 0.724 on the EASY, MEDIUM, and HARD data, respectively, outperforming not only the official random baselines but also a much more challenging one: claude-3.7-sonnet's zero-shot performance.

CLMay 31, 2025
Smotrom tvoja pa ander drogoj verden! Resurrecting Dead Pidgin with Generative Models: Russenorsk Case Study

Alexey Tikhonov, Sergei Shteiner, Anna Bykova et al.

Russenorsk, a pidgin language historically used in trade interactions between Russian and Norwegian speakers, represents a unique linguistic phenomenon. In this paper, we attempt to analyze its lexicon using modern large language models (LLMs), based on surviving literary sources. We construct a structured dictionary of the language, grouped by synonyms and word origins. Subsequently, we use this dictionary to formulate hypotheses about the core principles of word formation and grammatical structure in Russenorsk and show which hypotheses generated by large language models correspond to the hypotheses previously proposed ones in the academic literature. We also develop a "reconstruction" translation agent that generates hypothetical Russenorsk renderings of contemporary Russian and Norwegian texts.

CLApr 4, 2024
Knowledge Graph Representation for Political Information Sources

Tinatin Osmonova, Alexey Tikhonov, Ivan P. Yamshchikov

With the rise of computational social science, many scholars utilize data analysis and natural language processing tools to analyze social media, news articles, and other accessible data sources for examining political and social discourse. Particularly, the study of the emergence of echo-chambers due to the dissemination of specific information has become a topic of interest in mixed methods research areas. In this paper, we analyze data collected from two news portals, Breitbart News (BN) and New York Times (NYT) to prove the hypothesis that the formation of echo-chambers can be partially explained on the level of an individual information consumption rather than a collective topology of individuals' social networks. Our research findings are presented through knowledge graphs, utilizing a dataset spanning 11.5 years gathered from BN and NYT media portals. We demonstrate that the application of knowledge representation techniques to the aforementioned news streams highlights, contrary to common assumptions, shows relative "internal" neutrality of both sources and polarizing attitude towards a small fraction of entities. Additionally, we argue that such characteristics in information sources lead to fundamental disparities in audience worldviews, potentially acting as a catalyst for the formation of echo-chambers.

SIMar 28, 2024
Echo-chambers and Idea Labs: Communication Styles on Twitter

Aleksandra Sorokovikova, Michael Becker, Ivan P. Yamshchikov

This paper investigates the communication styles and structures of Twitter (X) communities within the vaccination context. While mainstream research primarily focuses on the echo-chamber phenomenon, wherein certain ideas are reinforced and participants are isolated from opposing opinions, this study reveals the presence of diverse communication styles across various communities. In addition to the communities exhibiting echo-chamber behavior, this research uncovers communities with distinct communication patterns. By shedding light on the nuanced nature of communication within social networks, this study emphasizes the significance of understanding the diversity of perspectives within online communities.

CLJan 31, 2024
Neural Machine Translation for Malayalam Paraphrase Generation

Christeena Varghese, Sergey Koshelev, Ivan P. Yamshchikov

This study explores four methods of generating paraphrases in Malayalam, utilizing resources available for English paraphrasing and pre-trained Neural Machine Translation (NMT) models. We evaluate the resulting paraphrases using both automated metrics, such as BLEU, METEOR, and cosine similarity, as well as human annotation. Our findings suggest that automated evaluation measures may not be fully appropriate for Malayalam, as they do not consistently align with human judgment. This discrepancy underscores the need for more nuanced paraphrase evaluation approaches especially for highly agglutinative languages.

CLFeb 7, 2022
Moving Other Way: Exploring Word Mover Distance Extensions

Ilya Smirnov, Ivan P. Yamshchikov

The word mover's distance (WMD) is a popular semantic similarity metric for two texts. This position paper studies several possible extensions of WMD. We experiment with the frequency of words in the corpus as a weighting factor and the geometry of the word vector space. We validate possible extensions of WMD on six document classification datasets. Some proposed extensions show better results in terms of the k-nearest neighbor classification error than WMD.

CLDec 29, 2021
Fine-Tuning Transformers: Vocabulary Transfer

Vladislav Mosin, Igor Samenko, Alexey Tikhonov et al.

Transformers are responsible for the vast majority of recent advances in natural language processing. The majority of practical natural language processing applications of these models are typically enabled through transfer learning. This paper studies if corpus-specific tokenization used for fine-tuning improves the resulting performance of the model. Through a series of experiments, we demonstrate that such tokenization combined with the initialization and fine-tuning strategy for the vocabulary tokens speeds up the transfer and boosts the performance of the fine-tuned model. We call this aspect of transfer facilitation vocabulary transfer.

CLDec 13, 2021
Do Data-based Curricula Work?

Maxim K. Surkov, Vladislav D. Mosin, Ivan P. Yamshchikov

Current state-of-the-art NLP systems use large neural networks that require lots of computational resources for training. Inspired by human knowledge acquisition, researchers have proposed curriculum learning, - sequencing of tasks (task-based curricula) or ordering and sampling of the datasets (data-based curricula) that facilitate training. This work investigates the benefits of data-based curriculum learning for large modern language models such as BERT and T5. We experiment with various curricula based on a range of complexity measures and different sampling strategies. Extensive experiments on different NLP tasks show that curricula based on various complexity measures rarely has any benefits while random sampling performs either as well or better than curricula.

CLSep 29, 2021
StoryDB: Broad Multi-language Narrative Dataset

Alexey Tikhonov, Igor Samenko, Ivan P. Yamshchikov

This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.

CLSep 28, 2021
Actionable Entities Recognition Benchmark for Interactive Fiction

Alexey Tikhonov, Ivan P. Yamshchikov

This paper presents a new natural language processing task - Actionable Entities Recognition (AER) - recognition of entities that protagonists could interact with for further plot development. Though similar to classical Named Entity Recognition (NER), it has profound differences. In particular, it is crucial for interactive fiction, where the agent needs to detect entities that might be useful in the future. We also discuss if AER might be further helpful for the systems dealing with narrative processing since actionable entities profoundly impact the causal relationship in a story. We validate the proposed task on two previously available datasets and present a new benchmark dataset for the AER task that includes 5550 descriptions with one or more actionable entities.

CLSep 24, 2021
Rethinking Crowd Sourcing for Semantic Similarity

Shaul Solomon, Adam Cohn, Hernan Rosenblum et al.

Estimation of semantic similarity is crucial for a variety of natural language processing (NLP) tasks. In the absence of a general theory of semantic information, many papers rely on human annotators as the source of ground truth for semantic similarity estimation. This paper investigates the ambiguities inherent in crowd-sourced semantic labeling. It shows that annotators that treat semantic similarity as a binary category (two sentences are either similar or not similar and there is no middle ground) play the most important role in the labeling. The paper offers heuristics to filter out unreliable annotators and stimulates further discussions on human perception of semantic similarity.

CLJul 26, 2021
DYPLODOC: Dynamic Plots for Document Classification

Anastasia Malysheva, Alexey Tikhonov, Ivan P. Yamshchikov

Narrative generation and analysis are still on the fringe of modern natural language processing yet are crucial in a variety of applications. This paper proposes a feature extraction method for plot dynamics. We present a dataset that consists of the plot descriptions for thirteen thousand TV shows alongside meta-information on their genres and dynamic plots extracted from them. We validate the proposed tool for plot dynamics extraction and discuss possible applications of this method to the tasks of narrative analysis and generation.

CLJul 13, 2020
Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity

Yana Agafonova, Alexey Tikhonov, Ivan P. Yamshchikov

This papers revisits the receptive theory in context of computational creativity. It presents a case study of a Paranoid Transformer - a fully autonomous text generation engine with raw output that could be read as the narrative of a mad digital persona without any additional human post-filtering. We describe technical details of the generative system, provide examples of output and discuss the impact of receptive theory, chance discovery and simulation of fringe mental state on the understanding of computational creativity.

ASJul 13, 2020
Artificial Neural Networks Jamming on the Beat

Alexey Tikhonov, Ivan P. Yamshchikov

This paper addresses the issue of long-scale correlations that is characteristic for symbolic music and is a challenge for modern generative algorithms. It suggests a very simple workaround for this challenge, namely, generation of a drum pattern that could be further used as a foundation for melody generation. The paper presents a large dataset of drum patterns alongside with corresponding melodies. It explores two possible methods for drum pattern generation. Exploring a latent space of drum patterns one could generate new drum patterns with a given music style. Finally, the paper demonstrates that a simple artificial neural network could be trained to generate melodies corresponding with these drum patters used as inputs. Resulting system could be used for end-to-end generation of symbolic music with song-like structure and higher long-scale correlations between the notes.

CLApr 27, 2020
Intuitive Contrasting Map for Antonym Embeddings

Igor Samenko, Alexey Tikhonov, Ivan P. Yamshchikov

This paper shows that, modern word embeddings contain information that distinguishes synonyms and antonyms despite small cosine similarities between corresponding vectors. This information is encoded in the geometry of the embeddings and could be extracted with a straight-forward and intuitive manifold learning procedure or a contrasting map. Such a map is trained on a small labeled subset of the data and can produce new embeddings that explicitly highlight specific semantic attributes of the word. The new embeddings produced by the map are shown to improve the performance on downstream tasks.

CLApr 10, 2020
Style-transfer and Paraphrase: Looking for a Sensible Semantic Similarity Metric

Ivan P. Yamshchikov, Viacheslav Shibaev, Nikolay Khlebnikov et al.

The rapid development of such natural language processing tasks as style transfer, paraphrase, and machine translation often calls for the use of semantic similarity metrics. In recent years a lot of methods to measure the semantic similarity of two short texts were developed. This paper provides a comprehensive analysis for more than a dozen of such methods. Using a new dataset of fourteen thousand sentence pairs human-labeled according to their semantic similarity, we demonstrate that none of the metrics widely used in the literature is close enough to human judgment in these tasks. A number of recently proposed metrics provide comparable results, yet Word Mover Distance is shown to be the most reasonable solution to measure semantic similarity in reformulated texts at the moment.

CLMar 12, 2020
It Means More if It Sounds Good: Yet Another Hypothesis Concerning the Evolution of Polysemous Words

Ivan P. Yamshchikov, Cyrille Merleau Nono Saha, Igor Samenko et al.

This position paper looks into the formation of language and shows ties between structural properties of the words in the English language and their polysemy. Using Ollivier-Ricci curvature over a large graph of synonyms to estimate polysemy it shows empirically that the words that arguably are easier to pronounce also tend to have multiple meanings.

CLSep 26, 2019
Decomposing Textual Information For Style Transfer

Ivan P. Yamshchikov, Viacheslav Shibaev, Aleksander Nagaev et al.

This paper focuses on latent representations that could effectively decompose different aspects of textual information. Using a framework of style transfer for texts, we propose several empirical methods to assess information decomposition quality. We validate these methods with several state-of-the-art textual style transfer methods. Higher quality of information decomposition corresponds to higher performance in terms of bilingual evaluation understudy (BLEU) between output and human-written reformulations.

CLAug 19, 2019
Style Transfer for Texts: Retrain, Report Errors, Compare with Rewrites

Alexey Tikhonov, Viacheslav Shibaev, Aleksander Nagaev et al.

This paper shows that standard assessment methodology for style transfer has several significant problems. First, the standard metrics for style accuracy and semantics preservation vary significantly on different re-runs. Therefore one has to report error margins for the obtained results. Second, starting with certain values of bilingual evaluation understudy (BLEU) between input and output and accuracy of the sentiment transfer the optimization of these two standard metrics diverge from the intuitive goal of the style transfer task. Finally, due to the nature of the task itself, there is a specific dependence between these two metrics that could be easily manipulated. Under these circumstances, we suggest taking BLEU between input and human-written reformulations into consideration for benchmarks. We also propose three new architectures that outperform state of the art in terms of this metric.

CLAug 13, 2018
What is wrong with style transfer for texts?

Alexey Tikhonov, Ivan P. Yamshchikov

A number of recent machine learning papers work with an automated style transfer for texts and, counter to intuition, demonstrate that there is no consensus formulation of this NLP task. Different researchers propose different algorithms, datasets and target metrics to address it. This short opinion paper aims to discuss possible formalization of this NLP task in anticipation of a further growing interest to it.

CLJul 17, 2018
Guess who? Multilingual approach for the automated generation of author-stylized poetry

Alexey Tikhonov, Ivan P. Yamshchikov

This paper addresses the problem of stylized text generation in a multilingual setup. A version of a language model based on a long short-term memory (LSTM) artificial neural network with extended phonetic and semantic embeddings is used for stylized poetry generation. The quality of the resulting poems generated by the network is estimated through bilingual evaluation understudy (BLEU), a survey and a new cross-entropy based metric that is suggested for the problems of such type. The experiments show that the proposed model consistently outperforms random sample and vanilla-LSTM baselines, humans also tend to associate machine generated texts with the target author.

SDMay 15, 2017
Music generation with variational recurrent autoencoder supported by history

Ivan P. Yamshchikov, Alexey Tikhonov

A new architecture of an artificial neural network that helps to generate longer melodic patterns is introduced alongside with methods for post-generation filtering. The proposed approach called variational autoencoder supported by history is based on a recurrent highway gated network combined with a variational autoencoder. Combination of this architecture with filtering heuristics allows generating pseudo-live acoustically pleasing and melodically diverse music.