Chen Shani

CL
h-index26
16papers
2,068citations
Novelty48%
AI Score55

16 Papers

CLNov 3, 2023
Towards Concept-Aware Large Language Models

Chen Shani, Jilles Vreeken, Dafna Shahaf

Concepts play a pivotal role in various human cognitive functions, including learning, reasoning and communication. However, there is very little work on endowing machines with the ability to form and reason with concepts. In particular, state-of-the-art large language models (LLMs) work at the level of tokens, not concepts. In this work, we analyze how well contemporary LLMs capture human concepts and their structure. We then discuss ways to develop concept-aware LLMs, taking place at different stages of the pipeline. We sketch a method for pretraining LLMs using concepts, and also explore the simpler approach that uses the output of existing LLMs. Despite its simplicity, our proof-of-concept is shown to better match human intuition, as well as improve the robustness of predictions. These preliminary results underscore the promise of concept-aware LLMs.

LGNov 15, 2022
The Lean Data Scientist: Recent Advances towards Overcoming the Data Bottleneck

Chen Shani, Jonathan Zarecki, Dafna Shahaf

Machine learning (ML) is revolutionizing the world, affecting almost every field of science and industry. Recent algorithms (in particular, deep networks) are increasingly data-hungry, requiring large datasets for training. Thus, the dominant paradigm in ML today involves constructing large, task-specific datasets. However, obtaining quality datasets of such magnitude proves to be a difficult challenge. A variety of methods have been proposed to address this data bottleneck problem, but they are scattered across different areas, and it is hard for a practitioner to keep up with the latest developments. In this work, we propose a taxonomy of these methods. Our goal is twofold: (1) We wish to raise the community's awareness of the methods that already exist and encourage more efficient use of resources, and (2) we hope that such a taxonomy will contribute to our understanding of the problem, inspiring novel ideas and strategies to replace current annotation-heavy approaches.

CLNov 3, 2023
FAME: Flexible, Scalable Analogy Mappings Engine

Shahar Jacob, Chen Shani, Dafna Shahaf

Analogy is one of the core capacities of human cognition; when faced with new situations, we often transfer prior experience from other domains. Most work on computational analogy relies heavily on complex, manually crafted input. In this work, we relax the input requirements, requiring only names of entities to be mapped. We automatically extract commonsense representations and use them to identify a mapping between the entities. Unlike previous works, our framework can handle partial analogies and suggest new entities to be added. Moreover, our method's output is easily interpretable, allowing for users to understand why a specific mapping was chosen. Experiments show that our model correctly maps 81.2% of classical 2x2 analogy problems (guess level=50%). On larger problems, it achieves 77.8% accuracy (mean guess level=13.1%). In another experiment, we show our algorithm outperforms human performance, and the automatic suggestions of new entities resemble those suggested by humans. We hope this work will advance computational analogy by paving the way to more flexible, realistic input requirements, with broader applicability.

CLJan 16
Beyond Tokens: Concept-Level Training Objectives for LLMs

Laya Iyer, Pranav Somani, Alice Guo et al.

The next-token prediction (NTP) objective has been foundational in the development of modern large language models (LLMs), driving advances in fluency and generalization. However, NTP operates at the \textit{token} level, treating deviations from a single reference continuation as errors even when alternative continuations are equally plausible or semantically equivalent (e.g., ``mom'' vs. ``mother''). As a result, token-level loss can penalize valid abstractions, paraphrases, or conceptually correct reasoning paths, biasing models toward surface form rather than underlying meaning. This mismatch between the training signal and semantic correctness motivates learning objectives that operate over higher-level representations. We propose a shift from token-level to concept-level prediction, where concepts group multiple surface forms of the same idea (e.g., ``mom,'' ``mommy,'' ``mother'' $\rightarrow$ \textit{MOTHER}). We introduce various methods for integrating conceptual supervision into LLM training and show that concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks. This suggests \textit{concept-level supervision} as an improved training signal that better aligns LLMs with human semantic abstractions.

CLJan 12
The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

Chen Shani, Yuval Reif, Nathan Roll et al.

Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world's languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.

8.6CLMar 31
Concept Training for Human-Aligned Language Models

Christine Zhang, Dan Jurafsky, Chen Shani

The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}'' could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global token-level perplexity, reflecting a tradeoff between standard NTP optimization and concept-level supervision. Our results suggest that concept-level objectives can improve semantic alignment while maintaining competitive language modeling performance.

CLMay 21, 2025
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

Chen Shani, Liron Soffer, Dan Jurafsky et al.

Humans organize knowledge into compact categories that balance compression with semantic meaning preservation. Large Language Models (LLMs) demonstrate striking linguistic abilities, yet whether they achieve this same balance remains unclear. We apply the Information Bottleneck principle to quantitatively compare how LLMs and humans navigate this compression-meaning trade-off. Analyzing embeddings from 40+ LLMs against classic human categorization benchmarks, we uncover three key findings. First, LLMs broadly align with human categories but miss fine-grained semantic distinctions crucial for human understanding. Second, LLMs demonstrate aggressive statistical compression, achieving ``optimal'' information-theoretic efficiency, while humans prioritize contextual richness and adaptive flexibility. Third, encoder models surprisingly outperform decoder models in human alignment, suggesting that generation and understanding rely on distinct mechanisms in current architectures. In addition, training dynamics analysis reveals that conceptual structure develops in distinct phases: rapid initial formation followed by architectural reorganization, with semantic processing migrating from deeper to mid-network layers as models discover more efficient encoding. These divergent strategies, where LLMs optimize for compression and humans for adaptive utility, reveal fundamental differences between artificial and biological intelligence, guiding development toward more human-aligned AI.

CYApr 14, 2025
Labeling Messages as AI-Generated Does Not Reduce Their Persuasive Effects

Isabel O. Gallegos, Chen Shani, Weiyan Shi et al.

As generative artificial intelligence (AI) enables the creation and dissemination of information at massive scale and speed, it is increasingly important to understand how people perceive AI-generated content. One prominent policy proposal requires explicitly labeling AI-generated content to increase transparency and encourage critical thinking about the information, but prior research has not yet tested the effects of such labels. To address this gap, we conducted a survey experiment (N=1601) on a diverse sample of Americans, presenting participants with an AI-generated message about several public policies (e.g., allowing colleges to pay student-athletes), randomly assigning whether participants were told the message was generated by (a) an expert AI model, (b) a human policy expert, or (c) no label. We found that messages were generally persuasive, influencing participants' views of the policies by 9.74 percentage points on average. However, while 94.6% of participants assigned to the AI and human label conditions believed the authorship labels, labels had no significant effects on participants' attitude change toward the policies, judgments of message accuracy, nor intentions to share the message with others. These patterns were robust across a variety of participant characteristics, including prior knowledge of the policy, prior experience with AI, political party, education level, or age. Taken together, these results imply that, while authorship labels would likely enhance transparency, they are unlikely to substantially affect the persuasiveness of the labeled content, highlighting the need for alternative strategies to address challenges posed by AI-generated information.

HCApr 4, 2025
Measuring Mental Health Variables in Computational Research: Toward Validated, Dimensional, and Transdiagnostic Approaches

Chen Shani, Elizabeth C. Stade

Computational mental health research develops models to predict and understand psychological phenomena, but often relies on inappropriate measures of psychopathology constructs, undermining validity. We identify three key issues: (1) reliance on unvalidated measures (e.g., self-declared diagnosis) over validated ones (e.g., diagnosis by clinician); (2) treating mental health constructs as categorical rather than dimensional; and (3) focusing on disorder-specific constructs instead of transdiagnostic ones. We outline the benefits of using validated, dimensional, and transdiagnostic measures and offer practical recommendations for practitioners. Using valid measures that reflect the nature and structure of psychopathology is essential for computational mental health research.

LGOct 8, 2025
Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging

Patrick Peixuan Ye, Chen Shani, Ellen Vitercik

We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input $X$ and output $Y$ dataset. Our method first clusters $X$ and $Y$ independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input $x$ is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction $\hat{y}$. Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.

CLAug 26, 2025
One Joke to Rule them All? On the (Im)possibility of Generalizing Humor

Mor Turgeman, Chen Shani, Dafna Shahaf

Humor is a broad and complex form of communication that remains challenging for machines. Despite its broadness, most existing research on computational humor traditionally focused on modeling a specific type of humor. In this work, we wish to understand whether competence on one or more specific humor tasks confers any ability to transfer to novel, unseen types; in other words, is this fragmentation inevitable? This question is especially timely as new humor types continuously emerge in online and social media contexts (e.g., memes, anti-humor, AI fails). If Large Language Models (LLMs) are to keep up with this evolving landscape, they must be able to generalize across humor types by capturing deeper, transferable mechanisms. To investigate this, we conduct a series of transfer learning experiments across four datasets, representing different humor tasks. We train LLMs under varied diversity settings (1-3 datasets in training, testing on a novel task). Experiments reveal that models are capable of some transfer, and can reach up to 75% accuracy on unseen datasets; training on diverse sources improves transferability (1.88-4.05%) with minimal-to-no drop in in-domain performance. Further analysis suggests relations between humor types, with Dad Jokes surprisingly emerging as the best enabler of transfer (but is difficult to transfer to). We release data and code.

CLApr 29, 2025
Cooking Up Creativity: Enhancing LLM Creativity through Structured Recombination

Moran Mizrahi, Chen Shani, Gabriel Stanovsky et al.

Large Language Models (LLMs) excel at many tasks, yet they struggle to produce truly creative, diverse ideas. In this paper, we introduce a novel approach that enhances LLM creativity. We apply LLMs for translating between natural language and structured representations, and perform the core creative leap via cognitively inspired manipulations on these representations. Our notion of creativity goes beyond superficial token-level variations; rather, we recombine structured representations of existing ideas, enabling our system to effectively explore a more abstract landscape of ideas. We demonstrate our approach in the culinary domain with DishCOVER, a model that generates creative recipes. Experiments and domain-expert evaluations reveal that our outputs, which are mostly coherent and feasible, significantly surpass GPT-4o in terms of novelty and diversity, thus outperforming it in creative generation. We hope our work inspires further research into structured creativity in AI.

CLFeb 8, 2025
Rethinking Word Similarity: Semantic Similarity through Classification Confusion

Kaitlyn Zhou, Haishan Gao, Sarah Chen et al.

Word similarity has many applications to social science and cultural analytics tasks like measuring meaning change over time and making sense of contested terms. Yet traditional similarity methods based on cosine similarity between word embeddings cannot capture the context-dependent, asymmetrical, polysemous nature of semantic similarity. We propose a new measure of similarity, Word Confusion, that reframes semantic similarity in terms of feature-based classification confusion. Word Confusion is inspired by Tversky's suggestion that similarity features be chosen dynamically. Here we train a classifier to map contextual embeddings to word identities and use the classifier confusion (the probability of choosing a confounding word c instead of the correct target word t) as a measure of the similarity of c and t. The set of potential confounding words acts as the chosen features. Our method is comparable to cosine similarity in matching human similarity judgments across several datasets (MEN, WirdSim353, and SimLex), and can measure similarity using predetermined features of interest. We demonstrate our model's ability to make use of dynamic features by applying it to test a hypothesis about changes in the 18th C. meaning of the French word "revolution" from popular to state action during the French Revolution. We hope this reimagining of semantic similarity will inspire the development of new tools that better capture the multi-faceted and dynamic nature of language, advancing the fields of computational social science and cultural analytics and beyond.

CLJun 6, 2021
How Did This Get Funded?! Automatically Identifying Quirky Scientific Achievements

Chen Shani, Nadav Borenstein, Dafna Shahaf

Humor is an important social phenomenon, serving complex social and psychological functions. However, despite being studied for millennia humor is computationally not well understood, often considered an AI-complete problem. In this work, we introduce a novel setting in humor mining: automatically detecting funny and unusual scientific papers. We are inspired by the Ig Nobel prize, a satirical prize awarded annually to celebrate funny scientific achievements (example past winner: "Are cows more likely to lie down the longer they stand?"). This challenging task has unique characteristics that make it particularly suitable for automatic learning. We construct a dataset containing thousands of funny papers and use it to learn classifiers, combining findings from psychology and linguistics with recent advances in NLP. We use our models to identify potentially funny papers in a large dataset of over 630,000 articles. The results demonstrate the potential of our methods, and more broadly the utility of integrating state-of-the-art NLP methods with insights from more traditional disciplines.

HCMay 12, 2021
"Alexa, what do you do for fun?" Characterizing playful requests with virtual assistants

Chen Shani, Alexander Libov, Sofia Tolmach et al.

Virtual assistants such as Amazon's Alexa, Apple's Siri, Google Home, and Microsoft's Cortana, are becoming ubiquitous in our daily lives and successfully help users in various daily tasks, such as making phone calls or playing music. Yet, they still struggle with playful utterances, which are not meant to be interpreted literally. Examples include jokes or absurd requests or questions such as, "Are you afraid of the dark?", "Who let the dogs out?", or "Order a zillion gummy bears". Today, virtual assistants often return irrelevant answers to such utterances, except for hard-coded ones addressed by canned replies. To address the challenge of automatically detecting playful utterances, we first characterize the different types of playful human-virtual assistant interaction. We introduce a taxonomy of playful requests rooted in theories of humor and refined by analyzing real-world traffic from Alexa. We then focus on one node, personification, where users refer to the virtual assistant as a person ("What do you do for fun?"). Our conjecture is that understanding such utterances will improve user experience with virtual assistants. We conducted a Wizard-of-Oz user study and showed that endowing virtual assistant s with the ability to identify humorous opportunities indeed has the potential to increase user satisfaction. We hope this work will contribute to the understanding of the landscape of the problem and inspire novel ideas and techniques towards the vision of giving virtual assistants a sense of humor.

CLMay 1, 2020
Language (Re)modelling: Towards Embodied Language Understanding

Ronen Tamari, Chen Shani, Tom Hope et al.

While natural language understanding (NLU) is advancing rapidly, today's technology differs from human-like language understanding in fundamental ways, notably in its inferior efficiency, interpretability, and generalization. This work proposes an approach to representation and learning based on the tenets of embodied cognitive linguistics (ECL). According to ECL, natural language is inherently executable (like programming languages), driven by mental simulation and metaphoric mappings over hierarchical compositions of structures and schemata learned through embodied interaction. This position paper argues that the use of grounding by metaphoric inference and simulation will greatly benefit NLU systems, and proposes a system architecture along with a roadmap towards realizing this vision.