HebID: Detecting Social Identities in Hebrew-language Political Text
This provides a resource for studying social identities in Hebrew political contexts, with potential applications to other non-English languages, though it is incremental as it adapts existing methods to a new language and domain.
The authors tackled the lack of non-English datasets for social identity detection by introducing HebID, a multilabel Hebrew corpus of 5,536 sentences from Israeli politicians' Facebook posts, achieving a macro-F1 score of 0.74 with Hebrew-tuned LLMs. They applied the classifier to analyze identity expression in political discourse and compare it with public survey data.
Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.