CLDec 2, 2024Code
A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and HumansZhihan Cao, Hiroaki Yamada, Simone Teufel et al.
Recently, much work has concerned itself with the enigma of what exactly pretrained language models~(PLMs) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Generally, only one relation has been considered, namely hypernymy. Furthermore, previous work did not measure humans' performance on the same task as that performed by the PLMs. This means that at this point in time, there is only an incomplete view of the extent of these models' semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use five metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, prototypicality, and distinguishability. Using these, we can fairly compare humans and models on the same task. Our extensive experiments involve six PLMs, four masked and two causal language models. The results reveal a significant knowledge gap between humans and models for all semantic relations. In general, causal language models, despite their wide use, do not always perform significantly better than masked language models. Antonymy is the outlier relation where all models perform reasonably well. The evaluation materials can be found at https://github.com/hancules/ProbeResponses.
CLDec 3, 2024
Misalignment of Semantic Relation Knowledge between WordNet and Human IntuitionZhihan Cao, Hiroaki Yamada, Simone Teufel et al.
WordNet provides a carefully constructed repository of semantic relations, created by specialists. But there is another source of information on semantic relations, the intuition of language users. We present the first systematic study of the degree to which these two sources are aligned. Investigating the cases of misalignment could make proper use of WordNet and facilitate its improvement. Our analysis which uses templates to elicit responses from human participants, reveals a general misalignment of semantic relation knowledge between WordNet and human intuition. Further analyses find a systematic pattern of mismatch among synonymy and taxonomic relations~(hypernymy and hyponymy), together with the fact that WordNet path length does not serve as a reliable indicator of human intuition regarding hypernymy or hyponymy relations.
CRMar 6
Alkaid: Resilience to Edit Errors in Provably Secure Steganography via Distance-Constrained EncodingZhihan Cao, Gaolei Li, Jun Wu et al.
While provably secure steganography provides strong concealment by ensuring stego carriers are indistinguishable from natural samples, such systems remain vulnerable to real-world edit errors (e.g., insertions, deletions, substitutions) because their decoding depends on perfect synchronization and lacks error-correcting capability. To bridge this gap, we propose Alkaid, a provably secure steganographic scheme resilient to edit errors via distance-constrained encoding. The key innovation integrates the minimum distance decoding principle directly into the encoding process by enforcing a strict lower bound on the edit distance between codewords of different messages. Specifically, if two candidate codewords violate this bound, they are merged to represent the same message, thereby guaranteeing reliable recovery. While maintaining provable security, we theoretically prove that Alkaid offers deterministic robustness against bounded errors. To implement this scheme efficiently, we adopt block-wise and batch processing. Extensive experiments demonstrate that Alkaid achieves decoding success rates of 99\% to 100\% across diverse error channels, delivers a payload of 0.2 bits per token for high embedding capacity, and maintains an encoding speed of 6.72 bits per second, significantly surpassing state-of-the-art (SOTA) methods in robustness, capacity, and efficiency.
CLNov 28, 2025
JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain KnowledgeZhihan Cao, Fumihito Nishino, Hiroaki Yamada et al.
We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models' legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil Code focus of prior Japanese resources. Each question is decomposed into independent true/false judgments with structured contextual fields. The dataset contains 3,464 items with balanced labels. We evaluate 26 LLMs, including proprietary, open-weight, Japanese-specialised, and reasoning models. Our results show that proprietary models with reasoning enabled perform best, and the Constitution questions are generally easier than the Civil Code or the Penal Code questions.
CLSep 15, 2025
On the Distinctive Co-occurrence Characteristics of AntonymyZhihan Cao, Hiroaki Yamada, Takenobu Tokunaga
Antonymy has long received particular attention in lexical semantics. Previous studies have shown that antonym pairs frequently co-occur in text, across genres and parts of speech, more often than would be expected by chance. However, whether this co-occurrence pattern is distinctive of antonymy remains unclear, due to a lack of comparison with other semantic relations. This work fills the gap by comparing antonymy with three other relations across parts of speech using robust co-occurrence metrics. We find that antonymy is distinctive in three respects: antonym pairs co-occur with high strength, in a preferred linear order, and within short spans. All results are available online.