CLJun 3
ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine TranslationJoseph Marvin Imperial, Junhong Liang, Belal Shoer et al.
When a text is translated, does the translation retain the complexity of the original? We introduce ComplexityMT, a new challenge for assessing how text complexity and machine translation interact with and influence each other, using the Common European Framework of Reference for Languages (CEFR) levels as the measure of text complexity. Across six languages, including Arabic, Dutch, English, French, Hindi, and Russian, we evaluate three open-weight models, one closed model, and a commercial machine translation system on two tasks: i) correlation of CEFR with translation difficulty, and ii) shifts in CEFR levels of the source texts. Our experiments show that higher CEFR levels make texts more difficult to translate, and that machine translation shifts the CEFR level of the target text compared to the original source, for most languages. These findings provide new insights for researchers and practitioners working on multilingual pedagogical content generation and machine translation difficulty estimation.
CLApr 30
Cultural Benchmarking of LLMs in Standard and Dialectal Arabic DialoguesMuhammad Dehan Al Kautsar, Saeed Almheiri, Momina Ahsan et al.
There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country's respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.
CLMar 28Code
Listen, Correct, and Feed Back: Spoken Pedagogical Feedback GenerationJunhong Liang, Yifan Lu, Ekaterina Kochmar et al.
Grammatical error correction (GEC) and explanation (GEE) have made rapid progress, but real teaching scenarios also require \emph{learner-friendly pedagogical feedback} that is actionable, level-appropriate, and encouraging. We introduce \textbf{SPFG} (\textbf{S}poken \textbf{P}edagogical \textbf{F}eedback \textbf{G}eneration), a dataset built based on the Speak \& Improve Challenge 2025 corpus, pairing fluency-oriented transcriptions with GEC targets and \emph{human-verified} teacher-style feedback, including preferred/rejected feedback pairs for preference learning. We study a transcript-based Spoken Grammatical Error Correction (SGEC) setting and evaluate three instruction-tuned LLMs (Qwen2.5, Llama-3.1, and GLM-4), comparing supervised fine-tuning (SFT) with preference-based alignment (using DPO and KTO) for jointly generating corrections and feedback. Results show that SFT provides the most consistent improvements, while DPO/KTO yield smaller or mixed gains, and that correction quality and feedback quality are weakly coupled. Our implementation is available at https://github.com/Skywalker-Harrison/spfg.
CVApr 27Code
ServImage: An Image Generation and Editing Benchmark from Real-world Commercial Imaging ServicesFengxian Ji, Jingpu Yang, Zirui Song et al.
Recent image generation and editing models demonstrate robust adherence to instructions and high visual quality on academic benchmarks. However, their performance on paid, real-world design projects remains uncertain. We introduce \textbf{ServImage}, a benchmark that explicitly correlates model outputs with economic value in commercial design projects. ServImage consists of (i) \textbf{\textit{ServImageBench}}: a dataset of 1.07k paid commercial design tasks and 2.05k designer deliverables totaling over \$295k, covering portrait, product, and digital content, along with 33k candidate images and 33k human annotations. (ii) \textbf{\textit{ServImageScore}}: an integrated scoring system that combines three quality dimensions: baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction. These three dimensions are designed to characterize the factors that drive human payment decisions and indicate whether an image is commercially acceptable. (iii) \textbf{\textit{ServImageModel}}: under this scoring system, we propose a payment prediction model trained on the human-annotated candidate images, achieving 82.00\% accuracy in predicting human payment decisions and producing calibrated payment probabilities. ServImage provides a comprehensive foundation for assessing the commercial viability of image generation models and offers a scalable resource for future research on economically grounded vision systems \href{https://github.com/FengxianJi/ServImage}{Github.}
CLSep 22, 2025Code
Vision Language Models Are Not (Yet) Spelling CorrectorsJunhong Liang, Bojun Zhang
Spelling correction from visual input poses unique challenges for vision language models (VLMs), as it requires not only detecting but also correcting textual errors directly within images. We present ReViCo (Real Visual Correction), the first benchmark that systematically evaluates VLMs on real-world visual spelling correction across Chinese and English. ReViCo contains naturally occurring errors collected from real-world image data and supports fine-grained evaluation at both image and token levels. Through comprehensive experiments on representative cascaded (Qwen) and native (InternVL) open-source models, as well as closed-source systems (GPT-4o, Claude), we show that current VLMs fall significantly short of human performance, particularly in correction. To address these limitations, we explore two solution paradigms: a Joint OCR-Correction pipeline and a Background Information enhanced approach, both of which yield consistent performance gains. Our analysis highlights fundamental limitations of existing architectures and provides actionable insights for advancing multimodal spelling correction.
CLApr 26, 2025
RAIR: Retrieval-Augmented Iterative Refinement for Chinese Spelling CorrectionJunhong Liang, Yu Zhou
Chinese Spelling Correction (CSC) aims to detect and correct erroneous tokens in sentences. Traditional CSC focuses on equal length correction and uses pretrained language models (PLMs). While Large Language Models (LLMs) have shown remarkable success in identifying and rectifying potential errors, they often struggle with adapting to domain-specific corrections, especially when encountering terminologies in specialized domains. To address domain adaptation, we propose a \textbf{R}etrieval-\textbf{A}ugmented \textbf{I}terative \textbf{R}efinement (RAIR) framework. Our approach constructs a retrieval corpus adaptively from domain-specific training data and dictionaries, employing a fine-tuned retriever to ensure that the retriever catches the error correction pattern. We also extend equal-length into variable-length correction scenarios. Extensive experiments demonstrate that our framework outperforms current approaches in domain spelling correction and significantly improves the performance of LLMs in variable-length scenarios.
CLDec 4, 2024
PERL: Pinyin Enhanced Rephrasing Language Model for Chinese ASR N-best Error CorrectionJunhong Liang, Bojun Zhang
Existing Chinese ASR correction methods have not effectively utilized Pinyin information, a unique feature of the Chinese language. In this study, we address this gap by proposing a \textbf{P}inyin \textbf{E}nhanced \textbf{R}ephrasing \textbf{L}anguage model (PERL) pipeline, designed explicitly for N-best correction scenarios. We conduct experiments on the Aishell-1 dataset and our newly proposed DoAD dataset. The results show that our approach outperforms baseline methods, achieving a 29.11\% reduction in Character Error Rate on Aishell-1 and around 70\% CER reduction on domain-specific datasets. PERL predicts the correct length of the output, leveraging the Pinyin information, which is embedded with a semantic model to perform phonetically similar corrections. Extensive experiments demonstrate the effectiveness of correcting wrong characters using N-best output and the low latency of our model.
CLFeb 15, 2024
An Analysis of Language Frequency and Error Correction for EsperantoJunhong Liang
Current Grammar Error Correction (GEC) initiatives tend to focus on major languages, with less attention given to low-resource languages like Esperanto. In this article, we begin to bridge this gap by first conducting a comprehensive frequency analysis using the Eo-GP dataset, created explicitly for this purpose. We then introduce the Eo-GEC dataset, derived from authentic user cases and annotated with fine-grained linguistic details for error identification. Leveraging GPT-3.5 and GPT-4, our experiments show that GPT-4 outperforms GPT-3.5 in both automated and human evaluations, highlighting its efficacy in addressing Esperanto's grammatical peculiarities and illustrating the potential of advanced language models to enhance GEC strategies for less commonly studied languages.