Jingyao Liu

CL
h-index9
4papers
24citations
Novelty60%
AI Score48

4 Papers

SEOct 16, 2025Code
E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Jingyao Liu, Chen Huang, Zhizhao Guan et al.

The rapid advancement in large language models (LLMs) has demonstrated significant potential in End-to-End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities. To address these limitations, we present E2EDev, a novel benchmark grounded in the principles of Behavior-Driven Development (BDD), which evaluates the capabilities of E2ESD frameworks by assessing whether the generated software meets user needs through mimicking real user interactions (Figure 1). E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.

CLFeb 17, 2025
Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study

Yujie Lin, Ante Wang, Moye Chen et al.

Recently, inference-time scaling of chain-of-thought (CoT) has been demonstrated as a promising approach for addressing multi-modal reasoning tasks. While existing studies have predominantly centered on text-based thinking, the integration of both visual and textual modalities within the reasoning process remains unexplored. In this study, we pioneer the exploration of inference-time scaling with multi-modal thought, aiming to bridge this gap. To provide a comprehensive analysis, we systematically investigate popular sampling-based and tree search-based inference-time scaling methods on 10 challenging tasks spanning various domains. Besides, we uniformly adopt a consistency-enhanced verifier to ensure effective guidance for both methods across different thought paradigms. Results show that multi-modal thought promotes better performance against conventional text-only thought, and blending the two types of thought fosters more diverse thinking. Despite these advantages, multi-modal thoughts necessitate higher token consumption for processing richer visual inputs, which raises concerns in practical applications. We hope that our findings on the merits and drawbacks of this research line will inspire future works in the field.

CLDec 23, 2024
A Dual-Perspective Metaphor Detection Framework Using Large Language Models

Yujie Lin, Jingyao Liu, Yan Gao et al.

Metaphor detection, a critical task in natural language processing, involves identifying whether a particular word in a sentence is used metaphorically. Traditional approaches often rely on supervised learning models that implicitly encode semantic relationships based on metaphor theories. However, these methods often suffer from a lack of transparency in their decision-making processes, which undermines the reliability of their predictions. Recent research indicates that LLMs (large language models) exhibit significant potential in metaphor detection. Nevertheless, their reasoning capabilities are constrained by predefined knowledge graphs. To overcome these limitations, we propose DMD, a novel dual-perspective framework that harnesses both implicit and explicit applications of metaphor theories to guide LLMs in metaphor detection and adopts a self-judgment mechanism to validate the responses from the aforementioned forms of guidance. In comparison to previous methods, our framework offers more transparent reasoning processes and delivers more reliable predictions. Experimental results prove the effectiveness of DMD, demonstrating state-of-the-art performance across widely-used datasets.

CLJul 31, 2025
Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Ante Wang, Yujie Lin, Jingyao Liu et al.

Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B's accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.