Chenyu Hou

AI
h-index7
4papers
29citations
Novelty54%
AI Score45

4 Papers

AIJan 22
Benchmarking Text-to-Python against Text-to-SQL: The Impact of Explicit Logic and Ambiguity

Hangle Hu, Chenyu Hou, Bin Cao et al.

While Text-to-SQL remains the dominant approach for database interaction, real-world analytics increasingly require the flexibility of general-purpose programming languages such as Python or Pandas to manage file-based data and complex analytical workflows. Despite this growing need, the reliability of Text-to-Python in core data retrieval remains underexplored relative to the mature SQL ecosystem. To address this gap, we introduce BIRD-Python, a benchmark designed for cross-paradigm evaluation. We systematically refined the original dataset to reduce annotation noise and align execution semantics, thereby establishing a consistent and standardized baseline for comparison. Our analysis reveals a fundamental paradigmatic divergence: whereas SQL leverages implicit DBMS behaviors through its declarative structure, Python requires explicit procedural logic, making it highly sensitive to underspecified user intent. To mitigate this challenge, we propose the Logic Completion Framework (LCF), which resolves ambiguity by incorporating latent domain knowledge into the generation process. Experimental results show that (1) performance differences primarily stem from missing domain context rather than inherent limitations in code generation, and (2) when these gaps are addressed, Text-to-Python achieves performance parity with Text-to-SQL. These findings establish Python as a viable foundation for analytical agents-provided that systems effectively ground ambiguous natural language inputs in executable logical specifications. Resources are available at https://anonymous.4open.science/r/Bird-Python-43B7/.

46.1IRMar 13
FGTR: Fine-Grained Multi-Table Retrieval via Hierarchical LLM Reasoning

Chaojie Sun, Bin Cao, Tiantian Li et al.

With the rapid advancement of large language models (LLMs), growing efforts have been made on LLM-based table retrieval. However, existing studies typically focus on single-table query, and implement it by similarity matching after encoding the entire table. These methods usually result in low accuracy due to their coarse-grained encoding which incorporates much query-irrelated data, and are also inefficient when dealing with large tables, failing to fully utilize the reasoning capabilities of LLM. Further, multi-table query is under-explored in retrieval tasks. To this end, we propose a hierarchical multi-table query method based on LLM: Fine-Grained Multi-Table Retrieval FGTR, a new retrieval paradigm that employs a human-like reasoning strategy. Through hierarchical reasoning, FGTR first identifies relevant schema elements and then retrieves the corresponding cell contents, ultimately constructing a concise and accurate sub-table that aligns with the given query. To comprehensively evaluate the performance of FGTR, we construct two new benchmark datasets based on Spider and BIRD . Experimental results show that FGTR outperforms previous state-of-the-art methods, improving the F_2 metric by 18% on Spider and 21% on BIRD, demonstrating its effectiveness in enhancing fine-grained retrieval and its potential to improve end-to-end performance on table-based downstream tasks.

HCSep 24, 2025
MazeMate: An LLM-Powered Chatbot to Support Computational Thinking in Gamified Programming Learning

Chenyu Hou, Hua Yu, Gaoxia Zhu et al.

Computational Thinking (CT) is a foundational problem-solving skill, and gamified programming environments are a widely adopted approach to cultivating it. While large language models (LLMs) provide on-demand programming support, current applications rarely foster CT development. We present MazeMate, an LLM-powered chatbot embedded in a 3D Maze programming game, designed to deliver adaptive, context-sensitive scaffolds aligned with CT processes in maze solving and maze design. We report on the first classroom implementation with 247 undergraduates. Students rated MazeMate as moderately helpful, with higher perceived usefulness for maze solving than for maze design. Thematic analysis confirmed support for CT processes such as decomposition, abstraction, and algorithmic thinking, while also revealing limitations in supporting maze design, including mismatched suggestions and fabricated algorithmic solutions. These findings demonstrate the potential of LLM-based scaffolding to support CT and underscore directions for design refinement to enhance MazeMate usability in authentic classrooms.

CYMay 23, 2023
Embrace Opportunities and Face Challenges: Using ChatGPT in Undergraduate Students' Collaborative Interdisciplinary Learning

Gaoxia Zhu, Xiuyi Fan, Chenyu Hou et al.

ChatGPT, launched in November 2022, has gained widespread attention from students and educators globally, with an online report by Hu (2023) stating it as the fastest-growing consumer application in history. While discussions on the use of ChatGPT in higher education are abundant, empirical studies on its impact on collaborative interdisciplinary learning are rare. To investigate its potential, we conducted a quasi-experimental study with 130 undergraduate students (STEM and non-STEM) learning digital literacy with or without ChatGPT over two weeks. Weekly surveys were conducted on collaborative interdisciplinary problem-solving, physical and cognitive engagement, and individual reflections on ChatGPT use. Analysis of survey responses showed significant main effects of topics on collaborative interdisciplinary problem-solving and physical and cognitive engagement, a marginal interaction effect between disciplinary backgrounds and ChatGPT conditions for cognitive engagement, and a significant interaction effect for physical engagement. Sentiment analysis of student reflections suggested no significant difference between STEM and non-STEM students' opinions towards ChatGPT. Qualitative analysis of reflections generated eight positive themes, including efficiency, addressing knowledge gaps, and generating human-like responses, and eight negative themes, including generic responses, lack of innovation, and counterproductive to self-discipline and thinking. Our findings suggest that ChatGPT use needs to be optimized by considering the topics being taught and the disciplinary backgrounds of students rather than applying it uniformly. These findings have implications for both pedagogical research and practices.