Xinyi Hou

SE
h-index13
12papers
1,419citations
Novelty33%
AI Score53

12 Papers

SEAug 21, 2023Code
Large Language Models for Software Engineering: A Systematic Literature Review

Xinyi Hou, Yanjie Zhao, Yue Liu et al.

Large Language Models (LLMs) have significantly impacted numerous domains, including Software Engineering (SE). Many recent publications have explored LLMs applied to various SE tasks. Nevertheless, a comprehensive understanding of the application, effects, and possible limitations of LLMs on SE is still in its early stages. To bridge this gap, we conducted a systematic literature review (SLR) on LLM4SE, with a particular focus on understanding how LLMs can be exploited to optimize processes and outcomes. We select and analyze 395 research papers from January 2017 to January 2024 to answer four key research questions (RQs). In RQ1, we categorize different LLMs that have been employed in SE tasks, characterizing their distinctive features and uses. In RQ2, we analyze the methods used in data collection, preprocessing, and application, highlighting the role of well-curated datasets for successful LLM for SE implementation. RQ3 investigates the strategies employed to optimize and evaluate the performance of LLMs in SE. Finally, RQ4 examines the specific SE tasks where LLMs have shown success to date, illustrating their practical contributions to the field. From the answers to these RQs, we discuss the current state-of-the-art and trends, identifying gaps in existing research, and flagging promising areas for future study. Our artifacts are publicly available at https://github.com/xinyi-hou/LLM4SE_SLR.

CRJul 11, 2024
On the (In)Security of LLM App Stores

Xinyi Hou, Yanjie Zhao, Haoyu Wang

LLM app stores have seen rapid growth, leading to the proliferation of numerous custom LLM apps. However, this expansion raises security concerns. In this study, we propose a three-layer concern framework to identify the potential security risks of LLM apps, i.e., LLM apps with abusive potential, LLM apps with malicious intent, and LLM apps with exploitable vulnerabilities. Over five months, we collected 786,036 LLM apps from six major app stores: GPT Store, FlowGPT, Poe, Coze, Cici, and Character.AI. Our research integrates static and dynamic analysis, the development of a large-scale toxic word dictionary (i.e., ToxicDict) comprising over 31,783 entries, and automated monitoring tools to identify and mitigate threats. We uncovered that 15,146 apps had misleading descriptions, 1,366 collected sensitive personal information against their privacy policies, and 15,996 generated harmful content such as hate speech, self-harm, extremism, etc. Additionally, we evaluated the potential for LLM apps to facilitate malicious activities, finding that 616 apps could be used for malware generation, phishing, etc. Our findings highlight the urgent need for robust regulatory frameworks and enhanced enforcement mechanisms.

58.4SEMay 4Code
CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation

Zirui Wan, Zhaonan Wu, Xinyi Hou et al.

High-quality commit messages are critical for maintaining software projects, yet ensuring their consistency and informativeness remains a practical challenge. While the Conventional Commits Specification (CCS) provides a structured format for commit messages, research on CCS-based commit classification and commit message generation (CMG) is limited by the absence of large-scale benchmarks, semantic annotations, and reliable evaluation methods. In this paper, we introduce CommitSuite, a benchmark comprising 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit is labeled with its CCS type and enriched with AST-level code changes, along with LLM-assisted semantic annotations that capture the "what" and "why" behind the change. To evaluate CMG systems, we propose a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality, enabling semantic-level assessment without relying on human-written references. Our experiments show that LLMs can effectively support both generation and evaluation, with evaluation achieving 0.849 Cohen's Kappa agreement against human judgments. CommitSuite offers a unified resource for structured commit understanding and facilitates reproducible research on commit classification and generation.

CRMay 5, 2025Code
Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study

Xinyi Hou, Jiahao Han, Yanjie Zhao et al.

Large language models (LLMs) are increasingly deployed through open-source and commercial frameworks, enabling individuals and organizations to self-host advanced LLM capabilities. As LLM deployments become prevalent, particularly in industry, ensuring their secure and reliable operation has become a critical issue. However, insecure defaults and misconfigurations often expose LLM services to the public internet, posing serious security and system engineering risks. This study conducted a large-scale empirical investigation of public-facing LLM deployments, focusing on the prevalence of services, exposure characteristics, systemic vulnerabilities, and associated risks. Through internet-wide measurements, we identified 320,102 public-facing LLM services across 15 frameworks and extracted 158 unique API endpoints, categorized into 12 functional groups based on functionality and security risk. Our analysis found that over 40% of endpoints used plain HTTP, and over 210,000 endpoints lacked valid TLS metadata. API exposure was highly inconsistent: some frameworks, such as Ollama, responded to over 35% of unauthenticated API requests, with about 15% leaking model or system information, while other frameworks implemented stricter controls. We observed widespread use of insecure protocols, poor TLS configurations, and unauthenticated access to critical operations. These security risks, such as model leakage, system compromise, and unauthorized access, are pervasive and highlight the need for a secure-by-default framework and stronger deployment practices.

CRMar 30, 2025
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

Xinyi Hou, Yanjie Zhao, Shenao Wang et al.

The Model Context Protocol (MCP) is an emerging open standard that defines a unified, bi-directional communication and dynamic discovery protocol between AI models and external tools or resources, aiming to enhance interoperability and reduce fragmentation across diverse systems. This paper presents a systematic study of MCP from both architectural and security perspectives. We first define the full lifecycle of an MCP server, comprising four phases (creation, deployment, operation, and maintenance), further decomposed into 16 key activities that capture its functional evolution. Building on this lifecycle analysis, we construct a comprehensive threat taxonomy that categorizes security and privacy risks across four major attacker types: malicious developers, external attackers, malicious users, and security flaws, encompassing 16 distinct threat scenarios. To validate these risks, we develop and analyze real-world case studies that demonstrate concrete attack surfaces and vulnerability manifestations within MCP implementations. Based on these findings, the paper proposes a set of fine-grained, actionable security safeguards tailored to each lifecycle phase and threat category, offering practical guidance for secure MCP adoption. We also analyze the current MCP landscape, covering industry adoption, integration patterns, and supporting tools, to identify its technological strengths as well as existing limitations that constrain broader deployment. Finally, we outline future research and development directions aimed at strengthening MCP's standardization, trust boundaries, and sustainable growth within the evolving ecosystem of tool-augmented AI systems.

93.3CRMay 8
Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

Shenao Wang, Xinyi Hou, Zhao Liu et al.

GitHub Actions is increasingly used to deploy LLM-based agents for repository-centric tasks such as issue triage, pull-request review, code modification, and release assistance. These agentic workflows extend traditional CI/CD automation with agentic capabilities but also create a new injection surface. In this paper, we introduce Agentic Workflow Injection (AWI), a workflow-level injection flaw where untrusted GitHub event context, such as issue bodies, pull-request descriptions, or comments, is incorporated into agent prompts or agent-consumed inputs and converted into attacker-influenced behavior through agent tools or downstream workflow logic. We identify two core AWI patterns: Prompt-to-Agent (P2A), where untrusted content reaches an agent prompt boundary, and Prompt-to-Script (P2S), where attacker influence propagates through model- or agent-derived outputs into later scripts. We present the first systematic study of AWI in GitHub Actions. We characterize 1,033 real-world AI-assisted actions and extract AWI-specific taint specifications, including prompt boundaries, derived outputs, agentic capabilities, and access-control interfaces. Based on these specifications, we design TaintAWI, a taint-analysis tool that tracks flows from untrusted event context to agent prompt inputs and security-sensitive workflow sinks. Applying TaintAWI to 13,392 real-world agentic workflows from 10,792 repositories, we report 519 potential AWI vulnerabilities, of which 496 are confirmed exploitable under our threat model, yielding a precision of 95.6%. Among them, 343 are previously unknown zero-day vulnerabilities. We prioritized disclosure for 187 zero-day cases, received 26 maintainer responses, and 24 cases have been accepted or fixed at the time of writing.

83.5SEMay 8
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem

Xinyi Hou, Yanjie Zhao, Haoyu Wang

Model Context Protocol (MCP) have quickly become the interface layer between LLM agents and external tools, yet they also introduce unsafe data flows that existing analyzers handle poorly. Vulnerabilities manifest in two directions: requester-controlled arguments may propagate to sensitive operations, while untrusted external or sensitive internal data may surface through MCP-visible outputs and subsequently influence host or model behavior. Accurate detection is complicated by the heterogeneous registration and dispatch patterns MCP servers employ, the need for MCP-specific taint semantics, and the fact that bugs often only materialize along complete tool-scoped execution paths. We present MCP-BiFlow, a bidirectional static analysis framework built around MCP-aware entrypoint recovery, protocol-specific taint modeling, and interprocedural propagation analysis. Against a benchmark of 32 confirmed MCP vulnerability cases, MCP-BiFlow identifies 30 (93.8% recall), substantially outperforming CodeQL, Semgrep, Snyk Code, and MCPScan. Across 15,452 real-world MCP server repositories, MCP-BiFlow surfaces 549 overlap-compressed candidate clusters; manual review confirms 118 vulnerability paths in 87 servers, establishing unsafe propagation as a recurring failure mode that resists detection without protocol-aware recovery of both request-side and return-side flows.

LGMay 16, 2024
GPT Store Mining and Analysis

Dongxun Su, Yanjie Zhao, Xinyi Hou et al.

As a pivotal extension of the renowned ChatGPT, the GPT Store serves as a dynamic marketplace for various Generative Pre-trained Transformer (GPT) models, shaping the frontier of conversational AI. This paper presents an in-depth measurement study of the GPT Store, with a focus on the categorization of GPTs by topic, factors influencing GPT popularity, and the potential security risks. Our investigation starts with assessing the categorization of GPTs in the GPT Store, analyzing how they are organized by topics, and evaluating the effectiveness of the classification system. We then examine the factors that affect the popularity of specific GPTs, looking into user preferences, algorithmic influences, and market trends. Finally, the study delves into the security risks of the GPT Store, identifying potential threats and evaluating the robustness of existing security measures. This study offers a detailed overview of the GPT Store's current state, shedding light on its operational dynamics and user interaction patterns. Our findings aim to enhance understanding of the GPT ecosystem, providing valuable insights for future research, development, and policy-making in generative AI.

SEJun 30, 2025
Software Engineering for Large Language Models: Research Status, Challenges and the Road Ahead

Hongzhou Rao, Yanjie Zhao, Xinyi Hou et al.

The rapid advancement of large language models (LLMs) has redefined artificial intelligence (AI), pushing the boundaries of AI research and enabling unbounded possibilities for both academia and the industry. However, LLM development faces increasingly complex challenges throughout its lifecycle, yet no existing research systematically explores these challenges and solutions from the perspective of software engineering (SE) approaches. To fill the gap, we systematically analyze research status throughout the LLM development lifecycle, divided into six phases: requirements engineering, dataset construction, model development and enhancement, testing and evaluation, deployment and operations, and maintenance and evolution. We then conclude by identifying the key challenges for each phase and presenting potential research directions to address these challenges. In general, we provide valuable insights from an SE perspective to facilitate future advances in LLM development.

SEMar 6, 2025
LLM Applications: Current Paradigms and the Next Frontier

Xinyi Hou, Yanjie Zhao, Haoyu Wang

The development of large language models (LLMs) has given rise to four major application paradigms: LLM app stores, LLM agents, self-hosted LLM services, and LLM-powered devices. Each has its advantages but also shares common challenges. LLM app stores lower the barrier to development but lead to platform lock-in; LLM agents provide autonomy but lack a unified communication mechanism; self-hosted LLM services enhance control but increase deployment complexity; and LLM-powered devices improve privacy and real-time performance but are limited by hardware. This paper reviews and analyzes these paradigms, covering architecture design, application ecosystem, research progress, as well as the challenges and open problems they face. Based on this, we outline the next frontier of LLM applications, characterizing them through three interconnected layers: infrastructure, protocol, and application. We describe their responsibilities and roles of each layer and demonstrate how to mitigate existing fragmentation limitations and improve security and scalability. Finally, we discuss key future challenges, identify opportunities such as protocol-driven cross-platform collaboration and device integration, and propose a research roadmap for openness, security, and sustainability.

SEAug 26, 2025
LaQual: A Novel Framework for Automated Evaluation of LLM App Quality

Yan Wang, Xinyi Hou, Yanjie Zhao et al.

LLM app stores are quickly emerging as platforms that gather a wide range of intelligent applications based on LLMs, giving users many choices for content creation, coding support, education, and more. However, the current methods for ranking and recommending apps in these stores mostly rely on static metrics like user activity and favorites, which makes it hard for users to efficiently find high-quality apps. To address these challenges, we propose LaQual, an automated framework for evaluating the quality of LLM apps. LaQual consists of three main stages: first, it labels and classifies LLM apps in a hierarchical way to accurately match them to different scenarios; second, it uses static indicators, such as time-weighted user engagement and functional capability metrics, to filter out low-quality apps; and third, it conducts a dynamic, scenario-adaptive evaluation, where the LLM itself generates scenario-specific evaluation metrics, scoring rules, and tasks for a thorough quality assessment. Experiments on a popular LLM app store show that LaQual is effective. Its automated scores are highly consistent with human judgments (with Spearman's rho of 0.62 and p=0.006 in legal consulting, and rho of 0.60 and p=0.009 in travel planning). By effectively screening, LaQual can reduce the pool of candidate LLM apps by 66.7% to 81.3%. User studies further confirm that LaQual significantly outperforms baseline systems in decision confidence, comparison efficiency (with average scores of 5.45 compared to 3.30), and the perceived value of its evaluation reports (4.75 versus 2.25). Overall, these results demonstrate that LaQual offers a scalable, objective, and user-centered solution for finding and recommending high-quality LLM apps in real-world use cases.

AINov 12, 2024
LLM App Squatting and Cloning

Yinglin Xie, Xinyi Hou, Yanjie Zhao et al.

Impersonation tactics, such as app squatting and app cloning, have posed longstanding challenges in mobile app stores, where malicious actors exploit the names and reputations of popular apps to deceive users. With the rapid growth of Large Language Model (LLM) stores like GPT Store and FlowGPT, these issues have similarly surfaced, threatening the integrity of the LLM app ecosystem. In this study, we present the first large-scale analysis of LLM app squatting and cloning using our custom-built tool, LLMappCrazy. LLMappCrazy covers 14 squatting generation techniques and integrates Levenshtein distance and BERT-based semantic analysis to detect cloning by analyzing app functional similarities. Using this tool, we generated variations of the top 1000 app names and found over 5,000 squatting apps in the dataset. Additionally, we observed 3,509 squatting apps and 9,575 cloning cases across six major platforms. After sampling, we find that 18.7% of the squatting apps and 4.9% of the cloning apps exhibited malicious behavior, including phishing, malware distribution, fake content dissemination, and aggressive ad injection.