Lifei Liu

h-index23

3papers

2,878citations

3 Papers

10.2SEMay 30

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Su Wang, Pin Qian, Yihang Chen et al.

LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication pipeline, and an action-based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall-oriented scanner ceiling and calibrate it against human judgment: in a pattern-stratified audit, roughly one in five flagged pair-pattern hits survives as a real compositional risk (population-weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per-skill scanning misses by construction, since every pair is individually safe. An action-based harness then probes when these candidates become model-issued tool calls, and finds realization gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issues the dropper-stage tool call on all 39 direct-prompt trials (36 of them the full download-then-execute chain, 3 download-only), Opus-4-7 stops at the download, and Sonnet-4-6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install-time compositional checks and capability isolation as complements to per-skill scanning.

4.8SEJun 21Code

Habituation at the Gate: Rising Approval and Declining Scrutiny in Human Review of AI Agent Code

Haoran Yu, Lifei Liu, Xiaochong Jiang et al.

As AI coding agents (e.g., GitHub Copilot, Devin, OpenAI Codex, Cursor) submit pull requests to open-source repositories at scale, a key question arises: do human reviewers gradually lower their scrutiny for AI-generated code over time? We conduct a longitudinal within-reviewer analysis using the AIDev dataset, studying 400 repeat reviewers who collectively submitted 11,429 reviews over a seven-month observation period. Comparing each reviewer's early and late review episodes, we observe a population-level shift in approval rate from 30.1% to 36.8% (Wilcoxon signed-rank p < 10^{-6} on paired shifts). Pooled by within-reviewer experience decile, the cumulative gap reaches +14.5 pp from first to tenth decile. This shift is experience-driven (persists after controlling for calendar time), agent-specific (human PR approval rates decline over the same period), and not explained by PR difficulty (median PR size is flat). However, review latency increases rather than decreases (+3.5x), while inline comment volume decreases (-22%, p=0.0014), suggesting reviewers spend more time in queue but less time actively inspecting code. The combination of rising approval, declining comment effort, and increasing queue time is most consistent with reflexive habituation under growing workload rather than rational trust calibration alone.

5.1SEJun 21

Beyond Simpson's Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship

Haoran Yu, Xiaochong Jiang, Lifei Liu et al.

Pooled across five AI coding agents, pull requests (PRs) with a human Co-Authored-By trailer merge less often than purely-autonomous ones (53.8% vs. 79.8%) -- yet this aggregate finding is a textbook Simpson's Paradox. Stratifying 33,596 PRs from the AIDev dataset by agent identity reverses the conclusion: Copilot and Devin show large positive within-agent gaps (+41.2 and +33.5 pp, both p<0.001), while Cursor, Claude Code, and Codex show small effects whose cross-sectional 95% CIs span zero. The paradox is driven entirely by agent composition: Codex, which dominates 64.9% of the dataset, achieves high merge rates while rarely using co-authorship. But Simpson's Paradox is only the first layer of a cascade of confounders: within-repo controls eliminate Devin's gap (+33.5 to +1.6 pp, p=0.73); a commit-count control further halves Copilot's within-repo gap (+36.2 to +24.4 pp); restricted to multi-commit PRs, the Copilot within-repo effect dissolves to +4.8 pp (p=0.59). No agent retains a clear co-authorship effect once both repository selection and PR structure are controlled. Our findings caution against reporting agent-pooled statistics without stratification and demonstrate that cross-sectional co-authorship associations are largely selection and PR-structure artefacts rather than evidence of a causal benefit.