Songwu Lu

h-index62

5papers

46citations

Novelty52%

AI Score46

Ranked #35,335 of 194,257 authors (top 18%)#7,228 in CL (top 23%)

5 Papers

9.2DCNov 10, 2023Code

CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

Yifei Xu, Yuning Chen, Xumiao Zhang et al.

Among the thriving ecosystem of cloud computing and the proliferation of Large Language Model (LLM)-based code generation tools, there is a lack of benchmarking for code generation in cloud-native applications. In response to this need, we present CloudEval-YAML, a practical benchmark for cloud configuration generation. CloudEval-YAML tackles the diversity challenge by focusing on YAML, the de facto standard of numerous cloud-native tools. We develop the CloudEval-YAML benchmark with practicality in mind: the dataset consists of hand-written problems with unit tests targeting practical scenarios. We further enhanced the dataset to meet practical needs by rephrasing questions in a concise, abbreviated, and bilingual manner. The dataset consists of 1011 problems that take more than 1200 human hours to complete. To improve practicality during evaluation, we build a scalable evaluation platform for CloudEval-YAML that achieves a 20 times speedup over a single machine. To the best of our knowledge, the CloudEval-YAML dataset is the first hand-written dataset targeting cloud-native applications. We present an in-depth evaluation of 12 LLMs, leading to a deeper understanding of the problems and LLMs, as well as effective methods to improve task performance and reduce cost.

2.1CLFeb 24

SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing

Yifei Xu, Guilherme Potje, Shivam Shandilya et al.

Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training. Rubrics provide structured, interpretable supervision, but scaling rubric construction is difficult: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking. We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items. Memory is updated via verifier-based item rewards measured by reference-candidate answer discriminative gaps from a handful of examples. SibylSense alternates memory tuning with a rubric-adversarial policy update that produces rubric-satisfying candidate answers, shrinking discriminative gaps and driving the rubric generator to capture new quality dimensions. Experiments on two open-ended tasks show that SibylSense yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.

17.0CLJun 16, 2025

Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks

Yifei Xu, Tusher Chakraborty, Srinagesh Sharma et al.

Recent advances in Large Language Models (LLMs) have showcased impressive reasoning abilities in structured tasks like mathematics and programming, largely driven by Reinforcement Learning with Verifiable Rewards (RLVR), which uses outcome-based signals that are scalable, effective, and robust against reward hacking. However, applying similar techniques to open-ended long-form reasoning tasks remains challenging due to the absence of generic, verifiable reward signals. To address this, we propose Direct Reasoning Optimization (DRO), a reinforcement learning framework for fine-tuning LLMs on open-ended, particularly long-form, reasoning tasks, guided by a new reward signal: the Reasoning Reflection Reward (R3). At its core, R3 selectively identifies and emphasizes key tokens in the reference outcome that reflect the influence of the model's preceding chain-of-thought reasoning, thereby capturing the consistency between reasoning and reference outcome at a fine-grained level. Crucially, R3 is computed internally using the same model being optimized, enabling a fully self-contained training setup. Additionally, we introduce a dynamic data filtering strategy based on R3 for open-ended reasoning tasks, reducing cost while improving downstream performance. We evaluate DRO on two diverse datasets -- ParaRev, a long-form paragraph revision task, and FinQA, a math-oriented QA benchmark -- and show that it consistently outperforms strong baselines while remaining broadly applicable across both open-ended and structured domains.

13.9CLFeb 19, 2025

RLTHF: Targeted Human Feedback for LLM Alignment

Yifei Xu, Tusher Chakraborty, Emre Kıcıman et al.

Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.

3.2CROct 29, 2015

New Threats to SMS-Assisted Mobile Internet Services from 4G LTE: Lessons Learnt from Distributed Mobile-Initiated Attacks towards Facebook and Other Services

Guan-Hua Tu, Yuanjie Li, Chunyi Peng et al.

Mobile Internet is becoming the norm. With more personalized mobile devices in hand, many services choose to offer alternative, usually more convenient, approaches to authenticating and delivering the content between mobile users and service providers. One main option is to use SMS (i.e., short messaging service). Such carrier-grade text service has been widely used to assist versatile mobile services, including social networking, banking, to name a few. Though the text service can be spoofed via certain Internet text service providers which cooperated with carriers, such attacks haven well studied and defended by industry due to the efforts of research community. However, as cellular network technology advances to the latest IP-based 4G LTE, we find that these mobile services are somehow exposed to new threats raised by this change, particularly on 4G LTE Text service (via brand-new distributed Mobile-Initiated Spoofed SMS attack which is not available in legacy 2G/3G systems). The reason is that messaging service over LTE shifts from the circuit-switched (CS) design to the packet-switched (PS) paradigm as 4G LTE supports PS only. Due to this change, 4G LTE Text Service becomes open to access. However, its shields to messaging integrity and user authentication are not in place. As a consequence, such weaknesses can be exploited to launch attacks (e.g., hijack Facebook accounts) against a targeted individual, a large scale of mobile users and even service providers, from mobile devices. Current defenses for Internet-Initiated Spoofed SMS attacks cannot defend the unprecedented attack. Our study shows that 53 of 64 mobile services over 27 industries are vulnerable to at least one threat. We validate these proof-of-concept attacks in one major US carrier which supports more than 100 million users. We finally propose quick fixes and discuss security insights and lessons we have learnt.