Tuo Zhou

CR
h-index29
5papers
234citations
Novelty39%
AI Score52

5 Papers

CRFeb 17Code
SecCodeBench-V2 Technical Report

Longfei Chen, Ji Zhao, Lanxiao Cui et al.

We introduce SecCodeBench-V2, a publicly released benchmark for evaluating Large Language Model (LLM) copilots' capabilities of generating secure code. SecCodeBench-V2 comprises 98 generation and fix scenarios derived from Alibaba Group's industrial productions, where the underlying security issues span 22 common CWE (Common Weakness Enumeration) categories across five programming languages: Java, C, Python, Go, and JavaScript. SecCodeBench-V2 adopts a function-level task formulation: each scenario provides a complete project scaffold and requires the model to implement or patch a designated target function under fixed interfaces and dependencies. For each scenario, SecCodeBench-V2 provides executable proof-of-concept (PoC) test cases for both functional validation and security verification. All test cases are authored and double-reviewed by security experts, ensuring high fidelity, broad coverage, and reliable ground truth. Beyond the benchmark itself, we build a unified evaluation pipeline that assesses models primarily via dynamic execution. For most scenarios, we compile and run model-generated artifacts in isolated environments and execute PoC test cases to validate both functional correctness and security properties. For scenarios where security issues cannot be adjudicated with deterministic test cases, we additionally employ an LLM-as-a-judge oracle. To summarize performance across heterogeneous scenarios and difficulty levels, we design a Pass@K-based scoring protocol with principled aggregation over scenarios and severity, enabling holistic and comparable evaluation across models. Overall, SecCodeBench-V2 provides a rigorous and reproducible foundation for assessing the security posture of AI coding assistants, with results and artifacts released at https://alibaba.github.io/sec-code-bench. The benchmark is publicly available at https://github.com/alibaba/sec-code-bench.

AIFeb 28, 2024Code
Data Interpreter: An LLM Agent For Data Science

Sirui Hong, Yizhang Lin, Bang Liu et al. · tencent-ai, tsinghua

Large Language Model (LLM)-based agents have shown effectiveness across many applications. However, their use in data science scenarios requiring solving long-term interconnected tasks, dynamic data adjustments and domain expertise remains challenging. Previous approaches primarily focus on individual tasks, making it difficult to assess the complete data science workflow. Moreover, they struggle to handle real-time changes in intermediate data and fail to adapt dynamically to evolving task dependencies inherent to data science problems. In this paper, we present Data Interpreter, an LLM-based agent designed to automatically solve various data science problems end-to-end. Our Data Interpreter incorporates two key modules: 1) Hierarchical Graph Modeling, which breaks down complex problems into manageable subproblems, enabling dynamic node generation and graph optimization; and 2) Programmable Node Generation, a technique that refines and verifies each subproblem to iteratively improve code generation results and robustness. Extensive experiments consistently demonstrate the superiority of Data Interpreter. On InfiAgent-DABench, it achieves a 25% performance boost, raising accuracy from 75.9% to 94.9%. For machine learning and open-ended tasks, it improves performance from 88% to 95%, and from 60% to 97%, respectively. Moreover, on the MATH dataset, Data Interpreter achieves remarkable performance with a 26% improvement compared to state-of-the-art baselines. The code is available at https://github.com/geekan/MetaGPT.

40.0DCApr 12
CIR: Lightweight Container Image for Cross-Platform Deployment

Fengzhi Li, Xiaohui Peng, Qingru Xu et al.

In modern cloud and heterogeneous distributed infrastructures, container images are widely used as the deployment unit for machine learning applications. An image bundles the application with its entire platform-specific execution environment and can be directly launched into a container instance. However, this approach forces developers to build and maintain separate images for each target deployment platform. This limitation is particularly evident for widely used interpreted languages such as Python and R in data analytics and machine learning, where application code is inherently cross-platform, yet the runtime dependencies are highly platform-specific. With emerging computing paradigms such as sky computing and edge computing, which demand seamless workload migration and cross-platform deployment, traditional images not only introduce inefficiencies in storage and network usage, but also impose substantial burdens on developers, who must repeatedly craft and manage platform-specific builds. To address these challenges, we propose a lazy-build approach that defers platform-specific construction to the deployment stage, thus keeping the image itself cross-platform. To enable this, we introduce a new image format, CIR (Container Intermediate Representation), together with its pre-builder and lazy-builder. CIR targets interpreted-language applications and only stores the identifiers of the application's direct dependencies, leaving platform adaptation to the lazy-builder, which at deployment time assembles the actual dependencies into runnable containers. A single CIR can therefore be deployed across heterogeneous platforms while reducing image size by 95% compared to conventional images that bundle all dependencies. In our evaluation, CIR reduces deployment time by 40-60% compared with pre-built images, outperforming state-of-the-art systems such as Docker, Buildah, and Apptainer.

EMJun 1, 2025Code
Can AI Master Econometrics? Evidence from Econometrics AI Agent on Expert-Level Tasks

Qiang Chen, Tianyang Han, Jin Li et al.

Can AI effectively perform complex econometric analysis traditionally requiring human expertise? This paper evaluates AI agents' capability to master econometrics, focusing on empirical analysis performance. We develop an ``Econometrics AI Agent'' built on the open-source MetaGPT framework. This agent exhibits outstanding performance in: (1) planning econometric tasks strategically, (2) generating and executing code, (3) employing error-based reflection for improved robustness, and (4) allowing iterative refinement through multi-round conversations. We construct two datasets from academic coursework materials and published research papers to evaluate performance against real-world challenges. Comparative testing shows our domain-specialized AI agent significantly outperforms both benchmark large language models (LLMs) and general-purpose AI agents. This work establishes a testbed for exploring AI's impact on social science research and enables cost-effective integration of domain expertise, making advanced econometric methods accessible to users with minimal coding skills. Furthermore, our AI agent enhances research reproducibility and offers promising pedagogical applications for econometrics teaching.

CRDec 19, 2024
AIArena: A Blockchain-Based Decentralized AI Training Platform

Zhipeng Wang, Rui Sun, Elizabeth Lui et al.

The rapid advancement of AI has underscored critical challenges in its development and implementation, largely due to centralized control by a few major corporations. This concentration of power intensifies biases within AI models, resulting from inadequate governance and oversight mechanisms. Additionally, it limits public involvement and heightens concerns about the integrity of model generation. Such monopolistic control over data and AI outputs threatens both innovation and fair data usage, as users inadvertently contribute data that primarily benefits these corporations. In this work, we propose AIArena, a blockchain-based decentralized AI training platform designed to democratize AI development and alignment through on-chain incentive mechanisms. AIArena fosters an open and collaborative environment where participants can contribute models and computing resources. Its on-chain consensus mechanism ensures fair rewards for participants based on their contributions. We instantiate and implement AIArena on the public Base blockchain Sepolia testnet, and the evaluation results demonstrate the feasibility of AIArena in real-world applications.