AIJun 3
Agents' Last ExamYiyou Sun, Xinyang Han, Weichen Zhang et al.
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
LGOct 10, 2023
Ensemble Active Learning by Contextual Bandits for AI Incubation in ManufacturingYingyan Zeng, Xiaoyu Chen, Ran Jin
It is challenging but important to save annotation efforts in streaming data acquisition to maintain data quality for supervised learning base learners. We propose an ensemble active learning method to actively acquire samples for annotation by contextual bandits, which is will enforce the exploration-exploitation balance and leading to improved AI modeling performance.
CRApr 7
Understanding User Privacy Perceptions of GenAI SmartphonesRan Jin, Liu Wang, Shidong Pan et al.
GenAI smartphones, which natively embed generative AI at the system level, are transforming mobile interactions by automating a wide range of tasks and executing UI actions on behalf of users. Their superior capabilities rely on continuous access to sensitive and context-rich data, raising privacy concerns that surpass those of traditional mobile devices. Yet, little is known about how users perceive the privacy implications of such devices or what safeguards they expect, which is especially critical at this early stage of GenAI smartphone adoption. To address this gap, we conduct 22 semi-structured interviews with everyday mobile users to explore their usage of GenAI smartphones, privacy concerns, and privacy design expectations. Our findings show that users engage with GenAI smartphones with limited understanding of how these systems operate to deliver functions, but show heightened privacy concerns once exposed to the technical details. Participants' concerns span the entire data lifecycle, including nontransparent collection, insecure storage, and weak data control. In a follow-up focus group, participants discuss a range of privacy-enhancing suggestions that call for coordinated changes across system-level controls, data management practices, and user-facing transparency. Their concerns and suggestions offer user-centered guidances for designing GenAI smartphones that balance functionality with privacy protection, offering valuable takeaways for system designers and regulators.
AIMar 3, 2025
FAIR: Facilitating Artificial Intelligence Resilience in Manufacturing Industrial InternetYingyan Zeng, Ismini Lourentzou, Xinwei Deng et al.
Artificial intelligence (AI) systems have been increasingly adopted in the Manufacturing Industrial Internet (MII). Investigating and enabling the AI resilience is very important to alleviate profound impact of AI system failures in manufacturing and Industrial Internet of Things (IIoT) operations, leading to critical decision making. However, there is a wide knowledge gap in defining the resilience of AI systems and analyzing potential root causes and corresponding mitigation strategies. In this work, we propose a novel framework for investigating the resilience of AI performance over time under hazard factors in data quality, AI pipelines, and the cyber-physical layer. The proposed method can facilitate effective diagnosis and mitigation strategies to recover AI performance based on a multimodal multi-head self latent attention model. The merits of the proposed method are elaborated using an MII testbed of connected Aerosol Jet Printing (AJP) machines, fog nodes, and Cloud with inference tasks via AI pipelines.
LGDec 11, 2025
Hierarchical Dataset Selection for High-Quality Data SharingXiaona Zhou, Yingyan Zeng, Ran Jin et al.
The success of modern machine learning hinges on access to high-quality training data. In many real-world scenarios, such as acquiring data from public repositories or sharing across institutions, data is naturally organized into discrete datasets that vary in relevance, quality, and utility. Selecting which repositories or institutions to search for useful datasets, and which datasets to incorporate into model training are therefore critical decisions, yet most existing methods select individual samples and treat all data as equally relevant, ignoring differences between datasets and their sources. In this work, we formalize the task of dataset selection: selecting entire datasets from a large, heterogeneous pool to improve downstream performance under resource constraints. We propose Dataset Selection via Hierarchies (DaSH), a dataset selection method that models utility at both dataset and group (e.g., collections, institutions) levels, enabling efficient generalization from limited observations. Across two public benchmarks (Digit-Five and DomainNet), DaSH outperforms state-of-the-art data selection baselines by up to 26.2% in accuracy, while requiring significantly fewer exploration steps. Ablations show DaSH is robust to low-resource settings and lack of relevant datasets, making it suitable for scalable and adaptive dataset selection in practical multi-source learning workflows.
APSep 7, 2025
Robust Analysis for Resilient AI SystemYu Wang, Ran Jin, Lulu Kang
Operational hazards in Manufacturing Industrial Internet (MII) systems generate severe data outliers that cripple traditional statistical analysis. This paper proposes a novel robust regression method, DPD-Lasso, which integrates Density Power Divergence with Lasso regularization to analyze contaminated data from AI resilience experiments. We develop an efficient iterative algorithm to overcome previous computational bottlenecks. Applied to an MII testbed for Aerosol Jet Printing, DPD-Lasso provides reliable, stable performance on both clean and outlier-contaminated data, accurately quantifying hazard impacts. This work establishes robust regression as an essential tool for developing and validating resilient industrial AI systems.
LGNov 24, 2021
ModelPred: A Framework for Predicting Trained Model from Training DataYingyan Zeng, Jiachen T. Wang, Si Chen et al.
In this work, we propose ModelPred, a framework that helps to understand the impact of changes in training data on a trained model. This is critical for building trust in various stages of a machine learning pipeline: from cleaning poor-quality samples and tracking important ones to be collected during data preparation, to calibrating uncertainty of model prediction, to interpreting why certain behaviors of a model emerge during deployment. Specifically, ModelPred learns a parameterized function that takes a dataset $S$ as the input and predicts the model obtained by training on $S$. Our work differs from the recent work of Datamodels [1] as we aim for predicting the trained model parameters directly instead of the trained model behaviors. We demonstrate that a neural network-based set function class is capable of learning the complex relationships between the training data and model parameters. We introduce novel global and local regularization techniques to prevent overfitting and we rigorously characterize the expressive power of neural networks (NN) in approximating the end-to-end training process. Through extensive empirical investigations, we show that ModelPred enables a variety of applications that boost the interpretability and accountability of machine learning (ML), such as data valuation, data selection, memorization quantification, and model calibration.
LGJul 23, 2021
Adaptively Weighted Top-N Recommendation for Organ MatchingParshin Shojaee, Xiaoyu Chen, Ran Jin
Reducing the shortage of organ donations to meet the demands of patients on the waiting list has being a major challenge in organ transplantation. Because of the shortage, organ matching decision is the most critical decision to assign the limited viable organs to the most suitable patients. Currently, organ matching decisions were only made by matching scores calculated via scoring models, which are built by the first principles. However, these models may disagree with the actual post-transplantation matching performance (e.g., patient's post-transplant quality of life (QoL) or graft failure measurements). In this paper, we formulate the organ matching decision-making as a top-N recommendation problem and propose an Adaptively Weighted Top-N Recommendation (AWTR) method. AWTR improves performance of the current scoring models by using limited actual matching performance in historical data set as well as the collected covariates from organ donors and patients. AWTR sacrifices the overall recommendation accuracy by emphasizing the recommendation and ranking accuracy for top-N matched patients. The proposed method is validated in a simulation study, where KAS [60] is used to simulate the organ-patient recommendation response. The results show that our proposed method outperforms seven state-of-the-art top-N recommendation benchmark methods.