Wangyang Ying

LG
h-index22
24papers
565citations
Novelty54%
AI Score57

24 Papers

LGSep 8, 2023Code
Self-optimizing Feature Generation via Categorical Hashing Representation and Hierarchical Reinforcement Crossing

Wangyang Ying, Dongjie Wang, Kunpeng Liu et al.

Feature generation aims to generate new and meaningful features to create a discriminative representation space.A generated feature is meaningful when the generated feature is from a feature pair with inherent feature interaction. In the real world, experienced data scientists can identify potentially useful feature-feature interactions, and generate meaningful dimensions from an exponentially large search space, in an optimal crossing form over an optimal generation path. But, machines have limited human-like abilities.We generalize such learning tasks as self-optimizing feature generation. Self-optimizing feature generation imposes several under-addressed challenges on existing systems: meaningful, robust, and efficient generation. To tackle these challenges, we propose a principled and generic representation-crossing framework to solve self-optimizing feature generation.To achieve hashing representation, we propose a three-step approach: feature discretization, feature hashing, and descriptive summarization. To achieve reinforcement crossing, we develop a hierarchical reinforcement feature crossing approach.We present extensive experimental results to demonstrate the effectiveness and efficiency of the proposed method. The code is available at https://github.com/yingwangyang/HRC_feature_cross.git.

91.0AIMay 24Code
FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

Minwei Kong, Chonghe Jiang, Ao Qu et al.

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Our FrontierOR Benchmark is available at https://anonymous.4open.science/r/efficient-opt-bench-F03D.

LGSep 23, 2024
Revolutionizing Biomarker Discovery: Leveraging Generative AI for Bio-Knowledge-Embedded Continuous Space Exploration

Wangyang Ying, Dongjie Wang, Xuanming Hu et al.

Biomarker discovery is vital in advancing personalized medicine, offering insights into disease diagnosis, prognosis, and therapeutic efficacy. Traditionally, the identification and validation of biomarkers heavily depend on extensive experiments and statistical analyses. These approaches are time-consuming, demand extensive domain expertise, and are constrained by the complexity of biological systems. These limitations motivate us to ask: Can we automatically identify the effective biomarker subset without substantial human efforts? Inspired by the success of generative AI, we think that the intricate knowledge of biomarker identification can be compressed into a continuous embedding space, thus enhancing the search for better biomarkers. Thus, we propose a new biomarker identification framework with two important modules:1) training data preparation and 2) embedding-optimization-generation. The first module uses a multi-agent system to automatically collect pairs of biomarker subsets and their corresponding prediction accuracy as training data. These data establish a strong knowledge base for biomarker identification. The second module employs an encoder-evaluator-decoder learning paradigm to compress the knowledge of the collected data into a continuous space. Then, it utilizes gradient-based search techniques and autoregressive-based reconstruction to efficiently identify the optimal subset of biomarkers. Finally, we conduct extensive experiments on three real-world datasets to show the efficiency, robustness, and effectiveness of our method.

LGSep 23, 2024
Reinforcement Feature Transformation for Polymer Property Performance Prediction

Xuanming Hu, Dongjie Wang, Wangyang Ying et al.

Polymer property performance prediction aims to forecast specific features or attributes of polymers, which has become an efficient approach to measuring their performance. However, existing machine learning models face challenges in effectively learning polymer representations due to low-quality polymer datasets, which consequently impact their overall performance. This study focuses on improving polymer property performance prediction tasks by reconstructing an optimal and explainable descriptor representation space. Nevertheless, prior research such as feature engineering and representation learning can only partially solve this task since they are either labor-incentive or unexplainable. This raises two issues: 1) automatic transformation and 2) explainable enhancement. To tackle these issues, we propose our unique Traceable Group-wise Reinforcement Generation Perspective. Specifically, we redefine the reconstruction of the representation space into an interactive process, combining nested generation and selection. Generation creates meaningful descriptors, and selection eliminates redundancies to control descriptor sizes. Our approach employs cascading reinforcement learning with three Markov Decision Processes, automating descriptor and operation selection, and descriptor crossing. We utilize a group-wise generation strategy to explore and enhance reward signals for cascading agents. Ultimately, we conduct experiments to indicate the effectiveness of our proposed framework.

CLNov 2, 2022
Title2Event: Benchmarking Open Event Extraction with a Large-scale Chinese Title Dataset

Haolin Deng, Yanan Zhang, Yangfan Zhang et al.

Event extraction (EE) is crucial to downstream tasks such as new aggregation and event knowledge graph construction. Most existing EE datasets manually define fixed event types and design specific schema for each of them, failing to cover diverse events emerging from the online text. Moreover, news titles, an important source of event mentions, have not gained enough attention in current EE research. In this paper, We present Title2Event, a large-scale sentence-level dataset benchmarking Open Event Extraction without restricting event types. Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages. To the best of our knowledge, it is currently the largest manually-annotated Chinese dataset for open event extraction. We further conduct experiments on Title2Event with different models and show that the characteristics of titles make it challenging for event extraction, addressing the significance of advanced study on this problem. The dataset and baseline codes are available at https://open-event-hub.github.io/title2event.

82.9DBApr 13
Ozone: A Unified Platform for Transportation Research

Ou Zheng, Ruyi Feng, Yufeng Yang et al.

Intelligent Transportation Systems increasingly depend on heterogeneous data from roadside cameras, UAV imagery, LiDAR, and in-vehicle sensors, yet the lack of unified data standards, model interfaces, and evaluation protocols across these sources hampers reproducibility, cross-dataset benchmarking, and cross-region transferability of research findings. Existing trajectory datasets follow incompatible conventions for coordinate systems, object representations, and metadata fields, forcing researchers to build custom preprocessing pipelines for each dataset and simulator combination. To address these challenges, we propose Ozone, a unified platform for transportation research organized around five interconnected layers -- Hardware, Data, Model, Evaluation, and Prototype -- each with standardized schemas, automated conversion pipelines, and interoperable interfaces. In the first release, the data schema unifies four trajectory datasets -- NGSIM, highD, CitySim, and UTE -- into a canonical format with oriented bounding boxes, kinematic variables, and pre-computed surrogate safety measures. Digital-twin maps in CARLA and calibrated traffic models provide integrated benchmarking environments. Case studies in human-factor research, traffic scene generation, and safety-critical modeling demonstrate that Ozone reduces experiment setup time by 85%, achieves 91% cross-city transfer efficiency for safety models, and improves cross-dataset reproducibility to within 3% variance. The source code and datasets are publicly available.

AIJan 27
Multi-Agent Procedural Graph Extraction with Structural and Logical Refinement

Wangyang Ying, Yanchi Liu, Xujiang Zhao et al.

Automatically extracting workflows as procedural graphs from natural language is promising yet underexplored, demanding both structural validity and logical alignment. While recent large language models (LLMs) show potential for procedural graph extraction, they often produce ill-formed structures or misinterpret logical flows. We present \model{}, a multi-agent framework that formulates procedural graph extraction as a multi-round reasoning process with dedicated structural and logical refinement. The framework iterates through three stages: (1) a graph extraction phase with the graph builder agent, (2) a structural feedback phase in which a simulation agent diagnoses and explains structural defects, and (3) a logical feedback phase in which a semantic agent aligns semantics between flow logic and linguistic cues in the source text. Important feedback is prioritized and expressed in natural language, which is injected into subsequent prompts, enabling interpretable and controllable refinement. This modular design allows agents to target distinct error types without supervision or parameter updates. Experiments demonstrate that \model{} achieves substantial improvements in both structural correctness and logical consistency over strong baselines.

AIFeb 18
Causally-Guided Automated Feature Engineering with Multi-Agent Reinforcement Learning

Arun Vignesh Malarkkan, Wangyang Ying, Yanjie Fu

Automated feature engineering (AFE) enables AI systems to autonomously construct high-utility representations from raw tabular data. However, existing AFE methods rely on statistical heuristics, yielding brittle features that fail under distribution shift. We introduce CAFE, a framework that reformulates AFE as a causally-guided sequential decision process, bridging causal discovery with reinforcement learning-driven feature construction. Phase I learns a sparse directed acyclic graph over features and the target to obtain soft causal priors, grouping features as direct, indirect, or other based on their causal influence with respect to the target. Phase II uses a cascading multi-agent deep Q-learning architecture to select causal groups and transformation operators, with hierarchical reward shaping and causal group-level exploration strategies that favor causally plausible transformations while controlling feature complexity. Across 15 public benchmarks (classification with macro-F1; regression with inverse relative absolute error), CAFE achieves up to 7% improvement over strong AFE baselines, reduces episodes-to-convergence, and delivers competitive time-to-target. Under controlled covariate shifts, CAFE reduces performance drop by ~4x relative to a non-causal multi-agent baseline, and produces more compact feature sets with more stable post-hoc attributions. These findings underscore that causal structure, used as a soft inductive prior rather than a rigid constraint, can substantially improve the robustness and efficiency of automated feature engineering.

LGMar 6, 2024
Feature Selection as Deep Sequential Generative Learning

Wangyang Ying, Dongjie Wang, Haifeng Chen et al.

Feature selection aims to identify the most pattern-discriminative feature subset. In prior literature, filter (e.g., backward elimination) and embedded (e.g., Lasso) methods have hyperparameters (e.g., top-K, score thresholding) and tie to specific models, thus, hard to generalize; wrapper methods search a feature subset in a huge discrete space and is computationally costly. To transform the way of feature selection, we regard a selected feature subset as a selection decision token sequence and reformulate feature selection as a deep sequential generative learning task that distills feature knowledge and generates decision sequences. Our method includes three steps: (1) We develop a deep variational transformer model over a joint of sequential reconstruction, variational, and performance evaluator losses. Our model can distill feature selection knowledge and learn a continuous embedding space to map feature selection decision sequences into embedding vectors associated with utility scores. (2) We leverage the trained feature subset utility evaluator as a gradient provider to guide the identification of the optimal feature subset embedding;(3) We decode the optimal feature subset embedding to autoregressively generate the best feature selection decision sequence with autostop. Extensive experimental results show this generative perspective is effective and generic, without large discrete search space and expert-specific hyperparameters.

LGApr 26, 2024
Neuro-Symbolic Embedding for Short and Effective Feature Selection via Autoregressive Generation

Nanxu Gong, Wangyang Ying, Dongjie Wang et al.

Feature selection aims to identify the optimal feature subset for enhancing downstream models. Effective feature selection can remove redundant features, save computational resources, accelerate the model learning process, and improve the model overall performance. However, existing works are often time-intensive to identify the effective feature subset within high-dimensional feature spaces. Meanwhile, these methods mainly utilize a single downstream task performance as the selection criterion, leading to the selected subsets that are not only redundant but also lack generalizability. To bridge these gaps, we reformulate feature selection through a neuro-symbolic lens and introduce a novel generative framework aimed at identifying short and effective feature subsets. More specifically, we found that feature ID tokens of the selected subset can be formulated as symbols to reflect the intricate correlations among features. Thus, in this framework, we first create a data collector to automatically collect numerous feature selection samples consisting of feature ID tokens, model performance, and the measurement of feature subset redundancy. Building on the collected data, an encoder-decoder-evaluator learning paradigm is developed to preserve the intelligence of feature selection into a continuous embedding space for efficient search. Within the learned embedding space, we leverage a multi-gradient search algorithm to find more robust and generalized embeddings with the objective of improving model performance and reducing feature subset redundancy. These embeddings are then utilized to reconstruct the feature ID tokens for executing the final feature selection. Ultimately, comprehensive experiments and case studies are conducted to validate the effectiveness of the proposed framework.

LGMar 6, 2024
Knockoff-Guided Feature Selection via A Single Pre-trained Reinforced Agent

Xinyuan Wang, Dongjie Wang, Wangyang Ying et al.

Feature selection prepares the AI-readiness of data by eliminating redundant features. Prior research falls into two primary categories: i) Supervised Feature Selection, which identifies the optimal feature subset based on their relevance to the target variable; ii) Unsupervised Feature Selection, which reduces the feature space dimensionality by capturing the essential information within the feature set instead of using target variable. However, SFS approaches suffer from time-consuming processes and limited generalizability due to the dependence on the target variable and downstream ML tasks. UFS methods are constrained by the deducted feature space is latent and untraceable. To address these challenges, we introduce an innovative framework for feature selection, which is guided by knockoff features and optimized through reinforcement learning, to identify the optimal and effective feature subset. In detail, our method involves generating "knockoff" features that replicate the distribution and characteristics of the original features but are independent of the target variable. Each feature is then assigned a pseudo label based on its correlation with all the knockoff features, serving as a novel metric for feature evaluation. Our approach utilizes these pseudo labels to guide the feature selection process in 3 novel ways, optimized by a single reinforced agent: 1). A deep Q-network, pre-trained with the original features and their corresponding pseudo labels, is employed to improve the efficacy of the exploration process in feature selection. 2). We introduce unsupervised rewards to evaluate the feature subset quality based on the pseudo labels and the feature space reconstruction loss to reduce dependencies on the target variable. 3). A new ε-greedy strategy is used, incorporating insights from the pseudo labels to make the feature selection process more effective.

LGJan 17, 2025
Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation

Dongjie Wang, Yanyong Huang, Wangyang Ying et al.

Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns. This survey offers a comprehensive overview of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.

LGNov 8, 2024
Topology-aware Reinforcement Feature Space Reconstruction for Graph Data

Wangyang Ying, Haoyue Bai, Kunpeng Liu et al.

Feature space is an environment where data points are vectorized to represent the original dataset. Reconstructing a good feature space is essential to augment the AI power of data, improve model generalization, and increase the availability of downstream ML models. Existing literature, such as feature transformation and feature selection, is labor-intensive (e.g., heavy reliance on empirical experience) and mostly designed for tabular data. Moreover, these methods regard data samples as independent, which ignores the unique topological structure when applied to graph data, thus resulting in a suboptimal reconstruction feature space. Can we consider the topological information to automatically reconstruct feature space for graph data without heavy experiential knowledge? To fill this gap, we leverage topology-aware reinforcement learning to automate and optimize feature space reconstruction for graph data. Our approach combines the extraction of core subgraphs to capture essential structural information with a graph neural network (GNN) to encode topological features and reduce computing complexity. Then we introduce three reinforcement agents within a hierarchical structure to systematically generate meaningful features through an iterative process, effectively reconstructing the feature space. This framework provides a principled solution for attributed graph feature space reconstruction. The extensive experiments demonstrate the effectiveness and efficiency of including topological awareness.

LGFeb 12, 2025
A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective

Wangyang Ying, Cong Wei, Nanxu Gong et al.

Tabular data is one of the most widely used data formats across various domains such as bioinformatics, healthcare, and marketing. As artificial intelligence moves towards a data-centric perspective, improving data quality is essential for enhancing model performance in tabular data-driven applications. This survey focuses on data-driven tabular data optimization, specifically exploring reinforcement learning (RL) and generative approaches for feature selection and feature generation as fundamental techniques for refining data spaces. Feature selection aims to identify and retain the most informative attributes, while feature generation constructs new features to better capture complex data patterns. We systematically review existing generative methods for tabular data engineering, analyzing their latest advancements, real-world applications, and respective strengths and limitations. This survey emphasizes how RL-based and generative techniques contribute to the automation and intelligence of feature engineering. Finally, we summarize the existing challenges and discuss future research directions, aiming to provide insights that drive continued innovation in this field.

LGApr 30, 2025
Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming

Nanxu Gong, Xinyuan Wang, Wangyang Ying et al.

Feature transformation involves generating a new set of features from the original dataset to enhance the data's utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing methods fall short in efficient navigation of a vast space of feature combinations, and are mostly designed for supervised settings. To fill this gap, our unique perspective is to leverage a generator-critic duet-play teaming framework using LLM agents and in-context learning to derive pseudo-supervision from unsupervised data. The framework consists of three interconnected steps: (1) Critic agent diagnoses data to generate actionable advice, (2) Generator agent produces tokenized feature transformations guided by the critic's advice, and (3) Iterative refinement ensures continuous improvement through feedback between agents. The generator-critic framework can be generalized to human-agent collaborative generation, by replacing the critic agent with human experts. Extensive experiments demonstrate that the proposed framework outperforms even supervised baselines in feature transformation efficiency, robustness, and practical applicability across diverse datasets.

LGMay 21, 2025
Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories

Nanxu Gong, Sixun Dong, Haoyue Bai et al.

As a widely-used and practical tool, feature engineering transforms raw data into discriminative features to advance AI model performance. However, existing methods usually apply feature selection and generation separately, failing to strive a balance between reducing redundancy and adding meaningful dimensions. To fill this gap, we propose an agentic feature augmentation concept, where the unification of feature generation and selection is modeled as agentic teaming and planning. Specifically, we develop a Multi-Agent System with Long and Short-Term Memory (MAGS), comprising a selector agent to eliminate redundant features, a generator agent to produce informative new dimensions, and a router agent that strategically coordinates their actions. We leverage in-context learning with short-term memory for immediate feedback refinement and long-term memory for globally optimal guidance. Additionally, we employ offline Proximal Policy Optimization (PPO) reinforcement fine-tuning to train the router agent for effective decision-making to navigate a vast discrete feature space. Extensive experiments demonstrate that this unified agentic framework consistently achieves superior task performance by intelligently orchestrating feature selection and generation.

LGMay 21, 2025
Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation

Nanxu Gong, Zijun Li, Sixun Dong et al.

Feature Transformation (FT) crafts new features from original ones via mathematical operations to enhance dataset expressiveness for downstream models. However, existing FT methods exhibit critical limitations: discrete search struggles with enormous combinatorial spaces, impeding practical use; and continuous search, being highly sensitive to initialization and step sizes, often becomes trapped in local optima, restricting global exploration. To overcome these limitations, DIFFT redefines FT as a reward-guided generative task. It first learns a compact and expressive latent space for feature sets using a Variational Auto-Encoder (VAE). A Latent Diffusion Model (LDM) then navigates this space to generate high-quality feature embeddings, its trajectory guided by a performance evaluator towards task-specific optima. This synthesis of global distribution learning (from LDM) and targeted optimization (reward guidance) produces potent embeddings, which a novel semi-autoregressive decoder efficiently converts into structured, discrete features, preserving intra-feature dependencies while allowing parallel inter-feature generation. Extensive experiments on 14 benchmark datasets show DIFFT consistently outperforms state-of-the-art baselines in predictive accuracy and robustness, with significantly lower training and inference times.

CLJun 10, 2025
Efficient Post-Training Refinement of Latent Reasoning in Large Language Models

Xinyuan Wang, Dongjie Wang, Wangyang Ying et al.

Reasoning is a key component of language understanding in Large Language Models. While Chain-of-Thought prompting enhances performance via explicit intermediate steps, it suffers from sufficient token overhead and a fixed reasoning trajectory, preventing step-wise refinement. Recent advances in latent reasoning address these limitations by refining internal reasoning processes directly in the model's latent space, without producing explicit outputs. However, a key challenge remains: how to effectively update reasoning embeddings during post-training to guide the model toward more accurate solutions. To overcome this challenge, we propose a lightweight post-training framework that refines latent reasoning trajectories using two novel strategies: 1) Contrastive reasoning feedback, which compares reasoning embeddings against strong and weak baselines to infer effective update directions via embedding enhancement; 2) Residual embedding refinement, which stabilizes updates by progressively integrating current and historical gradients, enabling fast yet controlled convergence. Extensive experiments and case studies are conducted on five reasoning benchmarks to demonstrate the effectiveness of the proposed framework. Notably, a 5\% accuracy gain on MathQA without additional training.

LGMay 21, 2025
Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback

Wangyang Ying, Haoyue Bai, Nanxu Gong et al.

The data-to-equation (Data2Eqn) task aims to discover interpretable mathematical equations that map observed values to labels, offering physical insights and broad applicability across academic and industrial domains. Genetic programming and traditional deep learning-based approaches suffer from search inefficiency and poor generalization on small task-specific datasets. Foundation models showed promise in this area, but existing approaches suffer from: 1) They are pretrained on general-purpose data distributions, making them less effective for domain-specific tasks; and 2) their training objectives focus on token-level alignment, overlooking mathematical semantics, which can lead to inaccurate equations. To address these issues, we aim to enhance the domain adaptability of foundation models for Data2Eqn tasks. In this work, we propose a reinforcement learning-based finetuning framework that directly optimizes the generation policy of a pretrained model through reward signals derived from downstream numerical fitness. Our method allows the model to adapt to specific and complex data distributions and generate mathematically meaningful equations. Extensive experiments demonstrate that our approach improves both the accuracy and robustness of equation generation under complex distributions.

LGJun 10, 2025
LLM-ML Teaming: Integrated Symbolic Decoding and Gradient Search for Valid and Stable Generative Feature Transformation

Xinyuan Wang, Haoyue Bai, Nanxu Gong et al.

Feature transformation enhances data representation by deriving new features from the original data. Generative AI offers potential for this task, but faces challenges in stable generation (consistent outputs) and valid generation (error-free sequences). Existing methods--traditional MLs' low validity and LLMs' instability--fail to resolve both. We find that LLMs ensure valid syntax, while ML's gradient-steered search stabilizes performance. To bridge this gap, we propose a teaming framework combining LLMs' symbolic generation with ML's gradient optimization. This framework includes four steps: (1) golden examples generation, aiming to prepare high-quality samples with the ground knowledge of the teacher LLM; (2) feature transformation sequence embedding and search, intending to uncover potentially superior embeddings within the latent space; (3) student LLM feature transformation, aiming to distill knowledge from the teacher LLM; (4) LLM-ML decoder teaming, dedicating to combine ML and the student LLM probabilities for valid and stable generation. The experiments on various datasets show that the teaming policy can achieve 5\% improvement in downstream performance while reducing nearly half of the error cases. The results also demonstrate the efficiency and robustness of the teaming policy. Additionally, we also have exciting findings on LLMs' capacity to understand the original data.

AISep 23, 2025
Autonomous Data Agents: A New Opportunity for Smart Data

Yanjie Fu, Dongjie Wang, Wangyang Ying et al.

As data continues to grow in scale and complexity, preparing, transforming, and analyzing it remains labor-intensive, repetitive, and difficult to scale. Since data contains knowledge and AI learns knowledge from it, the alignment between AI and data is essential. However, data is often not structured in ways that are optimal for AI utilization. Moreover, an important question arises: how much knowledge can we pack into data through intensive data operations? Autonomous data agents (DataAgents), which integrate LLM reasoning with task decomposition, action reasoning and grounding, and tool calling, can autonomously interpret data task descriptions, decompose tasks into subtasks, reason over actions, ground actions into python code or tool calling, and execute operations. Unlike traditional data management and engineering tools, DataAgents dynamically plan workflows, call powerful tools, and adapt to diverse data tasks at scale. This report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems. DataAgents are capable of handling collection, integration, preprocessing, selection, transformation, reweighing, augmentation, reprogramming, repairs, and retrieval. Through these capabilities, DataAgents transform complex and unstructured data into coherent and actionable knowledge. We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend. We then define the concept of DataAgents and discuss their architectural design, training strategies, as well as the new skills and capabilities they enable. Finally, we call for concerted efforts to advance action workflow optimization, establish open datasets and benchmark ecosystems, safeguard privacy, balance efficiency with scalability, and develop trustworthy DataAgent guardrails to prevent malicious actions.

LGAug 27, 2025
Distribution Shift Aware Neural Tabular Learning

Wangyang Ying, Nanxu Gong, Dongjie Wang et al.

Tabular learning transforms raw features into optimized spaces for downstream tasks, but its effectiveness deteriorates under distribution shifts between training and testing data. We formalize this challenge as the Distribution Shift Tabular Learning (DSTL) problem and propose a novel Shift-Aware Feature Transformation (SAFT) framework to address it. SAFT reframes tabular learning from a discrete search task into a continuous representation-generation paradigm, enabling differentiable optimization over transformed feature sets. SAFT integrates three mechanisms to ensure robustness: (i) shift-resistant representation via embedding decorrelation and sample reweighting, (ii) flatness-aware generation through suboptimal embedding averaging, and (iii) normalization-based alignment between training and test distributions. Extensive experiments show that SAFT consistently outperforms prior tabular learning methods in terms of robustness, effectiveness, and generalization ability under diverse real-world distribution shifts.

LGAug 27, 2025
Data-Efficient Symbolic Regression via Foundation Model Distillation

Wangyang Ying, Jinghan Zhang, Haoyue Bai et al.

Discovering interpretable mathematical equations from observed data (a.k.a. equation discovery or symbolic regression) is a cornerstone of scientific discovery, enabling transparent modeling of physical, biological, and economic systems. While foundation models pre-trained on large-scale equation datasets offer a promising starting point, they often suffer from negative transfer and poor generalization when applied to small, domain-specific datasets. In this paper, we introduce EQUATE (Equation Generation via QUality-Aligned Transfer Embeddings), a data-efficient fine-tuning framework that adapts foundation models for symbolic equation discovery in low-data regimes via distillation. EQUATE combines symbolic-numeric alignment with evaluator-guided embedding optimization, enabling a principled embedding-search-generation paradigm. Our approach reformulates discrete equation search as a continuous optimization task in a shared embedding space, guided by data-equation fitness and simplicity. Experiments across three standard public benchmarks (Feynman, Strogatz, and black-box datasets) demonstrate that EQUATE consistently outperforms state-of-the-art baselines in both accuracy and robustness, while preserving low complexity and fast inference. These results highlight EQUATE as a practical and generalizable solution for data-efficient symbolic regression in foundation model distillation settings.

AIJul 10, 2025
Supply Chain Optimization via Generative Simulation and Iterative Decision Policies

Haoyue Bai, Haoyu Wang, Nanxu Gong et al.

High responsiveness and economic efficiency are critical objectives in supply chain transportation, both of which are influenced by strategic decisions on shipping mode. An integrated framework combining an efficient simulator with an intelligent decision-making algorithm can provide an observable, low-risk environment for transportation strategy design. An ideal simulation-decision framework must (1) generalize effectively across various settings, (2) reflect fine-grained transportation dynamics, (3) integrate historical experience with predictive insights, and (4) maintain tight integration between simulation feedback and policy refinement. We propose Sim-to-Dec framework to satisfy these requirements. Specifically, Sim-to-Dec consists of a generative simulation module, which leverages autoregressive modeling to simulate continuous state changes, reducing dependence on handcrafted domain-specific rules and enhancing robustness against data fluctuations; and a history-future dual-aware decision model, refined iteratively through end-to-end optimization with simulator interactions. Extensive experiments conducted on three real-world datasets demonstrate that Sim-to-Dec significantly improves timely delivery rates and profit.