45.4AIJun 2
LAP: An Agent-to-Instrument Protocol for Autonomous ScienceLinwu Zhu, Liqiang Gao, Yan Chen et al.
Autonomous science is moving from demonstration to infrastructure. Large language model agents now plan experiments, and self-driving laboratories execute them. Yet every such system rebuilds the link between the reasoning agent and the physical instrument from scratch, against fragmented vendor SDKs and standards built for deterministic software clients rather than probabilistic, goal-directed agents. Recent agent-interoperability protocols clarify two of the three edges of an agentic ecosystem (Anthropic's Model Context Protocol (MCP) standardizes the agent-to-tool edge, and Google's Agent2Agent (A2A) the agent-to-agent edge), but neither models the agent-to-instrument edge, where operations are stateful, safety-critical, exclusively owned, physically embodied, and produce measurements with units, calibration, and uncertainty. We present the Lab Agent Protocol (LAP), a protocol design that fills this gap. LAP retains A2A's peer-to-peer, discovery-first, task-lifecycle structure and adds four physical-world primitives: (i) the InstrumentCard, a signed capability and physical-limit description; (ii) first-class reservation for exclusive instrument and sample locking; (iii) a safety-fence handshake with operator-confirmation tokens cryptographically bound to a specific task and its parameters, gating hazardous and irreversible operations; and (iv) a MeasurementResult schema that makes every result physically typed (QUDT/UCUM), calibration-anchored, uncertainty-bearing, and reproducible by construction. We specify roles, a six-layer architecture, the JSON-RPC method set, the task and safety state machines, the error model, and cross-laboratory federation, and walk a closed-loop autonomous campaign through the protocol end-to-end. LAP is transport-compatible with the A2A/MCP ecosystem and encapsulates rather than replaces existing device standards such as SiLA 2 and OPC-UA.
AIJul 24, 2024Code
LAMBDA: A Large Model Based Data AgentMaojun Sun, Ruijian Han, Binyan Jiang et al.
We introduce LArge Model Based Data Agent (LAMBDA), a novel open-source, code-free multi-agent data analysis system that leverages the power of large language models. LAMBDA is designed to address data analysis challenges in data-driven applications through innovatively designed data agents using natural language. At the core of LAMBDA are two key agent roles: the programmer and the inspector, which are engineered to work together seamlessly. Specifically, the programmer generates code based on the user's instructions and domain-specific knowledge, while the inspector debugs the code when necessary. To ensure robustness and handle adverse scenarios, LAMBDA features a user interface that allows direct user intervention. Moreover, LAMBDA can flexibly integrate external models and algorithms through our proposed Knowledge Integration Mechanism, catering to the needs of customized data analysis. LAMBDA has demonstrated strong performance on various data analysis tasks. It has the potential to enhance data analysis paradigms by seamlessly integrating human and artificial intelligence, making it more accessible, effective, and efficient for users from diverse backgrounds. The strong performance of LAMBDA in solving data analysis problems is demonstrated using real-world data examples. The code for LAMBDA is available at https://github.com/AMA-CMFAI/LAMBDA and videos of three case studies can be viewed at https://www.polyu.edu.hk/ama/cmfai/lambda.html.
AROct 13, 2023Code
G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor MigrationsHaoyang Zhang, Yirui Eric Zhou, Yuqi Xue et al.
To break the GPU memory wall for scaling deep learning workloads, a variety of architecture and system techniques have been proposed recently. Their typical approaches include memory extension with flash memory and direct storage access. However, these techniques still suffer from suboptimal performance and introduce complexity to the GPU memory management, making them hard to meet the scalability requirement of deep learning workloads today. In this paper, we present a unified GPU memory and storage architecture named G10 driven by the fact that the tensor behaviors of deep learning workloads are highly predictable. G10 integrates the host memory, GPU memory, and flash memory into a unified memory space, to scale the GPU memory capacity while enabling transparent data migrations. Based on this unified GPU memory and storage architecture, G10 utilizes compiler techniques to characterize the tensor behaviors in deep learning workloads. Therefore, it can schedule data migrations in advance by considering the available bandwidth of flash memory and host memory. The cooperative mechanism between deep learning compilers and the unified memory architecture enables G10 to hide data transfer overheads in a transparent manner. We implement G10 based on an open-source GPU simulator. Our experiments demonstrate that G10 outperforms state-of-the-art GPU memory solutions by up to 1.75$\times$, without code modifications to deep learning workloads. With the smart data migration mechanism, G10 can reach 90.3\% of the performance of the ideal case assuming unlimited GPU memory.
NAApr 26, 2017
Multigrid Methods for A Mixed Finite Element Method of The Darcy-Forchheimer ModelJian Huang, Long Chen, Hongxing Rui
An efficient nonlinear multigrid method for a mixed finite element method of the Darcy-Forchheimer model is constructed in this paper. A Peaceman-Rachford type iteration is used as a smoother to decouple the nonlinearity from the divergence constraint. The nonlinear equation can be solved element-wise with a closed formulae. The linear saddle point system for the constraint is reduced into a symmetric positive definite system of Poisson type. Furthermore an empirical choice of the parameter used in the splitting is proposed and the resulting multigrid method is robust to the so-called Forchheimer number which controls the strength of the nonlinearity. By comparing the number of iterations and CPU time of different solvers in several numerical experiments, our multigrid method is shown to convergent with a rate independent of the mesh size and the Forchheimer number and with a nearly linear computational cost.
MLJul 21, 2022
Estimation of Non-Crossing Quantile Regression Process with Deep ReQU Neural NetworksGuohao Shen, Yuling Jiao, Yuanyuan Lin et al.
We propose a penalized nonparametric approach to estimating the quantile regression process (QRP) in a nonseparable model using rectifier quadratic unit (ReQU) activated deep neural networks and introduce a novel penalty function to enforce non-crossing of quantile regression curves. We establish the non-asymptotic excess risk bounds for the estimated QRP and derive the mean integrated squared error for the estimated QRP under mild smoothness and regularity conditions. To establish these non-asymptotic risk and estimation error bounds, we also develop a new error bound for approximating $C^s$ smooth functions with $s >0$ and their derivatives using ReQU activated neural networks. This is a new approximation result for ReQU networks and is of independent interest and may be useful in other problems. Our numerical experiments demonstrate that the proposed method is competitive with or outperforms two existing methods, including methods using reproducing kernels and random forests, for nonparametric quantile regression.
IVNov 1, 2023Code
DEFN: Dual-Encoder Fourier Group Harmonics Network for Three-Dimensional Indistinct-Boundary Object SegmentationXiaohua Jiang, Yihao Guo, Jian Huang et al.
The precise spatial and quantitative delineation of indistinct-boundary medical objects is paramount for the accuracy of diagnostic protocols, efficacy of surgical interventions, and reliability of postoperative assessments. Despite their significance, the effective segmentation and instantaneous three-dimensional reconstruction are significantly impeded by the paucity of representative samples in available datasets and noise artifacts. To surmount these challenges, we introduced Stochastic Defect Injection (SDi) to augment the representational diversity of challenging indistinct-boundary objects within training corpora. Consequently, we propose the Dual-Encoder Fourier Group Harmonics Network (DEFN) to tailor noise filtration, amplify detailed feature recognition, and bolster representation across diverse medical imaging scenarios. By incorporating Dynamic Weight Composing (DWC) loss dynamically adjusts model's focus based on training progression, DEFN achieves SOTA performance on the OIMHS public dataset, showcasing effectiveness in indistinct boundary contexts. Source code for DEFN is available at: https://github.com/IMOP-lab/DEFN-pytorch.
MLNov 20, 2023
Gaussian Interpolation FlowsYuan Gao, Jian Huang, Yuling Jiao
Gaussian denoising has emerged as a powerful method for constructing simulation-free continuous normalizing flows for generative modeling. Despite their empirical successes, theoretical properties of these flows and the regularizing effect of Gaussian denoising have remained largely unexplored. In this work, we aim to address this gap by investigating the well-posedness of simulation-free continuous normalizing flows built on Gaussian denoising. Through a unified framework termed Gaussian interpolation flow, we establish the Lipschitz regularity of the flow velocity field, the existence and uniqueness of the flow, and the Lipschitz continuity of the flow map and the time-reversed flow map for several rich classes of target distributions. This analysis also sheds light on the auto-encoding and cycle consistency properties of Gaussian interpolation flows. Additionally, we study the stability of these flows in source distributions and perturbations of the velocity field, using the quadratic Wasserstein distance as a metric. Our findings offer valuable insights into the learning techniques employed in Gaussian interpolation flows for generative modeling, providing a solid theoretical foundation for end-to-end error analyses of learning Gaussian interpolation flows with empirical observations.
NANov 3, 2017
Robust Decoding from 1-Bit Compressive Sampling with Least SquaresJian Huang, Yuling Jiao, Xiliang Lu et al.
In 1-bit compressive sensing (1-bit CS) where target signal is coded into a binary measurement, one goal is to recover the signal from noisy and quantized samples. Mathematically, the 1-bit CS model reads: $y = η\odot\textrm{sign} (Ψx^* + ε)$, where $x^{*}\in \mathcal{R}^{n}, y\in \mathcal{R}^{m}$, $Ψ\in \mathcal{R}^{m\times n}$, and $ε$ is the random error before quantization and $η\in \mathcal{R}^{n}$ is a random vector modeling the sign flips. Due to the presence of nonlinearity, noise and sign flips, it is quite challenging to decode from the 1-bit CS. In this paper, we consider least squares approach under the over-determined and under-determined settings. For $m>n$, we show that, up to a constant $c$, with high probability, the least squares solution $x_{\textrm{ls}}$ approximates $ x^*$ with precision $δ$ as long as $m \geq\widetilde{\mathcal{O}}(\frac{n}{δ^2})$. For $m< n$, we prove that, up to a constant $c$, with high probability, the $\ell_1$-regularized least-squares solution $x_{\ell_1}$ lies in the ball with center $x^*$ and radius $δ$ provided that $m \geq \mathcal{O}( \frac{s\log n}{δ^2})$ and $\|x^*\|_0 := s < m$. We introduce a Newton type method, the so-called primal and dual active set (PDAS) algorithm, to solve the nonsmooth optimization problem. The PDAS possesses the property of one-step convergence. It only requires to solve a small least squares problem on the active set. Therefore, the PDAS is extremely efficient for recovering sparse signals through continuation. We propose a novel regularization parameter selection rule which does not introduce any extra computational overhead. Extensive numerical experiments are presented to illustrate the robustness of our proposed model and the efficiency of our algorithm.
90.3PFMar 11Code
RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation SystemsShaobo Li, Yirui Zhou, Yuan Xu et al.
We present the design and implementation of a RAG-based AI system benchmarking (RAGPerf) framework for characterizing the system behaviors of RAG pipelines. To facilitate detailed profiling and fine-grained performance analysis, RAGPerf decouples the RAG workflow into several modular components - embedding, indexing, retrieval, reranking, and generation. RAGPerf offers the flexibility for users to configure the core parameters of each component and examine their impact on the end-to-end query performance and quality. RAGPerf has a workload generator to model real-world scenarios by supporting diverse datasets (e.g., text, pdf, code, and audio), different retrieval and update ratios, and query distributions. RAGPerf also supports different embedding models, major vector databases such as LanceDB, Milvus, Qdrant, Chroma, and Elasticsearch, as well as different LLMs for content generation. It automates the collection of performance metrics (i.e., end-to-end query throughput, host/GPU memory footprint, and CPU/GPU utilization) and accuracy metrics (i.e., context recall, query accuracy, and factual consistency). We demonstrate the capabilities of RAGPerf through a comprehensive set of experiments and open source its codebase at GitHub. Our evaluation shows that RAGPerf incurs negligible performance overhead.
MLOct 18, 2022
Nonparametric Quantile Regression: Non-Crossing Constraints and Conformal PredictionWenlu Tang, Guohao Shen, Yuanyuan Lin et al.
We propose a nonparametric quantile regression method using deep neural networks with a rectified linear unit penalty function to avoid quantile crossing. This penalty function is computationally feasible for enforcing non-crossing constraints in multi-dimensional nonparametric quantile regression. We establish non-asymptotic upper bounds for the excess risk of the proposed nonparametric quantile regression function estimators. Our error bounds achieve optimal minimax rate of convergence for the Holder class, and the prefactors of the error bounds depend polynomially on the dimension of the predictor, instead of exponentially. Based on the proposed non-crossing penalized deep quantile regression, we construct conformal prediction intervals that are fully adaptive to heterogeneity. The proposed prediction interval is shown to have good properties in terms of validity and accuracy under reasonable conditions. We also derive non-asymptotic upper bounds for the difference of the lengths between the proposed non-crossing conformal prediction interval and the theoretically oracle prediction interval. Numerical experiments including simulation studies and a real data example are conducted to demonstrate the effectiveness of the proposed method.
MLJun 27, 2023
Wasserstein Generative RegressionShanshan Song, Tong Wang, Guohao Shen et al.
In this paper, we propose a new and unified approach for nonparametric regression and conditional distribution learning. Our approach simultaneously estimates a regression function and a conditional generator using a generative learning framework, where a conditional generator is a function that can generate samples from a conditional distribution. The main idea is to estimate a conditional generator that satisfies the constraint that it produces a good regression function estimator. We use deep neural networks to model the conditional generator. Our approach can handle problems with multivariate outcomes and covariates, and can be used to construct prediction intervals. We provide theoretical guarantees by deriving non-asymptotic error bounds and the distributional consistency of our approach under suitable assumptions. We also perform numerical experiments with simulated and real data to demonstrate the effectiveness and superiority of our approach over some existing approaches in various scenarios.
DCAug 9, 2024
Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10Yiqi Liu, Yuqi Xue, Yu Cheng et al.
As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3$\times$ performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries.
IRFeb 10Code
AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented GenerationYanning Hou, Duanyang Yuan, Sihang Zhou et al.
Recent GraphRAG methods integrate graph structures into text indexing and retrieval, using knowledge graph triples to connect text chunks, thereby improving retrieval coverage and precision. However, we observe that treating text chunks as the basic unit of knowledge representation rigidly groups multiple atomic facts together, limiting the flexibility and adaptability needed to support diverse retrieval scenarios. Additionally, triple-based entity linking is sensitive to relation-extraction errors, which can lead to missing or incorrect reasoning paths and ultimately hurt retrieval accuracy. To address these issues, we propose the Atom-Entity Graph, a more precise and reliable architecture for knowledge representation and indexing. In our approach, knowledge is stored as knowledge atoms, namely individual, self-contained units of factual information, rather than coarse-grained text chunks. This allows knowledge elements to be flexibly reassembled without mutual interference, thereby enabling seamless alignment with diverse query perspectives. Edges between entities simply indicate whether a relationship exists. By combining personalized PageRank with relevance-based filtering, we maintain accurate entity connections and improve the reliability of reasoning. Theoretical analysis and experiments on five public benchmarks show that the proposed AtomicRAG algorithm outperforms strong RAG baselines in retrieval accuracy and reasoning robustness. Code: https://github.com/7HHHHH/AtomicRAG.
MLJul 21, 2022
Deep Sufficient Representation Learning via Mutual InformationSiming Zheng, Yuanyuan Lin, Jian Huang
We propose a mutual information-based sufficient representation learning (MSRL) approach, which uses the variational formulation of the mutual information and leverages the approximation power of deep neural networks. MSRL learns a sufficient representation with the maximum mutual information with the response and a user-selected distribution. It can easily handle multi-dimensional continuous or categorical response variables. MSRL is shown to be consistent in the sense that the conditional probability density function of the response variable given the learned representation converges to the conditional probability density function of the response variable given the predictor. Non-asymptotic error bounds for MSRL are also established under suitable conditions. To establish the error bounds, we derive a generalized Dudley's inequality for an order-two U-process indexed by deep neural networks, which may be of independent interest. We discuss how to determine the intrinsic dimension of the underlying data distribution. Moreover, we evaluate the performance of MSRL via extensive numerical experiments and real data analysis and demonstrate that MSRL outperforms some existing nonlinear sufficient dimension reduction methods.
AIJan 20
DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science ProblemsMaojun Sun, Yifei Xie, Yue Wu et al.
Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.
73.0ARApr 29Code
Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with VoxelYiqi Liu, Noelle Crawford, Michael Wang et al.
To overcome the well-known memory bottleneck of AI chips, 3D stacked architectures that employ advanced packaging technology with high-density through-silicon vias (TSVs) pins have proven to be a promising solution. The 3D-stacked AI chip enables ultra-high memory bandwidth between compute and memory by stacking numerous DRAM banks atop many AI cores in a distributed manner. However, it is not easy to explore the efficiency of the 3D-stacked AI chip, due to its unique distributed nature. And we need to carefully consider multiple intertwined factors that range from upper-level computing paradigm to machine learning (ML) compiler optimizations, and to the underlying hardware architecture. In this paper, we develop Voxel, a fast and compiler-aware end-to-end simulation framework to facilitate exploring the efficiency of 3D-stacked AI chips for large language model (LLM) inference. Voxel enables the software/hardware co-exploration by employing a programming interface that allows ML compilers to customize the model execution plans. After validating the results of Voxel with an emulator on real silicon, we thoroughly examine the impact and correlation of different aspects of 3D-stacked AI chips, including state-of-the-art compute paradigms, tile-to-core mapping, tensor-to-bank mapping, NoC topologies and link bandwidth, DRAM bank bandwidth, per-core SRAM capacity, and energy/thermal constraints. Our findings disclose that the end-to-end efficiency of a 3D stacked AI chip not only is determined by the cooperative function of these factors, but also significantly depends on the mappings from tiles to AI core and DRAM banks. We report our findings throughout the paper, with the expectation that they will shed light on the development of the 3D-stacked AI chip ecosystem. We will open source Voxel and our study results for public research.
ARAug 7, 2024
Hardware-Assisted Virtualization of Neural Processing Units for Cloud PlatformsYuqi Xue, Yiqi Liu, Lifeng Nai et al.
Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present Neu10, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. Neu10 consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement Neu10 based on a production-level NPU simulator. Our experiments show that Neu10 improves the throughput of ML inference services by up to 1.4$\times$ and reduces the tail latency by up to 4.6$\times$, while improving the NPU utilization by 1.2$\times$ on average, compared to state-of-the-art NPU sharing approaches.
LGNov 4, 2024Code
OwMatch: Conditional Self-Labeling with Consistency for Open-World Semi-Supervised LearningShengjie Niu, Lifan Lin, Jian Huang et al.
Semi-supervised learning (SSL) offers a robust framework for harnessing the potential of unannotated data. Traditionally, SSL mandates that all classes possess labeled instances. However, the emergence of open-world SSL (OwSSL) introduces a more practical challenge, wherein unlabeled data may encompass samples from unseen classes. This scenario leads to misclassification of unseen classes as known ones, consequently undermining classification accuracy. To overcome this challenge, this study revisits two methodologies from self-supervised and semi-supervised learning, self-labeling and consistency, tailoring them to address the OwSSL problem. Specifically, we propose an effective framework called OwMatch, combining conditional self-labeling and open-world hierarchical thresholding. Theoretically, we analyze the estimation of class distribution on unlabeled data through rigorous statistical analysis, thus demonstrating that OwMatch can ensure the unbiasedness of the self-label assignment estimator with reliability. Comprehensive empirical analyses demonstrate that our method yields substantial performance enhancements across both known and unknown classes in comparison to previous studies. Code is available at https://github.com/niusj03/OwMatch.
69.8GTMay 16
A Truthful Multiunit Profit-Optimal Mechanism for Synthesizing Social LawsJun Wu, Jian Huang, Chongjun Wang
This paper studies Social Law Synthesis (SLS) in strategic multi-agent environments as a new multi-unit mechanism design problem. We model SLS as a Bayesian single-parameter procurement auction based on Alternating-time Temporal Logic (ATL) and aim to design a truthful, individually rational, and profit-optimal mechanism. We first prove a representation lemma showing that any valuation respecting alternating bisimulation can be compactly expressed as a feature set of ATL formulae. We then reduce payment determination to allocation determination in polynomial time, resolving the irregular payment issue inherent in multi-unit settings. We further show that allocation determination is \(FP^{NP}\)-complete and encode ATL semantics into integer linear programming (ILP) constraints to make the problem tractable with standard solvers. Based on these results, we present the $\mathcal{PO\text{-}ASL}$ mechanism, which is incentive-compatible, individually rational, and maximizes expected profit. Theoretical guarantees and examples confirm that our approach provides an effective and computationally feasible solution for synthesizing optimal social laws under strategic agent behavior.
40.0MLMay 8
CONTRA: Conformal Prediction Region via Normalizing Flow TransformationZhenhan Fang, Aixin Tan, Jian Huang
Density estimation and reliable prediction regions for outputs are crucial in supervised and unsupervised learning. While conformal prediction effectively generates coverage-guaranteed regions, it struggles with multi-dimensional outputs due to reliance on one-dimensional nonconformity scores. To address this, we introduce CONTRA: CONformal prediction region via normalizing flow TRAnsformation. CONTRA utilizes the latent spaces of normalizing flows to define nonconformity scores based on distances from the center. This allows for the mapping of high-density regions in latent space to sharp prediction regions in the output space, surpassing traditional hyperrectangular or elliptical conformal regions. Further, for scenarios where other predictive models are favored over flow-based models, we extend CONTRA to enhance any such model with a reliable prediction region by training a simple normalizing flow on the residuals. We demonstrate that both CONTRA and its extension maintain guaranteed coverage probability and outperform existing methods in generating accurate prediction regions across various datasets. We conclude that CONTRA is an effective tool for (conditional) density estimation, addressing the under-explored challenge of delivering multi-dimensional prediction regions.
IRMar 5Code
DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware RetrievalMaojun Sun, Yue Wu, Yifei Xie et al.
Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.
17.7CLApr 18
Stability-Weighted Decoding for Diffusion Language ModelsYue Wu, Jian Huang
Diffusion large language models (dLLMs) enable parallel text generation by iteratively denoising a fully masked sequence, unmasking a subset of masked tokens at each step. Existing decoding strategies rely on static confidence metrics computed at a single denoising step, ignoring temporal history and often leading to premature unmasking of unstable tokens. In this work, we theoretically establish that a token's temporal instability, quantified by the KL divergence between consecutive prediction distributions, provides a strict lower bound on its mutual information with the remaining masked context, indicating that temporally unstable tokens are inherently unsafe to unmask. Based on this insight, we propose Stability-Weighted Decoding (SWD), a training-free, plug-and-play strategy that incorporates temporal stability into token scoring and acts as a universal modulator for arbitrary score-based decoding policies. Experiments on code generation and mathematical reasoning benchmarks demonstrate that SWD consistently improves generation accuracy across representative scoring metrics and selection policies, and exhibits exceptional robustness, maintaining a significant performance lead over standard baselines across varying acceleration ratios.
89.5LGApr 23
Decoupled Travel Planning with Behavior ForestDuanyang Yuan, Sihang Zhou, Yanning Hou et al.
Behavior sequences, composed of executable steps, serve as the operational foundation for multi-constraint planning problems such as travel planning. In such tasks, each planning step is not only constrained locally but also influenced by global constraints spanning multiple subtasks, leading to a tightly coupled and complex decision process. Existing travel planning methods typically rely on a single decision space that entangles all subtasks and constraints, failing to distinguish between locally acting constraints within a subtask and global constraints that span multiple subtasks. Consequently, the model is forced to jointly reason over local and global constraints at each decision step, increasing the reasoning burden and reducing planning efficiency. To address this problem, we propose the Behavior Forest method. Specifically, our approach structures the decision-making process into a forest of parallel behavior trees, where each behavior tree is responsible for a subtask. A global coordination mechanism is introduced to orchestrate the interactions among these trees, enabling modular and coherent travel planning. Within this framework, large language models are embedded as decision engines within behavior tree nodes, performing localized reasoning conditioned on task-specific constraints to generate candidate subplans and adapt decisions based on coordination feedback. The behavior trees, in turn, provide an explicit control structure that guides LLM generation. This design decouples complex tasks and constraints into manageable subspaces, enabling task-specific reasoning and reducing the cognitive load of LLM. Experimental results show that our method outperforms state-of-the-art methods by 6.67% on the TravelPlanner and by 11.82% on the ChinaTravel benchmarks, demonstrating its effectiveness in increasing LLM performance for complex multi-constraint travel planning.
LGFeb 6
SpecAttn: Co-Designing Sparse Attention with Self-Speculative DecodingYikang Yue, Yuqi Xue, Jian Huang
Long-context large language model (LLM) inference has become the norm for today's AI applications. However, it is severely bottlenecked by the increasing memory demands of its KV cache. Previous works have shown that self-speculative decoding with sparse attention, where tokens are drafted using a subset of the KV cache and verified in parallel with full KV cache, speeds up inference in a lossless way. However, this approach relies on standalone KV selection algorithms to select the KV entries used for drafting and overlooks that the criticality of each KV entry is inherently computed during verification. In this paper, we propose SpecAttn, a self-speculative decoding method with verification-guided sparse attention. SpecAttn identifies critical KV entries as a byproduct of verification and only loads these entries when drafting subsequent tokens. This not only improves draft token acceptance rate but also incurs low KV selection overhead, thereby improving decoding throughput. SpecAttn achieves 2.81$\times$ higher throughput over vanilla auto-regressive decoding and 1.29$\times$ improvement over state-of-the-art sparsity-based self-speculative decoding methods.
LGSep 14, 2025Code
MatQnA: A Benchmark Dataset for Multi-modal Large Language Models in Materials Characterization and AnalysisYonghao Weng, Liqiang Gao, Linwu Zhu et al.
Recently, large language models (LLMs) have achieved remarkable breakthroughs in general domains such as programming and writing, and have demonstrated strong potential in various scientific research scenarios. However, the capabilities of AI models in the highly specialized field of materials characterization and analysis have not yet been systematically or sufficiently validated. To address this gap, we present MatQnA, the first multi-modal benchmark dataset specifically designed for material characterization techniques. MatQnA includes ten mainstream characterization methods, such as X-ray Photoelectron Spectroscopy (XPS), X-ray Diffraction (XRD), Scanning Electron Microscopy (SEM), Transmission Electron Microscopy (TEM), etc. We employ a hybrid approach combining LLMs with human-in-the-loop validation to construct high-quality question-answer pairs, integrating both multiple-choice and subjective questions. Our preliminary evaluation results show that the most advanced multi-modal AI models (e.g., GPT-4.1, Claude 4, Gemini 2.5, and Doubao Vision Pro 32K) have already achieved nearly 90% accuracy on objective questions in materials data interpretation and analysis tasks, demonstrating strong potential for applications in materials characterization and analysis. The MatQnA dataset is publicly available at https://huggingface.co/datasets/richardhzgg/matQnA.
MTRL-SCIJun 10, 2025Code
Mic-hackathon 2024: Hackathon on Machine Learning for Electron and Scanning Probe MicroscopyUtkarsh Pratiush, Austin Houston, Kamyar Barakati et al.
Microscopy is a primary source of information on materials structure and functionality at nanometer and atomic scales. The data generated is often well-structured, enriched with metadata and sample histories, though not always consistent in detail or format. The adoption of Data Management Plans (DMPs) by major funding agencies promotes preservation and access. However, deriving insights remains difficult due to the lack of standardized code ecosystems, benchmarks, and integration strategies. As a result, data usage is inefficient and analysis time is extensive. In addition to post-acquisition analysis, new APIs from major microscope manufacturers enable real-time, ML-based analytics for automated decision-making and ML-agent-controlled microscope operation. Yet, a gap remains between the ML and microscopy communities, limiting the impact of these methods on physics, materials discovery, and optimization. Hackathons help bridge this divide by fostering collaboration between ML researchers and microscopy experts. They encourage the development of novel solutions that apply ML to microscopy, while preparing a future workforce for instrumentation, materials science, and applied ML. This hackathon produced benchmark datasets and digital twins of microscopes to support community growth and standardized workflows. All related code is available at GitHub: https://github.com/KalininGroup/Mic-hackathon-2024-codes-publication/tree/1.0.0.1
LGMar 16, 2025Code
Real-Time Cell Sorting with Scalable In Situ FPGA-Accelerated Deep LearningKhayrul Islam, Ryan F. Forelli, Jianzhong Han et al.
Precise cell classification is essential in biomedical diagnostics and therapeutic monitoring, particularly for identifying diverse cell types involved in various diseases. Traditional cell classification methods such as flow cytometry depend on molecular labeling which is often costly, time-intensive, and can alter cell integrity. To overcome these limitations, we present a label-free machine learning framework for cell classification, designed for real-time sorting applications using bright-field microscopy images. This approach leverages a teacher-student model architecture enhanced by knowledge distillation, achieving high efficiency and scalability across different cell types. Demonstrated through a use case of classifying lymphocyte subsets, our framework accurately classifies T4, T8, and B cell types with a dataset of 80,000 preprocessed images, accessible via an open-source Python package for easy adaptation. Our teacher model attained 98\% accuracy in differentiating T4 cells from B cells and 93\% accuracy in zero-shot classification between T8 and B cells. Remarkably, our student model operates with only 0.02\% of the teacher model's parameters, enabling field-programmable gate array (FPGA) deployment. Our FPGA-accelerated student model achieves an ultra-low inference latency of just 14.5~$μ$s and a complete cell detection-to-sorting trigger time of 24.7~$μ$s, delivering 12x and 40x improvements over the previous state-of-the-art real-time cell analysis algorithm in inference and total latency, respectively, while preserving accuracy comparable to the teacher model. This framework provides a scalable, cost-effective solution for lymphocyte classification, as well as a new SOTA real-time cell sorting implementation for rapid identification of subsets using in situ deep learning on off-the-shelf computing hardware.
ROFeb 14, 2024Code
DisGNet: A Distance Graph Neural Network for Forward Kinematics Learning of Gough-Stewart PlatformHuizhi Zhu, Wenxia Xu, Jian Huang et al.
In this paper, we propose a graph neural network, DisGNet, for learning the graph distance matrix to address the forward kinematics problem of the Gough-Stewart platform. DisGNet employs the k-FWL algorithm for message-passing, providing high expressiveness with a small parameter count, making it suitable for practical deployment. Additionally, we introduce the GPU-friendly Newton-Raphson method, an efficient parallelized optimization method executed on the GPU to refine DisGNet's output poses, achieving ultra-high-precision pose. This novel two-stage approach delivers ultra-high precision output while meeting real-time requirements. Our results indicate that on our dataset, DisGNet can achieves error accuracys below 1mm and 1deg at 79.8\% and 98.2\%, respectively. As executed on a GPU, our two-stage method can ensure the requirement for real-time computation. Codes are released at https://github.com/FLAMEZZ5201/DisGNet.
44.0MLMay 8
TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching ModelsZhenhan Fang, Aixin Tan, Jian Huang
Constructing valid and informative conformal prediction regions for multi-dimensional outputs remains a fundamental challenge. While conformal prediction provides finite-sample, distribution-free coverage guarantees, its practical performance critically depends on the choice of nonconformity score. Existing approaches often rely on restrictive geometric assumptions or require explicit likelihood evaluation and invertible transformations, limiting their applicability in complex generative settings. In this work, we introduce TRACE (TRansport Alignment Conformal Estimation), a conformal prediction framework that defines nonconformity through transport alignment in diffusion and flow matching models. Rather than evaluating likelihoods, we measure how well a candidate output aligns with the learned generative dynamics by averaging denoising or velocity-matching errors along stochastic transport trajectories. The resulting transport-based scores are scalar-valued and can be calibrated using split conformal prediction, yielding valid marginal coverage under exchangeability. We further analyze the statistical properties of the proposed scores and their sensitivity to computational budget. Experiments on synthetic and real datasets demonstrate valid coverage and show that the resulting regions adapt naturally to multimodal and non-convex conditional distributions.
43.1LGMay 7
RepFlow: Representation Enhanced Flow Matching for Causal Effect EstimationYifei Xie, Jian Huang
Estimating causal effects from observational data has become increasingly critical in diverse fields including healthcare, economics, and social policy. The fundamental challenge in causal inference arises from the missing counterfactuals and the selection bias. Existing methods are largely limited to point estimates and lack the capacity for distribution modeling. In this work, we propose RepFlow, a novel framework that formulates causal effect estimation as a joint optimization problem integrating representation learning with Conditional Flow Matching (CFM). RepFlow mitigates selection bias by minimizing the entropically regularized Wasserstein distance between treated and control representations. To enhance numerical stability, we further introduce an $L_2$ normalization constraint on latent representations. This balanced representation enables the flow model to accurately capture the distribution of potential outcomes. Extensive experiments across a wide range of benchmarks demonstrate that RepFlow consistently outperforms existing methods in both point and distributional causal effect estimation.
5.2LGApr 10
On-Meter Graph Machine Learning: A Case Study of PV Power Forecasting for Grid Edge IntelligenceJian Huang, Zixiang Ming, Yongli Zhu et al.
This paper presents a detailed study of how graph neural networks can be used on edge intelligent meters in a microgrid to forecast photovoltaic power generation. The problem background and the adopted technologies are introduced, including ONNX and ONNX Runtime. The hardware and software specifications of the smart meter are also briefly described. Then, the paper focuses on the training and deployment of two graph machine learning models, GCN and GraphSAGE, with particular emphasis on developing and deploying a customized ONNX operator for GCN. Finally, a case study is conducted using real datasets from a village microgrid. The performance of the two models is compared on both the PC and the smart meter, exhibiting successful deployments and executions on the smart meter.
98.4NAApr 7
Higher-Order Multiscale Computational Method for Multi-Continuum Problems in Highly Heterogeneous MediaHao Dong, Jiayuan Peng, Jian Huang
This paper presents a high-accuracy higher-order multiscale method for solving multi-continuum problems in in highly heterogeneous media. First, microscopic unit cell functions are defined, leading to the derivation of macroscopic homogenized equations and formulas for calculating effective parameters, which yield a higher-order multi-scale (HOMS) asymptotic solution. Subsequently, the pointwise approximation properties of this solution to the original equations are analyzed, and its convergence rate in the integral norm is rigorously established under certain assumptions. Furthermore, a multiscale numerical algorithm is developed by integrating the finite element method (FEM), finite difference method, and interpolation technique. Finally, numerical experiments demonstrate the high accuracy, efficiency, and stability of the proposed HOMS numerical algorithm.
7.8CLMar 16
DOS: Dependency-Oriented Sampler for Masked Diffusion Language ModelsXueyu Zhou, Yangrong Hu, Jian Huang
Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.
50.9CVMar 16
Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow MatchingFangran Miao, Jian Huang, Ting Li
Human motion generation is often learned in Euclidean spaces, although valid motions follow structured non-Euclidean geometry. We present Riemannian Motion Generation (RMG), a unified framework that represents motion on a product manifold and learns dynamics via Riemannian flow matching. RMG factorizes motion into several manifold factors, yielding a scale-free representation with intrinsic normalization, and uses geodesic interpolation, tangent-space supervision, and manifold-preserving ODE integration for training and sampling. On HumanML3D, RMG achieves state-of-the-art FID in the HumanML3D format (0.043) and ranks first on all reported metrics under the MotionStreamer format. On MotionMillion, it also surpasses strong baselines (FID 5.6, R@1 0.86). Ablations show that the compact $\mathscr{T}+\mathscr{R}$ (translation + rotations) representation is the most stable and effective, highlighting geometry-aware modeling as a practical and scalable route to high-fidelity motion generation.
MLDec 22, 2025
On Conditional Stochastic Interpolation for Generative Nonlinear Sufficient Dimension ReductionShuntuo Xu, Zhou Yu, Jian Huang
Identifying low-dimensional sufficient structures in nonlinear sufficient dimension reduction (SDR) has long been a fundamental yet challenging problem. Most existing methods lack theoretical guarantees of exhaustiveness in identifying lower dimensional structures, either at the population level or at the sample level. We tackle this issue by proposing a new method, generative sufficient dimension reduction (GenSDR), which leverages modern generative models. We show that GenSDR is able to fully recover the information contained in the central $σ$-field at both the population and sample levels. In particular, at the sample level, we establish a consistency property for the GenSDR estimator from the perspective of conditional distributions, capitalizing on the distributional learning capabilities of deep generative models. Moreover, by incorporating an ensemble technique, we extend GenSDR to accommodate scenarios with non-Euclidean responses, thereby substantially broadening its applicability. Extensive numerical results demonstrate the outstanding empirical performance of GenSDR and highlight its strong potential for addressing a wide range of complex, real-world tasks.
MLMar 31, 2024
Convergence of Continuous Normalizing Flows for Learning Probability DistributionsYuan Gao, Jian Huang, Yuling Jiao et al.
Continuous normalizing flows (CNFs) are a generative method for learning probability distributions, which is based on ordinary differential equations. This method has shown remarkable empirical success across various applications, including large-scale image synthesis, protein structure prediction, and molecule generation. In this work, we study the theoretical properties of CNFs with linear interpolation in learning probability distributions from a finite random sample, using a flow matching objective function. We establish non-asymptotic error bounds for the distribution estimator based on CNFs, in terms of the Wasserstein-2 distance. The key assumption in our analysis is that the target distribution satisfies one of the following three conditions: it either has a bounded support, is strongly log-concave, or is a finite or infinite mixture of Gaussian distributions. We present a convergence analysis framework that encompasses the error due to velocity estimation, the discretization error, and the early stopping error. A key step in our analysis involves establishing the regularity properties of the velocity field and its estimator for CNFs constructed with linear interpolation. This necessitates the development of uniform error bounds with Lipschitz regularity control of deep ReLU networks that approximate the Lipschitz function class, which could be of independent interest. Our nonparametric convergence analysis offers theoretical guarantees for using CNFs to learn probability distributions from a finite random sample.
AIDec 18, 2024
A Survey on Large Language Model-based Agents for Statistics and Data ScienceMaojun Sun, Ruijian Han, Binyan Jiang et al.
In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.
70.7AIApr 28
DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual LearningXirui Liu, Sihang Zhou, Yanning Hou et al.
Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.
DCNov 20, 2024
Transforming the Hybrid Cloud for Emerging AI WorkloadsDeming Chen, Alaa Youssef, Ruchi Pendse et al.
This white paper, developed through close collaboration between IBM Research and UIUC researchers within the IIDAI Institute, envisions transforming hybrid cloud systems to meet the growing complexity of AI workloads through innovative, full-stack co-design approaches, emphasizing usability, manageability, affordability, adaptability, efficiency, and scalability. By integrating cutting-edge technologies such as generative and agentic AI, cross-layer automation and optimization, unified control plane, and composable and adaptive system architecture, the proposed framework addresses critical challenges in energy efficiency, performance, and cost-effectiveness. Incorporating quantum computing as it matures will enable quantum-accelerated simulations for materials science, climate modeling, and other high-impact domains. Collaborative efforts between academia and industry are central to this vision, driving advancements in foundation models for material design and climate solutions, scalable multimodal data processing, and enhanced physics-based AI emulators for applications like weather forecasting and carbon sequestration. Research priorities include advancing AI agentic systems, LLM as an Abstraction (LLMaaA), AI model optimization and unified abstractions across heterogeneous infrastructure, end-to-end edge-cloud transformation, efficient programming model, middleware and platform, secure infrastructure, application-adaptive cloud systems, and new quantum-classical collaborative workflows. These ideas and solutions encompass both theoretical and practical research questions, requiring coordinated input and support from the research community. This joint initiative aims to establish hybrid clouds as secure, efficient, and sustainable platforms, fostering breakthroughs in AI-driven applications and scientific discovery across academia, industry, and society.
MLDec 9, 2023
Conditional Stochastic Interpolation for Generative LearningDing Huang, Jian Huang, Ting Li et al.
We propose a conditional stochastic interpolation (CSI) method for learning conditional distributions. CSI is based on estimating probability flow equations or stochastic differential equations that transport a reference distribution to the target conditional distribution. This is achieved by first learning the conditional drift and score functions based on CSI, which are then used to construct a deterministic process governed by an ordinary differential equation or a diffusion process for conditional sampling. In our proposed approach, we incorporate an adaptive diffusion term to address the instability issues arising in the diffusion process. We derive explicit expressions of the conditional drift and score functions in terms of conditional expectations, which naturally lead to an nonparametric regression approach to estimating these functions. Furthermore, we establish nonasymptotic error bounds for learning the target conditional distribution. We illustrate the application of CSI on image generation using a benchmark image dataset.
MLMar 29, 2025
Estimating Unbounded Density Ratios: Applications in Error Control under Covariate ShiftShuntuo Xu, Zhou Yu, Jian Huang
The density ratio is an important metric for evaluating the relative likelihood of two probability distributions, with extensive applications in statistics and machine learning. However, existing estimation theories for density ratios often depend on stringent regularity conditions, mainly focusing on density ratio functions with bounded domains and ranges. In this paper, we study density ratio estimators using loss functions based on least squares and logistic regression. We establish upper bounds on estimation errors with standard minimax optimal rates, up to logarithmic factors. Our results accommodate density ratio functions with unbounded domains and ranges. We apply our results to nonparametric regression and conditional flow models under covariate shift and identify the tail properties of the density ratio as crucial for error control across domains affected by covariate shift. We provide sufficient conditions under which loss correction is unnecessary and demonstrate effective generalization capabilities of a source estimator to any suitable target domain. Our simulation experiments support these theoretical findings, indicating that the source estimator can outperform those derived from loss correction methods, even when the true density ratio is known.
ARJul 15, 2025
ELK: Exploring the Efficiency of Inter-core Connected AI Chips with Deep Learning Compiler TechniquesYiqi Liu, Yuqi Xue, Noelle Crawford et al.
To meet the increasing demand of deep learning (DL) models, AI chips are employing both off-chip memory (e.g., HBM) and high-bandwidth low-latency interconnect for direct inter-core data exchange. However, it is not easy to explore the efficiency of these inter-core connected AI (ICCA) chips, due to a fundamental tussle among compute (per-core execution), communication (inter-core data exchange), and I/O (off-chip data access). In this paper, we develop Elk, a DL compiler framework to maximize the efficiency of ICCA chips by jointly trading off all the three performance factors discussed above. Elk structures these performance factors into configurable parameters and forms a global trade-off space in the DL compiler. To systematically explore this space and maximize overall efficiency, Elk employs a new inductive operator scheduling policy and a cost-aware on-chip memory allocation algorithm. It generates globally optimized execution plans that best overlap off-chip data loading and on-chip execution. To examine the efficiency of Elk, we build a full-fledged emulator based on a real ICCA chip IPU-POD4, and an ICCA chip simulator for sensitivity analysis with different interconnect network topologies. Elk achieves 94% of the ideal roofline performance of ICCA chips on average, showing the benefits of supporting large DL models on ICCA chips. We also show Elk's capability of enabling architecture design space exploration for new ICCA chip development.
MLMay 12, 2025
Wasserstein Distributionally Robust Nonparametric RegressionChangyu Liu, Yuling Jiao, Junhui Wang et al.
Wasserstein distributionally robust optimization (WDRO) strengthens statistical learning under model uncertainty by minimizing the local worst-case risk within a prescribed ambiguity set. Although WDRO has been extensively studied in parametric settings, its theoretical properties in nonparametric frameworks remain underexplored. This paper investigates WDRO for nonparametric regression. We first establish a structural distinction based on the order $k$ of the Wasserstein distance, showing that $k=1$ induces Lipschitz-type regularization, whereas $k > 1$ corresponds to gradient-norm regularization. To address model misspecification, we analyze the excess local worst-case risk, deriving non-asymptotic error bounds for estimators constructed using norm-constrained feedforward neural networks. This analysis is supported by new covering number and approximation bounds that simultaneously control both the function and its gradient. The proposed estimator achieves a convergence rate of $n^{-2β/(d+2β)}$ up to logarithmic factors, where $β$ depends on the target's smoothness and network parameters. This rate is shown to be minimax optimal under conditions commonly satisfied in high-dimensional settings. Moreover, these bounds on the excess local worst-case risk imply guarantees on the excess natural risk, ensuring robustness against any distribution within the ambiguity set. We show the framework's generality across regression and classification problems. Simulation studies and an application to the MNIST dataset further illustrate the estimator's robustness.
SYDec 2, 2024
Embedded Machine Learning for Solar PV Power Regulation in a Remote MicrogridYongli Zhu, Linna Xu, Jian Huang
This paper presents a machine-learning study for solar inverter power regulation in a remote microgrid. Machine learning models for active and reactive power control are respectively trained using an ensemble learning method. Then, unlike conventional schemes that make inferences on a central server in the far-end control center, the proposed scheme deploys the trained models on an embedded edge-computing device near the inverter to reduce the communication delay. Experiments on a real embedded device achieve matched results as on the desktop PC, with about 0.1ms time cost for each inference input.
LGJul 9, 2025
On-Device Training of PV Power Forecasting Models in a Smart Meter for Grid Edge IntelligenceJian Huang, Yongli Zhu, Linna Xu et al.
In this paper, an edge-side model training study is conducted on a resource-limited smart meter. The motivation of grid-edge intelligence and the concept of on-device training are introduced. Then, the technical preparation steps for on-device training are described. A case study on the task of photovoltaic power forecasting is presented, where two representative machine learning models are investigated: a gradient boosting tree model and a recurrent neural network model. To adapt to the resource-limited situation in the smart meter, "mixed"- and "reduced"-precision training schemes are also devised. Experiment results demonstrate the feasibility of economically achieving grid-edge intelligence via the existing advanced metering infrastructures.
MLMay 8, 2025
Boosting Statistic Learning with Synthetic Data from Pretrained Large ModelsJialong Jiang, Wenkang Hu, Jian Huang et al.
The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical methods, selectively integrating high-quality samples for effective augmentation. Our experiments demonstrate consistent improvements in predictive performance across various settings, highlighting the potential of our framework while underscoring the inherent limitations of generative models for data augmentation. Despite the ability to produce large volumes of synthetic data, the proportion that effectively improves model performance is limited.
ROApr 20, 2025
K2MUSE: A human lower limb multimodal dataset under diverse conditions for facilitating rehabilitation roboticsJiwei Li, Bi Zhang, Xiaowei Tan et al.
The natural interaction and control performance of lower limb rehabilitation robots are closely linked to biomechanical information from various human locomotion activities. Multidimensional human motion data significantly deepen the understanding of the complex mechanisms governing neuromuscular alterations, thereby facilitating the development and application of rehabilitation robots in multifaceted real-world environments. However, currently available lower limb datasets are inadequate for supplying the essential multimodal data and large-scale gait samples necessary for effective data-driven approaches, and they neglect the significant effects of acquisition interference in real applications.To fill this gap, we present the K2MUSE dataset, which includes a comprehensive collection of multimodal data, comprising kinematic, kinetic, amplitude-mode ultrasound (AUS), and surface electromyography (sEMG) measurements. The proposed dataset includes lower limb multimodal data from 30 able-bodied participants walking under different inclines (0$^\circ$, $\pm$5$^\circ$, and $\pm$10$^\circ$), various speeds (0.5 m/s, 1.0 m/s, and 1.5 m/s), and different nonideal acquisition conditions (muscle fatigue, electrode shifts, and inter-day differences). The kinematic and ground reaction force data were collected via a Vicon motion capture system and an instrumented treadmill with embedded force plates, whereas the sEMG and AUS data were synchronously recorded for thirteen muscles on the bilateral lower limbs. This dataset offers a new resource for designing control frameworks for rehabilitation robots and conducting biomechanical analyses of lower limb locomotion. The dataset is available at https://k2muse.github.io/.
MLApr 13, 2025
From Conditional to Unconditional Independence: Testing Conditional Independence via Transport MapsChenxuan He, Yuan Gao, Liping Zhu et al.
Testing conditional independence between two random vectors given a third is a fundamental and challenging problem in statistics, particularly in multivariate nonparametric settings due to the complexity of conditional structures. We propose a novel method for testing conditional independence by transforming it to an unconditional independence test problem. We achieve this by constructing two transport maps that transform conditional independence into unconditional independence, this substantially simplifies the problem. These transport maps are estimated from data using conditional continuous normalizing flow models. Within this framework, we derive a test statistic and prove its asymptotic validity under both the null and alternative hypotheses. A permutation-based procedure is employed to evaluate the significance of the test. We validate the proposed method through extensive simulations and real-data analysis. Our numerical studies demonstrate the practical effectiveness of the proposed method for conditional independence
MLMar 29, 2025
Fair Sufficient Representation LearningXueyu Zhou, Chun Yin IP, Jian Huang
The main objective of fair statistical modeling and machine learning is to minimize or eliminate biases that may arise from the data or the model itself, ensuring that predictions and decisions are not unjustly influenced by sensitive attributes such as race, gender, age, or other protected characteristics. In this paper, we introduce a Fair Sufficient Representation Learning (FSRL) method that balances sufficiency and fairness. Sufficiency ensures that the representation should capture all necessary information about the target variables, while fairness requires that the learned representation remains independent of sensitive attributes. FSRL is based on a convex combination of an objective function for learning a sufficient representation and an objective function that ensures fairness. Our approach manages fairness and sufficiency at the representation level, offering a novel perspective on fair representation learning. We implement this method using distance covariance, which is effective for characterizing independence between random variables. We establish the convergence properties of the learned representations. Experiments conducted on healthcase and text datasets with diverse structures demonstrate that FSRL achieves a superior trade-off between fairness and accuracy compared to existing approaches.
LGMar 3, 2025
DeepSuM: Deep Sufficient Modality Learning FrameworkZhe Gao, Jian Huang, Ting Li et al.
Multimodal learning has become a pivotal approach in developing robust learning models with applications spanning multimedia, robotics, large language models, and healthcare. The efficiency of multimodal systems is a critical concern, given the varying costs and resource demands of different modalities. This underscores the necessity for effective modality selection to balance performance gains against resource expenditures. In this study, we propose a novel framework for modality selection that independently learns the representation of each modality. This approach allows for the assessment of each modality's significance within its unique representation space, enabling the development of tailored encoders and facilitating the joint analysis of modalities with distinct characteristics. Our framework aims to enhance the efficiency and effectiveness of multimodal learning by optimizing modality integration and selection.