CVAug 8, 2023Code
LEFormer: A Hybrid CNN-Transformer Architecture for Accurate Lake Extraction from Remote Sensing ImageryBen Chen, Xuechao Zou, Yu Zhang et al.
Lake extraction from remote sensing images is challenging due to the complex lake shapes and inherent data noises. Existing methods suffer from blurred segmentation boundaries and poor foreground modeling. This paper proposes a hybrid CNN-Transformer architecture, called LEFormer, for accurate lake extraction. LEFormer contains three main modules: CNN encoder, Transformer encoder, and cross-encoder fusion. The CNN encoder effectively recovers local spatial information and improves fine-scale details. Simultaneously, the Transformer encoder captures long-range dependencies between sequences of any length, allowing them to obtain global features and context information. The cross-encoder fusion module integrates the local and global features to improve mask prediction. Experimental results show that LEFormer consistently achieves state-of-the-art performance and efficiency on the Surface Water and the Qinghai-Tibet Plateau Lake datasets. Specifically, LEFormer achieves 90.86% and 97.42% mIoU on two datasets with a parameter count of 3.61M, respectively, while being 20 minor than the previous best lake extraction method. The source code is available at https://github.com/BastianChen/LEFormer.
LGOct 23, 2023
HetGPT: Harnessing the Power of Prompt Tuning in Pre-Trained Heterogeneous Graph Neural NetworksYihong Ma, Ning Yan, Jiayu Li et al.
Graphs have emerged as a natural choice to represent and analyze the intricate patterns and rich information of the Web, enabling applications such as online page classification and social recommendation. The prevailing "pre-train, fine-tune" paradigm has been widely adopted in graph machine learning tasks, particularly in scenarios with limited labeled nodes. However, this approach often exhibits a misalignment between the training objectives of pretext tasks and those of downstream tasks. This gap can result in the "negative transfer" problem, wherein the knowledge gained from pre-training adversely affects performance in the downstream tasks. The surge in prompt-based learning within Natural Language Processing (NLP) suggests the potential of adapting a "pre-train, prompt" paradigm to graphs as an alternative. However, existing graph prompting techniques are tailored to homogeneous graphs, neglecting the inherent heterogeneity of Web graphs. To bridge this gap, we propose HetGPT, a general post-training prompting framework to improve the predictive performance of pre-trained heterogeneous graph neural networks (HGNNs). The key is the design of a novel prompting function that integrates a virtual class prompt and a heterogeneous feature prompt, with the aim to reformulate downstream tasks to mirror pretext tasks. Moreover, HetGPT introduces a multi-view neighborhood aggregation mechanism, capturing the complex neighborhood structure in heterogeneous graphs. Extensive experiments on three benchmark datasets demonstrate HetGPT's capability to enhance the performance of state-of-the-art HGNNs on semi-supervised node classification.
ROApr 28Code
Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid RobotChenghao Yin, Da Huang, Di Yang et al.
The development of robust and generalizable robot learning models is critically contingent upon the availability of large-scale, diverse training data and reliable evaluation benchmarks. Collecting data in the physical world poses prohibitive costs and scalability challenges, and prevailing simulation benchmarks frequently suffer from fragmentation, narrow scope, or insufficient fidelity to enable effective sim-to-real transfer. To address these challenges, we introduce Genie Sim 3.0, a unified simulation platform for robotic manipulation. We present Genie Sim Generator, a large language model (LLM)-powered tool that constructs high-fidelity scenes from natural language instructions. Its principal strength resides in rapid and multi-dimensional generalization, facilitating the synthesis of diverse environments to support scalable data collection and robust policy evaluation. We introduce the first benchmark that pioneers the application of LLM for automated evaluation. It leverages LLM to mass-generate evaluation scenarios and employs Vision-Language Model (VLM) to establish an automated assessment pipeline. We also release an open-source dataset comprising more than 10,000 hours of synthetic data across over 200 tasks. Through systematic experimentation, we validate the robust zero-shot sim-to-real transfer capability of our open-source dataset, demonstrating that synthetic data can server as an effective substitute for real-world data under controlled conditions for scalable policy training. For code and dataset details, please refer to: https://github.com/AgibotTech/genie_sim.
CLJan 7, 2023
Linguistic-style-aware Neural Networks for Fake News DetectionXinyi Zhou, Jiayu Li, Qinzhou Li et al. · uw
We propose the hierarchical recursive neural network (HERO) to predict fake news by learning its linguistic style, which is distinguishable from the truth, as psychological theories reveal. We first generate the hierarchical linguistic tree of news documents; by doing so, we translate each news document's linguistic style into its writer's usage of words and how these words are recursively structured as phrases, sentences, paragraphs, and, ultimately, the document. By integrating the hierarchical linguistic tree with the neural network, the proposed method learns and classifies the representation of news documents by capturing their locally sequential and globally recursive structures that are linguistically meaningful. It is the first work offering the hierarchical linguistic tree and the neural network preserving the tree information to our best knowledge. Experimental results based on public real-world datasets demonstrate the proposed method's effectiveness, which can outperform state-of-the-art techniques in classifying short and long news documents. We also examine the differential linguistic style of fake news and the truth and observe some patterns of fake news. The code and data have been publicly available.
MED-PHJun 16, 2023
Magnetic Resonance Spectroscopy Quantification Aided by Deep Estimations of Imperfection Factors and Macromolecular SignalDicheng Chen, Meijin Lin, Huiting Liu et al.
Objective: Magnetic Resonance Spectroscopy (MRS) is an important technique for biomedical detection. However, it is challenging to accurately quantify metabolites with proton MRS due to serious overlaps of metabolite signals, imperfections because of non-ideal acquisition conditions, and interference with strong background signals mainly from macromolecules. The most popular method, LCModel, adopts complicated non-linear least square to quantify metabolites and addresses these problems by designing empirical priors such as basis-sets, imperfection factors. However, when the signal-to-noise ratio of MRS signal is low, the solution may have large deviation. Methods: Linear Least Squares (LLS) is integrated with deep learning to reduce the complexity of solving this overall quantification. First, a neural network is designed to explicitly predict the imperfection factors and the overall signal from macromolecules. Then, metabolite quantification is solved analytically with the introduced LLS. In our Quantification Network (QNet), LLS takes part in the backpropagation of network training, which allows the feedback of the quantification error into metabolite spectrum estimation. This scheme greatly improves the generalization to metabolite concentrations unseen for training compared to the end-to-end deep learning method. Results: Experiments show that compared with LCModel, the proposed QNet, has smaller quantification errors for simulated data, and presents more stable quantification for 20 healthy in vivo data at a wide range of signal-to-noise ratio. QNet also outperforms other end-to-end deep learning methods. Conclusion: This study provides an intelligent, reliable and robust MRS quantification. Significance: QNet is the first LLS quantification aided by deep learning.
CLMay 13Code
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document IntelligenceDongsheng Ma, Jiayu Li, Zhengren Wang et al.
Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.
LGDec 3, 2025
Towards Irreversible Machine Unlearning for Diffusion ModelsXun Yuan, Zilong Zhao, Jiayu Li et al.
Diffusion models are renowned for their state-of-the-art performance in generating synthetic images. However, concerns related to safety, privacy, and copyright highlight the need for machine unlearning, which can make diffusion models forget specific training data and prevent the generation of sensitive or unwanted content. Current machine unlearning methods for diffusion models are primarily designed for conditional diffusion models and focus on unlearning specific data classes or features. Among these methods, finetuning-based machine unlearning methods are recognized for their efficiency and effectiveness, which update the parameters of pre-trained diffusion models by minimizing carefully designed loss functions. However, in this paper, we propose a novel attack named Diffusion Model Relearning Attack (DiMRA), which can reverse the finetuning-based machine unlearning methods, posing a significant vulnerability of this kind of technique. Without prior knowledge of the unlearning elements, DiMRA optimizes the unlearned diffusion model on an auxiliary dataset to reverse the unlearning, enabling the model to regenerate previously unlearned elements. To mitigate this vulnerability, we propose a novel machine unlearning method for diffusion models, termed as Diffusion Model Unlearning by Memorization (DiMUM). Unlike traditional methods that focus on forgetting, DiMUM memorizes alternative data or features to replace targeted unlearning data or features in order to prevent generating such elements. In our experiments, we demonstrate the effectiveness of DiMRA in reversing state-of-the-art finetuning-based machine unlearning methods for diffusion models, highlighting the need for more robust solutions. We extensively evaluate DiMUM, demonstrating its superior ability to preserve the generative performance of diffusion models while enhancing robustness against DiMRA.
ROApr 14
HazardArena: Evaluating Semantic Safety in Vision-Language-Action ModelsZixing Chen, Yifeng Gao, Li Wang et al.
Vision-Language-Action (VLA) models inherit rich world knowledge from vision-language backbones and acquire executable skills via action demonstrations. However, existing evaluations largely focus on action execution success, leaving action policies loosely coupled with visual-linguistic semantics. This decoupling exposes a systematic vulnerability whereby correct action execution may induce unsafe outcomes under semantic risk. To expose this vulnerability, we introduce HazardArena, a benchmark designed to evaluate semantic safety in VLAs under controlled yet risk-bearing contexts. HazardArena is constructed from safe/unsafe twin scenarios that share matched objects, layouts, and action requirements, differing only in the semantic context that determines whether an action is unsafe. We find that VLA models trained exclusively on safe scenarios often fail to behave safely when evaluated in their corresponding unsafe counterparts. HazardArena includes over 2,000 assets and 40 risk-sensitive tasks spanning 7 real-world risk categories grounded in established robotic safety standards. To mitigate this vulnerability, we propose a training-free Safety Option Layer that constrains action execution using semantic attributes or a vision-language judge, substantially reducing unsafe behaviors with minimal impact on task performance. We hope that HazardArena highlights the need to rethink how semantic safety is evaluated and enforced in VLAs as they scale toward real-world deployment.
CVMar 29, 2024Code
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image AnalysisChao Pang, Xingxing Weng, Jiang Wu et al.
This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD), and an honest instruction dataset comprising both factual and deceptive questions (HnstD). Unlike prevailing remote sensing image-text datasets, in which image captions focus on a few prominent objects and their relationships, VersaD captions provide detailed information about image properties, object attributes, and the overall scene. This comprehensive captioning enables VHM to thoroughly understand remote sensing images and perform diverse remote sensing tasks. Moreover, different from existing remote sensing instruction datasets that only include factual questions, HnstD contains additional deceptive questions stemming from the non-existence of objects. This feature prevents VHM from producing affirmative answers to nonsense queries, thereby ensuring its honesty. In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding. Additionally, VHM achieves competent performance on several unexplored tasks, such as building vectorizing, multi-label classification and honest question answering. We will release the code, data and model weights at https://github.com/opendatalab/VHM .
LGSep 25, 2024
CombU: A Combined Unit Activation for Fitting Mathematical Expressions with Neural NetworksJiayu Li, Zilong Zhao, Kevin Yee et al.
The activation functions are fundamental to neural networks as they introduce non-linearity into data relationships, thereby enabling deep networks to approximate complex data relations. Existing efforts to enhance neural network performance have predominantly focused on developing new mathematical functions. However, we find that a well-designed combination of existing activation functions within a neural network can also achieve this objective. In this paper, we introduce the Combined Units activation (CombU), which employs different activation functions at various dimensions across different layers. This approach can be theoretically proven to fit most mathematical expressions accurately. The experiments conducted on four mathematical expression datasets, compared against six State-Of-The-Art (SOTA) activation function algorithms, demonstrate that CombU outperforms all SOTA algorithms in 10 out of 16 metrics and ranks in the top three for the remaining six metrics.
AIMay 18
Learning to Hand Off: Provably Convergent Workflow Learning under Interface ConstraintsJiayu Li, Enpei Zhang, Dawei Zhou et al.
We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.
MLMay 18
Dual-Channel Tensor Neural Networks: Finite-Sample Theory and Conformal Structure SelectionElynn Chen, Jiayu Li, Zheshi Zheng et al.
Tensor-valued data arise naturally in neuroimaging, genomics, climate science, and spatiotemporal networks, where multilinear dependencies across modes carry information that is destroyed under vectorization. Existing approaches either impose a single low-rank structure, which can miss localized signal, or treat the tensor as a long vector, which discards its multiway geometry. We propose a *Dual-Channel Tensor Neural Network* (DC-TNN) that decomposes each tensor input into a low-rank core and a sparse refinement, and processes the two components through coupled neural channels. The framework is structure-agnostic and accommodates CP, Tucker, and tensor-train cores within a single architecture. For estimation, we establish non-asymptotic risk bounds for the DC-TNN estimator that decompose into network approximation, core estimation, and refinement-selection terms, and show that the effective dimension is determined jointly by the core rank and refinement sparsity rather than by the ambient tensor size. For inference, we develop a *structure-aware conformal ROC* procedure that calibrates within the core-refinement latent space and produces ROC and AUC confidence bands with finite-sample, distribution-free coverage. Building on this, we propose a *conformal structure selector* that, to our knowledge, is the *first distribution-free procedure* for choosing among candidate tensor decompositions with finite-sample validity. Simulations and an analysis of a protein dataset demonstrate competitive predictive accuracy, reliable uncertainty quantification, and consistent recovery of the tensor structure.
CRNov 15, 2025
AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action ModelsJiayu Li, Yunhan Zhao, Xiang Zheng et al.
Vision-Language-Action (VLA) models enable robots to interpret natural-language instructions and perform diverse tasks, yet their integration of perception, language, and control introduces new safety vulnerabilities. Despite growing interest in attacking such models, the effectiveness of existing techniques remains unclear due to the absence of a unified evaluation framework. One major issue is that differences in action tokenizers across VLA architectures hinder reproducibility and fair comparison. More importantly, most existing attacks have not been validated in real-world scenarios. To address these challenges, we propose AttackVLA, a unified framework that aligns with the VLA development lifecycle, covering data construction, model training, and inference. Within this framework, we implement a broad suite of attacks, including all existing attacks targeting VLAs and multiple adapted attacks originally developed for vision-language models, and evaluate them in both simulation and real-world settings. Our analysis of existing attacks reveals a critical gap: current methods tend to induce untargeted failures or static action states, leaving targeted attacks that drive VLAs to perform precise long-horizon action sequences largely unexplored. To fill this gap, we introduce BackdoorVLA, a targeted backdoor attack that compels a VLA to execute an attacker-specified long-horizon action sequence whenever a trigger is present. We evaluate BackdoorVLA in both simulated benchmarks and real-world robotic settings, achieving an average targeted success rate of 58.4% and reaching 100% on selected tasks. Our work provides a standardized framework for evaluating VLA vulnerabilities and demonstrates the potential for precise adversarial manipulation, motivating further research on securing VLA-based embodied systems.
CVMay 1
Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language ModelsJiayu Li, Jiaxin Qi, Sheng Zhou et al.
Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.
CVMar 29, 2024Code
A Parallel Attention Network for Cattle Face RecognitionJiayu Li, Xuechao Zou, Shiying Wang et al.
Cattle face recognition holds paramount significance in domains such as animal husbandry and behavioral research. Despite significant progress in confined environments, applying these accomplishments in wild settings remains challenging. Thus, we create the first large-scale cattle face recognition dataset, ICRWE, for wild environments. It encompasses 483 cattle and 9,816 high-resolution image samples. Each sample undergoes annotation for face features, light conditions, and face orientation. Furthermore, we introduce a novel parallel attention network, PANet. Comprising several cascaded Transformer modules, each module incorporates two parallel Position Attention Modules (PAM) and Feature Mapping Modules (FMM). PAM focuses on local and global features at each image position through parallel channel attention, and FMM captures intricate feature patterns through non-linear mappings. Experimental results indicate that PANet achieves a recognition accuracy of 88.03% on the ICRWE dataset, establishing itself as the current state-of-the-art approach. The source code is available in the supplementary materials.
AIJan 12
AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing UnitsXinzi Cao, Jianyang Zhai, Pengfei Li et al.
To meet the ever-increasing demand for computational efficiency, Neural Processing Units (NPUs) have become critical in modern AI infrastructure. However, unlocking their full potential requires developing high-performance compute kernels using vendor-specific Domain-Specific Languages (DSLs), a task that demands deep hardware expertise and is labor-intensive. While Large Language Models (LLMs) have shown promise in general code generation, they struggle with the strict constraints and scarcity of training data in the NPU domain. Our preliminary study reveals that state-of-the-art general-purpose LLMs fail to generate functional complex kernels for Ascend NPUs, yielding a near-zero success rate. To address these challenges, we propose AscendKernelGen, a generation-evaluation integrated framework for NPU kernel development. We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations, and KernelGen-LM, a domain-adaptive model trained via supervised fine-tuning and reinforcement learning with execution feedback. Furthermore, we design NPUKernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels. Experimental results demonstrate that our approach significantly bridges the gap between general LLMs and hardware-specific coding. Specifically, the compilation success rate on complex Level-2 kernels improves from 0% to 95.5% (Pass@10), while functional correctness achieves 64.3% compared to the baseline's complete failure. These results highlight the critical role of domain-specific reasoning and rigorous evaluation in automating accelerator-aware code generation.
CYJan 8
LLM Agents in Law: Taxonomy, Applications, and ChallengesShuang Liu, Ruijia Zhang, Ruoyun Ma et al.
Large language models (LLMs) have precipitated a dramatic improvement in the legal domain, yet the deployment of standalone models faces significant limitations regarding hallucination, outdated information, and verifiability. Recently, LLM agents have attracted significant attention as a solution to these challenges, utilizing advanced capabilities such as planning, memory, and tool usage to meet the rigorous standards of legal practice. In this paper, we present a comprehensive survey of LLM agents for legal tasks, analyzing how these architectures bridge the gap between technical capabilities and domain-specific needs. Our major contributions include: (1) systematically analyzing the technical transition from standard legal LLMs to legal agents; (2) presenting a structured taxonomy of current agent applications across distinct legal practice areas; (3) discussing evaluation methodologies specifically for agentic performance in law; and (4) identifying open challenges and outlining future directions for developing robust and autonomous legal assistants.
CLDec 9, 2024Code
AIDE: Attribute-Guided MultI-Hop Data Expansion for Data Scarcity in Task-Specific Fine-tuningJiayu Li, Xuan Zhu, Fang Liu et al.
Fine-tuning large language models (LLMs) for specific tasks requires diverse, high-quality training data. However, obtaining sufficient relevant data remains a significant challenge. Existing data synthesis methods either depend on extensive seed datasets or struggle to balance task relevance and data diversity. To address these challenges, we propose Attribute-guided multI-hop Data Expansion (AIDE), a novel data synthesis framework that uses a multi-hop process to expand very few seed data points while ensuring data diversity and task relevance. AIDE extracts the main topic and key knowledge attributes from the seeds to guide the synthesis steps. The process repeats for K hops, using the generated data as seeds. To prevent irrelevant data generation as the hop depth increases, AIDE incorporates a residual connection mechanism. Our empirical results show that AIDE enables fine-tuning of Mistral-7B, Llama-3.1-8B and Llama-3.2-3B from 10 seeds, surpassing the models fine-tuned on human curated data. Furthermore, AIDE outperforms state-of-the-art data synthesis methods, such as Evol-Instruct, by over 30% in task-specific fine-tuning. Code is available at https://github.com/Code4Graph/AIDE.
DBDec 23, 2023Code
IRG: Generating Synthetic Relational Databases using Deep Learning with Insightful Relational UnderstandingJiayu Li, Zilong Zhao, Vikram Chundawat et al.
Synthetic data has numerous applications, including but not limited to software testing at scale, privacy-preserving data sharing to enable smoother collaboration between stakeholders, and data augmentation for analytical and machine learning tasks. Relational databases, which are commonly used by corporations, governments, and financial institutions, present unique challenges for synthetic data generation due to their complex structures. Existing synthetic relational database generation approaches often assume idealized scenarios, such as every table having a perfect primary key column without composite and potentially overlapping primary or foreign key constraints, and fail to account for the sequential nature of certain tables. In this paper, we propose incremental relational generator (IRG), that successfully handles these ubiquitous real-life situations. IRG ensures the preservation of relational schema integrity, offers a deep contextual understanding of relationships beyond direct ancestors and descendants, leverages the power of newly designed deep neural networks, and scales efficiently to handle larger datasets--a combination never achieved in previous works. Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.
CRJul 26, 2021Code
HySec-Flow: Privacy-Preserving Genomic Computing with SGX-based Big-Data Analytics FrameworkChathura Widanage, Weijie Liu, Jiayu Li et al.
Trusted execution environments (TEE) such as Intel's Software Guard Extension (SGX) have been widely studied to boost security and privacy protection for the computation of sensitive data such as human genomics. However, a performance hurdle is often generated by SGX, especially from the small enclave memory. In this paper, we propose a new Hybrid Secured Flow framework (called "HySec-Flow") for large-scale genomic data analysis using SGX platforms. Here, the data-intensive computing tasks can be partitioned into independent subtasks to be deployed into distinct secured and non-secured containers, therefore allowing for parallel execution while alleviating the limited size of Page Cache (EPC) memory in each enclave. We illustrate our contributions using a workflow supporting indexing, alignment, dispatching, and merging the execution of SGX- enabled containers. We provide details regarding the architecture of the trusted and untrusted components and the underlying Scorn and Graphene support as generic shielding execution frameworks to port legacy code. We thoroughly evaluate the performance of our privacy-preserving reads mapping algorithm using real human genome sequencing data. The results demonstrate that the performance is enhanced by partitioning the time-consuming genomic computation into subtasks compared to the conventional execution of the data-intensive reads mapping algorithm in an enclave. The proposed HySec-Flow framework is made available as an open-source and adapted to the data-parallel computation of other large-scale genomic tasks requiring security and scalable computational resources.
CVDec 25, 2025
Hierarchy-Aware Fine-Tuning of Vision-Language ModelsJiayu Li, Rajesh Gangireddy, Samet Akcay et al.
Vision-Language Models (VLMs) learn powerful multimodal representations through large-scale image-text pretraining, but adapting them to hierarchical classification is underexplored. Standard approaches treat labels as flat categories and require full fine-tuning, which is expensive and produces inconsistent predictions across taxonomy levels. We propose an efficient hierarchy-aware fine-tuning framework that updates a few parameters while enforcing structural consistency. We combine two objectives: Tree-Path KL Divergence (TP-KL) aligns predictions along the ground-truth label path for vertical coherence, while Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) encourages consistent predictions among sibling classes. Both losses work in the VLM's shared embedding space and integrate with lightweight LoRA adaptation. Experiments across multiple benchmarks show consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead. Our approach provides an efficient strategy for adapting VLMs to structured taxonomies.
IRFeb 12
AttentionRetriever: Attention Layers are Secretly Long Document RetrieversDavid Jiahao Fu, Lam Thanh Do, Jiayu Li et al.
Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing retrieval models are not designed for long document retrieval and fail to address several key challenges of long document retrieval, including context-awareness, causal dependence, and scope of retrieval. In this paper, we proposed AttentionRetriever, a novel long document retrieval model that leverages attention mechanism and entity-based retrieval to build context-aware embeddings for long document and determine the scope of retrieval. With extensive experiments, we found AttentionRetriever is able to outperform existing retrieval models on long document retrieval datasets by a large margin while remaining as efficient as dense retrieval models.
MLOct 18, 2024
High-Dimensional Tensor Discriminant Analysis with Incomplete TensorsElynn Chen, Yuefeng Han, Jiayu Li
Tensor classification is gaining importance across fields, yet handling partially observed data remains challenging. In this paper, we introduce a novel approach to tensor classification with incomplete data, framed within high-dimensional tensor linear discriminant analysis. Specifically, we consider a high-dimensional tensor predictor with missing observations under the Missing Completely at Random (MCR) assumption and employ the Tensor Gaussian Mixture Model (TGMM) to capture the relationship between the tensor predictor and class label. We propose a Tensor Linear Discriminant Analysis with Missing Data (Tensor LDA-MD) algorithm, which manages high-dimensional tensor predictors with missing entries by leveraging the decomposable low-rank structure of the discriminant tensor. Our work establishes convergence rates for the estimation error of the discriminant tensor with incomplete data and minimax optimal bounds for the misclassification rate, addressing key gaps in the literature. Additionally, we derive large deviation bounds for the generalized mode-wise sample covariance matrix and its inverse, which are crucial tools in our analysis and hold independent interest. Our method demonstrates excellent performance in simulations and real data analysis, even with significant proportions of missing data.
APAug 3, 2025
ACT-Tensor: Tensor Completion Framework for Financial Dataset ImputationJunyi Mo, Jiayu Li, Duo Zhang et al.
Missing data in financial panels presents a critical obstacle, undermining asset-pricing models and reducing the effectiveness of investment strategies. Such panels are often inherently multi-dimensional, spanning firms, time, and financial variables, which adds complexity to the imputation task. Conventional imputation methods often fail by flattening the data's multidimensional structure, struggling with heterogeneous missingness patterns, or overfitting in the face of extreme data sparsity. To address these limitations, we introduce an Adaptive, Cluster-based Temporal smoothing tensor completion framework (ACT-Tensor) tailored for severely and heterogeneously missing multi-dimensional financial data panels. ACT-Tensor incorporates two key innovations: a cluster-based completion module that captures cross-sectional heterogeneity by learning group-specific latent structures; and a temporal smoothing module that proactively removes short-lived noise while preserving slow-moving fundamental trends. Extensive experiments show that ACT-Tensor consistently outperforms state-of-the-art benchmarks in terms of imputation accuracy across a range of missing data regimes, including extreme sparsity scenarios. To assess its practical financial utility, we evaluate the imputed data with an asset-pricing pipeline tailored for tensor-structured financial data. Results show that ACT-Tensor not only reduces pricing errors but also significantly improves risk-adjusted returns of the constructed portfolio. These findings confirm that our method delivers highly accurate and informative imputations, offering substantial value for financial decision-making.
AIFeb 26, 2025
Evaluating Intelligence via Trial and ErrorJingtao Zhan, Jiahao Zhao, Jiayu Li et al. · tsinghua
Intelligence is a crucial trait for species to find solutions within a limited number of trial-and-error attempts. Building on this idea, we introduce Survival Game as a framework to evaluate intelligence based on the number of failed attempts in a trial-and-error process. Fewer failures indicate higher intelligence. When the expectation and variance of failure counts are both finite, it signals the ability to consistently find solutions to new challenges, which we define as the Autonomous Level of intelligence. Using Survival Game, we comprehensively evaluate existing AI systems. Our results show that while AI systems achieve the Autonomous Level in simple tasks, they are still far from it in more complex tasks, such as vision, search, recommendation, and language. While scaling current AI technologies might help, this would come at an astronomical cost. Projections suggest that achieving the Autonomous Level for general tasks would require $10^{26}$ parameters. To put this into perspective, loading such a massive model requires so many H100 GPUs that their total value is $10^{7}$ times that of Apple Inc.'s market value. Even with Moore's Law, supporting such a parameter scale would take $70$ years. This staggering cost highlights the complexity of human tasks and the inadequacies of current AI technologies. To further investigate this phenomenon, we conduct a theoretical analysis of Survival Game and its experimental results. Our findings suggest that human tasks possess a criticality property. As a result, Autonomous Level requires a deep understanding of the task's underlying mechanisms. Current AI systems, however, do not fully grasp these mechanisms and instead rely on superficial mimicry, making it difficult for them to reach an autonomous level. We believe Survival Game can not only guide the future development of AI but also offer profound insights into human intelligence.
LGJan 2, 2025
TabTreeFormer: Tabular Data Generation Using Hybrid Tree-TransformerJiayu Li, Bingyin Zhao, Zilong Zhao et al.
Transformers have shown impressive results in tabular data generation. However, they lack domain-specific inductive biases which are critical for preserving the intrinsic characteristics of tabular data. They also suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that integrates inductive biases of tree-based models (i.e., non-smoothness and non-rotational invariance) to effectively handle the discrete and weakly correlated features in tabular datasets. To improve numerical fidelity and capture multimodal distributions, we introduce a novel tokenizer that learns token sequences based on the complexity of tabular values. This reduces vocabulary size and sequence length, yielding more compact and efficient representations without sacrificing performance. We evaluate TabTreeFormer on nine diverse datasets, benchmarking against eight generative models. We show that TabTreeFormer consistently outperforms baselines in utility, fidelity, and privacy metrics with competitive efficiency. Notably, in scenarios prioritizing data utility over privacy and efficiency, the best variant of TabTreeFormer delivers a 44% performance gain relative to its baseline variant.
CRMar 28
Safety in Embodied AI: A Survey of Risks, Attacks, and DefensesXiao Li, Xiang Zheng, Yifeng Gao et al.
Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open-world, safety-critical environments. As these systems gain autonomy and enter domains such as transportation, healthcare, and industrial or assistive robotics, ensuring their safety becomes both technically challenging and socially indispensable. Unlike digital AI systems, embodied agents must act under uncertain sensing, incomplete knowledge, and dynamic human-robot interactions, where failures can directly lead to physical harm. This survey provides a comprehensive and structured review of safety research in embodied AI, examining attacks and defenses across the full embodied pipeline, from perception and cognition to planning, action and interaction, and agentic system. We introduce a multi-level taxonomy that unifies fragmented lines of work and connects embodied-specific safety findings with broader advances in vision, language, and multimodal foundation models. Our review synthesizes insights from over 400 papers spanning adversarial, backdoor, jailbreak, and hardware-level attacks; attack detection, safe training and robust inference; and risk-aware human-agent interaction. This analysis reveals several overlooked challenges, including the fragility of multimodal perception fusion, the instability of planning under jailbreak attacks, and the trustworthiness of human-agent interaction in open-ended scenarios. By organizing the field into a coherent framework and identifying critical research gaps, this survey provides a roadmap for building embodied agents that are not only capable and autonomous but also safe, robust, and reliable in real-world deployment.
LGDec 13, 2025
High-Dimensional Tensor Discriminant Analysis: Low-Rank Discriminant Structure, Representation Synergy, and Theoretical GuaranteesElynn Chen, Yuefeng Han, Jiayu Li
High-dimensional tensor-valued predictors arise in modern applications, increasingly as learned representations from neural networks. Existing tensor classification methods rely on sparsity or Tucker structures and often lack theoretical guarantees. Motivated by empirical evidence that discriminative signals concentrate along a few multilinear components, we introduce CP low-rank structure for the discriminant tensor, a modeling perspective not previously explored. Under a Tensor Gaussian Mixture Model, we propose high-dimensional CP low-rank Tensor Discriminant Analysis (CP-TDA) with Randomized Composite PCA (\textsc{rc-PCA}) initialization, that is essential for handling dependent and anisotropic noise under weaker signal strength and incoherence conditions, followed by iterative refinement algorithm. We establish global convergence and minimax-optimal misclassification rates. To handle tensor data deviating from tensor normality, we develop the first semiparametric tensor discriminant model, in which learned tensor representations are mapped via deep generative models into a latent space tailored for CP-TDA. Misclassification risk decomposes into representation, approximation, and estimation errors. Numerical studies and real data analysis on graph classification demonstrate substantial gains over existing tensor classifiers and state-of-the-art graph neural networks, particularly in high-dimensional, small-sample regimes.
CLOct 23, 2025
BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary GenerationHaoyuan Li, Zhengyuan Shen, Sullam Jeoung et al.
As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
CLOct 11, 2025
Unpacking Hateful Memes: Presupposed Context and False ClaimsWeibin Cai, Jiayu Li, Reza Zafarani
While memes are often humorous, they are frequently used to disseminate hate, causing serious harm to individuals and society. Current approaches to hateful meme detection mainly rely on pre-trained language models. However, less focus has been dedicated to \textit{what make a meme hateful}. Drawing on insights from philosophy and psychology, we argue that hateful memes are characterized by two essential features: a \textbf{presupposed context} and the expression of \textbf{false claims}. To capture presupposed context, we develop \textbf{PCM} for modeling contextual information across modalities. To detect false claims, we introduce the \textbf{FACT} module, which integrates external knowledge and harnesses cross-modal reference graphs. By combining PCM and FACT, we introduce \textbf{\textsf{SHIELD}}, a hateful meme detection framework designed to capture the fundamental nature of hate. Extensive experiments show that SHIELD outperforms state-of-the-art methods across datasets and metrics, while demonstrating versatility on other tasks, such as fake news detection.
IRJul 31, 2025
Are Recommenders Self-Aware? Label-Free Recommendation Performance Estimation via Model UncertaintyJiayu Li, Ziyi Ye, Guohao Jian et al.
Can a recommendation model be self-aware? This paper investigates the recommender's self-awareness by quantifying its uncertainty, which provides a label-free estimation of its performance. Such self-assessment can enable more informed understanding and decision-making before the recommender engages with any users. To this end, we propose an intuitive and effective method, probability-based List Distribution uncertainty (LiDu). LiDu measures uncertainty by determining the probability that a recommender will generate a certain ranking list based on the prediction distributions of individual items. We validate LiDu's ability to represent model self-awareness in two settings: (1) with a matrix factorization model on a synthetic dataset, and (2) with popular recommendation algorithms on real-world datasets. Experimental results show that LiDu is more correlated with recommendation performance than a series of label-free performance estimators. Additionally, LiDu provides valuable insights into the dynamic inner states of models throughout training and inference. This work establishes an empirical connection between recommendation uncertainty and performance, framing it as a step towards more transparent and self-evaluating recommender systems.
SYJun 2, 2025
Inverse Design in Distributed Circuits Using Single-Step Reinforcement LearningJiayu Li, Masood Mortazavi, Ning Yan et al.
The goal of inverse design in distributed circuits is to generate near-optimal designs that meet a desirable transfer function specification. Existing design exploration methods use some combination of strategies involving artificial grids, differentiable evaluation procedures, and specific template topologies. However, real-world design practices often require non-differentiable evaluation procedures, varying topologies, and near-continuous placement spaces. In this paper, we propose DCIDA, a design exploration framework that learns a near-optimal design sampling policy for a target transfer function. DCIDA decides all design factors in a compound single-step action by sampling from a set of jointly-trained conditional distributions generated by the policy. Utilizing an injective interdependent ``map", DCIDA transforms raw sampled design ``actions" into uniquely equivalent physical representations, enabling the framework to learn the conditional dependencies among joint ``raw'' design decisions. Our experiments demonstrate DCIDA's Transformer-based policy network achieves significant reductions in design error compared to state-of-the-art approaches, with significantly better fit in cases involving more complex transfer functions.
LGOct 4, 2020
Rank Position Forecasting in Car RacingBo Peng, Jiayu Li, Selahattin Akkas et al.
Forecasting is challenging since uncertainty resulted from exogenous factors exists. This work investigates the rank position forecasting problem in car racing, which predicts the rank positions at the future laps for cars. Among the many factors that bring changes to the rank positions, pit stops are critical but irregular and rare. We found existing methods, including statistical models, machine learning regression models, and state-of-the-art deep forecasting model based on encoder-decoder architecture, all have limitations in the forecasting. By elaborative analysis of pit stops events, we propose a deep model, RankNet, with the cause effects decomposition that modeling the rank position sequence and pit stop events separately. It also incorporates probabilistic forecasting to model the uncertainty inside each sub-model. Through extensive experiments, RankNet demonstrates a strong performance improvement over the baselines, e.g., MAE improves more than 10% consistently, and is also more stable when adapting to unseen new data. Details of model optimization, performance profiling are presented. It is promising to provide useful forecasting tools for the car racing analysis and shine a light on solutions to similar challenging issues in general forecasting problems.
LGDec 31, 2018
ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of MultipliersAo Ren, Tianyun Zhang, Shaokai Ye et al.
To facilitate efficient embedded and hardware implementations of deep neural networks (DNNs), two important categories of DNN model compression techniques: weight pruning and weight quantization are investigated. The former leverages the redundancy in the number of weights, whereas the latter leverages the redundancy in bit representation of weights. However, there lacks a systematic framework of joint weight pruning and quantization of DNNs, thereby limiting the available model compression ratio. Moreover, the computation reduction, energy efficiency improvement, and hardware performance overhead need to be accounted for besides simply model size reduction. To address these limitations, we present ADMM-NN, the first algorithm-hardware co-optimization framework of DNNs using Alternating Direction Method of Multipliers (ADMM), a powerful technique to deal with non-convex optimization problems with possibly combinatorial constraints. The first part of ADMM-NN is a systematic, joint framework of DNN weight pruning and quantization using ADMM. It can be understood as a smart regularization technique with regularization target dynamically updated in each ADMM iteration, thereby resulting in higher performance in model compression than prior work. The second part is hardware-aware DNN optimizations to facilitate hardware-level implementations. Without accuracy loss, we can achieve 85$\times$ and 24$\times$ pruning on LeNet-5 and AlexNet models, respectively, significantly higher than prior work. The improvement becomes more significant when focusing on computation reductions. Combining weight pruning and quantization, we achieve 1,910$\times$ and 231$\times$ reductions in overall model size on these two benchmarks, when focusing on data storage. Highly promising results are also observed on other representative DNNs such as VGGNet and ResNet-50.
NENov 5, 2018
A Unified Framework of DNN Weight Pruning and Weight Clustering/Quantization Using ADMMShaokai Ye, Tianyun Zhang, Kaiqi Zhang et al.
Many model compression techniques of Deep Neural Networks (DNNs) have been investigated, including weight pruning, weight clustering and quantization, etc. Weight pruning leverages the redundancy in the number of weights in DNNs, while weight clustering/quantization leverages the redundancy in the number of bit representations of weights. They can be effectively combined in order to exploit the maximum degree of redundancy. However, there lacks a systematic investigation in literature towards this direction. In this paper, we fill this void and develop a unified, systematic framework of DNN weight pruning and clustering/quantization using Alternating Direction Method of Multipliers (ADMM), a powerful technique in optimization theory to deal with non-convex optimization problems. Both DNN weight pruning and clustering/quantization, as well as their combinations, can be solved in a unified manner. For further performance improvement in this framework, we adopt multiple techniques including iterative weight quantization and retraining, joint weight clustering training and centroid updating, weight clustering retraining, etc. The proposed framework achieves significant improvements both in individual weight pruning and clustering/quantization problems, as well as their combinations. For weight pruning alone, we achieve 167x weight reduction in LeNet-5, 24.7x in AlexNet, and 23.4x in VGGNet, without any accuracy loss. For the combination of DNN weight pruning and clustering/quantization, we achieve 1,910x and 210x storage reduction of weight data on LeNet-5 and AlexNet, respectively, without accuracy loss. Our codes and models are released at the link http://bit.ly/2D3F0np
LGOct 17, 2018
Progressive Weight Pruning of Deep Neural Networks using ADMMShaokai Ye, Tianyun Zhang, Kaiqi Zhang et al.
Deep neural networks (DNNs) although achieving human-level performance in many domains, have very large model size that hinders their broader applications on edge computing devices. Extensive research work have been conducted on DNN model compression or pruning. However, most of the previous work took heuristic approaches. This work proposes a progressive weight pruning approach based on ADMM (Alternating Direction Method of Multipliers), a powerful technique to deal with non-convex optimization problems with potentially combinatorial constraints. Motivated by dynamic programming, the proposed method reaches extremely high pruning rate by using partial prunings with moderate pruning rates. Therefore, it resolves the accuracy degradation and long convergence time problems when pursuing extremely high pruning ratios. It achieves up to 34 times pruning rate for ImageNet dataset and 167 times pruning rate for MNIST dataset, significantly higher than those reached by the literature work. Under the same number of epochs, the proposed method also achieves faster convergence and higher compression rates. The codes and pruned DNN models are released in the link bit.ly/2zxdlss
DCMar 28, 2018
Structured Weight Matrices-Based Hardware Accelerators in Deep Neural Networks: FPGAs and ASICsCaiwen Ding, Ao Ren, Geng Yuan et al.
Both industry and academia have extensively investigated hardware accelerations. In this work, to address the increasing demands in computational capability and memory requirement, we propose structured weight matrices (SWM)-based compression techniques for both \emph{field programmable gate array} (FPGA) and \emph{application-specific integrated circuit} (ASIC) implementations. In algorithm part, SWM-based framework adopts block-circulant matrices to achieve a fine-grained tradeoff between accuracy and compression ratio. The SWM-based technique can reduce computational complexity from O($n^2$) to O($n\log n$) and storage complexity from O($n^2$) to O($n$) for each layer and both training and inference phases. For FPGA implementations on deep convolutional neural networks (DCNNs), we achieve at least 152X and 72X improvement in performance and energy efficiency, respectively using the SWM-based framework, compared with the baseline of IBM TrueNorth processor under same accuracy constraints using the data set of MNIST, SVHN, and CIFAR-10. For FPGA implementations on long short term memory (LSTM) networks, the proposed SWM-based LSTM can achieve up to 21X enhancement in performance and 33.5X gains in energy efficiency compared with the baseline accelerator. For ASIC implementations, the SWM-based ASIC design exhibits impressive advantages in terms of power, throughput, and energy efficiency. Experimental results indicate that this method is greatly suitable for applying DNNs onto both FPGAs and mobile/IoT devices.
LGMar 14, 2018
On the Universal Approximation Property and Equivalence of Stochastic Computing-based Neural Networks and Binary Neural NetworksYanzhi Wang, Zheng Zhan, Jiayu Li et al.
Large-scale deep neural networks are both memory intensive and computation-intensive, thereby posing stringent requirements on the computing platforms. Hardware accelerations of deep neural networks have been extensively investigated in both industry and academia. Specific forms of binary neural networks (BNNs) and stochastic computing based neural networks (SCNNs) are particularly appealing to hardware implementations since they can be implemented almost entirely with binary operations. Despite the obvious advantages in hardware implementation, these approximate computing techniques are questioned by researchers in terms of accuracy and universal applicability. Also it is important to understand the relative pros and cons of SCNNs and BNNs in theory and in actual hardware implementations. In order to address these concerns, in this paper we prove that the "ideal" SCNNs and BNNs satisfy the universal approximation property with probability 1 (due to the stochastic behavior). The proof is conducted by first proving the property for SCNNs from the strong law of large numbers, and then using SCNNs as a "bridge" to prove for BNNs. Based on the universal approximation property, we further prove that SCNNs and BNNs exhibit the same energy complexity. In other words, they have the same asymptotic energy consumption with the growing of network size. We also provide a detailed analysis of the pros and cons of SCNNs and BNNs for hardware implementations and conclude that SCNNs are more suitable for hardware.