CVAug 15, 2024Code
HAIR: Hypernetworks-based All-in-One Image RestorationJin Cao, Yi Cao, Li Pang et al.
Image restoration aims to recover a high-quality clean image from its degraded version. Recent progress in image restoration has demonstrated the effectiveness of All-in-One image restoration models in addressing various unknown degradations simultaneously. However, these existing methods typically utilize the same parameters to tackle images with different types of degradation, forcing the model to balance the performance between different tasks and limiting its performance on each task. To alleviate this issue, we propose HAIR, a Hypernetworks-based All-in-One Image Restoration plug-and-play method that generates parameters based on the input image and thus makes the model to adapt to specific degradation dynamically. Specifically, HAIR consists of two main components, i.e., Classifier and Hyper Selecting Net (HSN). The Classifier is a simple image classification network used to generate a Global Information Vector (GIV) that contains the degradation information of the input image, and the HSN is a simple fully-connected neural network that receives the GIV and outputs parameters for the corresponding modules. Extensive experiments demonstrate that HAIR can significantly improve the performance of existing image restoration models in a plug-and-play manner, both in single-task and All-in-One settings. Notably, our proposed model Res-HAIR, which integrates HAIR into the well-known Restormer, can obtain superior or comparable performance compared with current state-of-the-art methods. Moreover, we theoretically demonstrate that to achieve a given small enough error, our proposed HAIR requires fewer parameters in contrast to mainstream embedding-based All-in-One methods. The code is available at https://github.com/toummHus/HAIR.
CLAug 7, 2023
PaniniQA: Enhancing Patient Education Through Interactive Question AnsweringPengshan Cai, Zonghai Yao, Fei Liu et al.
Patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions. In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients' discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients' misunderstandings. Our comprehensive automatic and human evaluation results demonstrate our PaniniQA is capable of improving patients' mastery of their medical instructions through effective interactions
CLApr 9, 2023Code
Continual Graph Convolutional Network for Text ClassificationTiandeng Wu, Qijiong Liu, Yi Cao et al.
Graph convolutional network (GCN) has been successfully applied to capture global non-consecutive and long-distance semantic information for text classification. However, while GCN-based methods have shown promising results in offline evaluations, they commonly follow a seen-token-seen-document paradigm by constructing a fixed document-token graph and cannot make inferences on new documents. It is a challenge to deploy them in online systems to infer steaming text data. In this work, we present a continual GCN model (ContGCN) to generalize inferences from observed documents to unobserved documents. Concretely, we propose a new all-token-any-document paradigm to dynamically update the document-token graph in every batch during both the training and testing phases of an online system. Moreover, we design an occurrence memory module and a self-supervised contrastive learning objective to update ContGCN in a label-free manner. A 3-month A/B test on Huawei public opinion analysis system shows ContGCN achieves 8.86% performance gain compared with state-of-the-art methods. Offline experiments on five public datasets also show ContGCN can improve inference quality. The source code will be released at https://github.com/Jyonn/ContGCN.
IRAug 26, 2022
Extracting Biomedical Factual Knowledge Using Pretrained Language Model and Electronic Health Record ContextZonghai Yao, Yi Cao, Zhichao Yang et al.
Language Models (LMs) have performed well on biomedical natural language processing applications. In this study, we conducted some experiments to use prompt methods to extract knowledge from LMs as new knowledge Bases (LMs as KBs). However, prompting can only be used as a low bound for knowledge extraction, and perform particularly poorly on biomedical domain KBs. In order to make LMs as KBs more in line with the actual application scenarios of the biomedical domain, we specifically add EHR notes as context to the prompt to improve the low bound in the biomedical domain. We design and validate a series of experiments for our Dynamic-Context-BioLAMA task. Our experiments show that the knowledge possessed by those language models can distinguish the correct knowledge from the noise knowledge in the EHR notes, and such distinguishing ability can also be used as a new metric to evaluate the amount of knowledge possessed by the model.
CLNov 18, 2022
Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge ProbingZonghai Yao, Yi Cao, Zhichao Yang et al.
Pretrained language models (PLMs) have motivated research on what kinds of knowledge these models learn. Fill-in-the-blanks problem (e.g., cloze tests) is a natural approach for gauging such knowledge. BioLAMA generates prompts for biomedical factual knowledge triples and uses the Top-k accuracy metric to evaluate different PLMs' knowledge. However, existing research has shown that such prompt-based knowledge probing methods can only probe a lower bound of knowledge. Many factors like prompt-based probing biases make the LAMA benchmark unreliable and unstable. This problem is more prominent in BioLAMA. The severe long-tailed distribution in vocabulary and large-N-M relation make the performance gap between LAMA and BioLAMA remain notable. To address these, we introduce context variance into the prompt generation and propose a new rank-change-based evaluation metric. Different from the previous known-unknown evaluation criteria, we propose the concept of "Misunderstand" in LAMA for the first time. Through experiments on 12 PLMs, our context variance prompts and Understand-Confuse-Misunderstand (UCM) metric makes BioLAMA more friendly to large-N-M relations and rare relations. We also conducted a set of control experiments to disentangle "understand" from just "read and copy".
CLJan 29Code
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document EmbeddingJiahao Huo, Yu Huang, Yibo Yan et al.
Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval. Our code is available at https://github.com/Z1zs/Causal-Embed.
AIApr 25, 2023
Self-Supervised Temporal Analysis of Spatiotemporal DataYi Cao, Swetava Ganguli, Vipul Pandey · apple-ml
There exists a correlation between geospatial activity temporal patterns and type of land use. A novel self-supervised approach is proposed to stratify landscape based on mobility activity time series. First, the time series signal is transformed to the frequency domain and then compressed into task-agnostic temporal embeddings by a contractive autoencoder, which preserves cyclic temporal patterns observed in time series. The pixel-wise embeddings are converted to image-like channels that can be used for task-based, multimodal modeling of downstream geospatial tasks using deep semantic segmentation. Experiments show that temporal embeddings are semantically meaningful representations of time series data and are effective across different tasks such as classifying residential area and commercial areas.
61.8CVApr 11
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document RetrievalYibo Yan, Mingdong Ou, Yi Cao et al.
Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative single-vector models. ColChunk provides a practical solution for balancing retrieval accuracy and efficiency in visual document systems.
CLFeb 23
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge FrameworkYibo Yan, Mingdong Ou, Yi Cao et al.
Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.
CVOct 16, 2023
Temporal Embeddings: Scalable Self-Supervised Temporal Representation Learning from Spatiotemporal Data for Multimodal Computer VisionYi Cao, Swetava Ganguli, Vipul Pandey · apple-ml
There exists a correlation between geospatial activity temporal patterns and type of land use. A novel self-supervised approach is proposed to stratify landscape based on mobility activity time series. First, the time series signal is transformed to the frequency domain and then compressed into task-agnostic temporal embeddings by a contractive autoencoder, which preserves cyclic temporal patterns observed in time series. The pixel-wise embeddings are converted to image-like channels that can be used for task-based, multimodal modeling of downstream geospatial tasks using deep semantic segmentation. Experiments show that temporal embeddings are semantically meaningful representations of time series data and are effective across different tasks such as classifying residential area and commercial areas. Temporal embeddings transform sequential, spatiotemporal motion trajectory data into semantically meaningful image-like tensor representations that can be combined (multimodal fusion) with other data modalities that are or can be transformed into image-like tensor representations (for e.g., RBG imagery, graph embeddings of road networks, passively collected imagery like SAR, etc.) to facilitate multimodal learning in geospatial computer vision. Multimodal computer vision is critical for training machine learning models for geospatial feature detection to keep a geospatial mapping service up-to-date in real-time and can significantly improve user experience and above all, user safety.
CLFeb 23
Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document RetrievalYibo Yan, Jiahao Huo, Guanbo Feng et al.
With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.
IRJul 2, 2024
ECAT: A Entire space Continual and Adaptive Transfer Learning Framework for Cross-Domain RecommendationChaoqun Hou, Yuanhang Zhou, Yi Cao et al.
In industrial recommendation systems, there are several mini-apps designed to meet the diverse interests and needs of users. The sample space of them is merely a small subset of the entire space, making it challenging to train an efficient model. In recent years, there have been many excellent studies related to cross-domain recommendation aimed at mitigating the problem of data sparsity. However, few of them have simultaneously considered the adaptability of both sample and representation continual transfer setting to the target task. To overcome the above issue, we propose a Entire space Continual and Adaptive Transfer learning framework called ECAT which includes two core components: First, as for sample transfer, we propose a two-stage method that realizes a coarse-to-fine process. Specifically, we perform an initial selection through a graph-guided method, followed by a fine-grained selection using domain adaptation method. Second, we propose an adaptive knowledge distillation method for continually transferring the representations from a model that is well-trained on the entire space dataset. ECAT enables full utilization of the entire space samples and representations under the supervision of the target task, while avoiding negative migration. Comprehensive experiments on real-world industrial datasets from Taobao show that ECAT advances state-of-the-art performance on offline metrics, and brings +13.6% CVR and +8.6% orders for Baiyibutie, a famous mini-app of Taobao.
LGNov 16, 2022
Deep Intention-Aware Network for Click-Through Rate PredictionYaxian Xia, Yi Cao, Sihao Hu et al.
E-commerce platforms provide entrances for customers to enter mini-apps that can meet their specific shopping requirements. Trigger items displayed on entrance icons can attract more entering. However, conventional Click-Through-Rate (CTR) prediction models, which ignore user instant interest in trigger item, fail to be applied to the new recommendation scenario dubbed Trigger-Induced Recommendation in Mini-Apps (TIRA). Moreover, due to the high stickiness of customers to mini-apps, we argue that existing trigger-based methods that over-emphasize the importance of trigger items, are undesired for TIRA, since a large portion of customer entries are because of their routine shopping habits instead of triggers. We identify that the key to TIRA is to extract customers' personalized entering intention and weigh the impact of triggers based on this intention. To achieve this goal, we convert CTR prediction for TIRA into a separate estimation form, and present Deep Intention-Aware Network (DIAN) with three key elements: 1) Intent Net that estimates user's entering intention, i.e., whether he/she is affected by the trigger or by the habits; 2) Trigger-Aware Net and 3) Trigger-Free Net that estimate CTRs given user's intention is to the trigger-item and the mini-app respectively. Following a joint learning way, DIAN can both accurately predict user intention and dynamically balance the results of trigger-free and trigger-based recommendations based on the estimated intention. Experiments show that DIAN advances state-of-the-art performance in a large real-world dataset, and brings a 9.39% lift of online Item Page View and 4.74% CTR for Juhuasuan, a famous mini-app of Taobao.
CLMar 2
Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document RepresentationsYibo Yan, Mingdong Ou, Yi Cao et al.
Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.
IRDec 15, 2025
Towards Practical Large-scale Dynamical Heterogeneous Graph Embedding: Cold-start Resilient RecommendationMabiao Long, Jiaxi Liu, Yufeng Li et al.
Deploying dynamic heterogeneous graph embeddings in production faces key challenges of scalability, data freshness, and cold-start. This paper introduces a practical, two-stage solution that balances deep graph representation with low-latency incremental updates. Our framework combines HetSGFormer, a scalable graph transformer for static learning, with Incremental Locally Linear Embedding (ILLE), a lightweight, CPU-based algorithm for real-time updates. HetSGFormer captures global structure with linear scalability, while ILLE provides rapid, targeted updates to incorporate new data, thus avoiding costly full retraining. This dual approach is cold-start resilient, leveraging the graph to create meaningful embeddings from sparse data. On billion-scale graphs, A/B tests show HetSGFormer achieved up to a 6.11% lift in Advertiser Value over previous methods, while the ILLE module added another 3.22% lift and improved embedding refresh timeliness by 83.2%. Our work provides a validated framework for deploying dynamic graph learning in production environments.
81.7OCMay 7
Global self-optimizing control of batch processesChenchen Zhou, Hongxin Su, Xinhui Tang et al.
This work considers to achieve near-optimal operation for a class of batch processes by employing self-optimizing control (SOC). Comparing with a continuous one, a batch process exhibits stronger nonlinearity with dynamics because of the non-steady operation condition. This necessitates a global version of SOC to achieve satisfactory performance. Meanwhile, it also makes the existing global SOC (gSOC) not directly applicable to batch processes due to the causality amongst variables. Therefore, it is necessary to extend the original gSOC to batch processes. In addition to the nonconvexity challenge of the original gSOC problem, the new extension for batch processes has to face even more challenges. Particularly, the causality due to dynamics of batch processes brings in structural constraints on controlled variables (CVs), making a CV selection problem even more difficult. To address these challenges, the gSOC problem is recast in a vectorized formulation and it is proved that the structural constraints considered are linear in the vectorized formulation. Moreover, a novel shortcut method is proposed to efficiently find sub-optimal but more transparent solutions for this problem. The effectiveness of the new approach is validated through a case study of a fed-batch reactor, where CVs are constructed through a combination matrix with a repetitive structure, resulting in a simple SOC scheme. This simplicity facilitates the implementation of the SOC approach and enhances its practical applicability and robustness.
20.5OCMay 7
Performance guaranteed MPC Policy Approximation via Cost Guided LearningChenchen Zhou, Yi Cao, Shuang-hua Yang
Model predictive control (MPC) is widely used in industries but implementing it poses challenges due to hardware or time constraints. A promising solution is to approximate the MPC policy using function approximators like neural networks. Existing methods focus on minimizing the error between the approximators outputs and the MPC optimal control actions on training data, which is called error guided learning approach in this paper. However, the goals of control law design is not to minimize the fitting error but to minimize the operation cost. This paper proposes a novel cost-guided learning approach that utilizes the cost sensitivity information from the MPC problem to directly minimize the loss in closed-loop performance. A theoretical analysis shows cost-guided learning provides tighter guarantees on optimality loss compared to traditional error-guided learning. Experiments on a continuous stirred tank reactor (CSTR) benchmark demonstrate that the proposed technique results in approximate MPC policies that achieve substantially better closed-loop performance. This work makes an important contribution by connecting the fitting errors with operational objectives, overcoming key limitations of existing approximation methods. The core idea could be applied more broadly for data-driven control.
48.5OCMay 7
Dynamic Controlled Variables Based Dynamic Self-Optimizing ControlChenchen Zhou, Shaoqi Wang, Hongxin Su et al.
Self-optimizing control is a strategy for selecting controlled variables, where the economic objective guides the selection and design of controlled variables, with the expectation that maintaining the controlled variables at constant values can achieve optimization effects, translating the process optimization problem into a process control problem. Currently, self-optimizing control is widely applied to steady-state optimization problems. However, the development of process systems exhibits a trend towards refinement, highlighting the importance of optimizing dynamic processes such as batch processes and grade transitions. This paper formally introduces the self-optimizing control problem for dynamic optimization, termed the dynamic self-optimizing control problem, extending the original definition of self-optimizing control. A novel concept, "dynamic controlled variables" (DCVs), is proposed, and an implicit control policy is presented based on this concept. The paper theoretically analyzes the advantages and generality of DCVs compared to explicit control strategies and elucidates the relationship between DCVs and traditional controllers. Moreover, this paper puts forth a data-driven approach to designing self-optimizing DCVs, which considers DCV design as a mapping identification problem and employs deep neural networks to parameterize the variables. Three case studies validate the efficacy and superiority of DCVs in approximating multi-valued and discontinuous functions, as well as their application to dynamic optimization problems with non-fixed horizons, which traditional self-optimizing control methods are unable to address.
39.6OCMay 8
Generalized Global Self-Optimizing Control for Chemical Processes: Part II Objective-Guided Controlled Variable Learning ApproachChenchen Zhou, Hongxin Su, Xinhui Tang et al.
Self-optimizing control (SOC) aims to maintain near-optimal process operation by judiciously selecting controlled variables (CVs). In this series of work, the generalized global SOC (g2SOC) approach is proposed, which extends the concept of SOC to the whole operation space and uses general nonlinear functions to design CVs instead of linear combinations. In the first part of this series work, two numerical approaches for g2SOC are proposed: the optimization-based approach and the regression-based approach, based on a theoretical analysis of the existence of perfect self-optimizing CVs. The CVs designed by the former perform better, but are usually infeasible for large-scale problems. In this paper, we propose an algorithm called objective-guided controlled variable learning (OGCVL) that combines the advantages of both and has a better scalability. OGCVL is proposed for efficient CV design that seamlessly integrates symbolic and numerical computation techniques. Finally, the effectiveness of the OGCVL method is verified in two numerical examples. Both examples illustrate show that the OGCVL method is able to achieve good results while maintaining computational efficiency and is also feasible in large-scale problems.
89.3SEMay 1
Can Coding Agents Reproduce Findings in Computational Materials Science?Ziyang Huang, Yi Cao, Ali K. Shargh et al.
Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims. To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers to test whether coding agents can recover and execute the end-to-end workflow needed to support (or undermine) such claims. We then evaluate multiple representative coding agent settings across several foundation models. Our results show that current LLM-based agents obtain low overall success rates on AutoMat, with the best-performing setting achieving a success rate of only 54.1%. Error analysis further reveals that agents perform worst when workflows must be reconstructed from paper text alone and that they fail primarily due to incomplete procedures, methodological deviations, and execution fragility. Taken together, these findings position AutoMat as both a benchmark for computational scientific reproducibility and a tool for diagnosing the current limitations of agentic systems in AI-for-science settings.
CVJul 2, 2024
SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light ImagesJintu Zheng, Yi Ding, Qizhe Liu et al.
Traditional fluorescence staining is phototoxic to live cells, slow, and expensive; thus, the subcellular structure prediction (SSP) from transmitted light (TL) images is emerging as a label-free, faster, low-cost alternative. However, existing approaches utilize 3D networks for one-to-one voxel level dense prediction, which necessitates a frequent and time-consuming Z-axis imaging process. Moreover, 3D convolutions inevitably lead to significant computation and GPU memory overhead. Therefore, we propose an efficient framework, SparseSSP, predicting fluorescent intensities within the target voxel grid in an efficient paradigm instead of relying entirely on 3D topologies. In particular, SparseSSP makes two pivotal improvements to prior works. First, SparseSSP introduces a one-to-many voxel mapping paradigm, which permits the sparse TL slices to reconstruct the subcellular structure. Secondly, we propose a hybrid dimensions topology, which folds the Z-axis information into channel features, enabling the 2D network layers to tackle SSP under low computational cost. We conduct extensive experiments to validate the effectiveness and advantages of SparseSSP on diverse sparse imaging ratios, and our approach achieves a leading performance compared to pure 3D topologies. SparseSSP reduces imaging frequencies compared to previous dense-view SSP (i.e., the number of imaging is reduced up to 87.5% at most), which is significant in visualizing rapid biological dynamics on low-cost devices and samples.
87.8MTRL-SCIMay 4
From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and ChemistryAritra Roy, Kevin Shen, Andrew MacBride et al.
Large language models (LLMs) are rapidly changing how researchers in materials science and chemistry discover, organize, and act on scientific knowledge. This paper analyzes a broad set of community-developed LLM applications in an effort to identify emerging patterns in how these systems can be used across the scientific research lifecycle. We organize the projects into two complementary categories: Knowledge Infrastructure, systems that structure, retrieve, synthesize, and validate scientific information; and Action Systems, systems that execute, coordinate, or automate scientific work across computational and experimental environments. The submissions reveal a shift from single-purpose LLM tools toward integrated, multi-agent workflows that combine retrieval, reasoning, tool use, and domain-specific validation. Prominent themes include retrieval-augmented generation as grounding infrastructure, persistent structured knowledge representations, multimodal and multilingual scientific inputs, and early progress toward laboratory-integrated closed-loop systems. Together, these results suggest that LLMs are evolving from general-purpose assistants into composable infrastructure for scientific reasoning and action. This work provides a community snapshot of that transition and a practical taxonomy for understanding emerging LLM-enabled workflows in materials science and chemistry.
CVApr 6, 2025
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation EngineeringYanshu Li, Yi Cao, Hongyang He et al.
Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74\% with substantial improvements in overall efficiency.
CVMay 18, 2024
InfRS: Incremental Few-Shot Object Detection in Remote Sensing ImagesWuzhou Li, Jiawei Zhou, Xiang Li et al.
Recently, the field of few-shot detection within remote sensing imagery has witnessed significant advancements. Despite these progresses, the capacity for continuous conceptual learning still poses a significant challenge to existing methodologies. In this paper, we explore the intricate task of incremental few-shot object detection in remote sensing images. We introduce a pioneering fine-tuningbased technique, termed InfRS, designed to facilitate the incremental learning of novel classes using a restricted set of examples, while concurrently preserving the performance on established base classes without the need to revisit previous datasets. Specifically, we pretrain the model using abundant data from base classes and then generate a set of class-wise prototypes that represent the intrinsic characteristics of the data. In the incremental learning stage, we introduce a Hybrid Prototypical Contrastive (HPC) encoding module for learning discriminative representations. Furthermore, we develop a prototypical calibration strategy based on the Wasserstein distance to mitigate the catastrophic forgetting problem. Comprehensive evaluations on the NWPU VHR-10 and DIOR datasets demonstrate that our model can effectively solve the iFSOD problem in remote sensing images. Code will be released.
CVMar 20, 2024
Few-shot Oriented Object Detection with Memorable Contrastive Learning in Remote Sensing ImagesJiawei Zhou, Wuzhou Li, Yi Cao et al.
Few-shot object detection (FSOD) has garnered significant research attention in the field of remote sensing due to its ability to reduce the dependency on large amounts of annotated data. However, two challenges persist in this area: (1) axis-aligned proposals, which can result in misalignment for arbitrarily oriented objects, and (2) the scarcity of annotated data still limits the performance for unseen object categories. To address these issues, we propose a novel FSOD method for remote sensing images called Few-shot Oriented object detection with Memorable Contrastive learning (FOMC). Specifically, we employ oriented bounding boxes instead of traditional horizontal bounding boxes to learn a better feature representation for arbitrary-oriented aerial objects, leading to enhanced detection performance. To the best of our knowledge, we are the first to address oriented object detection in the few-shot setting for remote sensing images. To address the challenging issue of object misclassification, we introduce a supervised contrastive learning module with a dynamically updated memory bank. This module enables the use of large batches of negative samples and enhances the model's capability to learn discriminative features for unseen classes. We conduct comprehensive experiments on the DOTA and HRSC2016 datasets, and our model achieves state-of-the-art performance on the few-shot oriented object detection task. Code and pretrained models will be released.
39.6MLApr 1
Bridging Structured Knowledge and Data: A Unified Framework with Finance ApplicationsYi Cao, Zexun Chen, Lin William Cong et al.
We develop Structured-Knowledge-Informed Neural Networks (SKINNs), a unified estimation framework that embeds theoretical, simulated, previously learned, or cross-domain insights as differentiable constraints within flexible neural function approximation. SKINNs jointly estimate neural network parameters and economically meaningful structural parameters in a single optimization problem, enforcing theoretical consistency not only on observed data but over a broader input domain through collocation, and therefore nesting approaches such as functional GMM, Bayesian updating, transfer learning, PINNs, and surrogate modeling. SKINNs define a class of M-estimators that are consistent and asymptotically normal with root-N convergence, sandwich covariance, and recovery of pseudo-true parameters under misspecification. We establish identification of structural parameters under joint flexibility, derive generalization and target-risk bounds under distributional shift in a convex proxy, and provide a restricted-optimal characterization of the weighting parameter that governs the bias-variance tradeoff. In an illustrative financial application to option pricing, SKINNs improve out-of-sample valuation and hedging performance, particularly at longer horizons and during high-volatility regimes, while recovering economically interpretable structural parameters with improved stability relative to conventional calibration. More broadly, SKINNs provide a general econometric framework for combining model-based reasoning with high-dimensional, data-driven estimation.
NIMar 6
A Dual-AoI-based Approach for Optimal Transmission Scheduling in Wireless Monitoring Systems with Random Data ArrivalsYuchong Zhang, Yi Cao, Xianghui Cao
In Internet of Things (IoTs), the freshness of system status information is crucial for real-time monitoring and decision-making. This paper studies the transmission scheduling problem in wireless monitoring systems, where information freshness -- typically quantified by the Age of Information (AoI) -- is heavily constrained by limited channel resources and influenced by factors such as the randomness of data arrivals and unreliable wireless channel. Such randomness leads to asynchronous AoI evolution at local sensors and the monitoring center, rendering conventional scheduling policies that rely solely on the monitoring center's AoI inefficient. To this end, we propose a dual-AoI model that captures asynchronous AoI dynamics and formulate the problem as minimizing a long-term time-average AoI function. We develop a scheduling policy based on Markov decision process (MDP) to solve the problem, and analyze the existence and monotonicity of a deterministic stationary optimal policy. Moreover, we derive a low-complexity scheduling policy which exhibits a channel-state-dependent threshold structure. In addition, we establish a necessary and sufficient condition for the stability of the AoI objective. Simulation results demonstrate that the proposed policy outperforms existing approaches.
CHEM-PHAug 27, 2025
Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force FieldsYi Cao, Paulette Clancy
Machine-learned force fields (MLFFs), especially pre-trained foundation models, are transforming computational materials science by enabling ab initio-level accuracy at molecular dynamics scales. Yet their rapid rise raises a key question: should researchers train specialist models from scratch, fine-tune generalist foundation models, or use hybrid approaches? The trade-offs in data efficiency, accuracy, cost, and robustness to out-of-distribution failure remain unclear. We introduce a benchmarking framework using defect migration pathways, evaluated through nudged elastic band trajectories, as diagnostic probes that test both interpolation and extrapolation. Using Cr-doped Sb2Te3 as a representative two-dimensional material, we benchmark multiple training paradigms within the MACE architecture across equilibrium, kinetic (atomic migration), and mechanical (interlayer sliding) tasks. Fine-tuned models substantially outperform from-scratch and zero-shot approaches for kinetic properties but show partial loss of long-range physics. Representational analysis reveals distinct, non-overlapping latent encodings, indicating that different training strategies learn different aspects of system physics. This framework provides practical guidelines for MLFF development and establishes migration-based probes as efficient diagnostics linking performance to learned representations, guiding future uncertainty-aware active learning.
LGJan 25, 2025
PIP: Perturbation-based Iterative Pruning for Large Language ModelsYi Cao, Wei-Jie Xu, Yucheng Shen et al.
The rapid increase in the parameter counts of Large Language Models (LLMs), which often reach into the billions or even trillions, presents significant challenges for their practical deployment, particularly in resource-constrained environments. To address this issue, we propose PIP (Perturbation-based Iterative Pruning), a novel double-view structured pruning method to optimize LLMs, which combines information from two different views: the unperturbed view and the perturbed view. With the calculation of gradient differences, PIP iteratively prunes those that struggle to distinguish between these two views. Our experiments show that PIP reduces the parameter count by approximately 20% while retaining over 85% of the original model's accuracy across varied benchmarks. In some cases, the performance of the pruned model is within 5% of the unpruned version, demonstrating PIP's ability to preserve key aspects of model effectiveness. Moreover, PIP consistently outperforms existing state-of-the-art (SOTA) structured pruning methods, establishing it as a leading technique for optimizing LLMs in constrained environments.
IRFeb 21, 2022
GIFT: Graph-guIded Feature Transfer for Cold-Start Video Click-Through Rate PredictionSihao Hu, Yi Cao, Yu Gong et al.
Short video has witnessed rapid growth in the past few years in e-commerce platforms like Taobao. To ensure the freshness of the content, platforms need to release a large number of new videos every day, making conventional click-through rate (CTR) prediction methods suffer from the item cold-start problem. In this paper, we propose GIFT, an efficient Graph-guIded Feature Transfer system, to fully take advantages of the rich information of warmed-up videos to compensate for the cold-start ones. Specifically, we establish a heterogeneous graph that contains physical and semantic linkages to guide the feature transfer process from warmed-up video to cold-start videos. The physical linkages represent explicit relationships, while the semantic linkages measure the proximity of multi-modal representations of two videos. We elaborately design the feature transfer function to make aware of different types of transferred features (e.g., id representations and historical statistics) from different metapaths on the graph. We conduct extensive experiments on a large real-world dataset, and the results show that our GIFT system outperforms SOTA methods significantly and brings a 6.82% lift on CTR in the homepage of Taobao App.
CRJun 25, 2020
A framework of blockchain-based secure and privacy-preserving E-government systemNoe Elisa, Longzhi Yang, Fei Chao et al.
Electronic government (e-government) uses information and communication technologies to deliver public services to individuals and organisations effectively, efficiently and transparently. E-government is one of the most complex systems which needs to be distributed, secured and privacy-preserved, and the failure of these can be very costly both economically and socially. Most of the existing e-government systems such as websites and electronic identity management systems (eIDs) are centralized at duplicated servers and databases. A centralized management and validation system may suffer from a single point of failure and make the system a target to cyber attacks such as malware, denial of service attacks (DoS), and distributed denial of service attacks (DDoS). The blockchain technology enables the implementation of highly secure and privacy-preserving decentralized systems where transactions are not under the control of any third party organizations. Using the blockchain technology, exiting data and new data are stored in a sealed compartment of blocks (i.e., ledger) distributed across the network in a verifiable and immutable way. Information security and privacy are enhanced by the blockchain technology in which data are encrypted and distributed across the entire network. This paper proposes a framework of a decentralized e-government peer-to-peer (p2p) system using the blockchain technology, which can ensure both information security and privacy while simultaneously increasing the trust of the public sectors. In addition, a prototype of the proposed system is presented, with the support of a theoretical and qualitative analysis of the security and privacy implications of such system.
PMApr 20, 2020
The new methods for equity fund selection and optimal portfolio constructionYi Cao
We relook at the classic equity fund selection and portfolio construction problems from a new perspective and propose an easy-to-implement framework to tackle the problem in practical investment. Rather than the conventional way by constructing a long only portfolio from a big universe of stocks or macro factors, we show how to produce a long-short portfolio from a smaller pool of stocks from mutual fund top holdings and generate impressive results. As these methods are based on statistical evidence, we need closely monitoring the model validity, and prepare repair strategies.
IRAug 15, 2018
Neural Collaborative RankingBo Song, Xin Yang, Yi Cao et al.
Recommender systems are aimed at generating a personalized ranked list of items that an end user might be interested in. With the unprecedented success of deep learning in computer vision and speech recognition, recently it has been a hot topic to bridge the gap between recommender systems and deep neural network. And deep learning methods have been shown to achieve state-of-the-art on many recommendation tasks. For example, a recent model, NeuMF, first projects users and items into some shared low-dimensional latent feature space, and then employs neural nets to model the interaction between the user and item latent features to obtain state-of-the-art performance on the recommendation tasks. NeuMF assumes that the non-interacted items are inherent negative and uses negative sampling to relax this assumption. In this paper, we examine an alternative approach which does not assume that the non-interacted items are necessarily negative, just that they are less preferred than interacted items. Specifically, we develop a new classification strategy based on the widely used pairwise ranking assumption. We combine our classification strategy with the recently proposed neural collaborative filtering framework, and propose a general collaborative ranking framework called Neural Network based Collaborative Ranking (NCR). We resort to a neural network architecture to model a user's pairwise preference between items, with the belief that neural network will effectively capture the latent structure of latent factors. The experimental results on two real-world datasets show the superior performance of our models in comparison with several state-of-the-art approaches.