CVFeb 22, 2023Code
Human MotionFormer: Transferring Human Motions with Vision TransformersHongyu Liu, Xintong Han, Chengbin Jin et al.
Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. An accurate matching between the source person and the target motion in both large and subtle motion changes is vital for improving the transferred motion quality. In this paper, we propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching, respectively. It consists of two ViT encoders to extract input features (i.e., a target motion image and a source human image) and a ViT decoder with several cascaded blocks for feature matching and motion transfer. In each block, we set the target motion feature as Query and the source person as Key and Value, calculating the cross-attention maps to conduct a global feature matching. Further, we introduce a convolutional layer to improve the local perception after the global cross-attention computations. This matching process is implemented in both warping and generation branches to guide the motion transfer. During training, we propose a mutual learning loss to enable the co-supervision between warping and generation branches for better motion representations. Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively. Project page: \url{https://github.com/KumapowerLIU/Human-MotionFormer}
CVSep 20, 2023
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive LearningChen Jiang, Hong Liu, Xuzheng Yu et al.
In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.
CLMay 15, 2022
Discovering Latent Concepts Learned in BERTFahim Dalvi, Abdul Rafae Khan, Firoj Alam et al.
A large number of studies that analyze deep neural network models and their ability to encode various linguistic and non-linguistic concepts provide an interpretation of the inner mechanics of these models. The scope of the analyses is limited to pre-defined concepts that reinforce the traditional linguistic knowledge and do not reflect on how novel concepts are learned by the model. We address this limitation by discovering and analyzing latent concepts learned in neural network models in an unsupervised fashion and provide interpretations from the model's perspective. In this work, we study: i) what latent concepts exist in the pre-trained BERT model, ii) how the discovered latent concepts align or diverge from classical linguistic hierarchy and iii) how the latent concepts evolve across layers. Our findings show: i) a model learns novel concepts (e.g. animal categories and demographic groups), which do not strictly adhere to any pre-defined categorization (e.g. POS, semantic tags), ii) several latent concepts are based on multiple properties which may include semantics, syntax, and morphology, iii) the lower layers in the model dominate in learning shallow lexical concepts while the higher layers learn semantic relations and iv) the discovered latent concepts highlight potential biases learned in the model. We also release a novel BERT ConceptNet dataset (BCN) consisting of 174 concept labels and 1M annotated instances.
CLJun 27, 2022
Analyzing Encoded Concepts in Transformer Language ModelsHassan Sajjad, Nadir Durrani, Fahim Dalvi et al.
We propose a novel framework ConceptX, to analyze how latent concepts are encoded in representations learned within pre-trained language models. It uses clustering to discover the encoded concepts and explains them by aligning with a large set of human-defined concepts. Our analysis on seven transformer language models reveal interesting insights: i) the latent space within the learned representations overlap with different linguistic concepts to a varying degree, ii) the lower layers in the model are dominated by lexical concepts (e.g., affixation), whereas the core-linguistic concepts (e.g., morphological or syntactic relations) are better represented in the middle and higher layers, iii) some encoded concepts are multi-faceted and cannot be adequately explained using the existing human-defined concepts.
CRFeb 22, 2023
PAD: Towards Principled Adversarial Malware Detection Against Evasion AttacksDeqiang Li, Shicheng Cui, Yun Li et al.
Machine Learning (ML) techniques can facilitate the automation of malicious software (malware for short) detection, but suffer from evasion attacks. Many studies counter such attacks in heuristic manners, lacking theoretical guarantees and defense effectiveness. In this paper, we propose a new adversarial training framework, termed Principled Adversarial Malware Detection (PAD), which offers convergence guarantees for robust optimization methods. PAD lays on a learnable convex measurement that quantifies distribution-wise discrete perturbations to protect malware detectors from adversaries, whereby for smooth detectors, adversarial training can be performed with theoretical treatments. To promote defense effectiveness, we propose a new mixture of attacks to instantiate PAD to enhance deep neural network-based measurements and malware detectors. Experimental results on two Android malware datasets demonstrate: (i) the proposed method significantly outperforms the state-of-the-art defenses; (ii) it can harden ML-based malware detection against 27 evasion attacks with detection accuracies greater than 83.45%, at the price of suffering an accuracy decrease smaller than 2.16% in the absence of attacks; (iii) it matches or outperforms many anti-malware scanners in VirusTotal against realistic adversarial malware.
90.2CVMar 19Code
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMsKeda Tao, Yuhua Zheng, Jia Xu et al.
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
AIJan 13Code
The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace ScenariosDaocheng Fu, Jianbiao Mei, Rong Wu et al.
The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method{}, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, \method{} evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv
SYJul 25, 2023
Towards Integrated Traffic Control with Operating Decentralized Autonomous OrganizationShengyue Yao, Jingru Yu, Yi Yu et al.
With a growing complexity of the intelligent traffic system (ITS), an integrated control of ITS that is capable of considering plentiful heterogeneous intelligent agents is desired. However, existing control methods based on the centralized or the decentralized scheme have not presented their competencies in considering the optimality and the scalability simultaneously. To address this issue, we propose an integrated control method based on the framework of Decentralized Autonomous Organization (DAO). The proposed method achieves a global consensus on energy consumption efficiency (ECE), meanwhile to optimize the local objectives of all involved intelligent agents, through a consensus and incentive mechanism. Furthermore, an operation algorithm is proposed regarding the issue of structural rigidity in DAO. Specifically, the proposed operation approach identifies critical agents to execute the smart contract in DAO, which ultimately extends the capability of DAO-based control. In addition, a numerical experiment is designed to examine the performance of the proposed method. The experiment results indicate that the controlled agents can achieve a consensus faster on the global objective with improved local objectives by the proposed method, compare to existing decentralized control methods. In general, the proposed method shows a great potential in developing an integrated control system in the ITS
CLNov 12, 2022
ConceptX: A Framework for Latent Concept AnalysisFiroj Alam, Fahim Dalvi, Nadir Durrani et al.
The opacity of deep neural networks remains a challenge in deploying solutions where explanation is as important as precision. We present ConceptX, a human-in-the-loop framework for interpreting and annotating latent representational space in pre-trained Language Models (pLMs). We use an unsupervised method to discover concepts learned in these models and enable a graphical interface for humans to generate explanations for the concepts. To facilitate the process, we provide auto-annotations of the concepts (based on traditional linguistic ontologies). Such annotations enable development of a linguistic resource that directly represents latent concepts learned within deep NLP models. These include not just traditional linguistic concepts, but also task-specific or sensitive concepts (words grouped based on gender or religious connotation) that helps the annotators to mark bias in the model. The framework consists of two parts (i) concept discovery and (ii) annotation platform.
ITJun 18, 2011
Anti-measurement Matrix Uncertainty Sparse Signal Recovery for Compressive SensingYipeng Liu, Qun Wan, Fei Wen et al.
Compressive sensing (CS) is a technique for estimating a sparse signal from the random measurements and the measurement matrix. Traditional sparse signal recovery methods have seriously degeneration with the measurement matrix uncertainty (MMU). Here the MMU is modeled as a bounded additive error. An anti-uncertainty constraint in the form of a mixed L2 and L1 norm is deduced from the sparse signal model with MMU. Then we combine the sparse constraint with the anti-uncertainty constraint to get an anti-uncertainty sparse signal recovery operator. Numerical simulations demonstrate that the proposed operator has a better reconstructing performance with the MMU than traditional methods.
64.0CLMay 11Code
Multi-domain Multi-modal Document Classification Benchmark with a Multi-level TaxonomyDenghao Ma, Qing Liu, Zulong Chen et al.
Document classification forms the backbone of modern enterprise content management, yet existing benchmarks remain trapped in oversimplified paradigms -- single domain settings with flat label structures -- that bear little resemblance to the hierarchical, multi-modal, and cross-domain nature of real-world business documents. This gap not only misrepresents practical complexity but also stifles progress toward industrially viable document intelligence. To bridge this gap, we construct the first Multi-level, Multi-domain, Multi-modal document classification Benchmark (MMM-Bench). MMM-Bench includes (1) a deeply hierarchical taxonomy spanning five levels that capture the authentic organizational logic of business documentation; and (2) 5,990 real-world multi-modal documents meticulously curated from 12 commercial domains in Alibaba. Each document is manually annotated with a complete hierarchical path by domain experts. We establish comprehensive baselines on MMM-Bench, which consists of open-weight models and API-based models. Through systematic experiments, we identify four fundamental challenges within MMM-Bench and propose corresponding insights. To provide a solid foundation for advancing research in multi-level, multi-domain document classification, we release all of the data and the evaluation toolkit of MMM-Bench at https://github.com/MMMDC-Bench/MMMDC-Bench.
ITJun 18, 2011
Sparse Support Recovery with Phase-Only MeasurementsYipeng Liu, Qun Wan, Fei Wen et al.
Sparse support recovery (SSR) is an important part of the compressive sensing (CS). Most of the current SSR methods are with the full information measurements. But in practice the amplitude part of the measurements may be seriously destroyed. The corrupted measurements mismatch the current SSR algorithms, which leads to serious performance degeneration. This paper considers the problem of SSR with only phase information. In the proposed method, the minimization of the l1 norm of the estimated sparse signal enforces sparse distribution, while a nonzero constraint of the uncorrupted random measurements' amplitudes with respect to the reconstructed sparse signal is introduced. Because it only requires the phase components of the measurements in the constraint, it can avoid the performance deterioration by corrupted amplitude components. Simulations demonstrate that the proposed phase-only SSR is superior in the support reconstruction accuracy when the amplitude components of the measurements are contaminated.
LGSep 27, 2023
Fair Canonical Correlation AnalysisZhuoping Zhou, Davoud Ataee Tarzanagh, Bojian Hou et al.
This paper investigates fairness and bias in Canonical Correlation Analysis (CCA), a widely used statistical technique for examining the relationship between two sets of variables. We present a framework that alleviates unfairness by minimizing the correlation disparity error associated with protected attributes. Our approach enables CCA to learn global projection matrices from all data points while ensuring that these matrices yield comparable correlation levels to group-specific projection matrices. Experimental evaluation on both synthetic and real-world datasets demonstrates the efficacy of our method in reducing correlation disparity error without compromising CCA accuracy.
CLOct 21, 2022
SIT at MixMT 2022: Fluent Translation Built on Giant Pre-trained ModelsAbdul Rafae Khan, Hrishikesh Kanade, Girish Amar Budhrani et al.
This paper describes the Stevens Institute of Technology's submission for the WMT 2022 Shared Task: Code-mixed Machine Translation (MixMT). The task consisted of two subtasks, subtask $1$ Hindi/English to Hinglish and subtask $2$ Hinglish to English translation. Our findings lie in the improvements made through the use of large pre-trained multilingual NMT models and in-domain datasets, as well as back-translation and ensemble techniques. The translation output is automatically evaluated against the reference translations using ROUGE-L and WER. Our system achieves the $1^{st}$ position on subtask $2$ according to ROUGE-L, WER, and human evaluation, $1^{st}$ position on subtask $1$ according to WER and human evaluation, and $3^{rd}$ position on subtask $1$ with respect to ROUGE-L metric.
IRAug 10, 2023
Beyond Semantics: Learning a Behavior Augmented Relevance Model with Self-supervised LearningZeyuan Chen, Wei Chen, Jia Xu et al.
Relevance modeling aims to locate desirable items for corresponding queries, which is crucial for search engines to ensure user experience. Although most conventional approaches address this problem by assessing the semantic similarity between the query and item, pure semantic matching is not everything. In reality, auxiliary query-item interactions extracted from user historical behavior data of the search log could provide hints to reveal users' search intents further. Drawing inspiration from this, we devise a novel Behavior Augmented Relevance Learning model for Alipay Search (BARL-ASe) that leverages neighbor queries of target item and neighbor items of target query to complement target query-item semantic matching. Specifically, our model builds multi-level co-attention for distilling coarse-grained and fine-grained semantic representations from both neighbor and target views. The model subsequently employs neighbor-target self-supervised learning to improve the accuracy and robustness of BARL-ASe by strengthening representation and logit learning. Furthermore, we discuss how to deal with the long-tail query-item matching of the mini apps search scenario of Alipay practically. Experiments on real-world industry data and online A/B testing demonstrate our proposal achieves promising performance with low latency.
IRAug 18, 2024
Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented PromptingZeyuan Chen, Haiyan Wu, Kaixin Wu et al.
Relevance modeling is a critical component for enhancing user experience in search engines, with the primary objective of identifying items that align with users' queries. Traditional models only rely on the semantic congruence between queries and items to ascertain relevance. However, this approach represents merely one aspect of the relevance judgement, and is insufficient in isolation. Even powerful Large Language Models (LLMs) still cannot accurately judge the relevance of a query and an item from a semantic perspective. To augment LLMs-driven relevance modeling, this study proposes leveraging user interactions recorded in search logs to yield insights into users' implicit search intentions. The challenge lies in the effective prompting of LLMs to capture dynamic search intentions, which poses several obstacles in real-world relevance scenarios, i.e., the absence of domain-specific knowledge, the inadequacy of an isolated prompt, and the prohibitive costs associated with deploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with LLMs effectively. Specifically, we perform the user-driven behavior neighbors retrieval from the daily search logs to obtain domain-specific knowledge in time, retrieving candidates that users consider to meet their expectations. Then, we guide LLMs for relevance modeling by employing advanced prompting techniques that progressively improve the outputs of the LLMs, followed by a progressive aggregation with comprehensive consideration of diverse aspects. For online serving, we have developed an industrial application framework tailored for the deployment of LLMs in relevance modeling. Experiments on real-world industry data and online A/B testing demonstrate our proposal achieves promising performance.
LGMar 13, 2025Code
MentalChat16K: A Benchmark Dataset for Conversational Mental Health AssistanceJia Xu, Tianyi Wei, Bojian Hou et al.
We introduce MentalChat16K, an English benchmark dataset combining a synthetic mental health counseling dataset and a dataset of anonymized transcripts from interventions between Behavioral Health Coaches and Caregivers of patients in palliative or hospice care. Covering a diverse range of conditions like depression, anxiety, and grief, this curated dataset is designed to facilitate the development and evaluation of large language models for conversational mental health assistance. By providing a high-quality resource tailored to this critical domain, MentalChat16K aims to advance research on empathetic, personalized AI solutions to improve access to mental health support services. The dataset prioritizes patient privacy, ethical considerations, and responsible data usage. MentalChat16K presents a valuable opportunity for the research community to innovate AI technologies that can positively impact mental well-being. The dataset is available at https://huggingface.co/datasets/ShenLab/MentalChat16K and the code and documentation are hosted on GitHub at https://github.com/ChiaPatricia/MentalChat16K.
20.7AIApr 13
Intelligent Approval of Access Control Flow in Office Automation Systems via Relational ModelingDugang Liu, Zulong Chen, Chuanfei Xu et al.
Office automation (OA) systems play a crucial role in enterprise operations and management, with access control flow approval (ACFA) being a key component that manages the accessibility of various resources. However, traditional ACFA requires approval from the person in charge at each step, which consumes a significant amount of manpower and time. Its intelligence is a crucial issue that needs to be addressed urgently by all companies. In this paper, we propose a novel relational modeling-driven intelligent approval (RMIA) framework to automate ACFA. Specifically, our RMIA consists of two core modules: (1) The binary relation modeling module aims to characterize the coupling relation between applicants and approvers and provide reliable basic information for ACFA decision-making from a coarse-grained perspective. (2) The ternary relation modeling module utilizes specific resource information as its core, characterizing the complex relations between applicants, resources, and approvers, and thus provides fine-grained gain information for informed decision-making. Then, our RMIA effectively fuses these two kinds of information to form the final decision. Finally, extensive experiments are conducted on two product datasets and an online A/B test to verify the effectiveness of RMIA.
LGAug 4, 2022
Transferable Multi-Agent Reinforcement Learning with Dynamic Participating AgentsXuting Tang, Jia Xu, Shusen Wang
We study multi-agent reinforcement learning (MARL) with centralized training and decentralized execution. During the training, new agents may join, and existing agents may unexpectedly leave the training. In such situations, a standard deep MARL model must be trained again from scratch, which is very time-consuming. To tackle this problem, we propose a special network architecture with a few-shot learning algorithm that allows the number of agents to vary during centralized training. In particular, when a new agent joins the centralized training, our few-shot learning algorithm trains its policy network and value network using a small number of samples; when an agent leaves the training, the training process of the remaining agents is not affected. Our experiments show that using the proposed network architecture and algorithm, model adaptation when new agents join can be 100+ times faster than the baseline. Our work is applicable to any setting, including cooperative, competitive, and mixed.
CRDec 13, 2025Code
UniMark: Artificial Intelligence Generated Content Identification ToolkitMeilin Li, Ji He, Yi Yu et al.
The rapid proliferation of Artificial Intelligence Generated Content has precipitated a crisis of trust and urgent regulatory demands. However, existing identification tools suffer from fragmentation and a lack of support for visible compliance marking. To address these gaps, we introduce the \textbf{UniMark}, an open-source, unified framework for multimodal content governance. Our system features a modular unified engine that abstracts complexities across text, image, audio, and video modalities. Crucially, we propose a novel dual-operation strategy, natively supporting both \emph{Hidden Watermarking} for copyright protection and \emph{Visible Marking} for regulatory compliance. Furthermore, we establish a standardized evaluation framework with three specialized benchmarks (Image/Video/Audio-Bench) to ensure rigorous performance assessment. This toolkit bridges the gap between advanced algorithms and engineering implementation, fostering a more transparent and secure digital ecosystem.
CVMar 26, 2021Code
Few-Shot Human Motion Transfer by Personalized Geometry and Texture ModelingZhichao Huang, Xintong Han, Jia Xu et al.
We present a new method for few-shot human motion transfer that achieves realistic human image generation with only a small number of appearance inputs. Despite recent advances in single person motion transfer, prior methods often require a large number of training images and take long training time. One promising direction is to perform few-shot human motion transfer, which only needs a few of source images for appearance transfer. However, it is particularly challenging to obtain satisfactory transfer results. In this paper, we address this issue by rendering a human texture map to a surface geometry (represented as a UV map), which is personalized to the source person. Our geometry generator combines the shape information from source images, and the pose information from 2D keypoints to synthesize the personalized UV map. A texture generator then generates the texture map conditioned on the texture of source images to fill out invisible parts. Furthermore, we may fine-tune the texture map on the manifold of the texture generator from a few source images at the test time, which improves the quality of the texture map without over-fitting or artifacts. Extensive experiments show the proposed method outperforms state-of-the-art methods both qualitatively and quantitatively. Our code is available at https://github.com/HuangZhiChao95/FewShotMotionTransfer.
55.7SEMar 19
SQL-Commenter: Aligning Large Language Models for SQL Comment Generation with Direct Preference OptimizationLei Yu, Peng Wang, Jingyuan Zhang et al.
SQL query comprehension is a significant challenge due to complex syntax, diverse join types, and deep nesting. Many queries lack adequate comments, severely hindering code readability, maintainability, and knowledge transfer. Automated SQL comment generation faces two main challenges: limited datasets that inadequately represent complex real-world queries, and Large Language Models' (LLMs) insufficient understanding of SQL-specific semantics. Our empirical analysis shows that even after continual pre-training and supervised fine-tuning, LLMs struggle with complex SQL semantics, yielding inaccurate comments. To address this, we propose SQL-Commenter, an advanced method based on LLaMA-3.1-8B. We first construct a comprehensive dataset of complex SQL queries with expert-verified comments. Next, we perform continual pre-training on a large SQL corpus to enhance the LLM's syntax and semantic understanding, followed by supervised fine-tuning. Finally, we introduce Direct Preference Optimization (DPO) using human feedback. SQL-Commenter utilizes a preference-based loss function to favor preferred outputs, enhancing fine-grained semantic learning and context-dependent quality assessment. Evaluated on the Spider and Bird benchmarks, SQL-Commenter significantly outperforms state-of-the-art baselines. On average, it surpasses the strongest baseline (Qwen3-14B) by 9.29, 4.99, and 13.23 percentage points on BLEU-4, METEOR, and ROUGE-L, respectively. Moreover, human evaluation demonstrates the superior quality of comments generated by SQL-Commenter in terms of correctness, completeness, and naturalness.
AIFeb 16
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5Dongrui Liu, Yi Yu, Jie Zhang et al.
To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, Frontier AI Risk Management Framework in Practice presents a comprehensive assessment of their frontier risks. As Large Language Models (LLMs) general capabilities rapidly evolve and the proliferation of agentic AI, this version of the risk analysis technical report presents an updated and granular assessment of five critical dimensions: cyber offense, persuasion and manipulation, strategic deception, uncontrolled AI R\&D, and self-replication. Specifically, we introduce more complex scenarios for cyber offense. For persuasion and manipulation, we evaluate the risk of LLM-to-LLM persuasion on newly released LLMs. For strategic deception and scheming, we add the new experiment with respect to emergent misalignment. For uncontrolled AI R\&D, we focus on the ``mis-evolution'' of agents as they autonomously expand their memory substrates and toolsets. Besides, we also monitor and evaluate the safety performance of OpenClaw during the interaction on the Moltbook. For self-replication, we introduce a new resource-constrained scenario. More importantly, we propose and validate a series of robust mitigation strategies to address these emerging threats, providing a preliminary technical and actionable pathway for the secure deployment of frontier AI. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
79.5SCApr 21
Majorization and Inequalities among Complete Homogeneous Symmetric FunctionsJia Xu, Yong Yao
Inequalities among symmetric functions are fundamental in various branches of mathematics, thus motivating a systematic study of their structure. Majorization has been shown to characterize inequalities among commonly used symmetric functions, except for complete homogeneous symmetric functions (shortened as CHs). In 2011, Cuttler, Greene, and Skandera posed a natural question: Can majorization also characterize inequalities among CHs? Their work demonstrated that majorization characterizes inequalities among CHs up to degree 7 and suggested exploring its validity for higher degrees. In this paper, we show that, for every degree greater than 7, majorization does not characterize inequalities among CHs.
AIJan 5, 2025
A Survey of Test-Time Compute: From Intuitive Inference to Deliberate ReasoningYixin Ji, Juntao Li, Yang Xiang et al.
The remarkable performance of the o1 model in complex reasoning demonstrates that test-time compute scaling can further unlock the model's potential, enabling powerful System-2 thinking. However, there is still a lack of comprehensive surveys for test-time compute scaling. We trace the concept of test-time compute back to System-1 models. In System-1 models, test-time compute addresses distribution shifts and improves robustness and generalization through parameter updating, input modification, representation editing, and output calibration. In System-2 models, it enhances the model's reasoning ability to solve complex problems through repeated sampling, self-correction, and tree search. We organize this survey according to the trend of System-1 to System-2 thinking, highlighting the key role of test-time compute in the transition from System-1 models to weak System-2 models, and then to strong System-2 models. We also point out advanced topics and future directions.
CVMay 20, 2025
Unify Graph Learning with Text: Unleashing LLM Potentials for Session SearchSonghao Wu, Quan Tu, Hong Liu et al.
Session search involves a series of interactive queries and actions to fulfill user's complex information need. Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions. While some approaches focus on capturing structural information, they use a generalized representation for documents, neglecting the word-level semantic modeling. In this paper, we propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches by leveraging the power of recent Large Language Models (LLMs). Concretely, we first introduce a set of symbolic grammar rules to convert session graph into text. This allows integrating session history, interaction process, and task instruction seamlessly as inputs for the LLM. Moreover, given the natural discrepancy between LLMs pre-trained on textual corpora, and the symbolic language we produce using our graph-to-text grammar, our objective is to enhance LLMs' ability to capture graph structures within a textual format. To achieve this, we introduce a set of self-supervised symbolic learning tasks including link prediction, node content generation, and generative contrastive learning, to enable LLMs to capture the topological information from coarse-grained to fine-grained. Experiment results and comprehensive analysis on two benchmark datasets, AOL and Tiangong-ST, confirm the superiority of our approach. Our paradigm also offers a novel and effective methodology that bridges the gap between traditional search strategies and modern LLMs.
AIJul 22, 2025
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical ReportShanghai AI Lab, Xiaoyang Chen, Yunhao Chen et al.
To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the "AI-$45^\circ$ Law," we evaluate these risks using "red lines" (intolerable thresholds) and "yellow lines" (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
IRDec 17, 2024
Boosting LLM-based Relevance Modeling with Distribution-Aware Robust LearningHong Liu, Saisai Gong, Yixin Ji et al.
With the rapid advancement of pre-trained large language models (LLMs), recent endeavors have leveraged the capabilities of LLMs in relevance modeling, resulting in enhanced performance. This is usually done through the process of fine-tuning LLMs on specifically annotated datasets to determine the relevance between queries and items. However, there are two limitations when LLMs are naively employed for relevance modeling through fine-tuning and inference. First, it is not inherently efficient for performing nuanced tasks beyond simple yes or no answers, such as assessing search relevance. It may therefore tend to be overconfident and struggle to distinguish fine-grained degrees of relevance (e.g., strong relevance, weak relevance, irrelevance) used in search engines. Second, it exhibits significant performance degradation when confronted with data distribution shift in real-world scenarios. In this paper, we propose a novel Distribution-Aware Robust Learning framework (DaRL) for relevance modeling in Alipay Search. Specifically, we design an effective loss function to enhance the discriminability of LLM-based relevance modeling across various fine-grained degrees of query-item relevance. To improve the generalizability of LLM-based relevance modeling, we first propose the Distribution-Aware Sample Augmentation (DASA) module. This module utilizes out-of-distribution (OOD) detection techniques to actively select appropriate samples that are not well covered by the original training set for model fine-tuning. Furthermore, we adopt a multi-stage fine-tuning strategy to simultaneously improve in-distribution (ID) and OOD performance, bridging the performance gap between them. DaRL has been deployed online to serve the Alipay's insurance product search...
AINov 13, 2024
Responsible AI in Construction Safety: Systematic Evaluation of Large Language Models and Prompt EngineeringFarouq Sammour, Jia Xu, Xi Wang et al.
Construction remains one of the most hazardous sectors. Recent advancements in AI, particularly Large Language Models (LLMs), offer promising opportunities for enhancing workplace safety. However, responsible integration of LLMs requires systematic evaluation, as deploying them without understanding their capabilities and limitations risks generating inaccurate information, fostering misplaced confidence, and compromising worker safety. This study evaluates the performance of two widely used LLMs, GPT-3.5 and GPT-4o, across three standardized exams administered by the Board of Certified Safety Professionals (BCSP). Using 385 questions spanning seven safety knowledge areas, the study analyzes the models' accuracy, consistency, and reliability. Results show that both models consistently exceed the BCSP benchmark, with GPT-4o achieving an accuracy rate of 84.6% and GPT-3.5 reaching 73.8%. Both models demonstrate strengths in safety management systems and hazard identification and control, but exhibit weaknesses in science, mathematics, emergency response, and fire prevention. An error analysis identifies four primary limitations affecting LLM performance: lack of knowledge, reasoning flaws, memory issues, and calculation errors. Our study also highlights the impact of prompt engineering strategies, with variations in accuracy reaching 13.5% for GPT-3.5 and 7.9% for GPT-4o. However, no single prompt configuration proves universally effective. This research advances knowledge in three ways: by identifying areas where LLMs can support safety practices and where human oversight remains essential, by offering practical insights into improving LLM implementation through prompt engineering, and by providing evidence-based direction for future research and development. These contributions support the responsible integration of AI in construction safety management toward achieving zero injuries.
LGMay 25, 2025
Tokenizing Electron Cloud in Protein-Ligand Interaction LearningHaitao Lin, Odin Zhang, Jia Xu et al.
The affinity and specificity of protein-molecule binding directly impact functional outcomes, uncovering the mechanisms underlying biological regulation and signal transduction. Most deep-learning-based prediction approaches focus on structures of atoms or fragments. However, quantum chemical properties, such as electronic structures, are the key to unveiling interaction patterns but remain largely underexplored. To bridge this gap, we propose ECBind, a method for tokenizing electron cloud signals into quantized embeddings, enabling their integration into downstream tasks such as binding affinity prediction. By incorporating electron densities, ECBind helps uncover binding modes that cannot be fully represented by atom-level models. Specifically, to remove the redundancy inherent in electron cloud signals, a structure-aware transformer and hierarchical codebooks encode 3D binding sites enriched with electron structures into tokens. These tokenized codes are then used for specific tasks with labels. To extend its applicability to a wider range of scenarios, we utilize knowledge distillation to develop an electron-cloud-agnostic prediction model. Experimentally, ECBind demonstrates state-of-the-art performance across multiple tasks, achieving improvements of 6.42\% and 15.58\% in per-structure Pearson and Spearman correlation coefficients, respectively.
HCOct 14, 2024
Tracing Human Stress from Physiological Signals using UWB RadarJia Xu, Teng Xiao, Pin Lv et al.
Stress tracing is an important research domain that supports many applications, such as health care and stress management; and its closest related works are derived from stress detection. However, these existing works cannot well address two important challenges facing stress detection. First, most of these studies involve asking users to wear physiological sensors to detect their stress states, which has a negative impact on the user experience. Second, these studies have failed to effectively utilize multimodal physiological signals, which results in less satisfactory detection results. This paper formally defines the stress tracing problem, which emphasizes the continuous detection of human stress states. A novel deep stress tracing method, named DST, is presented. Note that DST proposes tracing human stress based on physiological signals collected by a noncontact ultrawideband radar, which is more friendly to users when collecting their physiological signals. In DST, a signal extraction module is carefully designed at first to robustly extract multimodal physiological signals from the raw RF data of the radar, even in the presence of body movement. Afterward, a multimodal fusion module is proposed in DST to ensure that the extracted multimodal physiological signals can be effectively fused and utilized. Extensive experiments are conducted on three real-world datasets, including one self-collected dataset and two publicity datasets. Experimental results show that the proposed DST method significantly outperforms all the baselines in terms of tracing human stress states. On average, DST averagely provides a 6.31% increase in detection accuracy on all datasets, compared with the best baselines.
CLFeb 18, 2024
A Note on Bias to CompleteJia Xu, Mona Diab
Minimizing social bias strengthens societal bonds, promoting shared understanding and better decision-making. We revisit the definition of bias by discovering new bias types (e.g., societal status) in dynamic environments and describe them relative to context, such as culture, region, time, and personal background. Our framework includes eight hypotheses about bias and a minimizing bias strategy for each assumption as well as five methods as proposed solutions in LLM. The realization of the framework is yet to be completed.
IRMar 27, 2025
Alleviating LLM-based Generative Retrieval Hallucination in Alipay SearchYedan Shen, Kaixin Wu, Yuechen Ding et al.
Generative retrieval (GR) has revolutionized document retrieval with the advent of large language models (LLMs), and LLM-based GR is gradually being adopted by the industry. Despite its remarkable advantages and potential, LLM-based GR suffers from hallucination and generates documents that are irrelevant to the query in some instances, severely challenging its credibility in practical applications. We thereby propose an optimized GR framework designed to alleviate retrieval hallucination, which integrates knowledge distillation reasoning in model training and incorporate decision agent to further improve retrieval precision. Specifically, we employ LLMs to assess and reason GR retrieved query-document (q-d) pairs, and then distill the reasoning data as transferred knowledge to the GR model. Moreover, we utilize a decision agent as post-processing to extend the GR retrieved documents through retrieval model and select the most relevant ones from multi perspectives as the final generative retrieval result. Extensive offline experiments on real-world datasets and online A/B tests on Fund Search and Insurance Search in Alipay demonstrate our framework's superiority and effectiveness in improving search quality and conversion gains.
AIDec 2, 2024
CPRM: A LLM-based Continual Pre-training Framework for Relevance Modeling in Commercial SearchKaixin Wu, Yixin Ji, Zeyuan Chen et al.
Relevance modeling between queries and items stands as a pivotal component in commercial search engines, directly affecting the user experience. Given the remarkable achievements of large language models (LLMs) in various natural language processing (NLP) tasks, LLM-based relevance modeling is gradually being adopted within industrial search systems. Nevertheless, foundational LLMs lack domain-specific knowledge and do not fully exploit the potential of in-context learning. Furthermore, structured item text remains underutilized, and there is a shortage in the supply of corresponding queries and background knowledge. We thereby propose CPRM (Continual Pre-training for Relevance Modeling), a framework designed for the continual pre-training of LLMs to address these issues. Our CPRM framework includes three modules: 1) employing both queries and multi-field item to jointly pre-train for enhancing domain knowledge, 2) applying in-context pre-training, a novel approach where LLMs are pre-trained on a sequence of related queries or items, and 3) conducting reading comprehension on items to produce associated domain knowledge and background information (e.g., generating summaries and corresponding queries) to further strengthen LLMs. Results on offline experiments and online A/B testing demonstrate that our model achieves convincing performance compared to strong baselines.
LGSep 28, 2025
Echo Flow NetworksHongbo Liu, Jia Xu
At the heart of time-series forecasting (TSF) lies a fundamental challenge: how can models efficiently and effectively capture long-range temporal dependencies across ever-growing sequences? While deep learning has brought notable progress, conventional architectures often face a trade-off between computational complexity and their ability to retain accumulative information over extended horizons. Echo State Networks (ESNs), a class of reservoir computing models, have recently regained attention for their exceptional efficiency, offering constant memory usage and per-step training complexity regardless of input length. This makes them particularly attractive for modeling extremely long-term event history in TSF. However, traditional ESNs fall short of state-of-the-art performance due to their limited nonlinear capacity, which constrains both their expressiveness and stability. We introduce Echo Flow Networks (EFNs), a framework composed of a group of extended Echo State Networks (X-ESNs) with MLP readouts, enhanced by our novel Matrix-Gated Composite Random Activation (MCRA), which enables complex, neuron-specific temporal dynamics, significantly expanding the network's representational capacity without compromising computational efficiency. In addition, we propose a dual-stream architecture in which recent input history dynamically selects signature reservoir features from an infinite-horizon memory, leading to improved prediction accuracy and long-term stability. Extensive evaluations on five benchmarks demonstrate that EFNs achieve up to 4x faster training and 3x smaller model size compared to leading methods like PatchTST, reducing forecasting error from 43% to 35%, a 20% relative improvement. One instantiation of our framework, EchoFormer, consistently achieves new state-of-the-art performance across five benchmark datasets: ETTh, ETTm, DMV, Weather, and Air Quality.
CLSep 28, 2025
ResFormer: All-Time Reservoir Memory for Long Sequence ClassificationHongbo Liu, Jia Xu
Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification. Transformer-based models, despite achieving state-of-the-art performance, have inherent limitations due to quadratic time and memory complexity, restricting their input length. Although extensive efforts have aimed at reducing computational demands, processing extensive contexts remains challenging. To overcome these limitations, we propose ResFormer, a novel neural network architecture designed to model varying context lengths efficiently through a cascaded methodology. ResFormer integrates an reservoir computing network featuring a nonlinear readout to effectively capture long-term contextual dependencies in linear time. Concurrently, short-term dependencies within sentences are modeled using a conventional Transformer architecture with fixed-length inputs. Experiments demonstrate that ResFormer significantly outperforms baseline models of DeepSeek-Qwen and ModernBERT, delivering an accuracy improvement of up to +22.3% on the EmoryNLP dataset and consistent gains on MultiWOZ, MELD, and IEMOCAP. In addition, ResFormer exhibits reduced memory consumption, underscoring its effectiveness and efficiency in modeling extensive contextual information.
NCJun 10, 2025
Sparse Autoencoders Bridge The Deep Learning Model and The BrainZiming Mao, Jia Xu, Zeqi Zheng et al.
We present SAE-BrainMap, a novel framework that directly aligns deep learning visual model representations with voxel-level fMRI responses using sparse autoencoders (SAEs). First, we train layer-wise SAEs on model activations and compute the correlations between SAE unit activations and cortical fMRI signals elicited by the same natural image stimuli with cosine similarity, revealing strong activation correspondence (maximum similarity up to 0.76). Depending on this alignment, we construct a voxel dictionary by optimally assigning the most similar SAE feature to each voxel, demonstrating that SAE units preserve the functional structure of predefined regions of interest (ROIs) and exhibit ROI-consistent selectivity. Finally, we establish fine-grained hierarchical mapping between model layers and the human ventral visual pathway, also by projecting voxel dictionary activations onto individual cortical surfaces, we visualize the dynamic transformation of the visual information in deep learning models. It is found that ViT-B/16$_{CLIP}$ tends to utilize low-level information to generate high-level semantic information in the early layers and reconstructs the low-dimension information later. Our results establish a direct, downstream-task-free bridge between deep neural networks and human visual cortex, offering new insights into model interpretability.
IRMay 20, 2025
Bridge the Gap between Past and Future: Siamese Model Optimization for Context-Aware Document RankingSonghao Wu, Quan Tu, Mingjie Zhong et al.
In the realm of information retrieval, users often engage in multi-turn interactions with search engines to acquire information, leading to the formation of sequences of user feedback behaviors. Leveraging the session context has proven to be beneficial for inferring user search intent and document ranking. A multitude of approaches have been proposed to exploit in-session context for improved document ranking. Despite these advances, the limitation of historical session data for capturing evolving user intent remains a challenge. In this work, we explore the integration of future contextual information into the session context to enhance document ranking. We present the siamese model optimization framework, comprising a history-conditioned model and a future-aware model. The former processes only the historical behavior sequence, while the latter integrates both historical and anticipated future behaviors. Both models are trained collaboratively using the supervised labels and pseudo labels predicted by the other. The history-conditioned model, referred to as ForeRanker, progressively learns future-relevant information to enhance ranking, while it singly uses historical session at inference time. To mitigate inconsistencies during training, we introduce the peer knowledge distillation method with a dynamic gating mechanism, allowing models to selectively incorporate contextual information. Experimental results on benchmark datasets demonstrate the effectiveness of our ForeRanker, showcasing its superior performance compared to existing methods.
LGDec 19, 2024
A hybrid framework for effective and efficient machine unlearningMingxin Li, Yizhen Yu, Ning Wang et al.
Recently machine unlearning (MU) is proposed to remove the imprints of revoked samples from the already trained model parameters, to solve users' privacy concern. Different from the runtime expensive retraining from scratch, there exist two research lines, exact MU and approximate MU with different favorites in terms of accuracy and efficiency. In this paper, we present a novel hybrid strategy on top of them to achieve an overall success. It implements the unlearning operation with an acceptable computation cost, while simultaneously improving the accuracy as much as possible. Specifically, it runs reasonable unlearning techniques by estimating the retraining workloads caused by revocations. If the workload is lightweight, it performs retraining to derive the model parameters consistent with the accurate ones retrained from scratch. Otherwise, it outputs the unlearned model by directly modifying the current parameters, for better efficiency. In particular, to improve the accuracy in the latter case, we propose an optimized version to amend the output model with lightweight runtime penalty. We particularly study the boundary of two approaches in our frameworks to adaptively make the smart selection. Extensive experiments on real datasets validate that our proposals can improve the unlearning efficiency by 1.5$\times$ to 8$\times$ while achieving comparable accuracy.
AIOct 21, 2024
Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and ComplexityMengjiao Zhang, Jia Xu
While NLP models significantly impact our lives, there are rising concerns about privacy invasion. Although federated learning enhances privacy, attackers may recover private training data by exploiting model parameters and gradients. Therefore, protecting against such embedding attacks remains an open challenge. To address this, we propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using deep neural networks, making input text recovery harder. Importantly, our method requires a smaller memory with $256$ bytes of vocabulary while keeping efficiency with the same input length. Thus, our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy. Our experiments show SEB can effectively protect against embedding-based attacks from recovering original sentences in federated learning. Meanwhile, we verify that SEB obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.
LGFeb 14, 2024
Changes by Butterflies: Farsighted Forecasting with Group Reservoir TransformerMd Kowsher, Abdul Rafae Khan, Jia Xu
In Chaos, a minor divergence between two initial conditions exhibits exponential amplification over time, leading to far-away outcomes, known as the butterfly effect. Thus, the distant future is full of uncertainty and hard to forecast. We introduce Group Reservoir Transformer to predict long-term events more accurately and robustly by overcoming two challenges in Chaos: (1) the extensive historical sequences and (2) the sensitivity to initial conditions. A reservoir is attached to a Transformer to efficiently handle arbitrarily long historical lengths, with an extension of a group of reservoirs to reduce the sensitivity to the initialization variations. Our architecture consistently outperforms state-of-the-art models in multivariate time series, including TimeLLM, GPT2TS, PatchTST, DLinear, TimeNet, and the baseline Transformer, with an error reduction of up to -59\% in various fields such as ETTh, ETTm, and air quality, demonstrating that an ensemble of butterfly learning can improve the adequacy and certainty of event prediction, despite of the traveling time to the unknown future.
CLJan 19, 2024
Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech RecognitionYu Yu, Chao-Han Huck Yang, Tuan Dinh et al.
The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in $1$-best perturbation, they alleviate the degradation in $N$-best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling.
STMay 9, 2023
Copula Variational LSTM for High-dimensional Cross-market Multivariate Dependence ModelingJia Xu, Longbing Cao
We address an important yet challenging problem - modeling high-dimensional dependencies across multivariates such as financial indicators in heterogeneous markets. In reality, a market couples and influences others over time, and the financial variables of a market are also coupled. We make the first attempt to integrate variational sequential neural learning with copula-based dependence modeling to characterize both temporal observable and latent variable-based dependence degrees and structures across non-normal multivariates. Our variational neural network WPVC-VLSTM models variational sequential dependence degrees and structures across multivariate time series by variational long short-term memory networks and regular vine copula. The regular vine copula models nonnormal and long-range distributional couplings across multiple dynamic variables. WPVC-VLSTM is verified in terms of both technical significance and portfolio forecasting performance. It outperforms benchmarks including linear models, stochastic volatility models, deep neural networks, and variational recurrent networks in cross-market portfolio forecasting.
LGFeb 10, 2022
Reinforcement Learning in the Wild: Scalable RL Dispatching Algorithm Deployed in Ridehailing MarketplaceSoheil Sadeghi Eshkevari, Xiaocheng Tang, Zhiwei Qin et al.
In this study, a real-time dispatching algorithm based on reinforcement learning is proposed and for the first time, is deployed in large scale. Current dispatching methods in ridehailing platforms are dominantly based on myopic or rule-based non-myopic approaches. Reinforcement learning enables dispatching policies that are informed of historical data and able to employ the learned information to optimize returns of expected future trajectories. Previous studies in this field yielded promising results, yet have left room for further improvements in terms of performance gain, self-dependency, transferability, and scalable deployment mechanisms. The present study proposes a standalone RL-based dispatching solution that is equipped with multiple mechanisms to ensure robust and efficient on-policy learning and inference while being adaptable for full-scale deployment. A new form of value updating based on temporal difference is proposed that is more adapted to the inherent uncertainty of the problem. For the driver-order assignment, a customized utility function is proposed that when tuned based on the statistics of the market, results in remarkable performance improvement and interpretability. In addition, for reducing the risk of cancellation after drivers' assignment, an adaptive graph pruning strategy based on the multi-arm bandit problem is introduced. The method is evaluated using offline simulation with real data and yields notable performance improvement. In addition, the algorithm is deployed online in multiple cities under DiDi's operation for A/B testing and is launched in one of the major international markets as the primary mode of dispatch. The deployed algorithm shows over 1.3% improvement in total driver income from A/B testing. In addition, by causal inference analysis, as much as 5.3% improvement in major performance metrics is detected after full-scale deployment.
IRAug 5, 2021
Itinerary-aware Personalized Deep Matching at FliggyJia Xu, Ziyi Wang, Zulong Chen et al.
Matching items for a user from a travel item pool of large cardinality have been the most important technology for increasing the business at Fliggy, one of the most popular online travel platforms (OTPs) in China. There are three major challenges facing OTPs: sparsity, diversity, and implicitness. In this paper, we present a novel Fliggy ITinerary-aware deep matching NETwork (FitNET) to address these three challenges. FitNET is designed based on the popular deep matching network, which has been successfully employed in many industrial recommendation systems, due to its effectiveness. The concept itinerary is firstly proposed under the context of recommendation systems for OTPs, which is defined as the list of unconsumed orders of a user. All orders in a user itinerary are learned as a whole, based on which the implicit travel intention of each user can be more accurately inferred. To alleviate the sparsity problem, users' profiles are incorporated into FitNET. Meanwhile, a series of itinerary-aware attention mechanisms that capture the vital interactions between user's itinerary and other input categories are carefully designed. These mechanisms are very helpful in inferring a user's travel intention or preference, and handling the diversity in a user's need. Further, two training objectives, i.e., prediction accuracy of user's travel intention and prediction accuracy of user's click behavior, are utilized by FitNET, so that these two objectives can be optimized simultaneously. An offline experiment on Fliggy production dataset with over 0.27 million users and 1.55 million travel items, and an online A/B test both show that FitNET effectively learns users' travel intentions, preferences, and diverse needs, based on their itineraries and gains superior performance compared with state-of-the-art methods. FitNET now has been successfully deployed at Fliggy, serving major online traffic.
LGJun 25, 2021
Interpreting Criminal Charge Prediction and Its Algorithmic Bias via Quantum-Inspired Complex Valued NetworksAbdul Rafae Khan, Jia Xu, Peter Varsanyi et al.
While predictive policing has become increasingly common in assisting with decisions in the criminal justice system, the use of these results is still controversial. Some software based on deep learning lacks accuracy (e.g., in F-1), and importantly many decision processes are not transparent, causing doubt about decision bias, such as perceived racial and age disparities. This paper addresses bias issues with post-hoc explanations to provide a trustable prediction of whether a person will receive future criminal charges given one's previous criminal records by learning temporal behavior patterns over twenty years. Bi-LSTM relieves the vanishing gradient problem, attentional mechanisms allow learning and interpretation of feature importance, and complex-valued networks inspired quantum physics to facilitate a certain level of transparency in modeling the decision process. Our approach shows a consistent and reliable prediction precision and recall on a real-life dataset. Our analysis of the importance of each input feature shows the critical causal impact on decision-making, suggesting that criminal histories are statistically significant factors, while identifiers, such as race and age, are not. Finally, our algorithm indicates that a suspect tends to rather than suddenly increase crime severity level over time gradually.
CVJun 8, 2021
Learning by Distillation: A Self-Supervised Learning Framework for Optical Flow EstimationPengpeng Liu, Michael R. Lyu, Irwin King et al.
We present DistillFlow, a knowledge distillation approach to learning optical flow. DistillFlow trains multiple teacher models and a student model, where challenging transformations are applied to the input of the student model to generate hallucinated occlusions as well as less confident predictions. Then, a self-supervised learning framework is constructed: confident predictions from teacher models are served as annotations to guide the student model to learn optical flow for those less confident predictions. The self-supervised learning framework enables us to effectively learn optical flow from unlabeled data, not only for non-occluded pixels, but also for occluded pixels. DistillFlow achieves state-of-the-art unsupervised learning performance on both KITTI and Sintel datasets. Our self-supervised pre-trained model also provides an excellent initialization for supervised fine-tuning, suggesting an alternate training paradigm in contrast to current supervised learning methods that highly rely on pre-training on synthetic data. At the time of writing, our fine-tuned models ranked 1st among all monocular methods on the KITTI 2015 benchmark, and outperform all published methods on the Sintel Final benchmark. More importantly, we demonstrate the generalization capability of DistillFlow in three aspects: framework generalization, correspondence generalization and cross-dataset generalization.
SEJan 30, 2021
EdgeWorkflowReal: An Edge Computing based Workflow Execution Engine for Smart SystemsXuejun Li, Ran Ding, Xiao Liu et al.
Current cloud-based smart systems suffer from weaknesses such as high response latency, limited network bandwidth and the restricted computing power of smart end devices which seriously affect the system's QoS (Quality of Service). Recently, given its advantages of low latency, high bandwidth and location awareness, edge computing has become a promising solution for smart systems. However, the development of edge computing based smart systems is a very challenging job for software developers who do not have the skills for the creation of edge computing environments. The management of edge computing resources and computing tasks is also very challenging. Workflow technology has been widely used in smart systems to automate task and resource management, but there does not yet exist a real-world deployable edge computing based workflow execution engine. To fill this gap, we present EdgeWorkflowReal, an edge computing based workflow execution engine for smart systems. EdgeWorkflowReal supports: 1) automatic creation of a real edge computing environment according to user settings; 2) visualized modelling of edge workflow applications; and 3) automatic deployment, monitoring and performance evaluation of edge workflow applications in a smart system.
CVOct 9, 2020
Learning 3D Face Reconstruction with a Pose Guidance NetworkPengpeng Liu, Xintong Han, Michael Lyu et al.
We present a self-supervised learning approach to learning monocular 3D face reconstruction with a pose guidance network (PGN). First, we unveil the bottleneck of pose estimation in prior parametric 3D face learning methods, and propose to utilize 3D face landmarks for estimating pose parameters. With our specially designed PGN, our model can learn from both faces with fully labeled 3D landmarks and unlimited unlabeled in-the-wild face images. Our network is further augmented with a self-supervised learning scheme, which exploits face geometry information embedded in multiple frames of the same person, to alleviate the ill-posed nature of regressing 3D face geometry from a single image. These three insights yield a single approach that combines the complementary strengths of parametric model learning and data-driven learning techniques. We conduct a rigorous evaluation on the challenging AFLW2000-3D, Florence and FaceWarehouse datasets, and show that our method outperforms the state-of-the-art for all metrics.
CVApr 5, 2020
Flow2Stereo: Effective Self-Supervised Learning of Optical Flow and Stereo MatchingPengpeng Liu, Irwin King, Michael Lyu et al.
In this paper, we propose a unified method to jointly learn optical flow and stereo matching. Our first intuition is stereo matching can be modeled as a special case of optical flow, and we can leverage 3D geometry behind stereoscopic videos to guide the learning of these two forms of correspondences. We then enroll this knowledge into the state-of-the-art self-supervised learning framework, and train one single network to estimate both flow and stereo. Second, we unveil the bottlenecks in prior self-supervised learning approaches, and propose to create a new set of challenging proxy tasks to boost performance. These two insights yield a single model that achieves the highest accuracy among all existing unsupervised flow and stereo methods on KITTI 2012 and 2015 benchmarks. More remarkably, our self-supervised method even outperforms several state-of-the-art fully supervised methods, including PWC-Net and FlowNet2 on KITTI 2012.