CLJun 15, 2022
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding SystemsJack FitzGerald, Shankar Ananthakrishnan, Konstantine Arkoudas et al. · amazon-science, gatech
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
CVAug 15, 2024Code
HAIR: Hypernetworks-based All-in-One Image RestorationJin Cao, Yi Cao, Li Pang et al.
Image restoration aims to recover a high-quality clean image from its degraded version. Recent progress in image restoration has demonstrated the effectiveness of All-in-One image restoration models in addressing various unknown degradations simultaneously. However, these existing methods typically utilize the same parameters to tackle images with different types of degradation, forcing the model to balance the performance between different tasks and limiting its performance on each task. To alleviate this issue, we propose HAIR, a Hypernetworks-based All-in-One Image Restoration plug-and-play method that generates parameters based on the input image and thus makes the model to adapt to specific degradation dynamically. Specifically, HAIR consists of two main components, i.e., Classifier and Hyper Selecting Net (HSN). The Classifier is a simple image classification network used to generate a Global Information Vector (GIV) that contains the degradation information of the input image, and the HSN is a simple fully-connected neural network that receives the GIV and outputs parameters for the corresponding modules. Extensive experiments demonstrate that HAIR can significantly improve the performance of existing image restoration models in a plug-and-play manner, both in single-task and All-in-One settings. Notably, our proposed model Res-HAIR, which integrates HAIR into the well-known Restormer, can obtain superior or comparable performance compared with current state-of-the-art methods. Moreover, we theoretically demonstrate that to achieve a given small enough error, our proposed HAIR requires fewer parameters in contrast to mainstream embedding-based All-in-One methods. The code is available at https://github.com/toummHus/HAIR.
CLApr 28, 2022
Instilling Type Knowledge in Language Models via Multi-Task QAShuyang Li, Mukund Sridhar, Chandana Satya Prakash et al. · amazon-science
Understanding human language often necessitates understanding entities and their place in a taxonomy of knowledge -- their types. Previous methods to learn entity types rely on training classifiers on datasets with coarse, noisy, and incomplete labels. We introduce a method to instill fine-grained type knowledge in language models with text-to-text pre-training on type-centric questions leveraging knowledge base documents and knowledge graphs. We create the WikiWiki dataset: entities and passages from 10M Wikipedia articles linked to the Wikidata knowledge graph with 41K types. Models trained on WikiWiki achieve state-of-the-art performance in zero-shot dialog state tracking benchmarks, accurately infer entity types in Wikipedia articles, and can discover new types deemed useful by human judges.
LGApr 20, 2022
DAME: Domain Adaptation for Matching EntitiesMohamed Trabelsi, Jeff Heflin, Jin Cao
Entity matching (EM) identifies data records that refer to the same real-world entity. Despite the effort in the past years to improve the performance in EM, the existing methods still require a huge amount of labeled data in each domain during the training phase. These methods treat each domain individually, and capture the specific signals for each dataset in EM, and this leads to overfitting on just one dataset. The knowledge that is learned from one dataset is not utilized to better understand the EM task in order to make predictions on the unseen datasets with fewer labeled samples. In this paper, we propose a new domain adaptation-based method that transfers the task knowledge from multiple source domains to a target domain. Our method presents a new setting for EM where the objective is to capture the task-specific knowledge from pretraining our model using multiple source domains, then testing our model on a target domain. We study the zero-shot learning case on the target domain, and demonstrate that our method learns the EM task and transfers knowledge to the target domain. We extensively study fine-tuning our model on the target dataset from multiple domains, and demonstrate that our model generalizes better than state-of-the-art methods in EM.
CVDec 9, 2025
WonderZoom: Multi-Scale 3D World GenerationJin Cao, Hong-Xing Yu, Jiajun Wu
We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to "zoom into" a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds in https://wonderzoom.github.io/
CVSep 24, 2024
Training Data Attribution: Was Your Model Secretly Trained On Data Created By Mine?Likun Zhang, Hao Wu, Lingcui Zhang et al.
The emergence of text-to-image models has recently sparked significant interest, but the attendant is a looming shadow of potential infringement by violating the user terms. Specifically, an adversary may exploit data created by a commercial model to train their own without proper authorization. To address such risk, it is crucial to investigate the attribution of a suspicious model's training data by determining whether its training data originates, wholly or partially, from a specific source model. To trace the generated data, existing methods require applying extra watermarks during either the training or inference phases of the source model. However, these methods are impractical for pre-trained models that have been released, especially when model owners lack security expertise. To tackle this challenge, we propose an injection-free training data attribution method for text-to-image models. It can identify whether a suspicious model's training data stems from a source model, without additional modifications on the source model. The crux of our method lies in the inherent memorization characteristic of text-to-image models. Our core insight is that the memorization of the training dataset is passed down through the data generated by the source model to the model trained on that data, making the source model and the infringing model exhibit consistent behaviors on specific samples. Therefore, our approach involves developing algorithms to uncover these distinct samples and using them as inherent watermarks to verify if a suspicious model originates from the source model. Our experiments demonstrate that our method achieves an accuracy of over 80\% in identifying the source of a suspicious model's training data, without interfering the original training or generation process of the source model.
LGJul 3, 2024
Multi-Scenario Combination Based on Multi-Agent Reinforcement Learning to Optimize the Advertising Recommendation SystemYang Zhao, Chang Zhou, Jin Cao et al.
This paper explores multi-scenario optimization on large platforms using multi-agent reinforcement learning (MARL). We address this by treating scenarios like search, recommendation, and advertising as a cooperative, partially observable multi-agent decision problem. We introduce the Multi-Agent Recurrent Deterministic Policy Gradient (MARDPG) algorithm, which aligns different scenarios under a shared objective and allows for strategy communication to boost overall performance. Our results show marked improvements in metrics such as click-through rate (CTR), conversion rate, and total sales, confirming our method's efficacy in practical settings.
LGMay 22, 2024
Optimizing Search Advertising Strategies: Integrating Reinforcement Learning with Generalized Second-Price Auctions for Enhanced Ad Ranking and BiddingChang Zhou, Yang Zhao, Jin Cao et al.
This paper explores the integration of strategic optimization methods in search advertising, focusing on ad ranking and bidding mechanisms within E-commerce platforms. By employing a combination of reinforcement learning and evolutionary strategies, we propose a dynamic model that adjusts to varying user interactions and optimizes the balance between advertiser cost, user relevance, and platform revenue. Our results suggest significant improvements in ad placement accuracy and cost efficiency, demonstrating the model's applicability in real-world scenarios.
IVMay 17, 2024
Infrared Image Super-Resolution via Lightweight Information Split NetworkShijie Liu, Kang Yan, Feiwei Qin et al.
Single image super-resolution (SR) is an established pixel-level vision task aimed at reconstructing a high-resolution image from its degraded low-resolution counterpart. Despite the notable advancements achieved by leveraging deep neural networks for SR, most existing deep learning architectures feature an extensive number of layers, leading to high computational complexity and substantial memory demands. These issues become particularly pronounced in the context of infrared image SR, where infrared devices often have stringent storage and computational constraints. To mitigate these challenges, we introduce a novel, efficient, and precise single infrared image SR model, termed the Lightweight Information Split Network (LISN). The LISN comprises four main components: shallow feature extraction, deep feature extraction, dense feature fusion, and high-resolution infrared image reconstruction. A key innovation within this model is the introduction of the Lightweight Information Split Block (LISB) for deep feature extraction. The LISB employs a sequential process to extract hierarchical features, which are then aggregated based on the relevance of the features under consideration. By integrating channel splitting and shift operations, the LISB successfully strikes an optimal balance between enhanced SR performance and a lightweight framework. Comprehensive experimental evaluations reveal that the proposed LISN achieves superior performance over contemporary state-of-the-art methods in terms of both SR quality and model complexity, affirming its efficacy for practical deployment in resource-constrained infrared imaging applications.
CLJan 3, 2025
Time Series Language Model for Descriptive Caption GenerationMohamed Trabelsi, Aidan Boyd, Jin Cao et al.
The automatic generation of representative natural language descriptions for observable patterns in time series data enhances interpretability, simplifies analysis and increases cross-domain utility of temporal data. While pre-trained foundation models have made considerable progress in natural language processing (NLP) and computer vision (CV), their application to time series analysis has been hindered by data scarcity. Although several large language model (LLM)-based methods have been proposed for time series forecasting, time series captioning is under-explored in the context of LLMs. In this paper, we introduce TSLM, a novel time series language model designed specifically for time series captioning. TSLM operates as an encoder-decoder model, leveraging both text prompts and time series data representations to capture subtle temporal patterns across multiple phases and generate precise textual descriptions of time series inputs. TSLM addresses the data scarcity problem in time series captioning by first leveraging an in-context prompting synthetic data generation, and second denoising the generated data via a novel cross-modal dense retrieval scoring applied to time series-caption pairs. Experimental findings on various time series captioning datasets demonstrate that TSLM outperforms existing state-of-the-art approaches from multiple data modalities by a significant margin.
ASNov 27, 2024
Wearable intelligent throat enables natural speech in stroke patients with dysarthriaChenyu Tang, Shuo Gao, Cong Li et al.
Wearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI-driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to enable fluent, emotionally expressive communication. The system utilizes ultrasensitive textile strain sensors to capture high-quality signals from the neck area and supports token-level processing for real-time, continuous speech decoding, enabling seamless, delay-free communication. In tests with five stroke patients with dysarthria, IT's LLM agents intelligently corrected token errors and enriched sentence-level emotional and logical coherence, achieving low error rates (4.2% word error rate, 2.9% sentence error rate) and a 55% increase in user satisfaction. This work establishes a portable, intuitive communication platform for patients with dysarthria with the potential to be applied broadly across different neurological conditions and in multi-language support systems.
CVOct 11, 2024
Chain-of-Restoration: Multi-Task Image Restoration Models are Zero-Shot Step-by-Step Universal Image RestorersJin Cao, Deyu Meng, Xiangyong Cao
Despite previous image restoration (IR) methods have often concentrated on isolated degradations, recent research has increasingly focused on addressing composite degradations involving a complex combination of multiple isolated degradations. However, current IR methods for composite degradations require building training data that contain an exponential number of possible degradation combinations, which brings in a significant burden. To alleviate this issue, this paper proposes a new task setting, i.e. Universal Image Restoration (UIR). Specifically, UIR doesn't require training on all the degradation combinations but only on a set of degradation bases and then removing any degradation that these bases can potentially compose in a zero-shot manner. Inspired by the Chain-of-Thought that prompts large language models (LLMs) to address problems step-by-step, we propose Chain-of-Restoration (CoR) mechanism, which instructs models to remove unknown composite degradations step-by-step. By integrating a simple Degradation Discriminator into pre-trained multi-task models, CoR facilitates the process where models remove one degradation basis per step, continuing this process until the image is fully restored from the unknown composite degradation. Extensive experiments show that CoR can significantly improve model performance in removing composite degradations, achieving comparable or better results than those state-of-the-art (SoTA) methods trained on all degradations.
CEDec 16, 2025
Wearable-informed generative digital avatars predict task-conditioned post-stroke locomotionYanning Dai, Chenyu Tang, Ruizhi Zhang et al.
Dynamic prediction of locomotor capacity after stroke could enable more individualized rehabilitation, yet current assessments largely provide static impairment scores and do not indicate whether patients can perform specific tasks such as slope walking or stair climbing. Here, we present a wearable-informed data-physics hybrid generative framework that reconstructs a stroke survivor's locomotor control from wearable inertial sensing and predicts task-conditioned post-stroke locomotion in new environments. From a single 20 m level-ground walking trial recorded by five IMUs, the framework personalizes a physics-based digital avatar using a healthy-motion prior and hybrid imitation learning, generating dynamically feasible, patient-specific movements for inclined walking and stair negotiation. Across 11 stroke inpatients, predicted postures reached 82.2% similarity for slopes and 69.9% for stairs, substantially exceeding a physics-only baseline. In a multicentre pilot randomized study (n = 21; 28 days), access to scenario-specific locomotion predictions to support task selection and difficulty titration was associated with larger gains in Fugl-Meyer lower-extremity scores than standard care (mean change 6.0 vs 3.7 points; $p < 0.05$). These results suggest that wearable-informed generative digital avatars may augment individualized gait rehabilitation planning and provide a pathway toward dynamically personalized post-stroke motor recovery strategies.
CVOct 9, 2025
FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge DistillationHongrui Wu, Zhicheng Gao, Jin Cao et al.
Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.
CVOct 2, 2025
UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field ReconstructionJin Cao, Hongrui Wu, Ziyong Feng et al.
This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations. However, these methods rely heavily on dense observations for robustly optimizing model parameters. To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process. To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images. Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies. Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/
LGSep 27, 2025
DPFNAS: Differential Privacy-Enhanced Federated Neural Architecture Search for 6G Edge IntelligenceYang Lv, Jin Cao, Ben Niu et al.
The Sixth-Generation (6G) network envisions pervasive artificial intelligence (AI) as a core goal, enabled by edge intelligence through on-device data utilization. To realize this vision, federated learning (FL) has emerged as a key paradigm for collaborative training across edge devices. However, the sensitivity and heterogeneity of edge data pose key challenges to FL: parameter sharing risks data reconstruction, and a unified global model struggles to adapt to diverse local distributions. In this paper, we propose a novel federated learning framework that integrates personalized differential privacy (DP) and adaptive model design. To protect training data, we leverage sample-level representations for knowledge sharing and apply a personalized DP strategy to resist reconstruction attacks. To ensure distribution-aware adaptation under privacy constraints, we develop a privacy-aware neural architecture search (NAS) algorithm that generates locally customized architectures and hyperparameters. To the best of our knowledge, this is the first personalized DP solution tailored for representation-based FL with theoretical convergence guarantees. Our scheme achieves strong privacy guarantees for training data while significantly outperforming state-of-the-art methods in model performance. Experiments on benchmark datasets such as CIFAR-10 and CIFAR-100 demonstrate that our scheme improves accuracy by 6.82\% over the federated NAS method PerFedRLNAS, while reducing model size to 1/10 and communication cost to 1/20.
IVJun 22, 2025
CT Radiomics-Based Explainable Machine Learning Model for Accurate Differentiation of Malignant and Benign Endometrial Tumors: A Two-Center StudyTingrui Zhang, Honglin Wu, Zekun Jiang et al.
Aimed to develop and validate a CT radiomics-based explainable machine learning model for precise diagnosing malignancy and benignity specifically in endometrial cancer (EC) patients. A total of 83 EC patients from two centers, including 46 with malignant and 37 with benign conditions, were included, with data split into a training set (n=59) and a testing set (n=24). The regions of interest (ROIs) were manually segmented from pre-surgical CT scans, and 1132 radiomic features were extracted from the pre-surgical CT scans using Pyradiomics. Six explainable machine learning (ML) modeling algorithms were implemented respectively, for determining the optimal radiomics pipeline. The diagnostic performance of the radiomic model was evaluated by using sensitivity, specificity, accuracy, precision, F1 score, AUROC, and AUPRC. To enhance clinical understanding and usability, we separately implemented SHAP analysis and feature mapping visualization, and evaluated the calibration curve and decision curve. By comparing six modeling strategies, the Random Forest model emerged as the optimal choice for diagnosing EC, with a training AUROC of 1.00 and a testing AUROC of 0.96. SHAP identified the most important radiomic features, revealing that all selected features were significantly associated with EC (P < 0.05). Radiomics feature maps also provide a feasible assessment tool for clinical applications. Decision Curve Analysis (DCA) indicated a higher net benefit for our model compared to the "All" and "None" strategies, suggesting its clinical utility in identifying high-risk cases and reducing unnecessary interventions. In conclusion, the CT radiomics-based explainable ML model achieved high diagnostic performance, which could be used as an intelligent auxiliary tool for the diagnosis of endometrial cancer.
DCDec 29, 2024
Dynamic Adaptation in Data Storage: Real-Time Machine Learning for Enhanced PrefetchingChiyu Cheng, Chang Zhou, Yang Zhao et al.
The exponential growth of data storage demands has necessitated the evolution of hierarchical storage management strategies [1]. This study explores the application of streaming machine learning [3] to revolutionize data prefetching within multi-tiered storage systems. Unlike traditional batch-trained models, streaming machine learning [5] offers adaptability, real-time insights, and computational efficiency, responding dynamically to workload variations. This work designs and validates an innovative framework that integrates streaming classification models for predicting file access patterns, specifically the next file offset. Leveraging comprehensive feature engineering and real-time evaluation over extensive production traces, the proposed methodology achieves substantial improvements in prediction accuracy, memory efficiency, and system adaptability. The results underscore the potential of streaming models in real-time storage management, setting a precedent for advanced caching and tiering strategies.
DCDec 29, 2024
Optimizing SSD Caches for Cloud Block Storage Systems Using Machine Learning ApproachesChiyu Cheng, Chang Zhou, Yang Zhao et al.
The growing demand for efficient cloud storage solutions has led to the widespread adoption of Solid-State Drives (SSDs) for caching in cloud block storage systems. The management of data writes to SSD caches plays a crucial role in improving overall system performance, reducing latency, and extending the lifespan of storage devices. A critical challenge arises from the large volume of write-only data, which significantly impacts the performance of SSD caches when handled inefficiently. Specifically, writes that have not been read for a certain period may introduce unnecessary write traffic to the SSD cache without offering substantial benefits for cache performance. This paper proposes a novel approach to mitigate this issue by leveraging machine learning techniques to dynamically optimize the write policy in cloud-based storage systems. The proposed method identifies write-only data and selectively filters it out in real-time, thereby minimizing the number of unnecessary write operations and improving the overall performance of the cache system. Experimental results demonstrate that the proposed machine learning-based policy significantly outperforms traditional approaches by reducing the number of harmful writes and optimizing cache utilization. This solution is particularly suitable for cloud environments with varying and unpredictable workloads, where traditional cache management strategies often fall short.
CRJun 7, 2024
Advanced Payment Security System:XGBoost, LightGBM and SMOTE IntegratedQi Zheng, Chang Yu, Jin Cao et al.
With the rise of various online and mobile payment systems, transaction fraud has become a significant threat to financial security. This study explores the application of advanced machine learning models, specifically based on XGBoost and LightGBM, for developing a more accurate and robust Payment Security Protection Model. To enhance data reliability, we meticulously processed the data sources and applied SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance and improve data representation. By selecting highly correlated features, we aimed to strengthen the training process and boost model performance. We conducted thorough performance evaluations of our proposed models, comparing them against traditional methods including Random Forest, Neural Network, and Logistic Regression. Using metrics such as Precision, Recall, and F1 Score, we rigorously assessed their effectiveness. Our detailed analyses and comparisons reveal that the combination of SMOTE with XGBoost and LightGBM offers a highly efficient and powerful mechanism for payment security protection. Moreover, the integration of XGBoost and LightGBM in a Local Ensemble model further demonstrated outstanding performance. After incorporating SMOTE, the new combined model achieved a significant improvement of nearly 6\% over traditional models and around 5\% over its sub-models, showcasing remarkable results.
LGJun 6, 2024
Credit Card Fraud Detection Using Advanced Transformer ModelChang Yu, Yongshun Xu, Jin Cao et al.
With the proliferation of various online and mobile payment systems, credit card fraud has emerged as a significant threat to financial security. This study focuses on innovative applications of the latest Transformer models for more robust and precise fraud detection. To ensure the reliability of the data, we meticulously processed the data sources, balancing the dataset to address the issue of data sparsity significantly. We also selected highly correlated vectors to strengthen the training process.To guarantee the reliability and practicality of the new Transformer model, we conducted performance comparisons with several widely adopted models, including Support Vector Machine (SVM), Random Forest, Neural Network, and Logistic Regression. We rigorously compared these models using metrics such as Precision, Recall, and F1 Score. Through these detailed analyses and comparisons, we present to the readers a highly efficient and powerful anti-fraud mechanism with promising prospects. The results demonstrate that the Transformer model not only excels in traditional applications but also shows great potential in niche areas like fraud detection, offering a substantial advancement in the field.
IRJun 4, 2024
Predict Click-Through Rates with Deep Interest Network Model in E-commerce AdvertisingChang Zhou, Yang Zhao, Yuelin Zou et al.
This paper proposes new methods to enhance click-through rate (CTR) prediction models using the Deep Interest Network (DIN) model, specifically applied to the advertising system of Alibaba's Taobao platform. Unlike traditional deep learning approaches, this research focuses on localized user behavior activation for tailored ad targeting by leveraging extensive user behavior data. Compared to traditional models, this method demonstrates superior ability to handle diverse and dynamic user data, thereby improving the efficiency of ad systems and increasing revenue.
CVJun 1, 2024
Research on the Application of Computer Vision Based on Deep Learning in Autonomous Driving TechnologyJingyu Zhang, Jin Cao, Jinghao Chang et al.
This research aims to explore the application of deep learning in autonomous driving computer vision technology and its impact on improving system performance. By using advanced technologies such as convolutional neural networks (CNN), multi-task joint learning methods, and deep reinforcement learning, this article analyzes in detail the application of deep learning in image recognition, real-time target tracking and classification, environment perception and decision support, and path planning and navigation. Application process in key areas. Research results show that the proposed system has an accuracy of over 98% in image recognition, target tracking and classification, and also demonstrates efficient performance and practicality in environmental perception and decision support, path planning and navigation. The conclusion points out that deep learning technology can significantly improve the accuracy and real-time response capabilities of autonomous driving systems. Although there are still challenges in environmental perception and decision support, with the advancement of technology, it is expected to achieve wider applications and greater capabilities in the future. potential.
LGJun 22, 2021
Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNNHaowei Jiang, Feiwei Qin, Jin Cao et al.
The recurrent network architecture is a widely used model in sequence modeling, but its serial dependency hinders the computation parallelization, which makes the operation inefficient. The same problem was encountered in serial adder at the early stage of digital electronics. In this paper, we discuss the similarities between recurrent neural network (RNN) and serial adder. Inspired by carry-lookahead adder, we introduce carry-lookahead module to RNN, which makes it possible for RNN to run in parallel. Then, we design the method of parallel RNN computation, and finally Carry-lookahead RNN (CL-RNN) is proposed. CL-RNN takes advantages in parallelism and flexible receptive field. Through a comprehensive set of tests, we verify that CL-RNN can perform better than existing typical RNNs in sequence modeling tasks which are specially designed for RNNs.
CLJan 20, 2021
Zero-shot Generalization in Dialog State Tracking through Generative Question AnsweringShuyang Li, Jin Cao, Mukund Sridhar et al.
Dialog State Tracking (DST), an integral part of modern dialog systems, aims to track user preferences and constraints (slots) in task-oriented dialogs. In real-world settings with constantly changing services, DST systems must generalize to new domains and unseen slot types. Existing methods for DST do not generalize well to new slot names and many require known ontologies of slot types and values for inference. We introduce a novel ontology-free framework that supports natural language queries for unseen constraints and slots in multi-domain task-oriented dialogs. Our approach is based on generative question-answering using a conditional language model pre-trained on substantive English sentences. Our model improves joint goal accuracy in zero-shot domain adaptation settings by up to 9% (absolute) over the previous state-of-the-art on the MultiWOZ 2.1 dataset.
CLNov 11, 2020
Towards Semi-Supervised Semantics Understanding from SpeechCheng-I Lai, Jin Cao, Sravan Bodapati et al.
Much recent work on Spoken Language Understanding (SLU) falls short in at least one of three ways: models were trained on oracle text input and neglected the Automatics Speech Recognition (ASR) outputs, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. We proposed a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus. In parallel, we identified two inadequate settings under which SLU models have been tested: noise-robustness and E2E semantics evaluation. We tested the proposed framework under realistic environmental noises and with a new metric, the slots edit F1 score, on two public SLU corpora. Experiments show that our SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available.
LGOct 30, 2020
Semantic Labeling Using a Deep Contextualized Language ModelMohamed Trabelsi, Jin Cao, Jeff Heflin
Generating schema labels automatically for column values of data tables has many data science applications such as schema matching, and data discovery and linking. For example, automatically extracted tables with missing headers can be filled by the predicted schema labels which significantly minimizes human effort. Furthermore, the predicted labels can reduce the impact of inconsistent names across multiple data tables. Understanding the connection between column values and contextual information is an important yet neglected aspect as previously proposed methods treat each column independently. In this paper, we propose a context-aware semantic labeling method using both the column values and context. Our new method is based on a new setting for semantic labeling, where we sequentially predict labels for an input table with missing headers. We incorporate both the values and context of each data column using the pre-trained contextualized language model, BERT, that has achieved significant improvements in multiple natural language processing tasks. To our knowledge, we are the first to successfully apply BERT to solve the semantic labeling task. We evaluate our approach using two real-world datasets from different domains, and we demonstrate substantial improvements in terms of evaluation metrics over state-of-the-art feature-based methods.
CLOct 9, 2020
Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language UnderstandingJin Cao, Jun Wang, Wael Hamza et al.
Neural models have yielded state-of-the-art results in deciphering spoken language understanding (SLU) problems; however, these models require a significant amount of domain-specific labeled examples for training, which is prohibitively expensive. While pre-trained language models like BERT have been shown to capture a massive amount of knowledge by learning from unlabeled corpora and solve SLU using fewer labeled examples for adaption, the encoding of knowledge is implicit and agnostic to downstream tasks. Such encoding results in model inefficiencies in parameter usage: an entirely new model is required for every domain. To address these challenges, we introduce a novel SLU framework, comprising a conversational language modeling (CLM) pre-training task and a light encoder architecture. The CLM pre-training enables networks to capture the representation of the language in conversation style with the presence of ASR errors. The light encoder architecture separates the shared pre-trained networks from the mappings of generally encoded knowledge to specific domains of SLU, allowing for the domain adaptation to be performed solely at the light encoder and thus increasing efficiency. With the framework, we match the performance of state-of-the-art SLU results on Alexa internal datasets and on two public ones (ATIS, SNIPS), adding only 4.4% parameters per task.
DSSep 7, 2020
A Lightweight Algorithm to Uncover Deep Relationships in Data TablesJin Cao, Yibo Zhao, Linjun Zhang et al.
Many data we collect today are in tabular form, with rows as records and columns as attributes associated with each record. Understanding the structural relationship in tabular data can greatly facilitate the data science process. Traditionally, much of this relational information is stored in table schema and maintained by its creators, usually domain experts. In this paper, we develop automated methods to uncover deep relationships in a single data table without expert or domain knowledge. Our method can decompose a data table into layers of smaller tables, revealing its deep structure. The key to our approach is a computationally lightweight forward addition algorithm that we developed to recursively extract the functional dependencies between table columns that are scalable to tables with many columns. With our solution, data scientists will be provided with automatically generated, data-driven insights when exploring new data sets.
DSSep 7, 2020
A Fast Randomized Algorithm for Finding the Maximal Common SubsequencesJin Cao, Dewei Zhong
Finding the common subsequences of $L$ multiple strings has many applications in the area of bioinformatics, computational linguistics, and information retrieval. A well-known result states that finding a Longest Common Subsequence (LCS) for $L$ strings is NP-hard, e.g., the computational complexity is exponential in $L$. In this paper, we develop a randomized algorithm, referred to as {\em Random-MCS}, for finding a random instance of Maximal Common Subsequence ($MCS$) of multiple strings. A common subsequence is {\em maximal} if inserting any character into the subsequence no longer yields a common subsequence. A special case of MCS is LCS where the length is the longest. We show the complexity of our algorithm is linear in $L$, and therefore is suitable for large $L$. Furthermore, we study the occurrence probability for a single instance of MCS and demonstrate via both theoretical and experimental studies that the longest subsequence from multiple runs of {\em Random-MCS} often yields a solution to $LCS$.
AISep 12, 2019
Augmented Data Science: Towards Industrialization and Democratization of Data ScienceHuseyin Uzunalioglu, Jin Cao, Chitra Phadke et al.
Conversion of raw data into insights and knowledge requires substantial amounts of effort from data scientists. Despite breathtaking advances in Machine Learning (ML) and Artificial Intelligence (AI), data scientists still spend the majority of their effort in understanding and then preparing the raw data for ML/AI. The effort is often manual and ad hoc, and requires some level of domain knowledge. The complexity of the effort increases dramatically when data diversity, both in form and context, increases. In this paper, we introduce our solution, Augmented Data Science (ADS), towards addressing this "human bottleneck" in creating value from diverse datasets. ADS is a data-driven approach and relies on statistics and ML to extract insights from any data set in a domain-agnostic way to facilitate the data science process. Key features of ADS are the replacement of rudimentary data exploration and processing steps with automation and the augmentation of data scientist judgment with automatically-generated insights. We present building blocks of our end-to-end solution and provide a case study to exemplify its capabilities.
CLDec 11, 2017
Learning Robust Dialog Policies in Noisy EnvironmentsMaryam Fazel-Zarandi, Shang-Wen Li, Jin Cao et al.
Modern virtual personal assistants provide a convenient interface for completing daily tasks via voice commands. An important consideration for these assistants is the ability to recover from automatic speech recognition (ASR) and natural language understanding (NLU) errors. In this paper, we focus on learning robust dialog policies to recover from these errors. To this end, we develop a user simulator which interacts with the assistant through voice commands in realistic scenarios with noisy audio, and use it to learn dialog policies through deep reinforcement learning. We show that dialogs generated by our simulator are indistinguishable from human generated dialogs, as determined by human evaluators. Furthermore, preliminary experimental results show that the learned policies in noisy environments achieve the same execution success rate with fewer dialog turns compared to fixed rule-based policies.