AIJul 31, 2024
The Llama 3 Herd of ModelsAaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri et al. · allen-ai, berkeley
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
SPJul 10, 2023
Towards Enabling Cardiac Digital Twins of Myocardial Infarction Using Deep Computational Models for Inverse InferenceLei Li, Julia Camps, Zhinuo et al.
Cardiac digital twins (CDTs) have the potential to offer individualized evaluation of cardiac function in a non-invasive manner, making them a promising approach for personalized diagnosis and treatment planning of my-ocardial infarction (MI). The inference of accurate myocardial tissue properties is crucial in creating a reliable CDT of MI. In this work, we investigate the feasibility of inferring myocardial tissue properties from the electrocardiogram (ECG) within a CDT platform. The platform integrates multi-modal data, such as cardiac MRI and ECG, to enhance the accuracy and reliability of the inferred tissue properties. We perform a sensitivity analysis based on computer simulations, systematically exploring the effects of infarct location, size, degree of transmurality, and electrical ac-tivity alteration on the simulated QRS complex of ECG, to establish the limits of the approach. We subsequently present a novel deep computational model, comprising a dual-branch variational autoencoder and an inference model, to infer infarct location and distribution from the simulated QRS. The proposed model achieves mean Dice scores of 0.457 \pm 0.317 and 0.302 \pm 0.273 for the inference of left ventricle scars and border zone, respectively. The sensitivity analysis enhances our understanding of the complex relationship between infarct characteristics and electrophysiological features. The in silico experimental results show that the model can effectively capture the relationship for the inverse inference, with promising potential for clinical application in the future. The code will be released publicly once the manuscript is accepted for publication.
LGJan 21, 2023
The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge DistillationHuancheng Chen, Johnny, Wang et al.
Heterogeneity of data distributed across clients limits the performance of global models trained through federated learning, especially in the settings with highly imbalanced class distributions of local datasets. In recent years, personalized federated learning (pFL) has emerged as a potential solution to the challenges presented by heterogeneous data. However, existing pFL methods typically enhance performance of local models at the expense of the global model's accuracy. We propose FedHKD (Federated Hyper-Knowledge Distillation), a novel FL algorithm in which clients rely on knowledge distillation (KD) to train local models. In particular, each client extracts and sends to the server the means of local data representations and the corresponding soft predictions -- information that we refer to as ``hyper-knowledge". The server aggregates this information and broadcasts it to the clients in support of local training. Notably, unlike other KD-based pFL methods, FedHKD does not rely on a public dataset nor it deploys a generative model at the server. We analyze convergence of FedHKD and conduct extensive experiments on visual datasets in a variety of scenarios, demonstrating that FedHKD provides significant improvement in both personalized as well as global model performance compared to state-of-the-art FL methods designed for heterogeneous data settings.
LGOct 28, 2023
Episodic Multi-Task Learning with Heterogeneous Neural ProcessesJiayi Shen, Xiantong Zhen, Qi et al. · tsinghua
This paper focuses on the data-insufficiency problem in multi-task learning within an episodic training setup. Specifically, we explore the potential of heterogeneous information across tasks and meta-knowledge among episodes to effectively tackle each task with limited data. Existing meta-learning methods often fail to take advantage of crucial heterogeneous information in a single episode, while multi-task learning models neglect reusing experience from earlier episodes. To address the problem of insufficient data, we develop Heterogeneous Neural Processes (HNPs) for the episodic multi-task setup. Within the framework of hierarchical Bayes, HNPs effectively capitalize on prior experiences as meta-knowledge and capture task-relatedness among heterogeneous tasks, mitigating data-insufficiency. Meanwhile, transformer-structured inference modules are designed to enable efficient inferences toward meta-knowledge and task-relatedness. In this way, HNPs can learn more powerful functional priors for adapting to novel heterogeneous tasks in each meta-test episode. Experimental results show the superior performance of the proposed HNPs over typical baselines, and ablation studies verify the effectiveness of the designed inference modules.
CVApr 4, 2023
Influence of Myocardial Infarction on QRS Properties: A Simulation StudyLei Li, Julia Camps, Zhinuo et al.
The interplay between structural and electrical changes in the heart after myocardial infarction (MI) plays a key role in the initiation and maintenance of arrhythmia. The anatomical and electrophysiological properties of scar, border zone, and normal myocardium modify the electrocardiographic morphology, which is routinely analysed in clinical settings. However, the influence of various MI properties on the QRS is not intuitively predictable.In this work, we have systematically investigated the effects of 17 post-MI scenarios, varying the location, size, transmural extent, and conductive level of scarring and border zone area, on the forward-calculated QRS. Additionally, we have compared the contributions of different QRS score criteria for quantifying post-MI pathophysiology.The propagation of electrical activity in the ventricles is simulated via a Eikonal model on a unified coordinate system.The analysis has been performed on 49 subjects, and the results imply that the QRS is capable of identifying MI, suggesting the feasibility of inversely reconstructing infarct regions from QRS.There exist sensitivity variations of different QRS criteria for identifying 17 MI scenarios, which is informative for solving the inverse problem.
CLApr 28
Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior DataYuxuan Lu, Jing Huang, Yan Han et al.
Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.
LGDec 20, 2022
Modeling Human Eye Movements with Neural Networks in a Maze-Solving TaskJason Li, Nicholas Watters, Yingting et al.
From smoothly pursuing moving objects to rapidly shifting gazes during visual search, humans employ a wide variety of eye movement strategies in different contexts. While eye movements provide a rich window into mental processes, building generative models of eye movements is notoriously difficult, and to date the computational objectives guiding eye movements remain largely a mystery. In this work, we tackled these problems in the context of a canonical spatial planning task, maze-solving. We collected eye movement data from human subjects and built deep generative models of eye movements using a novel differentiable architecture for gaze fixations and gaze shifts. We found that human eye movements are best predicted by a model that is optimized not to perform the task as efficiently as possible but instead to run an internal simulation of an object traversing the maze. This not only provides a generative model of eye movements in this task but also suggests a computational theory for how humans solve the task, namely that humans use mental simulation.
IVMay 9, 2022
Masked Co-attentional Transformer reconstructs 100x ultra-fast/low-dose whole-body PET from longitudinal images and anatomically guided MRIYan-Ran, Wang, Liangqiong Qu et al.
Despite its tremendous value for the diagnosis, treatment monitoring and surveillance of children with cancer, whole body staging with positron emission tomography (PET) is time consuming and associated with considerable radiation exposure. 100x (1% of the standard clinical dosage) ultra-low-dose/ultra-fast whole-body PET reconstruction has the potential for cancer imaging with unprecedented speed and improved safety, but it cannot be achieved by the naive use of machine learning techniques. In this study, we utilize the global similarity between baseline and follow-up PET and magnetic resonance (MR) images to develop Masked-LMCTrans, a longitudinal multi-modality co-attentional CNN-Transformer that provides interaction and joint reasoning between serial PET/MRs of the same patient. We mask the tumor area in the referenced baseline PET and reconstruct the follow-up PET scans. In this manner, Masked-LMCTrans reconstructs 100x almost-zero radio-exposure whole-body PET that was not possible before. The technique also opens a new pathway for longitudinal radiology imaging reconstruction, a significantly under-explored area to date. Our model was trained and tested with Stanford PET/MRI scans of pediatric lymphoma patients and evaluated externally on PET/MRI images from Tübingen University. The high image quality of the reconstructed 100x whole-body PET images resulting from the application of Masked-LMCTrans will substantially advance the development of safer imaging approaches and shorter exam-durations for pediatric patients, as well as expand the possibilities for frequent longitudinal monitoring of these patients by PET.
PMJul 4, 2023Code
MOPO-LSI: A User GuideYong Zheng, Kumar Neelotpal Shukla, Jasmine Xu et al.
MOPO-LSI is an open-source Multi-Objective Portfolio Optimization Library for Sustainable Investments. This document provides a user guide for MOPO-LSI version 1.0, including problem setup, workflow and the hyper-parameters in configurations.
ROMay 24
HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric VideosZhi, Wang, Botao He et al.
Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments.
CVFeb 16Code
Architectural Insights for Post-Tornado Damage RecognitionRobinson Umeike, Thang Dao, Shane Crawford et al.
Rapid and accurate building damage assessment in the immediate aftermath of tornadoes is critical for coordinating life-saving search and rescue operations, optimizing emergency resource allocation, and accelerating community recovery. However, current automated methods struggle with the unique visual complexity of tornado-induced wreckage, primarily due to severe domain shift from standard pre-training datasets and extreme class imbalance in real-world disaster data. To address these challenges, we introduce a systematic experimental framework evaluating 79 open-source deep learning models, encompassing both Convolutional Neural Networks (CNNs) and Vision Transformers, across over 2,300 controlled experiments on our newly curated Quad-State Tornado Damage (QSTD) benchmark dataset. Our findings reveal that achieving operational-grade performance hinges on a complex interaction between architecture and optimization, rather than architectural selection alone. Most strikingly, we demonstrate that optimizer choice can be more consequential than architecture: switching from Adam to SGD provided dramatic F1 gains of +25 to +38 points for Vision Transformer and Swin Transformer families, fundamentally reversing their ranking from bottom-tier to competitive with top-performing CNNs. Furthermore, a low learning rate of 1x10^(-4) proved universally critical, boosting average F1 performance by +10.2 points across all architectures. Our champion model, ConvNeXt-Base trained with these optimized settings, demonstrated strong cross-event generalization on the held-out Tuscaloosa-Moore Tornado Damage (TMTD) dataset, achieving 46.4% Macro F1 (+34.6 points over baseline) and retaining 85.5% Ordinal Top-1 Accuracy despite temporal and sensor domain shifts.
LGApr 9
$p1$: Better Prompt Optimization with Fewer PromptsZhaolin Gao, Yu, Wang et al.
Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose $p1$, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that $p1$ substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.
CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic CapabilitiesGheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
CVAug 29, 2025Code
Generative AI for Industrial Contour Detection: A Language-Guided Vision SystemLiang Gong, Tommy, Wang et al.
Industrial computer vision systems often struggle with noise, material variability, and uncontrolled imaging conditions, limiting the effectiveness of classical edge detectors and handcrafted pipelines. In this work, we present a language-guided generative vision system for remnant contour detection in manufacturing, designed to achieve CAD-level precision. The system is organized into three stages: data acquisition and preprocessing, contour generation using a conditional GAN, and multimodal contour refinement through vision-language modeling, where standardized prompts are crafted in a human-in-the-loop process and applied through image-text guided synthesis. On proprietary FabTrack datasets, the proposed system improved contour fidelity, enhancing edge continuity and geometric alignment while reducing manual tracing. For the refinement stage, we benchmarked several vision-language models, including Google's Gemini 2.0 Flash, OpenAI's GPT-image-1 integrated within a VLM-guided workflow, and open-source baselines. Under standardized conditions, GPT-image-1 consistently outperformed Gemini 2.0 Flash in both structural accuracy and perceptual quality. These findings demonstrate the promise of VLM-guided generative workflows for advancing industrial computer vision beyond the limitations of classical pipelines.
CVSep 10, 2021Code
LibFewShot: A Comprehensive Library for Few-shot LearningWenbin Li, Ziyi, Wang et al.
Few-shot learning, especially few-shot image classification, has received increasing attention and witnessed significant advances in recent years. Some recent studies implicitly show that many generic techniques or ``tricks'', such as data augmentation, pre-training, knowledge distillation, and self-supervision, may greatly boost the performance of a few-shot learning method. Moreover, different works may employ different software platforms, backbone architectures and input image sizes, making fair comparisons difficult and practitioners struggle with reproducibility. To address these situations, we propose a comprehensive library for few-shot learning (LibFewShot) by re-implementing eighteen state-of-the-art few-shot learning methods in a unified framework with the same single codebase in PyTorch. Furthermore, based on LibFewShot, we provide comprehensive evaluations on multiple benchmarks with various backbone architectures to evaluate common pitfalls and effects of different training tricks. In addition, with respect to the recent doubts on the necessity of meta- or episodic-training mechanism, our evaluation results confirm that such a mechanism is still necessary especially when combined with pre-training. We hope our work can not only lower the barriers for beginners to enter the area of few-shot learning but also elucidate the effects of nontrivial tricks to facilitate intrinsic research on few-shot learning. The source code is available from https://github.com/RL-VIG/LibFewShot.
LGMar 2
Personalized Multi-Agent Average Reward TD-Learning via Joint Linear ApproximationLeo, Wang, Pengkun Yang et al.
We study personalized multi-agent average reward TD learning, in which a collection of agents interacts with different environments and jointly learns their respective value functions. We focus on the setting where there exists a shared linear representation, and the agents' optimal weights collectively lie in an unknown linear subspace. Inspired by the recent success of personalized federated learning (PFL), we study the convergence of cooperative single-timescale TD learning in which agents iteratively estimate the common subspace and local heads. We showed that this decomposition can filter out conflicting signals, effectively mitigating the negative impacts of ``misaligned'' signals, and achieving linear speedup. The main technical challenges lie in the heterogeneity, the Markovian sampling, and their intricate interplay in shaping error evolutions. Specifically, not only are the error dynamics of multiple variables closely interconnected, but there is also no direct contraction for the principal angle distance between the optimal subspace and the estimated subspace. We hope our analytical techniques can be useful to inspire research on deeper exploration into leveraging common structures. Experiments are provided to show the benefits of learning via a shared structure to the more general control problem.
AISep 8, 2025
An AI system to help scientists write expert-level empirical softwareEser Aygün, Anastasiya Belyaeva, Gheorghe Comanici et al.
The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. To address this, we present an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. The system achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a wide range of benchmarks. In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Our method also produced state-of-the-art software for geospatial analysis, neural activity prediction in zebrafish, time series forecasting and numerical solution of integrals. By devising and implementing novel solutions to diverse tasks, the system represents a significant step towards accelerating scientific progress.
CLMar 26, 2025
Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior DataYuxuan Lu, Jing Huang, Yan Han et al.
Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.
CLJun 2, 2025
Growing Through Experience: Scaling Episodic Grounding in Language ModelsChunhui Zhang, Sirui, Wang et al.
Language models (LMs) require robust episodic grounding-the capacity to learn from and apply past experiences-to excel at physical planning tasks. Current episodic grounding approaches struggle with scalability and integration, limiting their effectiveness, especially for medium-sized LMs (7B parameters). While larger LMs (70-405B parameters) possess superior hierarchical representations and extensive pre-trained knowledge, they encounter a fundamental scale paradox: despite their advanced abstraction capabilities, they lack efficient mechanisms to leverage experience streams. We propose a scalable weak-to-strong episodic learning framework that effectively transfers episodic behaviors from smaller to larger LMs. This framework integrates Monte Carlo tree search for structured experience collection with a novel distillation method, preserving the inherent LM capabilities while embedding episodic memory. Experiments demonstrate our method surpasses state-of-the-art proprietary LMs by 3.45% across diverse planning and question-answering tasks. Layer-wise probing further indicates significant improvements in task alignment, especially within deeper LM layers, highlighting stable generalization even for previously unseen scenarios with increased planning complexity-conditions where baseline methods degrade markedly.
LGFeb 18
On the Power of Source Screening for Learning Shared Feature ExtractorsLeo, Wang, Connor Mclaughlin et al.
Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good'' collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.
LGMay 27, 2025
TuneComp: Joint Fine-tuning and Compression for Large Foundation ModelsXiangyu Chen, Jing Liu, Ye Wang et al.
To reduce model size during post-training, compression methods, including knowledge distillation, low-rank approximation, and pruning, are often applied after fine-tuning the model. However, sequential fine-tuning and compression sacrifices performance, while creating a larger than necessary model as an intermediate step. In this work, we aim to reduce this gap, by directly constructing a smaller model while guided by the downstream task. We propose to jointly fine-tune and compress the model by gradually distilling it to a pruned low-rank structure. Experiments demonstrate that joint fine-tuning and compression significantly outperforms other sequential compression methods.
LGMay 23, 2025
LatentLLM: Attention-Aware Joint Tensor CompressionToshiaki Koike-Akino, Xiangyu Chen, Jing Liu et al.
Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.
CVMay 4, 2023
A Cross-direction Task Decoupling Network for Small Logo DetectionHou, Sujuan, Li et al.
Logo detection plays an integral role in many applications. However, handling small logos is still difficult since they occupy too few pixels in the image, which burdens the extraction of discriminative features. The aggregation of small logos also brings a great challenge to the classification and localization of logos. To solve these problems, we creatively propose Cross-direction Task Decoupling Network (CTDNet) for small logo detection. We first introduce Cross-direction Feature Pyramid (CFP) to realize cross-direction feature fusion by adopting horizontal transmission and vertical transmission. In addition, Multi-frequency Task Decoupling Head (MTDH) decouples the classification and localization tasks into two branches. A multi frequency attention convolution branch is designed to achieve more accurate regression by combining discrete cosine transform and convolution creatively. Comprehensive experiments on four logo datasets demonstrate the effectiveness and efficiency of the proposed method.
SPFeb 19, 2022
Multi-Modal Recurrent Fusion for Indoor LocalizationJianyuan Yu, Pu, Wang et al.
This paper considers indoor localization using multi-modal wireless signals including Wi-Fi, inertial measurement unit (IMU), and ultra-wideband (UWB). By formulating the localization as a multi-modal sequence regression problem, a multi-stream recurrent fusion method is proposed to combine the current hidden state of each modality in the context of recurrent neural networks while accounting for the modality uncertainty which is directly learned from its own immediate past states. The proposed method was evaluated on the large-scale SPAWC2021 multi-modal localization dataset and compared with a wide range of baseline methods including the trilateration method, traditional fingerprinting methods, and convolution network-based methods.
NIDec 28, 2021
Multi-Band Wi-Fi Sensing with Matched Feature GranularityJianyuan Yu, Pu, Wang et al.
Complementary to the fine-grained channel state information (CSI) from the physical layer and coarse-grained received signal strength indicator (RSSI) measurements, the mid-grained spatial beam attributes (e.g., beam SNR) that are available at millimeter-wave (mmWave) bands during the mandatory beam training phase can be repurposed for Wi-Fi sensing applications. In this paper, we propose a multi-band Wi-Fi fusion method for Wi-Fi sensing that hierarchically fuses the features from both the fine-grained CSI at sub-6 GHz and the mid-grained beam SNR at 60 GHz in a granularity matching framework. The granularity matching is realized by pairing two feature maps from the CSI and beam SNR at different granularity levels and linearly combining all paired feature maps into a fused feature map with learnable weights. To further address the issue of limited labeled training data, we propose an autoencoder-based multi-band Wi-Fi fusion network that can be pre-trained in an unsupervised fashion. Once the autoencoder-based fusion network is pre-trained, we detach the decoders and append multi-task sensing heads to the fused feature map by fine-tuning the fusion block and re-training the multi-task heads from the scratch. The multi-band Wi-Fi fusion framework is thoroughly validated by in-house experimental Wi-Fi sensing datasets spanning three tasks: 1) pose recognition; 2) occupancy sensing; and 3) indoor localization. Comparison to four baseline methods (i.e., CSI-only, beam SNR-only, input fusion, and feature fusion) demonstrates the granularity matching improves the multi-task sensing performance. Quantitative performance is evaluated as a function of the number of labeled training data, latent space dimension, and fine-tuning learning rates.
IRAug 13, 2021
Multi-Objective Recommendations: A TutorialYong Zheng, David, Wang
Recommender systems (RecSys) have been well developed to assist user decision making. Traditional RecSys usually optimize a single objective (e.g., rating prediction errors or ranking quality) in the model. There is an emerging demand in multi-objective optimization recently in RecSys, especially in the area of multi-stakeholder and multi-task recommender systems. This article provides an overview of multi-objective recommendations, followed by the discussions with case studies. The document is considered as a supplementary material for our tutorial on multi-objective recommendations at ACM SIGKDD 2021.
CRJul 1, 2021
VeriDL: Integrity Verification of Outsourced Deep Learning Services (Extended Version)Boxiang Dong, Bo Zhang, Hui et al.
Deep neural networks (DNNs) are prominent due to their superior performance in many fields. The deep-learning-as-a-service (DLaaS) paradigm enables individuals and organizations (clients) to outsource their DNN learning tasks to the cloud-based platforms. However, the DLaaS server may return incorrect DNN models due to various reasons (e.g., Byzantine failures). This raises the serious concern of how to verify if the DNN models trained by potentially untrusted DLaaS servers are indeed correct. To address this concern, in this paper, we design VeriDL, a framework that supports efficient correctness verification of DNN models in the DLaaS paradigm. The key idea of VeriDL is the design of a small-size cryptographic proof of the training process of the DNN model, which is associated with the model and returned to the client. Through the proof, VeriDL can verify the correctness of the DNN model returned by the DLaaS server with a deterministic guarantee and cheap overhead. Our experiments on four real-world datasets demonstrate the efficiency and effectiveness of VeriDL.
CRMar 17, 2021
Cyber Intrusion Detection by Using Deep Neural Networks with Attack-sharing LossBoxiang Dong, Hui, Wang et al.
Cyber attacks pose crucial threats to computer system security, and put digital treasuries at excessive risks. This leads to an urgent call for an effective intrusion detection system that can identify the intrusion attacks with high accuracy. It is challenging to classify the intrusion events due to the wide variety of attacks. Furthermore, in a normal network environment, a majority of the connections are initiated by benign behaviors. The class imbalance issue in intrusion detection forces the classifier to be biased toward the majority/benign class, thus leave many attack incidents undetected. Spurred by the success of deep neural networks in computer vision and natural language processing, in this paper, we design a new system named DeepIDEA that takes full advantage of deep learning to enable intrusion detection and classification. To achieve high detection accuracy on imbalanced data, we design a novel attack-sharing loss function that can effectively move the decision boundary towards the attack classes and eliminates the bias towards the majority/benign class. By using this loss function, DeepIDEA respects the fact that the intrusion mis-classification should receive higher penalty than the attack mis-classification. Extensive experimental results on three benchmark datasets demonstrate the high detection accuracy of DeepIDEA. In particular, compared with eight state-of-the-art approaches, DeepIDEA always provides the best class-balanced accuracy.
LGJan 16, 2020
An adversarial learning framework for preserving users' anonymity in face-based emotion recognitionVansh Narula, Zhangyang, Wang et al.
Image and video-capturing technologies have permeated our every-day life. Such technologies can continuously monitor individuals' expressions in real-life settings, affording us new insights into their emotional states and transitions, thus paving the way to novel well-being and healthcare applications. Yet, due to the strong privacy concerns, the use of such technologies is met with strong skepticism, since current face-based emotion recognition systems relying on deep learning techniques tend to preserve substantial information related to the identity of the user, apart from the emotion-specific information. This paper proposes an adversarial learning framework which relies on a convolutional neural network (CNN) architecture trained through an iterative procedure for minimizing identity-specific information and maximizing emotion-dependent information. The proposed approach is evaluated through emotion classification and face identification metrics, and is compared against two CNNs, one trained solely for emotion recognition and the other trained solely for face identification. Experiments are performed using the Yale Face Dataset and Japanese Female Facial Expression Database. Results indicate that the proposed approach can learn a convolutional transformation for preserving emotion recognition accuracy and degrading face identity recognition, providing a foundation toward privacy-aware emotion recognition technologies.
CROct 14, 2019
Man-in-the-Middle Attacks against Machine Learning Classifiers via Malicious Generative ModelsDerui, Wang, Chaoran Li et al.
Deep Neural Networks (DNNs) are vulnerable to deliberately crafted adversarial examples. In the past few years, many efforts have been spent on exploring query-optimisation attacks to find adversarial examples of either black-box or white-box DNN models, as well as the defending countermeasures against those attacks. In this work, we explore vulnerabilities of DNN models under the umbrella of Man-in-the-Middle (MitM) attacks, which has not been investigated before. From the perspective of an MitM adversary, the aforementioned adversarial example attacks are not viable anymore. First, such attacks must acquire the outputs from the models by multiple times before actually launching attacks, which is difficult for the MitM adversary in practice. Second, such attacks are one-off and cannot be directly generalised onto new data examples, which decreases the rate of return for the attacker. In contrast, using generative models to craft adversarial examples on the fly can mitigate the drawbacks. However, the adversarial capability of the generative models, such as Variational Auto-Encoder (VAE), has not been extensively studied. Therefore, given a classifier, we investigate using a VAE decoder to either transform benign inputs to their adversarial counterparts or decode outputs from benign VAE encoders to be adversarial examples. The proposed method can endue more capability to MitM attackers. Based on our evaluation, the proposed attack can achieve above 95% success rate on both MNIST and CIFAR10 datasets, which is better or comparable with state-of-the-art query-optimisation attacks. At the meantime, the attack is 104 times faster than the query-optimisation attacks.
CVJul 26, 2019
Improved Super-Resolution Convolution Neural Network for Large ImagesJunyu, Wang, Rong Song
Single image super-resolution (SISR) is a very popular topic nowadays, which has both research value and practical value. In daily life, we crop a large image into sub-images to do super-resolution and then merge them together. Although convolution neural network performs very well in the research field, if we use it to do super-resolution, we can easily observe cutting lines from merged pictures. To address these problems, in this paper, we propose a refined architecture of SRCNN with 'Symmetric padding', 'Random learning' and 'Residual learning'. Moreover, we have done a lot of experiments to prove our model performs best among a lot of the state-of-art methods.
CRAug 24, 2018
Truth Inference on Sparse Crowdsourcing Data with Local Differential PrivacyHaipei Sun, Boxiang Dong, Hui et al.
Crowdsourcing has arisen as a new problem-solving paradigm for tasks that are difficult for computers but easy for humans. However, since the answers collected from the recruited participants (workers) may contain sensitive information, crowdsourcing raises serious privacy concerns. In this paper, we investigate the problem of protecting answer privacy under local differential privacy (LDP), by which individual workers randomize their answers independently and send the perturbed answers to the task requester. The utility goal is to enable to infer the true answer (i.e., truth) from the perturbed data with high accuracy. One of the challenges of LDP perturbation is the sparsity of worker answers (i.e., each worker only answers a small number of tasks). Simple extension of the existing approaches (e.g., Laplace perturbation and randomized response) may incur large error of truth inference on sparse data. Thus we design an efficient new matrix factorization (MF) algorithm under LDP. We prove that our MF algorithm can provide both LDP guarantee and small error of truth inference, regardless of the sparsity of worker answers. We perform extensive experiments on real-world and synthetic datasets, and demonstrate that the MF algorithm performs better than the existing LDP algorithms on sparse crowdsourcing data.
CVJul 20, 2018
Future Semantic Segmentation with Convolutional LSTMSeyed shahabeddin Nabavi, Mrigank Rochan, Yang et al.
We consider the problem of predicting semantic segmentation of future frames in a video. Given several observed frames in a video, our goal is to predict the semantic segmentation map of future frames that are not yet observed. A reliable solution to this problem is useful in many applications that require real-time decision making, such as autonomous driving. We propose a novel model that uses convolutional LSTM (ConvLSTM) to encode the spatiotemporal information of observed frames for future prediction. We also extend our model to use bidirectional ConvLSTM to capture temporal information in both directions. Our proposed approach outperforms other state-of-the-art methods on the benchmark dataset.
MLSep 24, 2015
High Dimensional Data Modeling Techniques for Detection of Chemical Plumes and Anomalies in Hyperspectral Images and MoviesYi, Wang, Guangliang Chen et al.
We briefly review recent progress in techniques for modeling and analyzing hyperspectral images and movies, in particular for detecting plumes of both known and unknown chemicals. For detecting chemicals of known spectrum, we extend the technique of using a single subspace for modeling the background to a "mixture of subspaces" model to tackle more complicated background. Furthermore, we use partial least squares regression on a resampled training set to boost performance. For the detection of unknown chemicals we view the problem as an anomaly detection problem, and use novel estimators with low-sampled complexity for intrinsically low-dimensional data in high-dimensions that enable us to model the "normal" spectra and detect anomalies. We apply these algorithms to benchmark data sets made available by the Automated Target Detection program co-funded by NSF, DTRA and NGA, and compare, when applicable, to current state-of-the-art algorithms, with favorable results.