CLMar 15, 2023
GPT-4 Technical ReportJosh Achiam, Steven Adler, Sandhini Agarwal et al. · berkeley, deepmind
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
CVJun 2
Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration NetworkZhikang Li, Yan Wu, Xin Hu et al.
In multi-modal image registration, the primary challenge lies in shared structural information extraction. Compared to Transformers, Structured State Space Duality (SSD) offers greater global structural feature extraction with higher efficiency during training and inference. Inspired by these advantages, we propose a novel algorithm for multi-modal image registration, named RegNetMamba-2. Our algorithm incorporates SSD into coarse-to-fine matching process to extract local and global structural features effectively. Firstly, SSD is applied in three different scales for multi-modal feature extraction in our network. To strengthen local representation, we pay more attention on foreground edge and structural information by feature scaling function of SSD. Secondly, for shared feature extraction of input images and multi-modal feature fusion in all scales, we propose cross-modality feature fusion model based on SSD, consisting of Cross-Modality feature Interaction (CMI) module and Multi-Scale feature Fusion (MSF) module. CMI module is designed for cross-modality feature extraction of each scale by SSD in cross form. MSF module is designed to employ a progressive upward fusion in feature-level to obtain fine features, consisting of multi-modal features in all scales. Following coarse-to-fine, the features in 1/8 scale from CMI and 1/2 scale from MSF are collected to calculate matching probability scores. Then we respectively establish matching process by correspondences of pixel-wise. Extensive experiments demonstrate that comparing with state-of-the-art deep-learning based algorithms, RegNetMamba-2 has achieved good effects in both performance and efficiency for multi-modal image registration on the following datasets: VIS-SAR (OSDataset), VIS-IR (LGHD/RoadSence) and VIS-NIR (RGB-NIR sense).
CVJun 13, 2022
A Multi-purpose Realistic Haze Benchmark with Quantifiable Haze Levels and Ground TruthPriya Narayanan, Xin Hu, Zhenyu Wu et al.
Imagery collected from outdoor visual environments is often degraded due to the presence of dense smoke or haze. A key challenge for research in scene understanding in these degraded visual environments (DVE) is the lack of representative benchmark datasets. These datasets are required to evaluate state-of-the-art vision algorithms (e.g., detection and tracking) in degraded settings. In this paper, we address some of these limitations by introducing the first realistic hazy image benchmark, from both aerial and ground view, with paired haze-free images, and in-situ haze density measurements. This dataset was produced in a controlled environment with professional smoke generating machines that covered the entire scene, and consists of images captured from the perspective of both an unmanned aerial vehicle (UAV) and an unmanned ground vehicle (UGV). We also evaluate a set of representative state-of-the-art dehazing approaches as well as object detectors on the dataset. The full dataset presented in this paper, including the ground truth object classification bounding boxes and haze density measurements, is provided for the community to evaluate their algorithms at: https://a2i2-archangel.vision. A subset of this dataset has been used for the ``Object Detection in Haze'' Track of CVPR UG2 2022 challenge at http://cvpr2022.ug2challenge.org/track1.html.
AIFeb 19, 2023
Language-Specific Representation of Emotion-Concept Knowledge Causally Supports Emotion InferenceMing Li, Yusheng Su, Hsiu-Yuan Huang et al. · tsinghua
Humans no doubt use language to communicate about their emotional experiences, but does language in turn help humans understand emotions, or is language just a vehicle of communication? This study used a form of artificial intelligence (AI) known as large language models (LLMs) to assess whether language-based representations of emotion causally contribute to the AI's ability to generate inferences about the emotional meaning of novel situations. Fourteen attributes of human emotion concept representation were found to be represented by the LLM's distinct artificial neuron populations. By manipulating these attribute-related neurons, we in turn demonstrated the role of emotion concept knowledge in generative emotion inference. The attribute-specific performance deterioration was related to the importance of different attributes in human mental space. Our findings provide a proof-in-concept that even a LLM can learn about emotions in the absence of sensory-motor representations and highlight the contribution of language-derived emotion-concept knowledge for emotion inference.
CVApr 9, 2022
E^2TAD: An Energy-Efficient Tracking-based Action DetectorXin Hu, Zhenyu Wu, Hao-Yu Miao et al.
Video action detection (spatio-temporal action localization) is usually the starting point for human-centric intelligent analysis of videos nowadays. It has high practical impacts for many applications across robotics, security, healthcare, etc. The two-stage paradigm of Faster R-CNN inspires a standard paradigm of video action detection in object detection, i.e., firstly generating person proposals and then classifying their actions. However, none of the existing solutions could provide fine-grained action detection to the "who-when-where-what" level. This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions spatially (by predicting the associated target IDs and locations) and temporally (by predicting the time in exact frame indices). This solution won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC).
CVDec 29, 2022
GPTR: Gestalt-Perception Transformer for Diagram Object DetectionXin Hu, Lingling Zhang, Jun Liu et al.
Diagram object detection is the key basis of practical applications such as textbook question answering. Because the diagram mainly consists of simple lines and color blocks, its visual features are sparser than those of natural images. In addition, diagrams usually express diverse knowledge, in which there are many low-frequency object categories in diagrams. These lead to the fact that traditional data-driven detection model is not suitable for diagrams. In this work, we propose a gestalt-perception transformer model for diagram object detection, which is based on an encoder-decoder architecture. Gestalt perception contains a series of laws to explain human perception, that the human visual system tends to perceive patches in an image that are similar, close or connected without abrupt directional changes as a perceptual whole object. Inspired by these thoughts, we build a gestalt-perception graph in transformer encoder, which is composed of diagram patches as nodes and the relationships between patches as edges. This graph aims to group these patches into objects via laws of similarity, proximity, and smoothness implied in these edges, so that the meaningful objects can be effectively detected. The experimental results demonstrate that the proposed GPTR achieves the best results in the diagram object detection task. Our model also obtains comparable results over the competitors in natural image object detection.
CVSep 14, 2024
Enhancing Skin Disease Diagnosis: Interpretable Visual Concept Discovery with SAMXin Hu, Janet Wang, Jihun Hamm et al.
Current AI-assisted skin image diagnosis has achieved dermatologist-level performance in classifying skin cancer, driven by rapid advancements in deep learning architectures. However, unlike traditional vision tasks, skin images in general present unique challenges due to the limited availability of well-annotated datasets, complex variations in conditions, and the necessity for detailed interpretations to ensure patient safety. Previous segmentation methods have sought to reduce image noise and enhance diagnostic performance, but these techniques require fine-grained, pixel-level ground truth masks for training. In contrast, with the rise of foundation models, the Segment Anything Model (SAM) has been introduced to facilitate promptable segmentation, enabling the automation of the segmentation process with simple yet effective prompts. Efforts applying SAM predominantly focus on dermatoscopy images, which present more easily identifiable lesion boundaries than clinical photos taken with smartphones. This limitation constrains the practicality of these approaches to real-world applications. To overcome the challenges posed by noisy clinical photos acquired via non-standardized protocols and to improve diagnostic accessibility, we propose a novel Cross-Attentive Fusion framework for interpretable skin lesion diagnosis. Our method leverages SAM to generate visual concepts for skin diseases using prompts, integrating local visual concepts with global image features to enhance model performance. Extensive evaluation on two skin disease datasets demonstrates our proposed method's effectiveness on lesion diagnosis and interpretability.
CRSep 10, 2024
Adversarial Attacks to Multi-Modal ModelsZhihao Dou, Xin Hu, Haibo Yang et al.
Multi-modal models have gained significant attention due to their powerful capabilities. These models effectively align embeddings across diverse data modalities, showcasing superior performance in downstream tasks compared to their unimodal counterparts. Recent study showed that the attacker can manipulate an image or audio file by altering it in such a way that its embedding matches that of an attacker-chosen targeted input, thereby deceiving downstream models. However, this method often underperforms due to inherent disparities in data from different modalities. In this paper, we introduce CrossFire, an innovative approach to attack multi-modal models. CrossFire begins by transforming the targeted input chosen by the attacker into a format that matches the modality of the original image or audio file. We then formulate our attack as an optimization problem, aiming to minimize the angular deviation between the embeddings of the transformed input and the modified image or audio file. Solving this problem determines the perturbations to be added to the original media. Our extensive experiments on six real-world benchmark datasets reveal that CrossFire can significantly manipulate downstream tasks, surpassing existing attacks. Additionally, we evaluate six defensive strategies against CrossFire, finding that current defenses are insufficient to counteract our CrossFire.
ROFeb 2Code
UniDWM: Towards a Unified Driving World Model via Multifaceted Representation LearningShuai Liu, Siheng Ren, Xiaoyao Zhu et al.
Achieving reliable and efficient planning in complex driving environments requires a model that can reason over the scene's geometry, appearance, and dynamics. We present UniDWM, a unified driving world model that advances autonomous driving through multifaceted representation learning. UniDWM constructs a structure- and dynamic-aware latent world representation that serves as a physically grounded state space, enabling consistent reasoning across perception, prediction, and planning. Specifically, a joint reconstruction pathway learns to recover the scene's structure, including geometry and visual texture, while a collaborative generation framework leverages a conditional diffusion transformer to forecast future world evolution within the latent space. Furthermore, we show that our UniDWM can be deemed as a variation of VAE, which provides theoretical guidance for the multifaceted representation learning. Extensive experiments demonstrate the effectiveness of UniDWM in trajectory planning, 4D reconstruction and generation, highlighting the potential of multifaceted world representations as a foundation for unified driving intelligence. The code will be publicly available at https://github.com/Say2L/UniDWM.
GROct 30, 2025Code
StructLayoutFormer:Conditional Structured Layout Generation via Structure Serialization and DisentanglementXin Hu, Pengfei Xu, Jin Zhou et al.
Structured layouts are preferable in many 2D visual contents (\eg, GUIs, webpages) since the structural information allows convenient layout editing. Computational frameworks can help create structured layouts but require heavy labor input. Existing data-driven approaches are effective in automatically generating fixed layouts but fail to produce layout structures. We present StructLayoutFormer, a novel Transformer-based approach for conditional structured layout generation. We use a structure serialization scheme to represent structured layouts as sequences. To better control the structures of generated layouts, we disentangle the structural information from the element placements. Our approach is the first data-driven approach that achieves conditional structured layout generation and produces realistic layout structures explicitly. We compare our approach with existing data-driven layout generation approaches by including post-processing for structure extraction. Extensive experiments have shown that our approach exceeds these baselines in conditional structured layout generation. We also demonstrate that our approach is effective in extracting and transferring layout structures. The code is publicly available at %\href{https://github.com/Teagrus/StructLayoutFormer} {https://github.com/Teagrus/StructLayoutFormer}.
SEApr 12
VulWeaver: Weaving Broken Semantics for Grounded Vulnerability DetectionYiheng Cao, Yihao Chen, Xin Hu et al.
Detecting vulnerabilities in source code remains critical yet challenging, as conventional static analysis tools construct inaccurate program representations, while existing LLM-based approaches often miss essential vulnerability context and lack grounded reasoning. To mitigate these challenges, we introduce VulWeaver, a novel LLM-based approach that weaves broken program semantics into accurate representations and extracts holistic vulnerability context for grounded vulnerability detection. Specifically, VulWeaver first constructs an enhanced unified dependency graph (UDG) by integrating deterministic rules with LLM-based semantic inference to address static analysis inaccuracies. It then extracts holistic vulnerability context by combining explicit contexts from program slicing with implicit contexts, including usage, definition, and declaration information. Finally, VulWeaver employs meta-prompting with vulnerability type specific expert guidelines to steer LLMs through systematic reasoning, aggregated via majority voting for robustness. Extensive experiments on PrimeVul4J dataset have demonstrated that VulWeaver achieves a F1-score of 0.75, outperforming state-of-the-art learning-based, LLM-based, and agent-based baselines by 23%, 15%, and 60% in F1-score, respectively. VulWeaver has also detected 26 true vulnerabilities across 9 realworld Java projects, with 15 confirmed by developers and 5 CVE identifiers assigned. In industrial deployment, VulWeaver identified 40 confirmed vulnerabilities in an internal repository.
CRApr 2
From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP ServersYiheng Huang, Zhijia Zhao, Bihuan Chen et al.
The model context protocol (MCP) standardizes how LLMs connect to external tools and data sources, enabling faster integration but introducing new attack vectors. Despite the growing adoption of MCP, existing MCP security studies classify attacks by their observable effects, obscuring how attacks behave across different MCP server components and overlooking multi-component attack chains. Meanwhile, existing defenses are less effective when facing multi-component attacks or previously unknown malicious behaviors. This work presents a component-centric perspective for understanding and detecting malicious MCP servers. First, we build the first component-centric PoC dataset of 114 malicious MCP servers where attacks are achieved as manipulation over MCP components and their compositions. We evaluate these attacks' effectiveness across two MCP hosts and five LLMs, and uncover that (1) component position shapes attack success rate; and (2) multi-component compositions often outperform single-component attacks by distributing malicious logic. Second, we propose and implement Connor, a two-stage behavioral deviation detector for malicious MCP servers. It first performs pre-execution analysis to detect malicious shell commands and extract each tool's function intent, and then conducts step-wise in-execution analysis to trace each tool's behavioral trajectories and detect deviations from its function intent. Evaluation on our curated dataset indicates that Connor achieves an F1-score of 94.6%, outperforming the state of the art by 8.9% to 59.6%. In real-world detection, Connor identifies two malicious servers.
CVFeb 23
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model BlindnessXin Hu, Haomiao Ni, Yunbei Zhang et al.
Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM's attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM's ability to focus on and reason about rare objects.
CVNov 17, 2025Code
VLMs Guided Interpretable Decision Making for Autonomous DrivingXin Hu, Taotao Jing, Renran Tian et al.
Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.
CVApr 18
Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow MatchingXin Hu, Ke Qin, Wen Yin et al.
Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.
CVMay 5, 2024
Residual-Conditioned Optimal Transport: Towards Structure-Preserving Unpaired and Paired Image RestorationXiaole Tang, Xin Hu, Xiang Gu et al.
Deep learning-based image restoration methods generally struggle with faithfully preserving the structures of the original image. In this work, we propose a novel Residual-Conditioned Optimal Transport (RCOT) approach, which models image restoration as an optimal transport (OT) problem for both unpaired and paired settings, introducing the transport residual as a unique degradation-specific cue for both the transport cost and the transport map. Specifically, we first formalize a Fourier residual-guided OT objective by incorporating the degradation-specific information of the residual into the transport cost. We further design the transport map as a two-pass RCOT map that comprises a base model and a refinement process, in which the transport residual is computed by the base model in the first pass and then encoded as a degradation-specific embedding to condition the second-pass restoration. By duality, the RCOT problem is transformed into a minimax optimization problem, which can be solved by adversarially training neural networks. Extensive experiments on multiple restoration tasks show that RCOT achieves competitive performance in terms of both distortion measures and perceptual quality, restoring images with more faithful structures as compared with state-of-the-art methods.
CVNov 3, 2024
Degradation-Aware Residual-Conditioned Optimal Transport for Unified Image RestorationXiaole Tang, Xiang Gu, Xiaoyi He et al.
All-in-one image restoration has emerged as a practical and promising low-level vision task for real-world applications. In this context, the key issue lies in how to deal with different types of degraded images simultaneously. In this work, we present a Degradation-Aware Residual-Conditioned Optimal Transport (DA-RCOT) approach that models (all-in-one) image restoration as an optimal transport (OT) problem for unpaired and paired settings, introducing the transport residual as a degradation-specific cue for both the transport cost and the transport map. Specifically, we formalize image restoration with a residual-guided OT objective by exploiting the degradation-specific patterns of the Fourier residual in the transport cost. More crucially, we design the transport map for restoration as a two-pass DA-RCOT map, in which the transport residual is computed in the first pass and then encoded as multi-scale residual embeddings to condition the second-pass restoration. This conditioning process injects intrinsic degradation knowledge (e.g., degradation type and level) and structural information from the multi-scale residual embeddings into the OT map, which thereby can dynamically adjust its behaviors for all-in-one restoration. Extensive experiments across five degradations demonstrate the favorable performance of DA-RCOT as compared to state-of-the-art methods, in terms of distortion measures, perceptual quality, and image structure preservation. Notably, DA-RCOT delivers superior adaptability to real-world scenarios even with multiple degradations and shows distinctive robustness to both degradation levels and the number of degradations.
CLSep 9, 2025
Dynamic Prompt Fusion for Multi-Task and Cross-Domain Adaptation in LLMsXin Hu, Yue Kang, Guanzi Yao et al.
This study addresses the generalization limitations commonly observed in large language models under multi-task and cross-domain settings. Unlike prior methods such as SPoT, which depends on fixed prompt templates, our study introduces a unified multi-task learning framework with dynamic prompt scheduling mechanism. By introducing a prompt pool and a task-aware scheduling strategy, the method dynamically combines and aligns prompts for different tasks. This enhances the model's ability to capture semantic differences across tasks. During prompt fusion, the model uses task embeddings and a gating mechanism to finely control the prompt signals. This ensures alignment between prompt content and task-specific demands. At the same time, it builds flexible sharing pathways across tasks. In addition, the proposed optimization objective centers on joint multi-task learning. It incorporates an automatic learning strategy for scheduling weights, which effectively mitigates task interference and negative transfer. To evaluate the effectiveness of the method, a series of sensitivity experiments were conducted. These experiments examined the impact of prompt temperature parameters and task number variation. The results confirm the advantages of the proposed mechanism in maintaining model stability and enhancing transferability. Experimental findings show that the prompt scheduling method significantly improves performance on a range of language understanding and knowledge reasoning tasks. These results fully demonstrate its applicability and effectiveness in unified multi-task modeling and cross-domain adaptation.
CVOct 1, 2025
Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical DiagnosisRan Tong, Jiaqi Liu, Tong Wang et al.
The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN's 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline's 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.
CLJul 21, 2025
Collaborative Distillation Strategies for Parameter-Efficient Language Model DeploymentXiandong Meng, Yan Wu, Yexin Tian et al.
This paper addresses the challenges of high computational cost and slow inference in deploying large language models. It proposes a distillation strategy guided by multiple teacher models. The method constructs several teacher models and integrates their output probability distributions and intermediate semantic features. This guides the student model to learn from multiple sources of knowledge. As a result, the student model gains stronger language understanding and generation ability while maintaining a small parameter size. To achieve this, the paper introduces a weighted output fusion mechanism, a feature alignment loss function, and an entropy-driven dynamic teacher weighting strategy. These components improve the quality and stability of knowledge transfer during distillation. Under multi-teacher guidance, the student model captures semantic information more effectively and demonstrates strong performance across multiple evaluation metrics. In particular, the method shows high consistency in expression, generalization ability, and task adaptability in tasks such as language modeling, text generation, and multi-task learning. The experiments compare the proposed method with several widely adopted distillation approaches. The results further confirm its overall advantages in perplexity, distillation loss, and generation quality. This study provides a feasible technical path for the efficient compression of large-scale language models. It also demonstrates the effectiveness of multi-teacher collaborative mechanisms in complex language modeling tasks.
LGMay 14, 2024
Airport Delay Prediction with Temporal Fusion TransformersKe Liu, Kaijing Ding, Xi Cheng et al.
Since flight delay hurts passengers, airlines, and airports, its prediction becomes crucial for the decision-making of all stakeholders in the aviation industry and thus has been attempted by various previous research. However, previous delay predictions are often categorical and at a highly aggregated level. To improve that, this study proposes to apply the novel Temporal Fusion Transformer model and predict numerical airport arrival delays at quarter hour level for U.S. top 30 airports. Inputs to our model include airport demand and capacity forecasts, historic airport operation efficiency information, airport wind and visibility conditions, as well as enroute weather and traffic conditions. The results show that our model achieves satisfactory performance measured by small prediction errors on the test set. In addition, the interpretability analysis of the model outputs identifies the important input factors for delay prediction.
SENov 4, 2024
Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFastWen Fan, Marilyn Rego, Xin Hu et al.
Static verification is a powerful method for enhancing software quality, but it demands significant human labor and resources. This is particularly true of static verifiers that reason about heap manipulating programs using an ownership logic. LLMs have shown promise in a number of software engineering activities, including code generation, test generation, proof generation for theorem provers, and specification generation for static verifiers. However, prior work has not explored how well LLMs can perform specification generation for specifications based in an ownership logic, such as separation logic. To address this gap, this paper explores OpenAI's GPT-4o model's effectiveness in generating specifications on C programs that are verifiable with VeriFast, a separation logic based static verifier. Our experiment employs three different types of user inputs as well as basic and Chain-of-Thought (CoT) prompting to assess GPT's capabilities. Our results indicate that the specifications generated by GPT-4o preserve functional behavior, but struggle to be verifiable. When the specifications are verifiable they contain redundancies. Future directions are discussed to improve the performance.
LGOct 19, 2025
Renaissance of RNNs in Streaming Clinical Time Series: Compact Recurrence Remains Competitive with TransformersRan Tong, Jiaqi Liu, Su Liu et al.
We present a compact, strictly causal benchmark for streaming clinical time series on the MIT--BIH Arrhythmia Database using per-second heart rate. Two tasks are studied under record-level, non-overlapping splits: near-term tachycardia risk (next ten seconds) and one-step heart rate forecasting. We compare a GRU-D (RNN) and a Transformer under matched training budgets against strong non-learned baselines. Evaluation is calibration-aware for classification and proper for forecasting, with temperature scaling and grouped bootstrap confidence intervals. On MIT-BIH, GRU-D slightly surpasses the Transformer for tachycardia risk, while the Transformer clearly lowers forecasting error relative to GRU-D and persistence. Our results show that, in longitudinal monitoring, model choice is task-dependent: compact RNNs remain competitive for short-horizon risk scoring, whereas compact Transformers deliver clearer gains for point forecasting.
CLOct 11, 2025
Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong DefaultJiaqi Liu, Tong Wang, Su Liu et al.
The research evaluates lightweight medical abstract classification methods to establish their maximum performance capabilities under financial budget restrictions. On the public medical abstracts corpus, we finetune BERT base and Distil BERT with three objectives cross entropy (CE), class weighted CE, and focal loss under identical tokenization, sequence length, optimizer, and schedule. DistilBERT with plain CE gives the strongest raw argmax trade off, while a post hoc operating point selection (validation calibrated, classwise thresholds) sub stantially improves deployed performance; under this tuned regime, focal benefits most. We report Accuracy, Macro F1, and WeightedF1, release evaluation artifacts, and include confusion analyses to clarify error structure. The practical takeaway is to start with a compact encoder and CE, then add lightweight calibration or thresholding when deployment requires higher macro balance.
CVApr 2
AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical ImagingQiang Ma, Qingjie Meng, Xin Hu et al.
Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.
AIAug 26, 2025
eSkinHealth: A Multimodal Dataset for Neglected Tropical Skin DiseasesJanet Wang, Xin Hu, Yunbei Zhang et al.
Skin Neglected Tropical Diseases (NTDs) impose severe health and socioeconomic burdens in impoverished tropical communities. Yet, advancements in AI-driven diagnostic support are hindered by data scarcity, particularly for underrepresented populations and rare manifestations of NTDs. Existing dermatological datasets often lack the demographic and disease spectrum crucial for developing reliable recognition models of NTDs. To address this, we introduce eSkinHealth, a novel dermatological dataset collected on-site in Côte d'Ivoire and Ghana. Specifically, eSkinHealth contains 5,623 images from 1,639 cases and encompasses 47 skin diseases, focusing uniquely on skin NTDs and rare conditions among West African populations. We further propose an AI-expert collaboration paradigm to implement foundation language and segmentation models for efficient generation of multimodal annotations, under dermatologists' guidance. In addition to patient metadata and diagnosis labels, eSkinHealth also includes semantic lesion masks, instance-specific visual captions, and clinical concepts. Overall, our work provides a valuable new resource and a scalable annotation framework, aiming to catalyze the development of more equitable, accurate, and interpretable AI tools for global dermatology.
CVJul 8, 2025
SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context ReasoningXin Hu, Ke Qin, Guiduo Duan et al.
Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model's inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework -- a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaptation, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed- and open-set scenarios, particularly for spatial relationship prediction.
CVNov 19, 2025
TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion RecognitionWen Yin, Siyu Zhan, Cencen Liu et al.
Multimodal Emotion Recognition (MER) aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data. Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts, wherein different modalities within the same sample may express divergent emotional tendencies. In this work, we address this overlooked issue by proposing a novel framework, Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception. TiCAL dynamically assesses the consistency of each training sample by leveraging pseudo unimodal emotion labels alongside a typicality estimation. To further enhance emotion representation, we embed features in a hyperbolic space, enabling the capture of fine-grained distinctions among emotional categories. By incorporating consistency estimates into the learning process, our method improves model performance, particularly on samples exhibiting high modality inconsistency. Extensive experiments on benchmark datasets, e.g, CMU-MOSEI and MER2023, validate the effectiveness of TiCAL in mitigating inter-modal emotional conflicts and enhancing overall recognition accuracy, e.g., with about 2.6% improvements over the state-of-the-art DMD.
CVNov 27, 2025
HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous DrivingQiang Li, Yingwenqi Jiang, Tuoxi Li et al.
Realistic and controllable simulation is critical for advancing end-to-end autonomous driving, yet existing approaches often struggle to support novel view synthesis under large viewpoint changes or to ensure geometric consistency. We introduce HybridWorldSim, a hybrid simulation framework that integrates multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents. This unified design addresses key limitations of previous methods, enabling the creation of diverse and high-fidelity driving scenarios with reliable visual and spatial consistency. To facilitate robust benchmarking, we further release a new multi-traversal dataset MIRROR that captures a wide range of routes and environmental conditions across different cities. Extensive experiments demonstrate that HybridWorldSim surpasses previous state-of-the-art methods, providing a practical and scalable solution for high-fidelity simulation and a valuable resource for research and development in autonomous driving.
LGOct 25, 2025
Dynamic Graph Neural Network for Data-Driven Physiologically Based Pharmacokinetic ModelingSu Liu, Xin Hu, Shurong Wen et al.
Physiologically Based Pharmacokinetic (PBPK) modeling plays a critical role in drug development by predicting drug concentration dynamics across organs. Traditional approaches rely on ordinary differential equations with strong simplifying assumptions, which limit their adaptability to nonlinear physiological interactions. In this study, we explore data-driven alternatives for PBPK prediction using deep learning. Two baseline architectures - a multilayer perceptron (MLP) and a long short-term memory (LSTM) network - are implemented to capture molecular and temporal dependencies, respectively. To incorporate inter-organ interactions, we propose a Dynamic Graph Neural Network (Dynamic GNN) that models physiological connections as recurrent message-passing processes between organs. Experimental results demonstrate that the proposed Dynamic GNN achieves the highest predictive performance among all models, with an R^2 of 0.9342, an RMSE of 0.0159, and an MAE of 0.0116. In comparison, the MLP baseline obtains an R^2 of 0.8705 and the LSTM achieves 0.8059. These results highlight that explicitly modeling the spatial and temporal dependencies of organ interactions enables more accurate and generalizable drug concentration prediction. The Dynamic GNN provides a scalable, equation-free alternative to traditional PBPK formulations and demonstrates strong potential for data-driven pharmacokinetic modeling in preclinical and clinical research.
LGFeb 18, 2025
Multi-branch of Attention Yields Accurate Results for Tabular DataXuechen Li, Yupeng Li, Jian Liu et al.
Tabular data inherently exhibits significant feature heterogeneity, but existing transformer-based methods lack specialized mechanisms to handle this property. To bridge the gap, we propose MAYA, an encoder-decoder transformer-based framework. In the encoder, we design a Multi-Branch of Attention (MBA) that constructs multiple parallel attention branches and averages the features at each branch, effectively fusing heterogeneous features while limiting parameter growth. Additionally, we employ collaborative learning with a dynamic consistency weight constraint to produce more robust representations. In the decoder stage, cross-attention is utilized to seamlessly integrate tabular data with corresponding label features. This dual-attention mechanism effectively captures both intra-instance and inter-instance interactions. We evaluate the proposed method on a wide range of datasets and compare it with other state-of-the-art transformer-based methods. Extensive experiments demonstrate that our model achieves superior performance among transformer-based methods in both tabular classification and regression tasks.
LOOct 17, 2021
Learning First-Order Rules with Relational Path Contrast for Inductive Relation ReasoningYudai Pan, Jun Liu, Lingling Zhang et al.
Relation reasoning in knowledge graphs (KGs) aims at predicting missing relations in incomplete triples, whereas the dominant paradigm is learning the embeddings of relations and entities, which is limited to a transductive setting and has restriction on processing unseen entities in an inductive situation. Previous inductive methods are scalable and consume less resource. They utilize the structure of entities and triples in subgraphs to own inductive ability. However, in order to obtain better reasoning results, the model should acquire entity-independent relational semantics in latent rules and solve the deficient supervision caused by scarcity of rules in subgraphs. To address these issues, we propose a novel graph convolutional network (GCN)-based approach for interpretable inductive reasoning with relational path contrast, named RPC-IR. RPC-IR firstly extracts relational paths between two entities and learns representations of them, and then innovatively introduces a contrastive strategy by constructing positive and negative relational paths. A joint training strategy considering both supervised and contrastive information is also proposed. Comprehensive experiments on three inductive datasets show that RPC-IR achieves outstanding performance comparing with the latest inductive reasoning methods and could explicitly represent logical rules for interpretability.
HCSep 20, 2021
Contrastive Learning of Subject-Invariant EEG Representations for Cross-Subject Emotion RecognitionXinke Shen, Xianggen Liu, Xin Hu et al.
EEG signals have been reported to be informative and reliable for emotion recognition in recent years. However, the inter-subject variability of emotion-related EEG signals still poses a great challenge for the practical applications of EEG-based emotion recognition. Inspired by recent neuroscience studies on inter-subject correlation, we proposed a Contrastive Learning method for Inter-Subject Alignment (CLISA) to tackle the cross-subject emotion recognition problem. Contrastive learning was employed to minimize the inter-subject differences by maximizing the similarity in EEG signal representations across subjects when they received the same emotional stimuli in contrast to different ones. Specifically, a convolutional neural network was applied to learn inter-subject aligned spatiotemporal representations from EEG time series in contrastive learning. The aligned representations were subsequently used to extract differential entropy features for emotion classification. CLISA achieved state-of-the-art cross-subject emotion recognition performance on our THU-EP dataset with 80 subjects and the publicly available SEED dataset with 15 subjects. It could generalize to unseen subjects or unseen emotional stimuli in testing. Furthermore, the spatiotemporal representations learned by CLISA could provide insights into the neural mechanisms of human emotion processing.
CVMay 11, 2021
A Feature Fusion-Net Using Deep Spatial Context Encoder and Nonstationary Joint Statistical Model for High Resolution SAR Image ClassificationWenkai Liang, Yan Wu, Ming Li et al.
Convolutional neural networks (CNNs) have been applied to learn spatial features for high-resolution (HR) synthetic aperture radar (SAR) image classification. However, there has been little work on integrating the unique statistical distributions of SAR images which can reveal physical properties of terrain objects, into CNNs in a supervised feature learning framework. To address this problem, a novel end-to-end supervised classification method is proposed for HR SAR images by considering both spatial context and statistical features. First, to extract more effective spatial features from SAR images, a new deep spatial context encoder network (DSCEN) is proposed, which is a lightweight structure and can be effectively trained with a small number of samples. Meanwhile, to enhance the diversity of statistics, the nonstationary joint statistical model (NS-JSM) is adopted to form the global statistical features. Specifically, SAR images are transformed into the Gabor wavelet domain and the produced multi-subbands magnitudes and phases are modeled by the log-normal and uniform distribution. The covariance matrix is further utilized to capture the inter-scale and intra-scale nonstationary correlation between the statistical subbands and make the joint statistical features more compact and distinguishable. Considering complementary advantages, a feature fusion network (Fusion-Net) base on group compression and smooth normalization is constructed to embed the statistical features into the spatial features and optimize the fusion feature representation. As a result, our model can learn the discriminative features and improve the final classification performance. Experiments on four HR SAR images validate the superiority of the proposed method over other related algorithms.
CVMar 10, 2021
RL-CSDia: Representation Learning of Computer Science DiagramsShaowei Wang, LingLing Zhang, Xuan Luo et al.
Recent studies on computer vision mainly focus on natural images that express real-world scenes. They achieve outstanding performance on diverse tasks such as visual question answering. Diagram is a special form of visual expression that frequently appears in the education field and is of great significance for learners to understand multimodal knowledge. Current research on diagrams preliminarily focuses on natural disciplines such as Biology and Geography, whose expressions are still similar to natural images. Another type of diagrams such as from Computer Science is composed of graphics containing complex topologies and relations, and research on this type of diagrams is still blank. The main challenges of graphic diagrams understanding are the rarity of data and the confusion of semantics, which are mainly reflected in the diversity of expressions. In this paper, we construct a novel dataset of graphic diagrams named Computer Science Diagrams (CSDia). It contains more than 1,200 diagrams and exhaustive annotations of objects and relations. Considering the visual noises caused by the various expressions in diagrams, we introduce the topology of diagrams to parse topological structure. After that, we propose Diagram Parsing Net (DPN) to represent the diagram from three branches: topology, visual feature, and text, and apply the model to the diagram classification task to evaluate the ability of diagrams understanding. The results show the effectiveness of the proposed DPN on diagrams understanding.
IVDec 27, 2020
WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets for hyperspectral image classificationXin Hu, Yanfei Zhong, Chang Luo et al.
Classification is an important aspect of hyperspectral images processing and application. At present, the researchers mostly use the classic airborne hyperspectral imagery as the benchmark dataset. However, existing datasets suffer from three bottlenecks: (1) low spatial resolution; (2) low labeled pixels proportion; (3) low degree of subclasses distinction. In this paper, a new benchmark dataset named the Wuhan UAV-borne hyperspectral image (WHU-Hi) dataset was built for hyperspectral image classification. The WHU-Hi dataset with a high spectral resolution (nm level) and a very high spatial resolution (cm level), which we refer to here as H2 imager. Besides, the WHU-Hi dataset has a higher pixel labeling ratio and finer subclasses. Some start-of-art hyperspectral image classification methods benchmarked the WHU-Hi dataset, and the experimental results show that WHU-Hi is a challenging dataset. We hope WHU-Hi dataset can become a strong benchmark to accelerate future research.
SPJul 13, 2020
Inertial Sensing Meets Artificial Intelligence: Opportunity or Challenge?You Li, Ruizhi Chen, Xiaoji Niu et al.
The inertial navigation system (INS) has been widely used to provide self-contained and continuous motion estimation in intelligent transportation systems. Recently, the emergence of chip-level inertial sensors has expanded the relevant applications from positioning, navigation, and mobile mapping to location-based services, unmanned systems, and transportation big data. Meanwhile, benefit from the emergence of big data and the improvement of algorithms and computing power, artificial intelligence (AI) has become a consensus tool that has been successfully applied in various fields. This article reviews the research on using AI technology to enhance inertial sensing from various aspects, including sensor design and selection, calibration and error modeling, navigation and motion-sensing algorithms, multi-sensor information fusion, system evaluation, and practical application. Based on the over 30 representative articles selected from the nearly 300 related publications, this article summarizes the state of the art, advantages, and challenges on each aspect. Finally, it summarizes nine advantages and nine challenges of AI-enhanced inertial sensing and then points out future research directions.
LGApr 9, 2020
Deep Reinforcement Learning (DRL): Another Perspective for Unsupervised Wireless LocalizationYou Li, Xin Hu, Yuan Zhuang et al.
Location is key to spatialize internet-of-things (IoT) data. However, it is challenging to use low-cost IoT devices for robust unsupervised localization (i.e., localization without training data that have known location labels). Thus, this paper proposes a deep reinforcement learning (DRL) based unsupervised wireless-localization method. The main contributions are as follows. (1) This paper proposes an approach to model a continuous wireless-localization process as a Markov decision process (MDP) and process it within a DRL framework. (2) To alleviate the challenge of obtaining rewards when using unlabeled data (e.g., daily-life crowdsourced data), this paper presents a reward-setting mechanism, which extracts robust landmark data from unlabeled wireless received signal strengths (RSS). (3) To ease requirements for model re-training when using DRL for localization, this paper uses RSS measurements together with agent location to construct DRL inputs. The proposed method was tested by using field testing data from multiple Bluetooth 5 smart ear tags in a pasture. Meanwhile, the experimental verification process reflected the advantages and challenges for using DRL in wireless localization.
CRJul 15, 2017
Android Malware Clustering through Malicious Payload MiningYuping Li, Jiyong Jang, Xin Hu et al.
Clustering has been well studied for desktop malware analysis as an effective triage method. Conventional similarity-based clustering techniques, however, cannot be immediately applied to Android malware analysis due to the excessive use of third-party libraries in Android application development and the widespread use of repackaging in malware development. We design and implement an Android malware clustering system through iterative mining of malicious payload and checking whether malware samples share the same version of malicious payload. Our system utilizes a hierarchical clustering technique and an efficient bit-vector format to represent Android apps. Experimental results demonstrate that our clustering approach achieves precision of 0.90 and recall of 0.75 for Android Genome malware dataset, and average precision of 0.98 and recall of 0.96 with respect to manually verified ground-truth.