CLOct 17, 2022Code
MoSE: Modality Split and Ensemble for Multimodal Knowledge Graph CompletionYu Zhao, Xiangrui Cai, Yike Wu et al.
Multimodal knowledge graph completion (MKGC) aims to predict missing entities in MKGs. Previous works usually share relation representation across modalities. This results in mutual interference between modalities during training, since for a pair of entities, the relation from one modality probably contradicts that from another modality. Furthermore, making a unified prediction based on the shared relation representation treats the input in different modalities equally, while their importance to the MKGC task should be different. In this paper, we propose MoSE, a Modality Split representation learning and Ensemble inference framework for MKGC. Specifically, in the training phase, we learn modality-split relation embeddings for each modality instead of a single modality-shared one, which alleviates the modality interference. Based on these embeddings, in the inference phase, we first make modality-split predictions and then exploit various ensemble methods to combine the predictions with different weights, which models the modality importance dynamically. Experimental results on three KG datasets show that MoSE outperforms state-of-the-art MKGC methods. Codes are available at https://github.com/OreOZhao/MoSE4MKGC.
CLSep 18, 2022Code
Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar InstancesYike Wu, Yu Zhao, Shiwan Zhao et al.
Despite the great progress of Visual Question Answering (VQA), current VQA models heavily rely on the superficial correlation between the question type and its corresponding frequent answers (i.e., language priors) to make predictions, without really understanding the input. In this work, we define the training instances with the same question type but different answers as \textit{superficially similar instances}, and attribute the language priors to the confusion of VQA model on such instances. To solve this problem, we propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances. Specifically, for each training instance, we first construct a set that contains its superficially similar counterparts. Then we exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space. In this way, the VQA model is forced to further focus on the other parts of the input beyond the question type, which helps to overcome the language priors. Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2. Codes are available at \href{https://github.com/wyk-nku/Distinguishing-VQA.git}{Distinguishing-VQA}.
LGJun 3
New Benchmarking Shows Limited Generalization Power of TCR Antigenic Epitope Prediction ModelsYiming Liao, Yiheng Li, Ning Jiang et al.
Accurate computational prediction of T cell receptor (TCR) antigen specificity would transform the study of T cell biology and enable scalable immune engineering, yet existing models lack sufficient sensitivity and specificity for broad applications. A major limitation is the absence of rigorously defined, unseen benchmark datasets that allow unbiased evaluation of model performance and generalizability. Here, we describe two complementary classes of datasets that meet this criterion and argue that they provide both a robust framework for model assessment and a foundation for next-generation TCR-antigen prediction algorithm development.
SDJul 2, 2022Code
Improving Transformer-based Conversational ASR by Inter-Sentential Attention MechanismKun Wei, Pengcheng Guo, Ning Jiang
Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect contextual dependencies that span across utterances. In this paper, we propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition. Specifically, for the encoder network, we capture the contexts of previous speech and incorporate such historic information into current input by a context-aware residual attention mechanism. For the decoder, the prediction of current utterance is also conditioned on the historic linguistic information through a conditional decoder framework. We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models.
CLFeb 22, 2023
MADI: Inter-domain Matching and Intra-domain Discrimination for Cross-domain Speech RecognitionJiaming Zhou, Shiwan Zhao, Ning Jiang et al.
End-to-end automatic speech recognition (ASR) usually suffers from performance degradation when applied to a new domain due to domain shift. Unsupervised domain adaptation (UDA) aims to improve the performance on the unlabeled target domain by transferring knowledge from the source to the target domain. To improve transferability, existing UDA approaches mainly focus on matching the distributions of the source and target domains globally and/or locally, while ignoring the model discriminability. In this paper, we propose a novel UDA approach for ASR via inter-domain MAtching and intra-domain DIscrimination (MADI), which improves the model transferability by fine-grained inter-domain matching and discriminability by intra-domain contrastive discrimination simultaneously. Evaluations on the Libri-Adapt dataset demonstrate the effectiveness of our approach. MADI reduces the relative word error rate (WER) on cross-device and cross-environment ASR by 17.7% and 22.8%, respectively.
CVJan 22Code
Explainable Deepfake Detection with RL Enhanced Self-Blended ImagesNing Jiang, Dingheng Zeng, Yanhong Liu et al.
Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at https://github.com/deon1219/rlsbi.
SDOct 22, 2023
Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual RepresentationKun Wei, Bei Li, Hang Lv et al.
Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.
IVJun 8, 2023
Connectional-Style-Guided Contextual Representation Learning for Brain Disease DiagnosisGongshu Wang, Ning Jiang, Yunxiao Ma et al.
Structural magnetic resonance imaging (sMRI) has shown great clinical value and has been widely used in deep learning (DL) based computer-aided brain disease diagnosis. Previous approaches focused on local shapes and textures in sMRI that may be significant only within a particular domain. The learned representations are likely to contain spurious information and have a poor generalization ability in other diseases and datasets. To facilitate capturing meaningful and robust features, it is necessary to first comprehensively understand the intrinsic pattern of the brain that is not restricted within a single data/task domain. Considering that the brain is a complex connectome of interlinked neurons, the connectional properties in the brain have strong biological significance, which is shared across multiple domains and covers most pathological information. In this work, we propose a connectional style contextual representation learning model (CS-CRL) to capture the intrinsic pattern of the brain, used for multiple brain disease diagnosis. Specifically, it has a vision transformer (ViT) encoder and leverages mask reconstruction as the proxy task and Gram matrices to guide the representation of connectional information. It facilitates the capture of global context and the aggregation of features with biological plausibility. The results indicate that CS-CRL achieves superior accuracy in multiple brain disease diagnosis tasks across six datasets and three diseases and outperforms state-of-the-art models. Furthermore, we demonstrate that CS-CRL captures more brain-network-like properties, better aggregates features, is easier to optimize and is more robust to noise, which explains its superiority in theory. Our source code will be released soon.
CLApr 6, 2022
Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine ComprehensionHuibin Zhang, Zhengkun Zhang, Yao Zhang et al.
Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step. Comprehending PMDs and inducing their representations for the downstream reasoning tasks is designated as Procedural MultiModal Machine Comprehension (M3C). In this study, we approach Procedural M3C at a fine-grained level (compared with existing explorations at a document or sentence level), that is, entity. With delicate consideration, we model entity both in its temporal and cross-modal relation and propose a novel Temporal-Modal Entity Graph (TMEG). Specifically, graph structure is formulated to capture textual and visual entities and trace their temporal-modal evolution. In addition, a graph aggregation module is introduced to conduct graph encoding and reasoning. Comprehensive experiments across three Procedural M3C tasks are conducted on a traditional dataset RecipeQA and our new dataset CraftQA, which can better evaluate the generalization of TMEG.
CVSep 12, 2024
Learn from Balance: Rectifying Knowledge Transfer for Long-Tailed ScenariosXinlei Huang, Jialiang Tang, Xubin Zheng et al.
Knowledge Distillation (KD) transfers knowledge from a large pre-trained teacher network to a compact and efficient student network, making it suitable for deployment on resource-limited media terminals. However, traditional KD methods require balanced data to ensure robust training, which is often unavailable in practical applications. In such scenarios, a few head categories occupy a substantial proportion of examples. This imbalance biases the trained teacher network towards the head categories, resulting in severe performance degradation on the less represented tail categories for both the teacher and student networks. In this paper, we propose a novel framework called Knowledge Rectification Distillation (KRDistill) to address the imbalanced knowledge inherited in the teacher network through the incorporation of the balanced category priors. Furthermore, we rectify the biased predictions produced by the teacher network, particularly focusing on the tail categories. Consequently, the teacher network can provide balanced and accurate knowledge to train a reliable student network. Intensive experiments conducted on various long-tailed datasets demonstrate that our KRDistill can effectively train reliable student networks in realistic scenarios of data imbalance.
CVAug 21, 2024
AnyDesign: Versatile Area Fashion Editing via Mask-Free DiffusionYunfang Niu, Lingxiang Wu, Dong Yi et al.
Fashion image editing aims to modify a person's appearance based on a given instruction. Existing methods require auxiliary tools like segmenters and keypoint extractors, lacking a flexible and unified framework. Moreover, these methods are limited in the variety of clothing types they can handle, as most datasets focus on people in clean backgrounds and only include generic garments such as tops, pants, and dresses. These limitations restrict their applicability in real-world scenarios. In this paper, we first extend an existing dataset for human generation to include a wider range of apparel and more complex backgrounds. This extended dataset features people wearing diverse items such as tops, pants, dresses, skirts, headwear, scarves, shoes, socks, and bags. Additionally, we propose AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas. Users can simply input a human image along with a corresponding prompt in either text or image format. Our approach incorporates Fashion DiT, equipped with a Fashion-Guidance Attention (FGA) module designed to fuse explicit apparel types and CLIP-encoded apparel features. Both Qualitative and quantitative experiments demonstrate that our method delivers high-quality fashion editing and outperforms contemporary text-guided fashion editing methods.
CVJan 2, 2025Code
Towards Interactive Deepfake AnalysisLixiong Qin, Ning Jiang, Yang Zhang et al.
Existing deepfake analysis methods are primarily based on discriminative models, which significantly limit their application scenarios. This paper aims to explore interactive deepfake analysis by performing instruction tuning on multi-modal large language models (MLLMs). This will face challenges such as the lack of datasets and benchmarks, and low training efficiency. To address these issues, we introduce (1) a GPT-assisted data construction process resulting in an instruction-following dataset called DFA-Instruct, (2) a benchmark named DFA-Bench, designed to comprehensively evaluate the capabilities of MLLMs in deepfake detection, deepfake classification, and artifact description, and (3) construct an interactive deepfake analysis system called DFA-GPT, as a strong baseline for the community, with the Low-Rank Adaptation (LoRA) module. The dataset and code will be made available at https://github.com/lxq1000/DFA-Instruct to facilitate further research.
CVJun 20, 2023
Multi-task Collaborative Pre-training and Individual-adaptive-tokens Fine-tuning: A Unified Framework for Brain Representation LearningNing Jiang, Gongshu Wang, Tianyi Yan
Structural magnetic resonance imaging (sMRI) provides accurate estimates of the brain's structural organization and learning invariant brain representations from sMRI is an enduring issue in neuroscience. Previous deep representation learning models ignore the fact that the brain, as the core of human cognitive activity, is distinct from other organs whose primary attribute is anatomy. Therefore, capturing the semantic structure that dominates interindividual cognitive variability is key to accurately representing the brain. Given that this high-level semantic information is subtle, distributed, and interdependently latent in the brain structure, sMRI-based models need to capture fine-grained details and understand how they relate to the overall global structure. However, existing models are optimized by simple objectives, making features collapse into homogeneity and worsening simultaneous representation of fine-grained information and holistic semantics, causing a lack of biological plausibility and interpretation of cognition. Here, we propose MCIAT, a unified framework that combines Multi-task Collaborative pre-training and Individual-Adaptive-Tokens fine-tuning. Specifically, we first synthesize restorative learning, age prediction auxiliary learning and adversarial learning as a joint proxy task for deep semantic representation learning. Then, a mutual-attention-based token selection method is proposed to highlight discriminative features. The proposed MCIAT achieves state-of-the-art diagnosis performance on the ADHD-200 dataset compared with several sMRI-based approaches and shows superior generalization on the MCIC and OASIS datasets. Moreover, we studied 12 behavioral tasks and found significant associations between cognitive functions and MCIAT-established representations, which verifies the interpretability of our proposed framework.
LGMay 21, 2025Code
Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PCQingyu Song, Peiyu Liao, Wenqian Zhao et al.
The increasing deployment of Large Language Models (LLMs) on edge devices, driven by model advancements and hardware improvements, offers significant privacy benefits. However, these on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques. To address this, we introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs. Our comprehensive evaluation, encompassing models from 0.5B to 14B parameters and seven post-training quantization (PTQ) methods on commodity laptops, yields several critical insights: 1) System-level metrics exhibit near-linear scaling with effective bits-per-weight (BPW). 2) A practical threshold exists around $\sim$3.5 effective BPW, larger models subjected to low-bit quantization consistently outperform smaller models utilizing higher bit-precision. 3) Quantization with low BPW incurs marginal accuracy loss but significant memory savings. 4) Determined by low-level implementation specifics power consumption on CPU, where computation-intensive operations spend more power than memory-intensive ones. These findings offer crucial insights and practical guidelines for the efficient deployment and optimized configuration of LLMs on resource-constrained edge devices. Our codebase is available at https://github.com/simmonssong/LLMOnDevice.
IVJan 20Code
Unsupervised Deformable Image Registration with Local-Global Attention and Image DecompositionZhengyong Huang, Xingwen Sun, Xuting Chang et al.
Deformable image registration is a critical technology in medical image analysis, with broad applications in clinical practice such as disease diagnosis, multi-modal fusion, and surgical navigation. Traditional methods often rely on iterative optimization, which is computationally intensive and lacks generalizability. Recent advances in deep learning have introduced attention-based mechanisms that improve feature alignment, yet accurately registering regions with high anatomical variability remains challenging. In this study, we proposed a novel unsupervised deformable image registration framework, LGANet++, which employs a novel local-global attention mechanism integrated with a unique technique for feature interaction and fusion to enhance registration accuracy, robustness, and generalizability. We evaluated our approach using five publicly available datasets, representing three distinct registration scenarios: cross-patient, cross-time, and cross-modal CT-MR registration. The results demonstrated that our approach consistently outperforms several state-of-the-art registration methods, improving registration accuracy by 1.39% in cross-patient registration, 0.71% in cross-time registration, and 6.12% in cross-modal CT-MR registration tasks. These results underscore the potential of LGANet++ to support clinical workflows requiring reliable and efficient image registration. The source code is available at https://github.com/huangzyong/LGANet-Registration.
CLOct 22, 2025Code
MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal ModelsKailin Jiang, Ning Jiang, Yuntao Du et al.
Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs' ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.
CVJun 23, 2025Code
Referring Expression Instance Retrieval and A Strong End-to-End BaselineXiangzhao Hao, Kuan Zhu, Hongyu Guo et al.
Using natural language to query visual information is a fundamental need in real-world applications. Text-Image Retrieval (TIR) retrieves a target image from a gallery based on an image-level description, while Referring Expression Comprehension (REC) localizes a target object within a given image using an instance-level description. However, real-world applications often present more complex demands. Users typically query an instance-level description across a large gallery and expect to receive both relevant image and the corresponding instance location. In such scenarios, TIR struggles with fine-grained descriptions and object-level localization, while REC is limited in its ability to efficiently search large galleries and lacks an effective ranking mechanism. In this paper, we introduce a new task called \textbf{Referring Expression Instance Retrieval (REIR)}, which supports both instance-level retrieval and localization based on fine-grained referring expressions. First, we propose a large-scale benchmark for REIR, named REIRCOCO, constructed by prompting advanced vision-language models to generate high-quality referring expressions for instances in the MSCOCO and RefCOCO datasets. Second, we present a baseline method, Contrastive Language-Instance Alignment with Relation Experts (CLARE), which employs a dual-stream architecture to address REIR in an end-to-end manner. Given a referring expression, the textual branch encodes it into a query embedding. The visual branch detects candidate objects and extracts their instance-level visual features. The most similar candidate to the query is selected for bounding box prediction. CLARE is first trained on object detection and REC datasets to establish initial grounding capabilities, then optimized via Contrastive Language-Instance Alignment (CLIA) for improved retrieval across images. We will release our code and benchmark publicly.
CVSep 9, 2021Code
PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text RecognitionZhi Qiao, Yu Zhou, Jin Wei et al.
Nowadays, scene text recognition has attracted more and more attention due to its various applications. Most state-of-the-art methods adopt an encoder-decoder framework with attention mechanism, which generates text autoregressively from left to right. Despite the convincing performance, the speed is limited because of the one-by-one decoding strategy. As opposed to autoregressive models, non-autoregressive models predict the results in parallel with a much shorter inference time, but the accuracy falls behind the autoregressive counterpart considerably. In this paper, we propose a Parallel, Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency. Specifically, PIMNet adopts a parallel attention mechanism to predict the text faster and an iterative generation mechanism to make the predictions more accurate. In each iteration, the context information is fully explored. To improve learning of the hidden layer, we exploit the mimicking learning in the training phase, where an additional autoregressive decoder is adopted and the parallel decoder mimics the autoregressive decoder with fitting outputs of the hidden layer. With the shared backbone between the two decoders, the proposed PIMNet can be trained end-to-end without pre-training. During inference, the branch of the autoregressive decoder is removed for a faster speed. Extensive experiments on public benchmarks demonstrate the effectiveness and efficiency of PIMNet. Our code will be available at https://github.com/Pay20Y/PIMNet.
CVDec 5, 2023
PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal FeaturesTianshun Han, Shengnan Gui, Yiqing Huang et al.
Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Subsequently, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Then PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. We recommend watching the supplementary video.
LGJan 7, 2025
AADNet: Exploring EEG Spatiotemporal Information for Fast and Accurate Orientation and Timbre Detection of Auditory Attention Based on A Cue-Masked ParadigmKeren Shi, Xu Liu, Xue Yuan et al.
Auditory attention decoding from electroencephalogram (EEG) could infer to which source the user is attending in noisy environments. Decoding algorithms and experimental paradigm designs are crucial for the development of technology in practical applications. To simulate real-world scenarios, this study proposed a cue-masked auditory attention paradigm to avoid information leakage before the experiment. To obtain high decoding accuracy with low latency, an end-to-end deep learning model, AADNet, was proposed to exploit the spatiotemporal information from the short time window of EEG signals. The results showed that with a 0.5-second EEG window, AADNet achieved an average accuracy of 93.46% and 91.09% in decoding auditory orientation attention (OA) and timbre attention (TA), respectively. It significantly outperformed five previous methods and did not need the knowledge of the original audio source. This work demonstrated that it was possible to detect the orientation and timbre of auditory attention from EEG signals fast and accurately. The results are promising for the real-time multi-property auditory attention decoding, facilitating the application of the neuro-steered hearing aids and other assistive listening devices.
CLMay 30, 2025
When Large Multimodal Models Confront Evolving Knowledge:Challenges and PathwaysKailin Jiang, Yuntao Du, Yukai Ding et al.
Large language/multimodal models (LLMs/LMMs) store extensive pre-trained knowledge but struggle to maintain consistency with real-world updates, making it difficult to avoid catastrophic forgetting while acquiring evolving knowledge. Previous work focused on constructing textual knowledge datasets and exploring knowledge injection in LLMs, lacking exploration of multimodal evolving knowledge injection in LMMs. To address this, we propose the EVOKE benchmark to evaluate LMMs' ability to inject multimodal evolving knowledge in real-world scenarios. Meanwhile, a comprehensive evaluation of multimodal evolving knowledge injection revealed two challenges: (1) Existing knowledge injection methods perform terribly on evolving knowledge. (2) Supervised fine-tuning causes catastrophic forgetting, particularly instruction following ability is severely compromised. Additionally, we provide pathways and find that: (1) Text knowledge augmentation during the training phase improves performance, while image augmentation cannot achieve it. (2) Continual learning methods, especially Replay and MoELoRA, effectively mitigate forgetting. Our findings indicate that current knowledge injection methods have many limitations on evolving knowledge, which motivates further research on more efficient and stable knowledge injection methods.
CVDec 9, 2023
R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer ConditioningZhiling Ye, LiangGuo Zhang, Dingheng Zeng et al.
Dynamic NeRFs have recently garnered growing attention for 3D talking portrait synthesis. Despite advances in rendering speed and visual quality, challenges persist in enhancing efficiency and effectiveness. We present R2-Talker, an efficient and effective framework enabling realistic real-time talking head synthesis. Specifically, using multi-resolution hash grids, we introduce a novel approach for encoding facial landmarks as conditional features. This approach losslessly encodes landmark structures as conditional features, decoupling input diversity, and conditional spaces by mapping arbitrary landmarks to a unified feature space. We further propose a scheme of progressive multilayer conditioning in the NeRF rendering pipeline for effective conditional feature fusion. Our new approach has the following advantages as demonstrated by extensive experiments compared with the state-of-the-art works: 1) The lossless input encoding enables acquiring more precise features, yielding superior visual quality. The decoupling of inputs and conditional spaces improves generalizability. 2) The fusing of conditional features and MLP outputs at each MLP layer enhances conditional impact, resulting in more accurate lip synthesis and better visual quality. 3) It compactly structures the fusion of conditional features, significantly enhancing computational efficiency.
CLOct 22, 2025
KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and ConstraintsKailin Jiang, Hongbo Jiang, Ning Jiang et al.
Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM's linear layer activations and initializes the adapter by projecting the original weights into the matrix's null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.
IVJan 20
Partial Decoder Attention Network with Contour-weighted Loss Function for Data-Imbalance Medical Image SegmentationZhengyong Huang, Ning Jiang, Xingwen Sun et al.
Image segmentation is pivotal in medical image analysis, facilitating clinical diagnosis, treatment planning, and disease evaluation. Deep learning has significantly advanced automatic segmentation methodologies by providing superior modeling capability for complex structures and fine-grained anatomical regions. However, medical images often suffer from data imbalance issues, such as large volume disparities among organs or tissues, and uneven sample distributions across different anatomical structures. This imbalance tends to bias the model toward larger organs or more frequently represented structures, while overlooking smaller or less represented structures, thereby affecting the segmentation accuracy and robustness. To address these challenges, we proposed a novel contour-weighted segmentation approach, which improves the model's capability to represent small and underrepresented structures. We developed PDANet, a lightweight and efficient segmentation network based on a partial decoder mechanism. We evaluated our method using three prominent public datasets. The experimental results show that our methodology excelled in three distinct tasks: segmenting multiple abdominal organs, brain tumors, and pelvic bone fragments with injuries. It consistently outperformed nine state-of-the-art methods. Moreover, the proposed contour-weighted strategy improved segmentation for other comparison methods across the three datasets, yielding average enhancements in Dice scores of 2.32%, 1.67%, and 3.60%, respectively. These results demonstrate that our contour-weighted segmentation method surpassed current leading approaches in both accuracy and robustness. As a model-independent strategy, it can seamlessly fit various segmentation frameworks, enhancing their performance. This flexibility highlighted its practical importance and potential for broad use in medical image analysis.
IROct 24, 2025
Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative RecommendersZhimin Chen, Chenyu Zhao, Ka Chun Mo et al.
Modern large-scale recommendation systems rely heavily on user interaction history sequences to enhance the model performance. The advent of large language models and sequential modeling techniques, particularly transformer-like architectures, has led to significant advancements recently (e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories (10k to 100k items) generally improves model performance, it also creates significant challenges on latency, queries per second (QPS) and GPU cost in industry-scale recommendation systems. Existing models do not adequately address these industrial scalability issues. In this paper, we propose a novel two-stage modeling framework, namely VIrtual Sequential Target Attention (VISTA), which decomposes traditional target attention from a candidate item to user history items into two distinct stages: (1) user history summarization into a few hundred tokens; followed by (2) candidate item attention to those tokens. These summarization token embeddings are then cached in storage system and then utilized as sequence features for downstream model training and inference. This novel design for scalability enables VISTA to scale to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed, which is essential in industry. Our approach achieves significant improvements in offline and online metrics and has been successfully deployed on an industry leading recommendation platform serving billions of users.
CLAug 15, 2025
Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question AnsweringChangjian Wang, Weihong Deng, Weili Guan et al.
Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6$\times$ speedup in retrieval efficiency.
CVJan 23, 2025
Device-aware Optical Adversarial Attack for a Portable Projector-camera SystemNing Jiang, Yanhong Liu, Dingheng Zeng et al.
Deep-learning-based face recognition (FR) systems are susceptible to adversarial examples in both digital and physical domains. Physical attacks present a greater threat to deployed systems as adversaries can easily access the input channel, allowing them to provide malicious inputs to impersonate a victim. This paper addresses the limitations of existing projector-camera-based adversarial light attacks in practical FR setups. By incorporating device-aware adaptations into the digital attack algorithm, such as resolution-aware and color-aware adjustments, we mitigate the degradation from digital to physical domains. Experimental validation showcases the efficacy of our proposed algorithm against real and spoof adversaries, achieving high physical similarity scores in FR models and state-of-the-art commercial systems. On average, there is only a 14% reduction in scores from digital to physical attacks, with high attack success rate in both white- and black-box scenarios.
PEJun 17, 2024
Modeling, Inference, and Prediction in Mobility-Based Compartmental Models for EpidemiologyNing Jiang, Weiqi Chu, Yao Li
Classical compartmental models in epidemiology often assume a homogeneous population for simplicity, which neglects the inherent heterogeneity among individuals. This assumption frequently leads to inaccurate predictions when applied to real-world data. For example, evidence has shown that classical models overestimate the final pandemic size in the H1N1-2009 and COVID-19 outbreaks. To address this issue, we introduce individual mobility as a key factor in disease transmission and control. We characterize disease dynamics using mobility distribution functions for each compartment and propose a mobility-based compartmental model that incorporates population heterogeneity. Our results demonstrate that, for the same basic reproduction number, our mobility-based model predicts a smaller final pandemic size compared to the classical models, effectively addressing the common overestimation problem. Additionally, we infer mobility distributions from the time series of the infected population. We provide sufficient conditions for uniquely identifying the mobility distribution from a dataset and propose a machine-learning-based approach to learn mobility from both synthesized and real-world data.
CLMar 19, 2024
Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language TranslationZhigang Chen, Benjia Zhou, Jun Li et al.
Previous Sign Language Translation (SLT) methods achieve superior performance by relying on gloss annotations. However, labeling high-quality glosses is a labor-intensive task, which limits the further development of SLT. Although some approaches work towards gloss-free SLT through jointly training the visual encoder and translation network, these efforts still suffer from poor performance and inefficient use of the powerful Large Language Model (LLM). Most seriously, we find that directly introducing LLM into SLT will lead to insufficient learning of visual representations as LLM dominates the learning curve. To address these problems, we propose Factorized Learning assisted with Large Language Model (FLa-LLM) for gloss-free SLT. Concretely, we factorize the training process into two stages. In the visual initialing stage, we employ a lightweight translation model after the visual encoder to pre-train the visual encoder. In the LLM fine-tuning stage, we freeze the acquired knowledge in the visual encoder and integrate it with a pre-trained LLM to inspire the LLM's translation potential. This factorized training strategy proves to be highly effective as evidenced by significant improvements achieved across three SLT datasets which are all conducted under the gloss-free setting.
CRJan 4, 2022
Open Access Dataset for Electromyography based Multi-code Biometric AuthenticationAshirbad Pradhan, Jiayuan He, Ning Jiang
Recently, surface electromyogram (EMG) has been proposed as a novel biometric trait for addressing some key limitations of current biometrics, such as spoofing and liveness. The EMG signals possess a unique characteristic: they are inherently different for individuals (biometrics), and they can be customized to realize multi-length codes or passwords (for example, by performing different gestures). However, current EMG-based biometric research has two critical limitations: 1) a small subject pool, compared to other more established biometric traits, and 2) limited to single-session or single-day data sets. In this study, forearm and wrist EMG data were collected from 43 participants over three different days with long separation while they performed static hand and wrist gestures. The multi-day biometric authentication resulted in a median EER of 0.017 for the forearm setup and 0.025 for the wrist setup, comparable to well-established biometric traits suggesting consistent performance over multiple days. The presented large-sample multi-day data set and findings could facilitate further research on EMG-based biometrics and other gesture recognition-based applications.
CVOct 25, 2021
TPSNet: Reverse Thinking of Thin Plate Splines for Arbitrary Shape Scene Text RepresentationWei Wang, Yu Zhou, Jiahao Lv et al.
The research focus of scene text detection and recognition has shifted to arbitrary shape text in recent years, where the text shape representation is a fundamental problem. An ideal representation should be compact, complete, efficient, and reusable for subsequent recognition in our opinion. However, previous representations have flaws in one or more aspects. Thin-Plate-Spline (TPS) transformation has achieved great success in scene text recognition. Inspired by this, we reversely think of its usage and sophisticatedly take TPS as an exquisite representation for arbitrary shape text representation. The TPS representation is compact, complete, and efficient. With the predicted TPS parameters, the detected text region can be directly rectified to a near-horizontal one to assist the subsequent recognition. To further exploit the potential of the TPS representation, the Border Alignment Loss is proposed. Based on these designs, we implement the text detector TPSNet, which can be extended to a text spotter conveniently. Extensive evaluation and ablation of several public benchmarks demonstrate the effectiveness and superiority of the proposed method for text representation and spotting. Particularly, TPSNet achieves the detection F-Measure improvement of 4.4\% (78.4\% vs. 74.0\%) on Art dataset and the end-to-end spotting F-Measure improvement of 5.0\% (78.5\% vs. 73.5\%) on Total-Text, which are large margins with no bells and whistles.
CVSep 16, 2021
Dense Semantic Contrast for Self-Supervised Visual Representation LearningXiaoni Li, Yu Zhou, Yifei Zhang et al.
Self-supervised representation learning for visual pre-training has achieved remarkable success with sample (instance or pixel) discrimination and semantics discovery of instance, whereas there still exists a non-negligible gap between pre-trained model and downstream dense prediction tasks. Concretely, these downstream tasks require more accurate representation, in other words, the pixels from the same object must belong to a shared semantic category, which is lacking in the previous methods. In this work, we present Dense Semantic Contrast (DSC) for modeling semantic category decision boundaries at a dense level to meet the requirement of these tasks. Furthermore, we propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning. Specially, we explicitly explore the semantic structure of the dataset by mining relations among pixels from different perspectives. For intra-image relation modeling, we discover pixel neighbors from multiple views. And for inter-image relations, we enforce pixel representation from the same semantic class to be more similar than the representation from different classes in one mini-batch. Experimental results show that our DSC model outperforms state-of-the-art methods when transferring to downstream dense prediction tasks, including object detection, semantic segmentation, and instance segmentation. Code will be made available.
CVSep 8, 2021
Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text DetectionXugong Qin, Yu Zhou, Youhui Guo et al.
Due to the large success in object detection and instance segmentation, Mask R-CNN attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped scene text detection and spotting. However, two issues remain to be settled. The first is dense text case, which is easy to be neglected but quite practical. There may exist multiple instances in one proposal, which makes it difficult for the mask head to distinguish different instances and degrades the performance. In this work, we argue that the performance degradation results from the learning confusion issue in the mask head. We propose to use an MLP decoder instead of the "deconv-conv" decoder in the mask head, which alleviates the issue and promotes robustness significantly. And we propose instance-aware mask learning in which the mask head learns to predict the shape of the whole instance rather than classify each pixel to text or non-text. With instance-aware mask learning, the mask branch can learn separated and compact masks. The second is that due to large variations in scale and aspect ratio, RPN needs complicated anchor settings, making it hard to maintain and transfer across different datasets. To settle this issue, we propose an adaptive label assignment in which all instances especially those with extreme aspect ratios are guaranteed to be associated with enough anchors. Equipped with these components, the proposed method named MAYOR achieves state-of-the-art performance on five benchmarks including DAST1500, MSRA-TD500, ICDAR2015, CTW1500, and Total-Text.
AIOct 15, 2020
GMH: A General Multi-hop Reasoning Model for KG CompletionYao Zhang, Hongru Liang, Adam Jatowt et al.
Knowledge graphs are essential for numerous downstream natural language processing applications, but are typically incomplete with many facts missing. This results in research efforts on multi-hop reasoning task, which can be formulated as a search process and current models typically perform short distance reasoning. However, the long-distance reasoning is also vital with the ability to connect the superficially unrelated entities. To the best of our knowledge, there lacks a general framework that approaches multi-hop reasoning in mixed long-short distance reasoning scenarios. We argue that there are two key issues for a general multi-hop reasoning model: i) where to go, and ii) when to stop. Therefore, we propose a general model which resolves the issues with three modules: 1) the local-global knowledge module to estimate the possible paths, 2) the differentiated action dropout module to explore a diverse set of paths, and 3) the adaptive stopping search module to avoid over searching. The comprehensive results on three datasets demonstrate the superiority of our model with significant improvements against baselines in both short and long distance reasoning scenarios.
SPMay 30, 2019
Simultaneous induction of SSMVEP and SMR Using a Gaiting video stimulus: a novel hybrid brain-computer interfaceXin Zhang, Guanghua Xu, Aravind Ravi et al.
We proposed a novel visual stimulus for brain-computer interface. The stimulus is in the form gaiting sequence of a human. The hypothesis is that observing such a visual stimulus would simultaneously induce 1) steady-state motion visual evoked potential (SSMVEP) in the occipital area, similarly to an SSVEP stimulus; and 2) sensorimotor rhythm (SMR) in the primary sensorimotor area, because such action observation (AO) could activate the mirror neuron system. Canonical correlation analysis (CCA) was used to detect SSMVEP from occipital EEG, and event-related spectral perturbations (ERSP) were used to identify SMR in the EEG from the sensorimotor area. The results showed that the proposed visual gaiting stimulus-induced SSMVEP, with classification accuracies of 88.9 $\pm$ 12.0% in a four-class scenario. More importantly, it induced clear and sustained event-related desynchronization/synchronization (ERD/ERS) in the EEG from the sensorimotor area, while no ERD/ERS in the sensorimotor area could be observed when the other two SSVEP stimuli were used. Further, for participants with sufficiently clear SSMVEP pattern (classification accuracy > 85%), the ERD index values in mu-beta band induced by the proposed gaiting stimulus were statistically different from that of the other two types of stimulus. Therefore, a novel BCI based on the proposed stimulus has potential in neurorehabilitation applications because it simultaneously has the high accuracy of an SSMVEP (~90% accuracy in a four-class setup) and the ability to activate sensorimotor cortex. And such potential will be further explored in future studies.