Lei Zhang

h-index37

18papers

5,941citations

Novelty53%

AI Score47

Ranked #32,219 of 194,257 authors (top 17%)#11,507 in CV (top 19%)

18 Papers

5.2CVJul 23, 2024Code

SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Wenbo Huang, Jinghui Zhang, Xuwei Qian et al.

High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.

5.5CLMar 21, 2024Code

MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation

Longzheng Wang, Xiaohan Xu, Lei Zhang et al.

Automatic detection of multimodal misinformation has gained a widespread attention recently. However, the potential of powerful Large Language Models (LLMs) for multimodal misinformation detection remains underexplored. Besides, how to teach LLMs to interpret multimodal misinformation in cost-effective and accessible way is still an open question. To address that, we propose MMIDR, a framework designed to teach LLMs in providing fluent and high-quality textual explanations for their decision-making process of multimodal misinformation. To convert multimodal misinformation into an appropriate instruction-following format, we present a data augmentation perspective and pipeline. This pipeline consists of a visual information processing module and an evidence retrieval module. Subsequently, we prompt the proprietary LLMs with processed contents to extract rationales for interpreting the authenticity of multimodal misinformation. Furthermore, we design an efficient knowledge distillation approach to distill the capability of proprietary LLMs in explaining multimodal misinformation into open-source LLMs. To explore several research questions regarding the performance of LLMs in multimodal misinformation detection tasks, we construct an instruction-following multimodal misinformation dataset and conduct comprehensive experiments. The experimental findings reveal that our MMIDR exhibits sufficient detection performance and possesses the capacity to provide compelling rationales to support its assessments.

3.3SPApr 13, 2020Code

Sequential Weakly Labeled Multi-Activity Localization and Recognition on Wearable Sensors using Recurrent Attention Networks

Kun Wang, Jun He, Lei Zhang

With the popularity and development of the wearable devices such as smartphones, human activity recognition (HAR) based on sensors has become as a key research area in human computer interaction and ubiquitous computing. The emergence of deep learning leads to a recent shift in the research of HAR, which requires massive strictly labeled data. In comparison with video data, activity data recorded from accelerometer or gyroscope is often more difficult to interpret and segment. Recently, several attention mechanisms are proposed to handle the weakly labeled human activity data, which do not require accurate data annotation. However, these attention-based models can only handle the weakly labeled dataset whose sample includes one target activity, as a result it limits efficiency and practicality. In the paper, we propose a recurrent attention networks (RAN) to handle sequential weakly labeled multi-activity recognition and location tasks. The model can repeatedly perform steps of attention on multiple activities of one sample and each step is corresponding to the current focused activity. The effectiveness of the RAN model is validated on a collected sequential weakly labeled multi-activity dataset and the other two public datasets. The experiment results show that our RAN model can simultaneously infer multi-activity types from the coarse-grained sequential weak labels and determine specific locations of every target activity with only knowledge of which types of activities contained in the long sequence. It will greatly reduce the burden of manual labeling. The code of our work is available at https://github.com/KennCoder7/RAN.

7.1LGNov 12, 2018Code

Learning data augmentation policies using augmented random search

Mingyang Geng, Kele Xu, Bo Ding et al.

Previous attempts for data augmentation are designed manually, and the augmentation policies are dataset-specific. Recently, an automatic data augmentation approach, named AutoAugment, is proposed using reinforcement learning. AutoAugment searches for the augmentation polices in the discrete search space, which may lead to a sub-optimal solution. In this paper, we employ the Augmented Random Search method (ARS) to improve the performance of AutoAugment. Our key contribution is to change the discrete search space to continuous space, which will improve the searching performance and maintain the diversities between sub-policies. With the proposed method, state-of-the-art accuracies are achieved on CIFAR-10, CIFAR-100, and ImageNet (without additional data). Our code is available at https://github.com/gmy2013/ARS-Aug.

6.5CVDec 10, 2024

Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence

Wenbo Huang, Jinghui Zhang, Guang Li et al.

In few-shot action recognition (FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the high computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a Matryoshka MAmba and CoNtrasTive LeArning framework (Manta). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in two parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives.

6.5CVMar 4, 2024

DragTex: Generative Point-Based Texture Editing on 3D Mesh

Yudi Zhang, Qi Xu, Lei Zhang

Creating 3D textured meshes using generative artificial intelligence has garnered significant attention recently. While existing methods support text-based generative texture generation or editing on 3D meshes, they often struggle to precisely control pixels of texture images through more intuitive interaction. While 2D images can be edited generatively using drag interaction, applying this type of methods directly to 3D mesh textures still leads to issues such as the lack of local consistency among multiple views, error accumulation and long training times. To address these challenges, we propose a generative point-based 3D mesh texture editing method called DragTex. This method utilizes a diffusion model to blend locally inconsistent textures in the region near the deformed silhouette between different views, enabling locally consistent texture editing. Besides, we fine-tune a decoder to reduce reconstruction errors in the non-drag region, thereby mitigating overall error accumulation. Moreover, we train LoRA using multi-view images instead of training each view individually, which significantly shortens the training time. The experimental results show that our method effectively achieves dragging textures on 3D meshes and generates plausible textures that align with the desired intent of drag interaction.

3.6CVAug 2, 2025

3DRot: 3D Rotation Augmentation for RGB-Based 3D Tasks

Shitian Yang, Deyu Li, Xiaoke Jiang et al.

RGB-based 3D tasks, e.g., 3D detection, depth estimation, 3D keypoint estimation, still suffer from scarce, expensive annotations and a thin augmentation toolbox, since most image transforms, including resize and rotation, disrupt geometric consistency. In this paper, we introduce 3DRot, a plug-and-play augmentation that rotates and mirrors images about the camera's optical center while synchronously updating RGB images, camera intrinsics, object poses, and 3D annotations to preserve projective geometry-achieving geometry-consistent rotations and reflections without relying on any scene depth. We validate 3DRot with a classical 3D task, monocular 3D detection. On SUN RGB-D dataset, 3DRot raises $IoU_{3D}$ from 43.21 to 44.51, cuts rotation error (ROT) from 22.91$^\circ$ to 20.93$^\circ$, and boosts $mAP_{0.5}$ from 35.70 to 38.11. As a comparison, Cube R-CNN adds 3 other datasets together with SUN RGB-D for monocular 3D estimation, with a similar mechanism and test dataset, increases $IoU_{3D}$ from 36.2 to 37.8, boosts $mAP_{0.5}$ from 34.7 to 35.4. Because it operates purely through camera-space transforms, 3DRot is readily transferable to other 3D tasks.

8.4CVJul 5, 2025

Deconfounding Causal Inference through Two-Branch Framework with Early-Forking for Sensor-Based Cross-Domain Activity Recognition

Di Xiong, Lei Zhang, Shuoyuan Wang et al.

Recently, domain generalization (DG) has emerged as a promising solution to mitigate distribution-shift issue in sensor-based human activity recognition (HAR) scenario. However, most existing DG-based works have merely focused on modeling statistical dependence between sensor data and activity labels, neglecting the importance of intrinsic casual mechanism. Intuitively, every sensor input can be viewed as a mixture of causal (category-aware) and non-causal factors (domain-specific), where only the former affects activity classification judgment. In this paper, by casting such DG-based HAR as a casual inference problem, we propose a causality-inspired representation learning algorithm for cross-domain activity recognition. To this end, an early-forking two-branch framework is designed, where two separate branches are respectively responsible for learning casual and non-causal features, while an independence-based Hilbert-Schmidt Information Criterion is employed to implicitly disentangling them. Additionally, an inhomogeneous domain sampling strategy is designed to enhance disentanglement, while a category-aware domain perturbation layer is performed to prevent representation collapse. Extensive experiments on several public HAR benchmarks demonstrate that our causality-inspired approach significantly outperforms eleven related state-of-the-art baselines under cross-person, cross-dataset, and cross-position settings. Detailed ablation and visualizations analyses reveal underlying casual mechanism, indicating its effectiveness, efficiency, and universality in cross-domain activity recognition scenario.

18.1CVJul 16, 2021

Conditional Directed Graph Convolution for 3D Human Pose Estimation

Wenbo Hu, Changgong Zhang, Fangneng Zhan et al.

Graph convolutional networks have significantly improved 3D human pose estimation by representing the human skeleton as an undirected graph. However, this representation fails to reflect the articulated characteristic of human skeletons as the hierarchical orders among the joints are not explicitly presented. In this paper, we propose to represent the human skeleton as a directed graph with the joints as nodes and bones as edges that are directed from parent joints to child joints. By so doing, the directions of edges can explicitly reflect the hierarchical relationships among the nodes. Based on this representation, we further propose a spatial-temporal conditional directed graph convolution to leverage varying non-local dependence for different poses by conditioning the graph topology on input poses. Altogether, we form a U-shaped network, named U-shaped Conditional Directed Graph Convolutional Network, for 3D human pose estimation from monocular videos. To evaluate the effectiveness of our method, we conducted extensive experiments on two challenging large-scale benchmarks: Human3.6M and MPI-INF-3DHP. Both quantitative and qualitative results show that our method achieves top performance. Also, ablation studies show that directed graphs can better exploit the hierarchy of articulated human skeletons than undirected graphs, and the conditional connections can yield adaptive graph topologies for different poses.

10.6CVDec 25, 2020

Self-supervised Pre-training with Hard Examples Improves Visual Representations

Chunyuan Li, Xiujun Li, Lei Zhang et al.

Self-supervised pre-training (SSP) employs random image transformations to generate training data for visual representation learning. In this paper, we first present a modeling framework that unifies existing SSP methods as learning to predict pseudo-labels. Then, we propose new data augmentation methods of generating training examples whose pseudo-labels are harder to predict than those generated via random image transformations. Specifically, we use adversarial training and CutMix to create hard examples (HEXA) to be used as augmented views for MoCo-v2 and DeepCluster-v2, leading to two variants HEXA_{MoCo} and HEXA_{DCluster}, respectively. In our experiments, we pre-train models on ImageNet and evaluate them on multiple public benchmarks. Our evaluation shows that the two new algorithm variants outperform their original counterparts, and achieve new state-of-the-art on a wide range of tasks where limited task supervision is available for fine-tuning. These results verify that hard examples are instrumental in improving the generalization of the pre-trained models.

5.0CVJun 5, 2020

Real-time Human Activity Recognition Using Conditionally Parametrized Convolutions on Mobile and Wearable Devices

Xin Cheng, Lei Zhang, Yin Tang et al.

Recently, deep learning has represented an important research trend in human activity recognition (HAR). In particular, deep convolutional neural networks (CNNs) have achieved state-of-the-art performance on various HAR datasets. For deep learning, improvements in performance have to heavily rely on increasing model size or capacity to scale to larger and larger datasets, which inevitably leads to the increase of operations. A high number of operations in deep leaning increases computational cost and is not suitable for real-time HAR using mobile and wearable sensors. Though shallow learning techniques often are lightweight, they could not achieve good performance. Therefore, deep learning methods that can balance the trade-off between accuracy and computation cost is highly needed, which to our knowledge has seldom been researched. In this paper, we for the first time propose a computation efficient CNN using conditionally parametrized convolution for real-time HAR on mobile and wearable devices. We evaluate the proposed method on four public benchmark HAR datasets consisting of WISDM dataset, PAMAP2 dataset, UNIMIB-SHAR dataset, and OPPORTUNITY dataset, achieving state-of-the-art accuracy without compromising computation cost. Various ablation experiments are performed to show how such a network with large capacity is clearly preferable to baseline while requiring a similar amount of operations. The method can be used as a drop-in replacement for the existing deep HAR architectures and easily deployed onto mobile and wearable devices for real-time HAR applications.

7.2CVMay 8, 2020

Layer-wise training convolutional neural networks with smaller filters for human activity recognition using wearable sensors

Yin Tang, Qi Teng, Lei Zhang et al.

Recently, convolutional neural networks (CNNs) have set latest state-of-the-art on various human activity recognition (HAR) datasets. However, deep CNNs often require more computing resources, which limits their applications in embedded HAR. Although many successful methods have been proposed to reduce memory and FLOPs of CNNs, they often involve special network architectures designed for visual tasks, which are not suitable for deep HAR tasks with time series sensor signals, due to remarkable discrepancy. Therefore, it is necessary to develop lightweight deep models to perform HAR. As filter is the basic unit in constructing CNNs, it deserves further research whether re-designing smaller filters is applicable for deep HAR. In the paper, inspired by the idea, we proposed a lightweight CNN using Lego filters for HAR. A set of lower-dimensional filters is used as Lego bricks to be stacked for conventional filters, which does not rely on any special network structure. The local loss function is used to train model. To our knowledge, this is the first paper that proposes lightweight CNN for HAR in ubiquitous and wearable computing arena. The experiment results on five public HAR datasets, UCI-HAR dataset, OPPORTUNITY dataset, UNIMIB-SHAR dataset, PAMAP2 dataset, and WISDM dataset collected from either smartphones or multiple sensor nodes, indicate that our novel Lego CNN with local loss can greatly reduce memory and computation cost over CNN, while achieving higher accuracy. That is to say, the proposed model is smaller, faster and more accurate. Finally, we evaluate the actual performance on an Android smartphone.

15.9LGFeb 27, 2020

Bridging the Gap between Spatial and Spectral Domains: A Survey on Graph Neural Networks

Zhiqian Chen, Fanglan Chen, Lei Zhang et al.

Deep learning's success has been widely recognized in a variety of machine learning tasks, including image classification, audio recognition, and natural language processing. As an extension of deep learning beyond these domains, graph neural networks (GNNs) are designed to handle the non-Euclidean graph-structure which is intractable to previous deep learning techniques. Existing GNNs are presented using various techniques, making direct comparison and cross-reference more complex. Although existing studies categorize GNNs into spatial-based and spectral-based techniques, there hasn't been a thorough examination of their relationship. To close this gap, this study presents a single framework that systematically incorporates most GNNs. We organize existing GNNs into spatial and spectral domains, as well as expose the connections within each domain. A review of spectral graph theory and approximation theory builds a strong relationship across the spatial and spectral domains in further investigation.

0.9CVNov 11, 2019Code

Conditionally Learn to Pay Attention for Sequential Visual Task

Jun He, Quan-Jie Cao, Lei Zhang

Sequential visual task usually requires to pay attention to its current interested object conditional on its previous observations. Different from popular soft attention mechanism, we propose a new attention framework by introducing a novel conditional global feature which represents the weak feature descriptor of the current focused object. Specifically, for a standard CNN (Convolutional Neural Network) pipeline, the convolutional layers with different receptive fields are used to produce the attention maps by measuring how the convolutional features align to the conditional global feature. The conditional global feature can be generated by different recurrent structure according to different visual tasks, such as a simple recurrent neural network for multiple objects recognition, or a moderate complex language model for image caption. Experiments show that our proposed conditional attention model achieves the best performance on the SVHN (Street View House Numbers) dataset with / without extra bounding box; and for image caption, our attention model generates better scores than the popular soft attention model.

3.4LGAug 27, 2019

Learning Continually from Low-shot Data Stream

Canyu Le, Xihan Wei, Biao Wang et al.

While deep learning has achieved remarkable results on various applications, it is usually data hungry and struggles to learn over non-stationary data stream. To solve these two limits, the deep learning model should not only be able to learn from a few of data, but also incrementally learn new concepts from data stream over time without forgetting the previous knowledge. Limited literature simultaneously address both problems. In this work, we propose a novel approach, MetaCL, which enables neural networks to effectively learn meta knowledge from low-shot data stream without catastrophic forgetting. MetaCL trains a model to exploit the intrinsic feature of data (i.e. meta knowledge) and dynamically penalize the important model parameters change to preserve learned knowledge. In this way, the deep learning model can efficiently obtain new knowledge from small volume of data and still keep high performance on previous tasks. MetaCL is conceptually simple, easy to implement and model-agnostic. We implement our method on three recent regularization-based methods. Extensive experiments show that our approach leads to state-of-the-art performance on image classification benchmarks.

7.7LGMar 24, 2019

Attention-based Convolutional Neural Network for Weakly Labeled Human Activities Recognition with Wearable Sensors

Kun Wang, Jun He, Lei Zhang

Unlike images or videos data which can be easily labeled by human being, sensor data annotation is a time-consuming process. However, traditional methods of human activity recognition require a large amount of such strictly labeled data for training classifiers. In this paper, we present an attention-based convolutional neural network for human recognition from weakly labeled data. The proposed attention model can focus on labeled activity among a long sequence of sensor data, and while filter out a large amount of background noise signals. In experiment on the weakly labeled dataset, we show that our attention model outperforms classical deep learning methods in accuracy. Besides, we determine the specific locations of the labeled activity in a long sequence of weakly labeled data by converting the compatibility score which is generated from attention model to compatibility density. Our method greatly facilitates the process of sensor data annotation, and makes data collection more easy.

38.0CVNov 25, 2018

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Xin Wang, Qiuyuan Huang, Asli Celikyilmaz et al.

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).

55.6CVJul 25, 2017Code

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson, Xiaodong He, Chris Buehler et al.

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.