Jianhua Yang

CV
h-index3
12papers
298citations
Novelty47%
AI Score49

12 Papers

CVFeb 14, 2023Code
YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time Spatio-temporal Action Detection

Jianhua Yang, Kun Dai

Designing a real-time framework for the spatio-temporal action detection task is still a challenge. In this paper, we propose a novel real-time action detection framework, YOWOv2. In this new framework, YOWOv2 takes advantage of both the 3D backbone and 2D backbone for accurate action detection. A multi-level detection pipeline is designed to detect action instances of different scales. To achieve this goal, we carefully build a simple and efficient 2D backbone with a feature pyramid network to extract different levels of classification features and regression features. For the 3D backbone, we adopt the existing efficient 3D CNN to save development time. By combining 3D backbones and 2D backbones of different sizes, we design a YOWOv2 family including YOWOv2-Tiny, YOWOv2-Medium, and YOWOv2-Large. We also introduce the popular dynamic label assignment strategy and anchor-free mechanism to make the YOWOv2 consistent with the advanced model architecture design. With our improvement, YOWOv2 is significantly superior to YOWO, and can still keep real-time detection. Without any bells and whistles, YOWOv2 achieves 87.0 % frame mAP and 52.8 % video mAP with over 20 FPS on the UCF101-24. On the AVA, YOWOv2 achieves 21.7 % frame mAP with over 20 FPS. Our code is available on https://github.com/yjh0410/YOWOv2.

CVOct 20, 2022Code
YOWO-Plus: An Incremental Improvement

Jianhua Yang

In this technical report, we would like to introduce our updates to YOWO, a real-time method for spatio-temporal action detection. We make a bunch of little design changes to make it better. For network structure, we use the same ones of official implemented YOWO, including 3D-ResNext-101 and YOLOv2, but we use a better pretrained weight of our reimplemented YOLOv2, which is better than the official YOLOv2. We also optimize the label assignment used in YOWO. To accurately detection action instances, we deploy GIoU loss for box regression. After our incremental improvement, YOWO achieves 84.9\% frame mAP and 50.5\% video mAP on the UCF101-24, significantly higher than the official YOWO. On the AVA, our optimized YOWO achieves 20.6\% frame mAP with 16 frames, also exceeding the official YOWO. With 32 frames, our YOWO achieves 21.6 frame mAP with 25 FPS on an RTX 3090 GPU. We name the optimized YOWO as YOWO-Plus. Moreover, we replace the 3D-ResNext-101 with the efficient 3D-ShuffleNet-v2 to design a lightweight action detector, YOWO-Nano. YOWO-Nano achieves 81.0 \% frame mAP and 49.7\% video frame mAP with over 90 FPS on the UCF101-24. It also achieves 18.4 \% frame mAP with about 90 FPS on the AVA. As far as we know, YOWO-Nano is the fastest state-of-the-art action detector. Our code is available on https://github.com/yjh0410/PyTorch_YOWO.

CVJun 16, 2021Code
CMF: Cascaded Multi-model Fusion for Referring Image Segmentation

Jianhua Yang, Yan Huang, Zhanyu Ma et al.

In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multi-scale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multi-scale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at https://github.com/jianhua2022/CMF-Refseg.

26.4CYMar 27
Multi-Level Barriers to Generative AI Adoption Across Disciplines and Professional Roles in Higher Education

Jianhua Yang, Kerem Öge, Adrian von Mühlenen et al.

Generative Artificial Intelligence (GenAI) is rapidly reshaping higher education, yet barriers to its adoption across different disciplines and institutional roles remain underexplored. Existing literature frequently attributes adoption barriers to individual-level factors such as perceived usefulness and ease of use. This study instead investigates whether such barriers are structurally produced. Drawing on a multi-method survey analysis of 272 academic and professional services (PSs) staff at a Russell Group university, we examine how disciplinary contexts and institutional roles shape perceived barriers. By integrating multinomial logistic regression (MLR), structural equation modelling (SEM), and semantic clustering of open-ended responses, we move beyond descriptive accounts to provide a multi-level explanation of GenAI adoption. Our findings reveal clear, systematic differences: non-STEM academics primarily report ethical and cultural barriers related to academic integrity, whereas STEM and PSs staff disproportionately emphasize institutional, governance, and infrastructure constraints. We conclude that GenAI adoption barriers are deeply embedded in organizational ecosystems and epistemic norms, suggesting that universities must move beyond generalized training to develop role-specific governance and support frameworks.

CVSep 24, 2025
Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation

Bo Yu, Jianhua Yang, Zetao Du et al.

Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.

CVJul 11, 2025
Towards Imperceptible JPEG Image Hiding: Multi-range Representations-driven Adversarial Stego Generation

Junxue Yang, Xin Liao, Weixuan Tang et al.

Image hiding fully explores the hidden potential of deep learning-based models, aiming to conceal image-level messages within cover images and reveal them from stego images to achieve covert communication. Existing hiding schemes are easily detected by the naked eyes or steganalyzers due to the cover type confined to the spatial domain, single-range feature extraction and attacks, and insufficient loss constraints. To address these issues, we propose a multi-range representations-driven adversarial stego generation framework called MRAG for JPEG image hiding. This design stems from the fact that steganalyzers typically combine local-range and global-range information to better capture hidden traces. Specifically, MRAG integrates the local-range characteristic of the convolution and the global-range modeling of the transformer. Meanwhile, a features angle-norm disentanglement loss is designed to launch multi-range representations-driven feature-level adversarial attacks. It computes the adversarial loss between covers and stegos based on the surrogate steganalyzer's classified features, i.e., the features before the last fully connected layer. Under the dual constraints of features angle and norm, MRAG can delicately encode the concatenation of cover and secret into subtle adversarial perturbations from local and global ranges relevant to steganalysis. Therefore, the resulting stego can achieve visual and steganalysis imperceptibility. Moreover, coarse-grained and fine-grained frequency decomposition operations are devised to transform the input, introducing multi-grained information. Extensive experiments demonstrate that MRAG can achieve state-of-the-art performance.

MMJul 28, 2021
JPEG Steganography with Embedding Cost Learning and Side-Information Estimation

Jianhua Yang, Yi Liao, Fei Shang et al.

A great challenge to steganography has arisen with the wide application of steganalysis methods based on convolutional neural networks (CNNs). To this end, embedding cost learning frameworks based on generative adversarial networks (GANs) have been proposed and achieved success for spatial steganography. However, the application of GAN to JPEG steganography is still in the prototype stage; its anti-detectability and training efficiency should be improved. In conventional steganography, research has shown that the side-information calculated from the precover can be used to enhance security. However, it is hard to calculate the side-information without the spatial domain image. In this work, an embedding cost learning framework for JPEG Steganography via a Generative Adversarial Network (JS-GAN) has been proposed, the learned embedding cost can be further adjusted asymmetrically according to the estimated side-information. Experimental results have demonstrated that the proposed method can automatically learn a content-adaptive embedding cost function, and use the estimated side-information properly can effectively improve the security performance. For example, under the attack of a classic steganalyzer GFR with quality factor 75 and 0.4 bpnzAC, the proposed JS-GAN can increase the detection error 2.58% over J-UNIWARD, and the estimated side-information aided version JS-GAN(ESI) can further increase the security performance by 11.25% over JS-GAN.

CVNov 2, 2020
Actor and Action Modular Network for Text-based Video Segmentation

Jianhua Yang, Yan Huang, Kai Niu et al.

Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of \emph{semantic asymmetry}. The \emph{semantic asymmetry} implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.

LGAug 22, 2019
The Learning of Fuzzy Cognitive Maps With Noisy Data: A Rapid and Robust Learning Method With Maximum Entropy

Guoliang Feng, Wei Lu, Witold Pedrycz et al.

Numerous learning methods for fuzzy cognitive maps (FCMs), such as the Hebbian-based and the population-based learning methods, have been developed for modeling and simulating dynamic systems. However, these methods are faced with several obvious limitations. Most of these models are extremely time consuming when learning the large-scale FCMs with hundreds of nodes. Furthermore, the FCMs learned by those algorithms lack robustness when the experimental data contain noise. In addition, reasonable distribution of the weights is rarely considered in these algorithms, which could result in the reduction of the performance of the resulting FCM. In this article, a straightforward, rapid, and robust learning method is proposed to learn FCMs from noisy data, especially, to learn large-scale FCMs. The crux of the proposed algorithm is to equivalently transform the learning problem of FCMs to a classic-constrained convex optimization problem in which the least-squares term ensures the robustness of the well-learned FCM and the maximum entropy term regularizes the distribution of the weights of the well-learned FCM. A series of experiments covering two frequently used activation functions (the sigmoid and hyperbolic tangent functions) are performed on both synthetic datasets with noise and real-world datasets. The experimental results show that the proposed method is rapid and robust against data containing noise and that the well-learned weights have better distribution. In addition, the FCMs learned by the proposed method also exhibit superior performance in comparison with the existing methods. Index Terms-Fuzzy cognitive maps (FCMs), maximum entropy, noisy data, rapid and robust learning.

CVApr 28, 2018
Deep Learning based Inter-Modality Image Registration Supervised by Intra-Modality Similarity

Xiaohuan Cao, Jianhua Yang, Li Wang et al.

Non-rigid inter-modality registration can facilitate accurate information fusion from different modalities, but it is challenging due to the very different image appearances across modalities. In this paper, we propose to train a non-rigid inter-modality image registration network, which can directly predict the transformation field from the input multimodal images, such as CT and MR images. In particular, the training of our inter-modality registration network is supervised by intra-modality similarity metric based on the available paired data, which is derived from a pre-aligned CT and MR dataset. Specifically, in the training stage, to register the input CT and MR images, their similarity is evaluated on the warped MR image and the MR image that is paired with the input CT. So that, the intra-modality similarity metric can be directly applied to measure whether the input CT and MR images are well registered. Moreover, we use the idea of dual-modality fashion, in which we measure the similarity on both CT modality and MR modality. In this way, the complementary anatomies in both modalities can be jointly considered to more accurately train the inter-modality registration network. In the testing stage, the trained inter-modality registration network can be directly applied to register the new multimodal images without any paired data. Experimental results have shown that, the proposed method can achieve promising accuracy and efficiency for the challenging non-rigid inter-modality registration task and also outperforms the state-of-the-art approaches.

MMApr 21, 2018
Spatial Image Steganography Based on Generative Adversarial Network

Jianhua Yang, Kai Liu, Xiangui Kang et al.

With the recent development of deep learning on steganalysis, embedding secret information into digital images faces great challenges. In this paper, a secure steganography algorithm by using adversarial training is proposed. The architecture contain three component modules: a generator, an embedding simulator and a discriminator. A generator based on U-NET to translate a cover image into an embedding change probability is proposed. To fit the optimal embedding simulator and propagate the gradient, a function called Tanh-simulator is proposed. As for the discriminator, the selection-channel awareness (SCA) is incorporated to resist the SCA based steganalytic methods. Experimental results have shown that the proposed framework can increase the security performance dramatically over the recently reported method ASDL-GAN, while the training time is only 30% of that used by ASDL-GAN. Furthermore, it also performs better than the hand-crafted steganographic algorithm S-UNIWARD.

MMNov 26, 2017
JPEG Steganalysis Based on DenseNet

Jianhua Yang, Yun-Qing Shi, Edward K. Wong et al.

Different from the conventional deep learning work based on an images content in computer vision, deep steganalysis is an art to detect the secret information embedded in an image via deep learning, pose challenge of detection weak information invisible hidden in a host image thus learning in a very low signal-to-noise (SNR) case. In this paper, we propose a 32- layer convolutional neural Networks (CNNs) in to improve the efficiency of preprocess and reuse the features by concatenating all features from the previous layers with the same feature- map size, thus improve the flow of information and gradient. The shared features and bottleneck layers further improve the feature propagation and reduce the CNN model parameters dramatically. Experimental results on the BOSSbase, BOWS2 and ImageNet datasets have showed that the proposed CNN architecture can improve the performance and enhance the robustness. To further boost the detection accuracy, an ensemble architecture called as CNN-SCA-GFR is proposed, CNN-SCA- GFR is also the first work to combine the CNN architecture and conventional method in the JPEG domain. Experiments show that it can further lower detection errors. Compared with the state-of-the-art method XuNet [1] on BOSSbase, the proposed CNN-SCA-GFR architecture can reduce detection error rate by 5.67% for 0.1 bpnzAC and by 4.41% for 0.4 bpnzAC while the number of training parameters in CNN is only 17% of what used by XuNet. It also decreases the detection errors from the conventional method SCA-GFR by 7.89% for 0.1 bpnzAC and 8.06% for 0.4 bpnzAC, respectively.