CVNov 24, 2022
1st Workshop on Maritime Computer Vision (MaCVi) 2023: Challenge ResultsBenjamin Kiefer, Matej Kristan, Janez Perš et al.
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
IVOct 8, 2023
VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial IntelligenceJianing Qiu, Jian Wu, Hao Wei et al.
We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassification of disease phenotype, and systemic biomarker and disease prediction, with each application enhanced with expert-level intelligence and accuracy. The generalist intelligence of VisionFM outperformed ophthalmologists with basic and intermediate levels in jointly diagnosing 12 common ophthalmic diseases. Evaluated on a new large-scale ophthalmic disease diagnosis benchmark database, as well as a new large-scale segmentation and detection benchmark database, VisionFM outperformed strong baseline deep neural networks. The ophthalmic image representations learned by VisionFM exhibited noteworthy explainability, and demonstrated strong generalizability to new ophthalmic modalities, disease spectrum, and imaging devices. As a foundation model, VisionFM has a large capacity to learn from diverse ophthalmic imaging data and disparate datasets. To be commensurate with this capacity, in addition to the real data used for pre-training, we also generated and leveraged synthetic ophthalmic imaging data. Experimental results revealed that synthetic data that passed visual Turing tests, can also enhance the representation learning capability of VisionFM, leading to substantial performance gains on downstream ophthalmic AI tasks. Beyond the ophthalmic AI applications developed, validated, and demonstrated in this work, substantial further applications can be achieved in an efficient and cost-effective manner using VisionFM as the foundation.
CVApr 14, 2023
Masked Pre-Training of Transformers for Histology Image AnalysisShuai Jiang, Liesbeth Hondelink, Arief A. Suriawinata et al.
In digital pathology, whole slide images (WSIs) are widely used for applications such as cancer diagnosis and prognosis prediction. Visual transformer models have recently emerged as a promising method for encoding large regions of WSIs while preserving spatial relationships among patches. However, due to the large number of model parameters and limited labeled data, applying transformer models to WSIs remains challenging. Inspired by masked language models, we propose a pretext task for training the transformer model without labeled data to address this problem. Our model, MaskHIT, uses the transformer output to reconstruct masked patches and learn representative histological features based on their positions and visual features. The experimental results demonstrate that MaskHIT surpasses various multiple instance learning approaches by 3% and 2% on survival prediction and cancer subtype classification tasks, respectively. Furthermore, MaskHIT also outperforms two of the most recent state-of-the-art transformer-based methods. Finally, a comparison between the attention maps generated by the MaskHIT model with pathologist's annotations indicates that the model can accurately identify clinically relevant histological structures in each task.
34.5LGMay 7
On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning RegimeShuai Jiang, Alexey Voronin, Eric Cyr et al.
Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific tasks that require capturing fine-scale structures. The delayed generalization phenomenon known as grokking is another barrier to rapid training of neural networks. Grokking has been hypothesized to arise as learning transitions from the NTK to the feature-rich regime. This paper explores the impact of preconditioned gradient descent (PGD), such as Gauss-Newton, on spectral bias and grokking phenomena. We demonstrate through theoretical and empirical results how PGD can mitigate issues associated with spectral bias. Additionally, building on the rich learning regime grokking hypothesis, we study how PGD can be used to reduce delays associated with grokking. Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime. These findings deepen our understanding of the interplay between optimization dynamics, spectral bias, and the phases of neural network learning.
SPAug 9, 2024
Generative AI on SpectrumNet: An Open Benchmark of Multiband 3D Radio MapsShuhang Zhang, Shuai Jiang, Wanjie Lin et al.
Radio map is an efficient demonstration for visually displaying the wireless signal coverage within a certain region. It has been considered to be increasingly helpful for the future sixth generation (6G) of wireless networks, as wireless nodes are becoming more crowded and complicated. However, the construction of high resolution radio map is very challenging due to the sparse sampling in practical systems. Generative artificial intelligence (AI), which is capable to create synthetic data to fill in gaps in real-world measurements, is an effective technique to construct high precision radio maps. Currently, generative models for radio map construction are trained with two-dimension (2D) single band radio maps in urban scenario, which has poor generalization in diverse terrain scenarios, spectrum bands, and heights. To tackle this problem, we provide a multiband three-dimension (3D) radio map dataset with consideration of terrain and climate information, named SpectrumNet. It is the largest radio map dataset in terms of dimensions and scale, which contains the radio map of 3 spacial dimensions, 5 frequency bands, 11 terrain scenarios, and 3 climate scenarios. We introduce the parameters and settings for the SpectrumNet dataset generation, and evaluate three baseline methods for radio map construction based on the SpectrumNet dataset. Experiments show the necessity of the SpectrumNet dataset for training models with strong generalization in spacial, frequency, and scenario domains. Future works on the SpectrumNet dataset are also discussed, including the dataset expansion and calibration, as well as the extended studies on generative models for radio map construction based on the SpectrumNet dataset.
IVNov 14, 2023
GlanceSeg: Real-time microaneurysm lesion segmentation with gaze-map-guided foundation model for early detection of diabetic retinopathyHongyang Jiang, Mengdi Gao, Zirong Liu et al.
Early-stage diabetic retinopathy (DR) presents challenges in clinical diagnosis due to inconspicuous and minute microangioma lesions, resulting in limited research in this area. Additionally, the potential of emerging foundation models, such as the segment anything model (SAM), in medical scenarios remains rarely explored. In this work, we propose a human-in-the-loop, label-free early DR diagnosis framework called GlanceSeg, based on SAM. GlanceSeg enables real-time segmentation of microangioma lesions as ophthalmologists review fundus images. Our human-in-the-loop framework integrates the ophthalmologist's gaze map, allowing for rough localization of minute lesions in fundus images. Subsequently, a saliency map is generated based on the located region of interest, which provides prompt points to assist the foundation model in efficiently segmenting microangioma lesions. Finally, a domain knowledge filter refines the segmentation of minute lesions. We conducted experiments on two newly-built public datasets, i.e., IDRiD and Retinal-Lesions, and validated the feasibility and superiority of GlanceSeg through visualized illustrations and quantitative measures. Additionally, we demonstrated that GlanceSeg improves annotation efficiency for clinicians and enhances segmentation performance through fine-tuning using annotations. This study highlights the potential of GlanceSeg-based annotations for self-model optimization, leading to enduring performance advancements through continual learning.
CLJun 23, 2023
Mutually Guided Few-shot Learning for Relational Triple ExtractionChengmei Yang, Shuai Jiang, Bowei He et al.
Knowledge graphs (KGs), containing many entity-relation-entity triples, provide rich information for downstream applications. Although extracting triples from unstructured texts has been widely explored, most of them require a large number of labeled instances. The performance will drop dramatically when only few labeled data are available. To tackle this problem, we propose the Mutually Guided Few-shot learning framework for Relational Triple Extraction (MG-FTE). Specifically, our method consists of an entity-guided relation proto-decoder to classify the relations firstly and a relation-guided entity proto-decoder to extract entities based on the classified relations. To draw the connection between entity and relation, we design a proto-level fusion module to boost the performance of both entity extraction and relation classification. Moreover, a new cross-domain few-shot triple extraction task is introduced. Extensive experiments show that our method outperforms many state-of-the-art methods by 12.6 F1 score on FewRel 1.0 (single-domain) and 20.5 F1 score on FewRel 2.0 (cross-domain).
50.7LGMay 23
Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient SpectraBen S. Southworth, Shuai Jiang, Daniel McBride et al.
Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks. Under a fixed "full" augmentation recipe, the clearest Muon-AdamW contrast appears instead in QKV gradients, where AdamW gradient energy remains concentrated in a much narrower basis while Muon spreads energy across substantially more singular modes. Muon in ViTs is therefore best understood as an optimizer-recipe interaction. Under a fixed recipe, Muon differs from AdamW most clearly in attention projections, where its gradients consist of a broader spectral basis. Within Muon, a full training recipe is important for preventing late spectral concentration and mode collapse in deep feedforward blocks. We further demonstrate efficacy in training ViTs on image segmentation and masked autoencoder models, where Muon outperforms AdamW in all settings considered.
CVMay 24, 2024Code
MindShot: A Few-Shot Brain Decoding Framework via Transferring Cross-Subject Prior and Distilling Frequency Domain KnowledgeShuai Jiang, Zhu Meng, Haiwen Li et al.
Aiming to reconstruct visual stimuli from brain signals, brain decoding has recently made significant progress using functional magnetic resonance imaging (fMRI). However, it still has challenging issues such as substantial individual differences and high data collection costs. To simplify these problems, most methods adopt the per-subject-per-model paradigm, but this greatly limits their applications. In this paper, we design a few-shot brain decoding setting specifically for potential clinical scenarios and propose a novel two-stage decoding framework named MindShot, comprising a Multi-Subject Pretraining (MSP) stage and Fourier-based cross-subject Knowledge Distillation (FKD) stage. Firstly, a MSP framework based on multi-modal contrastive learning is constructed to mine the cross-subject prior. Secondly, the FKD is presented to decrease inter-individual differences while improving the decoding adaptability to new individuals. Our approach achieves high semantic fidelity in visual reconstruction on the largest dataset and has the potential to reduce scanning time by up to 99%. Remarkably, MindShot achieves a CLIP accuracy of 83.6% using only 1.8% of the fMRI-image pairs, surpassing the 77.4% accuracy of the method trained on the entire NSD dataset. This makes it feasible to train large-scale brain decoding frameworks that require less data, facilitating practical applications. The code is available at https://github.com/JSinBUPT/MindShot.
51.9LGMay 11
Temporal Preference Concepts and their Functions in a Large Language ModelIan Rios-Sialer, Shantanu Darveshi, Shuai Jiang et al.
Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these tradeoffs. In this work, we causally localize an underlying subgraph for temporal preference in a distilled LLM (Qwen3-4B-Instruct-2507), identifying mid-to-upper-layer nodes through converging evidence from gradient-based attribution and activation patching. We find that the geometry of time horizon is encoded in the residual stream at the expected localized layers. A behavioral analysis reveals that unintervened LLMs discount the future several times less steeply than humans, yet this preference is unstable across contexts, motivating explicit control rather than implicit reliance on training. Finally, we find suggestive evidence that steering vectors can shift temporal preference. Our work demonstrates how mechanistic interpretability can bring us closer to reliable control over how LLMs plan and reason
CVOct 8, 2025Code
Automated Neural Architecture Design for Industrial Defect DetectionYuxi Liu, Yunfeng Ma, Yi Tang et al.
Industrial surface defect detection (SDD) is critical for ensuring product quality and manufacturing reliability. Due to the diverse shapes and sizes of surface defects, SDD faces two main challenges: intraclass difference and interclass similarity. Existing methods primarily utilize manually designed models, which require extensive trial and error and often struggle to address both challenges effectively. To overcome this, we propose AutoNAD, an automated neural architecture design framework for SDD that jointly searches over convolutions, transformers, and multi-layer perceptrons. This hybrid design enables the model to capture both fine-grained local variations and long-range semantic context, addressing the two key challenges while reducing the cost of manual network design. To support efficient training of such a diverse search space, AutoNAD introduces a cross weight sharing strategy, which accelerates supernet convergence and improves subnet performance. Additionally, a searchable multi-level feature aggregation module (MFAM) is integrated to enhance multi-scale feature learning. Beyond detection accuracy, runtime efficiency is essential for industrial deployment. To this end, AutoNAD incorporates a latency-aware prior to guide the selection of efficient architectures. The effectiveness of AutoNAD is validated on three industrial defect datasets and further applied within a defect imaging and detection platform. Code will be available at https://github.com/Yuxi104/AutoNAD.
CVSep 3, 2025Code
Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors AvailabilityShuai Jiang, Yunfeng Ma, Jingyu Zhou et al.
Multimodal industrial surface defect detection (MISDD) aims to identify and locate defect in industrial products by fusing RGB and 3D modalities. This article focuses on modality-missing problems caused by uncertain sensors availability in MISDD. In this context, the fusion of multiple modalities encounters several troubles, including learning mode transformation and information vacancy. To this end, we first propose cross-modal prompt learning, which includes: i) the cross-modal consistency prompt serves the establishment of information consistency of dual visual modalities; ii) the modality-specific prompt is inserted to adapt different input patterns; iii) the missing-aware prompt is attached to compensate for the information vacancy caused by dynamic modalities-missing. In addition, we propose symmetric contrastive learning, which utilizes text modality as a bridge for fusion of dual vision modalities. Specifically, a paired antithetical text prompt is designed to generate binary text semantics, and triple-modal contrastive pre-training is offered to accomplish multimodal learning. Experiment results show that our proposed method achieves 73.83% I-AUROC and 93.05% P-AUROC with a total missing rate 0.7 for RGB and 3D modalities (exceeding state-of-the-art methods 3.84% and 5.58% respectively), and outperforms existing approaches to varying degrees under different missing types and rates. The source code will be available at https://github.com/SvyJ/MISDD-MM.
CVJan 30
OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On EvaluationJin Li, Tao Chen, Shuai Jiang et al.
Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $τ$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.
CVSep 16, 2025
Hunyuan3D Studio: End-to-End AI Pipeline for Game-Ready 3D Asset GenerationBiwen Lei, Yang Li, Xinhai Liu et al.
The creation of high-quality 3D assets, a cornerstone of modern game development, has long been characterized by labor-intensive and specialized workflows. This paper presents Hunyuan3D Studio, an end-to-end AI-powered content creation platform designed to revolutionize the game production pipeline by automating and streamlining the generation of game-ready 3D assets. At its core, Hunyuan3D Studio integrates a suite of advanced neural modules (such as Part-level 3D Generation, Polygon Generation, Semantic UV, etc.) into a cohesive and user-friendly system. This unified framework allows for the rapid transformation of a single concept image or textual description into a fully-realized, production-quality 3D model complete with optimized geometry and high-fidelity PBR textures. We demonstrate that assets generated by Hunyuan3D Studio are not only visually compelling but also adhere to the stringent technical requirements of contemporary game engines, significantly reducing iteration time and lowering the barrier to entry for 3D content creation. By providing a seamless bridge from creative intent to technical asset, Hunyuan3D Studio represents a significant leap forward for AI-assisted workflows in game development and interactive media.
LGApr 13, 2025
Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer PrognosisShuai Jiang, Saeed Hassanpour
Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.
IVDec 26, 2023
Automatic laminectomy cutting plane planning based on artificial intelligence in robot assisted laminectomy surgeryZhuofu Li, Yonghong Zhang, Chengxia Wang et al.
Objective: This study aims to use artificial intelligence to realize the automatic planning of laminectomy, and verify the method. Methods: We propose a two-stage approach for automatic laminectomy cutting plane planning. The first stage was the identification of key points. 7 key points were manually marked on each CT image. The Spatial Pyramid Upsampling Network (SPU-Net) algorithm developed by us was used to accurately locate the 7 key points. In the second stage, based on the identification of key points, a personalized coordinate system was generated for each vertebra. Finally, the transverse and longitudinal cutting planes of laminectomy were generated under the coordinate system. The overall effect of planning was evaluated. Results: In the first stage, the average localization error of the SPU-Net algorithm for the seven key points was 0.65mm. In the second stage, a total of 320 transverse cutting planes and 640 longitudinal cutting planes were planned by the algorithm. Among them, the number of horizontal plane planning effects of grade A, B, and C were 318(99.38%), 1(0.31%), and 1(0.31%), respectively. The longitudinal planning effects of grade A, B, and C were 622(97.18%), 1(0.16%), and 17(2.66%), respectively. Conclusions: In this study, we propose a method for automatic surgical path planning of laminectomy based on the localization of key points in CT images. The results showed that the method achieved satisfactory results. More studies are needed to confirm the reliability of this approach in the future.
CVMar 7
Aligning What EEG Can See: Structural Representations for Brain-Vision MatchingJingyi Tang, Shuai Jiang, Fei Su et al.
Visual decoding from electroencephalography (EEG) has emerged as a highly promising avenue for non-invasive brain-computer interfaces (BCIs). Existing EEG-based decoding methods predominantly align brain signals with the final-layer semantic embeddings of deep visual models. However, relying on these highly abstracted embeddings inevitably leads to severe cross-modal information mismatch. In this work, we introduce the concept of Neural Visibility and accordingly propose the EEG-Visible Layer Selection Strategy, aligning EEG signals with intermediate visual layers to minimize this mismatch. Furthermore, to accommodate the multi-stage nature of human visual processing, we propose a novel Hierarchically Complementary Fusion (HCF) framework that jointly integrates visual representations from different hierarchical levels. Extensive experiments demonstrate that our method achieves state-of-the-art performance, reaching an 84.6% accuracy (+21.4%) on zero-shot visual decoding on the THINGS-EEG dataset. Moreover, our method achieves up to a 129.8% performance gain across diverse EEG baselines, demonstrating its robust generalizability.
LGJan 13
Layer-Parallel Training for TransformersShuai Jiang, Marc Salvado, Eric C. Cyr et al.
We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel-acceleration as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.
LGJun 18, 2025
Interpretability and Generalization Bounds for Learning Spatial PhysicsAlejandro Francisco Queiruga, Theo Gutman-Solo, Shuai Jiang
While there are many applications of ML to scientific problems that look promising, visuals can be deceiving. For scientific applications, actual quantitative accuracy is crucial. This work applies the rigor of numerical analysis for differential equations to machine learning by specifically quantifying the accuracy of applying different ML techniques to the elementary 1D Poisson differential equation. Beyond the quantity and discretization of data, we identify that the function space of the data is critical to the generalization of the model. We prove generalization bounds and convergence rates under finite data discretizations and restricted training data subspaces by analyzing the training dynamics and deriving optimal parameters for both a white-box differential equation discovery method and a black-box linear model. The analytically derived generalization bounds are replicated empirically. Similar lack of generalization is empirically demonstrated for deep linear models, shallow neural networks, and physics-specific DeepONets and Neural Operators. We theoretically and empirically demonstrate that generalization to the true physical equation is not guaranteed in each explored case. Surprisingly, we find that different classes of models can exhibit opposing generalization behaviors. Based on our theoretical analysis, we also demonstrate a new mechanistic interpretability lens on scientific models whereby Green's function representations can be extracted from the weights of black-box models. Our results inform a new cross-validation technique for measuring generalization in physical systems. We propose applying it to the Poisson equation as an evaluation benchmark of future methods.
CVOct 13, 2024
Improving Colorectal Cancer Screening and Risk Assessment through Predictive Modeling on Medical Images and RecordsShuai Jiang, Christina Robinson, Joseph Anderson et al.
Colonoscopy screening effectively identifies and removes polyps before they progress to colorectal cancer (CRC), but current follow-up guidelines rely primarily on histopathological features, overlooking other important CRC risk factors. Variability in polyp characterization among pathologists also hinders consistent surveillance decisions. Advances in digital pathology and deep learning enable the integration of pathology slides and medical records for more accurate CRC risk prediction. Using data from the New Hampshire Colonoscopy Registry, including longitudinal follow-up, we adapted a transformer-based model for histopathology image analysis to predict 5-year CRC risk. We further explored multi-modal fusion strategies to combine clinical records with deep learning-derived image features. Training the model to predict intermediate clinical variables improved 5-year CRC risk prediction (AUC = 0.630) compared to direct prediction (AUC = 0.615, p = 0.013). Incorporating both imaging and non-imaging data, without requiring manual slide review, further improved performance (AUC = 0.674) compared to traditional features from colonoscopy and microscopy reports (AUC = 0.655, p = 0.001). These results highlight the value of integrating diverse data modalities with computational methods to enhance CRC risk stratification.
CLSep 3, 2023
A Visual Interpretation-Based Self-Improved Classification System Using Virtual Adversarial TrainingShuai Jiang, Sayaka Kamei, Chen Li et al.
The successful application of large pre-trained models such as BERT in natural language processing has attracted more attention from researchers. Since the BERT typically acts as an end-to-end black box, classification systems based on it usually have difficulty in interpretation and low robustness. This paper proposes a visual interpretation-based self-improving classification model with a combination of virtual adversarial training (VAT) and BERT models to address the above problems. Specifically, a fine-tuned BERT model is used as a classifier to classify the sentiment of the text. Then, the predicted sentiment classification labels are used as part of the input of another BERT for spam classification via a semi-supervised training manner using VAT. Additionally, visualization techniques, including visualizing the importance of words and normalizing the attention head matrix, are employed to analyze the relevance of each component to classification accuracy. Moreover, brand-new features will be found in the visual analysis, and classification performance will be improved. Experimental results on Twitter's tweet dataset demonstrate the effectiveness of the proposed model on the classification task. Furthermore, the ablation study results illustrate the effect of different components of the proposed model on the classification results.
IVOct 22, 2021
MHAttnSurv: Multi-Head Attention for Survival Prediction Using Whole-Slide Pathology ImagesShuai Jiang, Arief A. Suriawinata, Saeed Hassanpour
In pathology, whole-slide images (WSI) based survival prediction has attracted increasing interest. However, given the large size of WSIs and the lack of pathologist annotations, extracting the prognostic information from WSIs remains a challenging task. Previous studies have used multiple instance learning approaches to combine the information from multiple randomly sampled patches, but different visual patterns may contribute differently to prognosis prediction. In this study, we developed a multi-head attention approach to focus on various parts of a tumor slide, for more comprehensive information extraction from WSIs. We evaluated our approach on four cancer types from The Cancer Genome Atlas database. Our model achieved an average c-index of 0.640, outperforming two existing state-of-the-art approaches for WSI-based survival prediction, which have an average c-index of 0.603 and 0.619 on these datasets. Visualization of our attention maps reveals each attention head focuses synergistically on different morphological patterns.
CVSep 8, 2021
Unsupervised clothing change adaptive person ReIDZiyue Zhang, Shuai Jiang, Congzhentao Huang et al.
Clothing changes and lack of data labels are both crucial challenges in person ReID. For the former challenge, people may occur multiple times at different locations wearing different clothing. However, most of the current person ReID research works focus on the benchmarks in which a person's clothing is kept the same all the time. For the last challenge, some researchers try to make model learn information from a labeled dataset as a source to an unlabeled dataset. Whereas purely unsupervised training is less used. In this paper, we aim to solve both problems at the same time. We design a novel unsupervised model, Sync-Person-Cloud ReID, to solve the unsupervised clothing change person ReID problem. We developer a purely unsupervised clothing change person ReID pipeline with person sync augmentation operation and same person feature restriction. The person sync augmentation is to supply additional same person resources. These same person's resources can be used as part supervised input by same person feature restriction. The extensive experiments on clothing change ReID datasets show the out-performance of our methods.
CVJan 12, 2021
Resolution-invariant Person ReID Based on Feature Transformation and Self-weighted AttentionZiyue Zhang, Shuai Jiang, Congzhentao Huang et al.
Person Re-identification (ReID) is a critical computer vision task which aims to match the same person in images or video sequences. Most current works focus on settings where the resolution of images is kept the same. However, the resolution is a crucial factor in person ReID, especially when the cameras are at different distances from the person or the camera's models are different from each other. In this paper, we propose a novel two-stream network with a lightweight resolution association ReID feature transformation (RAFT) module and a self-weighted attention (SWA) ReID module to evaluate features under different resolutions. RAFT transforms the low resolution features to corresponding high resolution features. SWA evaluates both features to get weight factors for the person ReID. Both modules are jointly trained to get a resolution-invariant representation. Extensive experiments on five benchmark datasets show the effectiveness of our method. For instance, we achieve Rank-1 accuracy of 43.3% and 83.2% on CAVIAR and MLR-CUHK03, outperforming the state-of-the-art.
CVJul 15, 2020
RGB-IR Cross-modality Person ReID based on Teacher-Student GAN ModelZiyue Zhang, Shuai Jiang, Congzhentao Huang et al.
RGB-Infrared (RGB-IR) person re-identification (ReID) is a technology where the system can automatically identify the same person appearing at different parts of a video when light is unavailable. The critical challenge of this task is the cross-modality gap of features under different modalities. To solve this challenge, we proposed a Teacher-Student GAN model (TS-GAN) to adopt different domains and guide the ReID backbone to learn better ReID information. (1) In order to get corresponding RGB-IR image pairs, the RGB-IR Generative Adversarial Network (GAN) was used to generate IR images. (2) To kick-start the training of identities, a ReID Teacher module was trained under IR modality person images, which is then used to guide its Student counterpart in training. (3) Likewise, to better adapt different domain features and enhance model ReID performance, three Teacher-Student loss functions were used. Unlike other GAN based models, the proposed model only needs the backbone module at the test stage, making it more efficient and resource-saving. To showcase our model's capability, we did extensive experiments on the newly-released SYSU-MM01 RGB-IR Re-ID benchmark and achieved superior performance to the state-of-the-art with 49.8% Rank-1 and 47.4% mAP.
CVFeb 5, 2020
Illumination adaptive person reid based on teacher-student model and adversarial trainingZiyue Zhang, Richard YD Xu, Shuai Jiang et al.
Most existing works in Person Re-identification (ReID) focus on settings where illumination either is kept the same or has very little fluctuation. However, the changes in the illumination degree may affect the robustness of a ReID algorithm significantly. To address this problem, we proposed a Two-Stream Network that can separate ReID features from lighting features to enhance ReID performance. Its innovations are threefold: (1) A discriminative entropy loss to ensure the ReID features contain no lighting information. (2) A ReID Teacher model trained by images under "neutral" lighting conditions to guide ReID classification. (3) An illumination Teacher model trained by the differences between the illumination-adjusted and original images to guide illumination classification. We construct two augmented datasets by synthetically changing a set of predefined lighting conditions in two of the most popular ReID benchmarks: Market1501 and DukeMTMC-ReID. Experiments demonstrate that our algorithm outperforms other state-of-the-art works and particularly potent in handling images under extremely low light.
LGJul 15, 2018
Magnitude Bounded Matrix Factorisation for Recommender SystemsShuai Jiang, Kan Li, Richard Yi Da Xu
Low rank matrix factorisation is often used in recommender systems as a way of extracting latent features. When dealing with large and sparse datasets, traditional recommendation algorithms face the problem of acquiring large, unrestrained, fluctuating values over predictions especially for users/items with very few corresponding observations. Although the problem has been somewhat solved by imposing bounding constraints over its objectives, and/or over all entries to be within a fixed range, in terms of gaining better recommendations, these approaches have two major shortcomings that we aim to mitigate in this work: one is they can only deal with one pair of fixed bounds for all entries, and the other one is they are very time-consuming when applied on large scale recommender systems. In this paper, we propose a novel algorithm named Magnitude Bounded Matrix Factorisation (MBMF), which allows different bounds for individual users/items and performs very fast on large scale datasets. The key idea of our algorithm is to construct a model by constraining the magnitudes of each individual user/item feature vector. We achieve this by converting from the Cartesian to Spherical coordinate system with radii set as the corresponding magnitudes, which allows the above constrained optimisation problem to become an unconstrained one. The Stochastic Gradient Descent (SGD) method is then applied to solve the unconstrained task efficiently. Experiments on synthetic and real datasets demonstrate that in most cases the proposed MBMF is superior over all existing algorithms in terms of accuracy and time complexity.
LGMar 5, 2018
Relative Pairwise Relationship Constrained Non-negative Matrix FactorisationShuai Jiang, Kan Li, Richard Yida Xu
Non-negative Matrix Factorisation (NMF) has been extensively used in machine learning and data analytics applications. Most existing variations of NMF only consider how each row/column vector of factorised matrices should be shaped, and ignore the relationship among pairwise rows or columns. In many cases, such pairwise relationship enables better factorisation, for example, image clustering and recommender systems. In this paper, we propose an algorithm named, Relative Pairwise Relationship constrained Non-negative Matrix Factorisation (RPR-NMF), which places constraints over relative pairwise distances amongst features by imposing penalties in a triplet form. Two distance measures, squared Euclidean distance and Symmetric divergence, are used, and exponential and hinge loss penalties are adopted for the two measures respectively. It is well known that the so-called "multiplicative update rules" result in a much faster convergence than gradient descend for matrix factorisation. However, applying such update rules to RPR-NMF and also proving its convergence is not straightforward. Thus, we use reasonable approximations to relax the complexity brought by the penalties, which are practically verified. Experiments on both synthetic datasets and real datasets demonstrate that our algorithms have advantages on gaining close approximation, satisfying a high proportion of expected constraints, and achieving superior performance compared with other algorithms.