CVNov 7, 2022Code
MogaNet: Multi-order Gated Aggregation NetworkSiyuan Li, Zedong Wang, Zicheng Liu et al.
By contextualizing the kernel as global as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress on multi-order game-theoretic interaction within deep neural networks (DNNs) reveals the representation bottleneck of modern ConvNets, where the expressive interactions have not been effectively encoded with the increased kernel size. To tackle this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models with favorable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized adaptively. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet and various downstream vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D&3D human pose estimation, and video prediction. Notably, MogaNet hits 80.0% and 87.8% accuracy with 5.2M and 181M parameters on ImageNet-1K, outperforming ParC-Net and ConvNeXt-L, while saving 59% FLOPs and 17M parameters, respectively. The source code is available at https://github.com/Westlake-AI/MogaNet.
CVMar 10, 2023
CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational AlignmentJiangbin Zheng, Yile Wang, Cheng Tan et al.
Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods.
BMDec 2, 2022Code
Deciphering RNA Secondary Structure Prediction: A Probabilistic K-Rook Matching PerspectiveCheng Tan, Zhangyang Gao, Hanqun Cao et al.
The secondary structure of ribonucleic acid (RNA) is more stable and accessible in the cell than its tertiary structure, making it essential for functional prediction. Although deep learning has shown promising results in this field, current methods suffer from poor generalization and high complexity. In this work, we reformulate the RNA secondary structure prediction as a K-Rook problem, thereby simplifying the prediction process into probabilistic matching within a finite solution space. Building on this innovative perspective, we introduce RFold, a simple yet effective method that learns to predict the most matching K-Rook solution from the given sequence. RFold employs a bi-dimensional optimization strategy that decomposes the probabilistic matching problem into row-wise and column-wise components to reduce the matching complexity, simplifying the solving process while guaranteeing the validity of the output. Extensive experiments demonstrate that RFold achieves competitive performance and about eight times faster inference efficiency than the state-of-the-art approaches. The code and Colab demo are available in (http://github.com/A4Bio/RFold).
QMNov 30, 2022
Protein Language Models and Structure Prediction: Connection and ProgressionBozhen Hu, Jun Xia, Jiangbin Zheng et al.
The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past two years have witnessed remarkable success in tertiary protein structure prediction (PSP), including evolution-based and single-sequence-based PSP. It seems that instead of using energy-based models and sampling procedures, protein language model (pLM)-based pipelines have emerged as mainstream paradigms in PSP. Despite the fruitful progress, the PSP community needs a systematic and up-to-date survey to help bridge the gap between LMs in the natural language processing (NLP) and PSP domains and introduce their methodologies, advancements and practical applications. To this end, in this paper, we first introduce the similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases. Then, we systematically review recent advances in LMs and pLMs from the perspectives of network architectures, pre-training strategies, applications, and commonly-used protein databases. Next, different types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding. Finally, we identify challenges faced by the PSP community and foresee promising research directions along with the advances of pLMs. This survey aims to be a hands-on guide for researchers to understand PSP methods, develop pLMs and tackle challenging problems in this field for practical purposes.
CLOct 30, 2022
Using Context-to-Vector with Graph Retrofitting to Improve Word EmbeddingsJiangbin Zheng, Yile Wang, Ge Wang et al.
Although contextualized embeddings generated from large-scale pre-trained models perform well in many tasks, traditional static embeddings (e.g., Skip-gram, Word2Vec) still play an important role in low-resource and lightweight settings due to their low computational cost, ease of deployment, and stability. In this paper, we aim to improve word embeddings by 1) incorporating more contextual information from existing pre-trained models into the Skip-gram framework, which we call Context-to-Vec; 2) proposing a post-processing retrofitting method for static embeddings independent of training by employing priori synonym knowledge and weighted vector distribution. Through extrinsic and intrinsic tasks, our methods are well proven to outperform the baselines by a large margin.
LGFeb 5, 2023
Data-Efficient Protein 3D Geometric Pretraining via Refinement of Diffused Protein Structure DecoyYufei Huang, Lirong Wu, Haitao Lin et al.
Learning meaningful protein representation is important for a variety of biological downstream tasks such as structure-based drug design. Having witnessed the success of protein sequence pretraining, pretraining for structural data which is more informative has become a promising research topic. However, there are three major challenges facing protein structure pretraining: insufficient sample diversity, physically unrealistic modeling, and the lack of protein-specific pretext tasks. To try to address these challenges, we present the 3D Geometric Pretraining. In this paper, we propose a unified framework for protein pretraining and a 3D geometric-based, data-efficient, and protein-specific pretext task: RefineDiff (Refine the Diffused Protein Structure Decoy). After pretraining our geometric-aware model with this task on limited data(less than 1% of SOTA models), we obtained informative protein representations that can achieve comparable performance for various downstream tasks.
BMApr 21, 2023
Cross-Gate MLP with Protein Complex Invariant Embedding is A One-Shot Antibody DesignerCheng Tan, Zhangyang Gao, Lirong Wu et al.
Antibodies are crucial proteins produced by the immune system in response to foreign substances or antigens. The specificity of an antibody is determined by its complementarity-determining regions (CDRs), which are located in the variable domains of the antibody chains and form the antigen-binding site. Previous studies have utilized complex techniques to generate CDRs, but they suffer from inadequate geometric modeling. Moreover, the common iterative refinement strategies lead to an inefficient inference. In this paper, we propose a \textit{simple yet effective} model that can co-design 1D sequences and 3D structures of CDRs in a one-shot manner. To achieve this, we decouple the antibody CDR design problem into two stages: (i) geometric modeling of protein complex structures and (ii) sequence-structure co-learning. We develop a novel macromolecular structure invariant embedding, typically for protein complexes, that captures both intra- and inter-component interactions among the backbone atoms, including C$α$, N, C, and O atoms, to achieve comprehensive geometric modeling. Then, we introduce a simple cross-gate MLP for sequence-structure co-learning, allowing sequence and structure representations to implicitly refine each other. This enables our model to design desired sequences and structures in a one-shot manner. Extensive experiments are conducted to evaluate our results at both the sequence and structure levels, which demonstrate that our model achieves superior performance compared to the state-of-the-art antibody CDR design methods.
BMMar 19, 2023
CCPL: Cross-modal Contrastive Protein LearningJiangbin Zheng, Stan Z. Li
Effective protein representation learning is crucial for predicting protein functions. Traditional methods often pretrain protein language models on large, unlabeled amino acid sequences, followed by finetuning on labeled data. While effective, these methods underutilize the potential of protein structures, which are vital for function determination. Common structural representation techniques rely heavily on annotated data, limiting their generalizability. Moreover, structural pretraining methods, similar to natural language pretraining, can distort actual protein structures. In this work, we introduce a novel unsupervised protein structure representation pretraining method, cross-modal contrastive protein learning (CCPL). CCPL leverages a robust protein language model and uses unsupervised contrastive alignment to enhance structure learning, incorporating self-supervised structural constraints to maintain intrinsic structural information. We evaluated our model across various benchmarks, demonstrating the framework's superiority.
CLNov 1, 2022
Leveraging Graph-based Cross-modal Information Fusion for Neural Sign Language TranslationJiangbin Zheng, Siyuan Li, Cheng Tan et al.
Sign Language (SL), as the mother tongue of the deaf community, is a special visual language that most hearing people cannot understand. In recent years, neural Sign Language Translation (SLT), as a possible way for bridging communication gap between the deaf and the hearing people, has attracted widespread academic attention. We found that the current mainstream end-to-end neural SLT models, which tries to learning language knowledge in a weakly supervised manner, could not mine enough semantic information under the condition of low data resources. Therefore, we propose to introduce additional word-level semantic knowledge of sign language linguistics to assist in improving current end-to-end neural SLT models. Concretely, we propose a novel neural SLT model with multi-modal feature fusion based on the dynamic graph, in which the cross-modal information, i.e. text and video, is first assembled as a dynamic graph according to their correlation, and then the graph is processed by a multi-modal graph encoder to generate the multi-modal embeddings for further usage in the subsequent neural translation models. To the best of our knowledge, we are the first to introduce graph neural networks, for fusing multi-modal information, into neural sign language translation models. Moreover, we conducted experiments on a publicly available popular SLT dataset RWTH-PHOENIX-Weather-2014T. and the quantitative experiments show that our method can improve the model.
LGOct 4, 2023Code
Co-modeling the Sequential and Graphical Routes for Peptide Representation LearningZihan Liu, Ge Wang, Jiaqi Wang et al.
Peptides are formed by the dehydration condensation of multiple amino acids. The primary structure of a peptide can be represented either as an amino acid sequence or as a molecular graph consisting of atoms and chemical bonds. Previous studies have indicated that deep learning routes specific to sequential and graphical peptide forms exhibit comparable performance on downstream tasks. Despite the fact that these models learn representations of the same modality of peptides, we find that they explain their predictions differently. Considering sequential and graphical models as two experts making inferences from different perspectives, we work on fusing expert knowledge to enrich the learned representations for improving the discriminative performance. To achieve this, we propose a peptide co-modeling method, RepCon, which employs a contrastive learning-based framework to enhance the mutual information of representations from decoupled sequential and graphical end-to-end models. It considers representations from the sequential encoder and the graphical encoder for the same peptide sample as a positive pair and learns to enhance the consistency of representations between positive sample pairs and to repel representations between negative pairs. Empirical studies of RepCon and other co-modeling methods are conducted on open-source discriminative datasets, including aggregation propensity, retention time, antimicrobial peptide prediction, and family classification from Peptide Database. Our results demonstrate the superiority of the co-modeling approach over independent modeling, as well as the superiority of RepCon over other methods under the co-modeling framework. In addition, the attribution on RepCon further corroborates the validity of the approach at the level of model explanation.
LGOct 14, 2023
Protein 3D Graph Structure Learning for Robust Structure-based Protein Property PredictionYufei Huang, Siyuan Li, Jin Su et al.
Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternatives. However, we observed that current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. While similar phenomena have been extensively studied in general fields (e.g., Computer Vision) as model robustness, their impact on protein property prediction remains unexplored. In this paper, we first investigate the reason behind the performance decrease when utilizing predicted structures, attributing it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Extensive experiments have shown that our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures. The benchmark datasets and codes will be released to benefit the community.
AIDec 18, 2025
Probing Scientific General Intelligence of LLMs with Scientist-Aligned WorkflowsWanghan Xu, Yuhao Zhou, Yifan Zhou et al.
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
BMAug 5, 2024
MetaEnzyme: Meta Pan-Enzyme Learning for Task-Adaptive RedesignJiangbin Zheng, Han Zhang, Qianqing Xu et al.
Enzyme design plays a crucial role in both industrial production and biology. However, this field faces challenges due to the lack of comprehensive benchmarks and the complexity of enzyme design tasks, leading to a dearth of systematic research. Consequently, computational enzyme design is relatively overlooked within the broader protein domain and remains in its early stages. In this work, we address these challenges by introducing MetaEnzyme, a staged and unified enzyme design framework. We begin by employing a cross-modal structure-to-sequence transformation architecture, as the feature-driven starting point to obtain initial robust protein representation. Subsequently, we leverage domain adaptive techniques to generalize specific enzyme design tasks under low-resource conditions. MetaEnzyme focuses on three fundamental low-resource enzyme redesign tasks: functional design (FuncDesign), mutation design (MutDesign), and sequence generation design (SeqDesign). Through novel unified paradigm and enhanced representation capabilities, MetaEnzyme demonstrates adaptability to diverse enzyme design tasks, yielding outstanding results. Wet lab experiments further validate these findings, reinforcing the efficacy of the redesign process.
GNMay 13, 2024
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence ModelingSiyuan Li, Zedong Wang, Zicheng Liu et al.
Similar to natural language models, pre-trained genome language models are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the hand-crafted tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of genomic data. In this paper, we introduce VQDNA, a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings in an end-to-end manner. To further push its limits, we propose Hierarchical Residual Quantization (HRQ), where varying scales of codebooks are designed in a hierarchy to enrich the genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32 genome datasets demonstrate VQDNA's superiority and favorable parameter efficiency compared to existing genome language models. Notably, empirical analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and biological significance of learned HRQ vocabulary, highlighting its untapped potential for broader applications in genomics.
AIDec 11, 2023
Progressive Multi-Modality Learning for Inverse Protein FoldingJiangbin Zheng, Stan Z. Li
While deep generative models show promise for learning inverse protein folding directly from data, the lack of publicly available structure-sequence pairings limits their generalization. Previous improvements and data augmentation efforts to overcome this bottleneck have been insufficient. To further address this challenge, we propose a novel protein design paradigm called MMDesign, which leverages multi-modality transfer learning. To our knowledge, MMDesign is the first framework that combines a pretrained structural module with a pretrained contextual module, using an auto-encoder (AE) based language model to incorporate prior protein semantic knowledge. Experimental results, only training with the small dataset, demonstrate that MMDesign consistently outperforms baselines on various public benchmarks. To further assess the biological plausibility, we present systematic quantitative analysis techniques that provide interpretability and reveal more about the laws of protein design.
59.2CVApr 9
DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video CaptioningJunbo Wang, Liangyu Fu, Yuke Li et al.
Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.
CVMar 4, 2025
10K is Enough: An Ultra-Lightweight Binarized Network for Infrared Small-Target DetectionBiqiao Xin, Qianchen Mao, Bingshu Wang et al.
The widespread deployment of Infrared Small-Target Detection (IRSTD) algorithms on edge devices necessitates the exploration of model compression techniques. Binarized neural networks (BNNs) are distinguished by their exceptional efficiency in model compression. However, the small size of infrared targets introduces stringent precision requirements for the IRSTD task, while the inherent precision loss during binarization presents a significant challenge. To address this, we propose the Binarized Infrared Small-Target Detection Network (BiisNet), which preserves the core operations of binarized convolutions while integrating full-precision features into the network's information flow. Specifically, we propose the Dot Binary Convolution, which retains fine-grained semantic information in feature maps while still leveraging the binarized convolution operations. In addition, we introduce a smooth and adaptive Dynamic Softsign function, which provides more comprehensive and progressively finer gradient during backpropagation, enhancing model stability and promoting an optimal weight distribution. Experimental results demonstrate that BiisNet not only significantly outperforms other binary architectures but also has strong competitiveness among state-of-the-art full-precision models.
LGNov 17, 2025
Departures: Distributional Transport for Single-Cell Perturbation Prediction with Neural Schrödinger BridgesChangxi Chi, Yufei Huang, Jun Xia et al.
Predicting single-cell perturbation outcomes directly advances gene function analysis and facilitates drug candidate selection, making it a key driver of both basic and translational biomedical research. However, a major bottleneck in this task is the unpaired nature of single-cell data, as the same cell cannot be observed both before and after perturbation due to the destructive nature of sequencing. Although some neural generative transport models attempt to tackle unpaired single-cell perturbation data, they either lack explicit conditioning or depend on prior spaces for indirect distribution alignment, limiting precise perturbation modeling. In this work, we approximate Schrödinger Bridge (SB), which defines stochastic dynamic mappings recovering the entropy-regularized optimal transport (OT), to directly align the distributions of control and perturbed single-cell populations across different perturbation conditions. Unlike prior SB approximations that rely on bidirectional modeling to infer optimal source-target sample coupling, we leverage Minibatch-OT based pairing to avoid such bidirectional inference and the associated ill-posedness of defining the reverse process. This pairing directly guides bridge learning, yielding a scalable approximation to the SB. We approximate two SB models, one modeling discrete gene activation states and the other continuous expression distributions. Joint training enables accurate perturbation modeling and captures single-cell heterogeneity. Experiments on public genetic and drug perturbation datasets show that our model effectively captures heterogeneous single-cell responses and achieves state-of-the-art performance.
LGSep 15, 2025
Multimodal Regression for Enzyme Turnover Rates PredictionBozhen Hu, Cheng Tan, Siyuan Li et al.
The enzyme turnover rate is a fundamental parameter in enzyme kinetics, reflecting the catalytic efficiency of enzymes. However, enzyme turnover rates remain scarce across most organisms due to the high cost and complexity of experimental measurements. To address this gap, we propose a multimodal framework for predicting the enzyme turnover rate by integrating enzyme sequences, substrate structures, and environmental factors. Our model combines a pre-trained language model and a convolutional neural network to extract features from protein sequences, while a graph neural network captures informative representations from substrate molecules. An attention mechanism is incorporated to enhance interactions between enzyme and substrate representations. Furthermore, we leverage symbolic regression via Kolmogorov-Arnold Networks to explicitly learn mathematical formulas that govern the enzyme turnover rate, enabling interpretable and accurate predictions. Extensive experiments demonstrate that our framework outperforms both traditional and state-of-the-art deep learning approaches. This work provides a robust tool for studying enzyme kinetics and holds promise for applications in enzyme engineering, biotechnology, and industrial biocatalysis.
QMNov 26, 2024
DapPep: Domain Adaptive Peptide-agnostic Learning for Universal T-cell Receptor-antigen Binding Affinity PredictionJiangbin Zheng, Qianhui Xu, Ruichen Xia et al.
Identifying T-cell receptors (TCRs) that interact with antigenic peptides provides the technical basis for developing vaccines and immunotherapies. The emergent deep learning methods excel at learning antigen binding patterns from known TCRs but struggle with novel or sparsely represented antigens. However, binding specificity for unseen antigens or exogenous peptides is critical. We introduce a domain-adaptive peptide-agnostic learning framework DapPep for universal TCR-antigen binding affinity prediction to address this challenge. The lightweight self-attention architecture combines a pre-trained protein language model with an inner-loop self-supervised regime to enable robust TCR-peptide representations. Extensive experiments on various benchmarks demonstrate that DapPep consistently outperforms existing tools, showcasing robust generalization capability, especially for data-scarce settings and unseen peptides. Moreover, DapPep proves effective in challenging clinical tasks such as sorting reactive T cells in tumor neoantigen therapy and identifying key positions in 3D structures.
QMNov 26, 2024
Pan-protein Design Learning Enables Task-adaptive Generalization for Low-resource Enzyme DesignJiangbin Zheng, Ge Wang, Han Zhang et al.
Computational protein design (CPD) offers transformative potential for bioengineering, but current deep CPD models, focused on universal domains, struggle with function-specific designs. This work introduces a novel CPD paradigm tailored for functional design tasks, particularly for enzymes-a key protein class often lacking specific application efficiency. To address structural data scarcity, we present CrossDesign, a domain-adaptive framework that leverages pretrained protein language models (PPLMs). By aligning protein structures with sequences, CrossDesign transfers pretrained knowledge to structure models, overcoming the limitations of limited structural data. The framework combines autoregressive (AR) and non-autoregressive (NAR) states in its encoder-decoder architecture, applying it to enzyme datasets and pan-proteins. Experimental results highlight CrossDesign's superior performance and robustness, especially with out-of-domain enzymes. Additionally, the model excels in fitness prediction when tested on large-scale mutation data, showcasing its stability.
CVMar 19, 2024
Inter- and intra-uncertainty based feature aggregation model for semi-supervised histopathology image segmentationQiangguo Jin, Hui Cui, Changming Sun et al.
Acquiring pixel-level annotations is often limited in applications such as histology studies that require domain expertise. Various semi-supervised learning approaches have been developed to work with limited ground truth annotations, such as the popular teacher-student models. However, hierarchical prediction uncertainty within the student model (intra-uncertainty) and image prediction uncertainty (inter-uncertainty) have not been fully utilized by existing methods. To address these issues, we first propose a novel inter- and intra-uncertainty regularization method to measure and constrain both inter- and intra-inconsistencies in the teacher-student architecture. We also propose a new two-stage network with pseudo-mask guided feature aggregation (PG-FANet) as the segmentation model. The two-stage structure complements with the uncertainty regularization strategy to avoid introducing extra modules in solving uncertainties and the aggregation mechanisms enable multi-scale and multi-stage feature integration. Comprehensive experimental results over the MoNuSeg and CRAG datasets show that our PG-FANet outperforms other state-of-the-art methods and our semi-supervised learning framework yields competitive performance with a limited amount of labeled data.
LGJan 5, 2024
Graph-level Protein Representation Learning by Structure Knowledge RefinementGe Wang, Zelin Zang, Jiangbin Zheng et al.
This paper focuses on learning representation on the whole graph level in an unsupervised manner. Learning graph-level representation plays an important role in a variety of real-world issues such as molecule property prediction, protein structure feature extraction, and social network analysis. The mainstream method is utilizing contrastive learning to facilitate graph feature extraction, known as Graph Contrastive Learning (GCL). GCL, although effective, suffers from some complications in contrastive learning, such as the effect of false negative pairs. Moreover, augmentation strategies in GCL are weakly adaptive to diverse graph datasets. Motivated by these problems, we propose a novel framework called Structure Knowledge Refinement (SKR) which uses data structure to determine the probability of whether a pair is positive or negative. Meanwhile, we propose an augmentation strategy that naturally preserves the semantic meaning of the original data and is compatible with our SKR framework. Furthermore, we illustrate the effectiveness of our SKR framework through intuition and experiments. The experimental results on the tasks of graph-level classification demonstrate that our SKR framework is superior to most state-of-the-art baselines.
CVJan 13, 2022
A Survey on Masked Facial Detection Methods and Datasets for Fighting Against COVID-19Bingshu Wang, Jiangbin Zheng, C. L. Philip Chen
Coronavirus disease 2019 (COVID-19) continues to pose a great challenge to the world since its outbreak. To fight against the disease, a series of artificial intelligence (AI) techniques are developed and applied to real-world scenarios such as safety monitoring, disease diagnosis, infection risk assessment, lesion segmentation of COVID-19 CT scans,etc. The coronavirus epidemics have forced people wear masks to counteract the transmission of virus, which also brings difficulties to monitor large groups of people wearing masks. In this paper, we primarily focus on the AI techniques of masked facial detection and related datasets. We survey the recent advances, beginning with the descriptions of masked facial detection datasets. Thirteen available datasets are described and discussed in details. Then, the methods are roughly categorized into two classes: conventional methods and neural network-based methods. Conventional methods are usually trained by boosting algorithms with hand-crafted features, which accounts for a small proportion. Neural network-based methods are further classified as three parts according to the number of processing stages. Representative algorithms are described in detail, coupled with some typical techniques that are described briefly. Finally, we summarize the recent benchmarking results, give the discussions on the limitations of datasets and methods, and expand future research directions. To our knowledge, this is the first survey about masked facial detection methods and datasets. Hopefully our survey could provide some help to fight against epidemics.
CVJun 5, 2020
SparseFusion: Dynamic Human Avatar Modeling from Sparse RGBD ImagesXinxin Zuo, Sen Wang, Jiangbin Zheng et al.
In this paper, we propose a novel approach to reconstruct 3D human body shapes based on a sparse set of RGBD frames using a single RGBD camera. We specifically focus on the realistic settings where human subjects move freely during the capture. The main challenge is how to robustly fuse these sparse frames into a canonical 3D model, under pose changes and surface occlusions. This is addressed by our new framework consisting of the following steps. First, based on a generative human template, for every two frames having sufficient overlap, an initial pairwise alignment is performed; It is followed by a global non-rigid registration procedure, in which partial results from RGBD frames are collected into a unified 3D shape, under the guidance of correspondences from the pairwise alignment; Finally, the texture map of the reconstructed human model is optimized to deliver a clear and spatially consistent texture. Empirical evaluations on synthetic and real datasets demonstrate both quantitatively and qualitatively the superior performance of our framework in reconstructing complete 3D human models with high fidelity. It is worth noting that our framework is flexible, with potential applications going beyond shape reconstruction. As an example, we showcase its use in reshaping and reposing to a new avatar.
CVMar 5, 2018
Beyond Context: Exploring Semantic Similarity for Tiny Face DetectionYue Xi, Jiangbin Zheng, Xiangjian He et al.
Tiny face detection aims to find faces with high degrees of variability in scale, resolution and occlusion in cluttered scenes. Due to the very little information available on tiny faces, it is not sufficient to detect them merely based on the information presented inside the tiny bounding boxes or their context. In this paper, we propose to exploit the semantic similarity among all predicted targets in each image to boost current face detectors. To this end, we present a novel framework to model semantic similarity as pairwise constraints within the metric learning scheme, and then refine our predictions with the semantic similarity by utilizing the graph cut techniques. Experiments conducted on three widely-used benchmark datasets have demonstrated the improvement over the-state-of-the-arts gained by applying this idea.
CVFeb 6, 2017
Detailed Surface Geometry and Albedo Recovery from RGB-D Video Under Natural IlluminationXinxin Zuo, Sen Wang, Jiangbin Zheng et al.
In this paper we present a novel approach for depth map enhancement from an RGB-D video sequence. The basic idea is to exploit the shading information in the color image. Instead of making assumption about surface albedo or controlled object motion and lighting, we use the lighting variations introduced by casual object movement. We are effectively calculating photometric stereo from a moving object under natural illuminations. The key technical challenge is to establish correspondences over the entire image set. We therefore develop a lighting insensitive robust pixel matching technique that out-performs optical flow method in presence of lighting variations. In addition we present an expectation-maximization framework to recover the surface normal and albedo simultaneously, without any regularization term. We have validated our method on both synthetic and real datasets to show its superior performance on both surface details recovery and intrinsic decomposition.