Siyuan Zhang

CV
h-index49
24papers
1,133citations
Novelty45%
AI Score53

24 Papers

LGMay 12Code
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Guanglong Sun, Siyuan Zhang, Liyuan Wang et al.

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignment tax}. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose \textbf{O}rthogonal \textbf{G}radient \textbf{P}rojection for \textbf{S}afety \textbf{A}lignment (\textbf{OGPSA}), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFT$\rightarrow$DPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.

CVJan 14Code
STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao, Chunrui Han et al.

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

MLOct 31, 2022
Probability-Dependent Gradient Decay in Large Margin Softmax

Siyuan Zhang, Linbo Xie, Ying Chen

In the past few years, Softmax has become a common component in neural network frameworks. In this paper, a gradient decay hyperparameter is introduced in Softmax to control the probability-dependent gradient decay rate during training. By following the theoretical analysis and empirical results of a variety of model architectures trained on MNIST, CIFAR-10/100 and SVHN, we find that the generalization performance depends significantly on the gradient decay rate as the confidence probability rises, i.e., the gradient decreases convexly or concavely as the sample probability increases. Moreover, optimization with the small gradient decay shows a similar curriculum learning sequence where hard samples are in the spotlight only after easy samples are convinced sufficiently, and well-separated samples gain a higher gradient to reduce intra-class distance. Based on the analysis results, we can provide evidence that the large margin Softmax will affect the local Lipschitz constraint of the loss function by regulating the probability-dependent gradient decay rate. This paper provides a new perspective and understanding of the relationship among concepts of large margin Softmax, local Lipschitz constraint and curriculum learning by analyzing the gradient decay rate. Besides, we propose a warm-up strategy to dynamically adjust Softmax loss in training, where the gradient decay rate increases from over-small to speed up the convergence rate.

IVMar 13, 2023
Direct Motif Extraction from High Resolution Crystalline STEM Images

Amel Shamseldeen Ali Alhassan, Siyuan Zhang, Benjamin Berkels

During the last decade, automatic data analysis methods concerning different aspects of crystal analysis have been developed, e.g., unsupervised primitive unit cell extraction and automated crystal distortion and defects detection. However, an automatic, unsupervised motif extraction method is still not widely available yet. Here, we propose and demonstrate a novel method for the automatic motif extraction in real space from crystalline images based on a variational approach involving the unit cell projection operator. Due to the non-convex nature of the resulting minimization problem, a multi-stage algorithm is used. First, we determine the primitive unit cell in form of two lattice vectors. Second, a motif image is estimated using the unit cell information. Finally, the motif is determined in terms of atom positions inside the unit cell. The method was tested on various synthetic and experimental HAADF STEM images. The results are a representation of the motif in form of an image, atomic positions, primitive unit cell vectors, and a denoised and a modeled reconstruction of the input image. The method was applied to extract the primitive cells of complex $μ$-phase structures Nb$_\text{6.4}$Co$_\text{6.6}$ and Nb$_\text{7}$Co$_\text{6}$, where subtle differences between their interplanar spacings were determined.

GRMar 7, 2023
NEPHELE: A Neural Platform for Highly Realistic Cloud Radiance Rendering

Haimin Luo, Siyuan Zhang, Fuqiang Zhao et al.

We have recently seen tremendous progress in neural rendering (NR) advances, i.e., NeRF, for photo-real free-view synthesis. Yet, as a local technique based on a single computer/GPU, even the best-engineered Instant-NGP or i-NGP cannot reach real-time performance when rendering at a high resolution, and often requires huge local computing resources. In this paper, we resort to cloud rendering and present NEPHELE, a neural platform for highly realistic cloud radiance rendering. In stark contrast with existing NR approaches, our NEPHELE allows for more powerful rendering capabilities by combining multiple remote GPUs and facilitates collaboration by allowing multiple people to view the same NeRF scene simultaneously. We introduce i-NOLF to employ opacity light fields for ultra-fast neural radiance rendering in a one-query-per-ray manner. We further resemble the Lumigraph with geometry proxies for fast ray querying and subsequently employ a small MLP to model the local opacity lumishperes for high-quality rendering. We also adopt Perfect Spatial Hashing in i-NOLF to enhance cache coherence. As a result, our i-NOLF achieves an order of magnitude performance gain in terms of efficiency than i-NGP, especially for the multi-user multi-viewpoint setting under cloud rendering scenarios. We further tailor a task scheduler accompanied by our i-NOLF representation and demonstrate the advance of our methodological design through a comprehensive cloud platform, consisting of a series of cooperated modules, i.e., render farms, task assigner, frame composer, and detailed streaming strategies. Using such a cloud platform compatible with neural rendering, we further showcase the capabilities of our cloud radiance rendering through a series of applications, ranging from cloud VR/AR rendering.

CLFeb 4, 2025Code
STAIR: Improving Safety Alignment with Introspective Reasoning

Yichi Zhang, Siyuan Zhang, Yao Huang et al.

Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at https://github.com/thu-ml/STAIR.

LGApr 26, 2024Code
MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

Zheng Lian, Haiyang Sun, Licai Sun et al.

Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing the dataset size and building more effective algorithms. However, due to problems such as complex environments and inaccurate annotations, current systems are hard to meet the demands of practical applications. Therefore, we organize the MER series of competitions to promote the development of this field. Last year, we launched MER2023, focusing on three interesting topics: multi-label learning, noise robustness, and semi-supervised learning. In this year's MER2024, besides expanding the dataset size, we further introduce a new track around open-vocabulary emotion recognition. The main purpose of this track is that existing datasets usually fix the label space and use majority voting to enhance the annotator consistency. However, this process may lead to inaccurate annotations, such as ignoring non-majority or non-candidate labels. In this track, we encourage participants to generate any number of labels in any category, aiming to describe emotional states as accurately as possible. Our baseline code relies on MERTools and is available at: https://github.com/zeroQiaoba/MERTools/tree/master/MER2024.

AIMay 18
EXG: Self-Evolving Agents with Experience Graphs

Yuxin Jin, Siyuan Zhang, Hanchen Wang et al.

Large language model (LLM)-based agents have demonstrated strong capabilities in complex reasoning and problem solving through multi-step interactions, yet most deployed agents remain behaviorally static, with knowledge acquired during execution rarely translating into systematic improvement over time. In response, a growing line of work on self-evolving agents explores how agents can improve through experience during deployment, but most existing approaches either rely on ad hoc reflection limited to single-task correction or adopt unstructured memory that accumulates fragmented experience with delayed usability. To address this limitation, we introduce EXG, an experience graph framework for self-evolving agents that explicitly organizes accumulated successes and failures into a structured, relational representation. EXG is the first experience graph designed for self-evolving agents, supporting both online, real-time graph growth during execution for immediate cross-task experience reuse, and offline reuse of a consolidated experience graph as an external memory module. This design also enables EXG to serve as a plug-and-play component for existing self-evolving agents, organizing prior experience into a unified experience graph and improving both solution quality and resource efficiency as deployment progresses. Extensive experiments across code generation and reasoning benchmarks show that EXG attains more favorable performance-efficiency trade-offs than reflection- and memory-based baselines in both online and offline evaluations. Our results suggest that structuring experience as a graph provides a principled foundation for scalable and transferable self-evolving agent behavior.

CVMay 22, 2025Code
REOBench: Benchmarking Robustness of Earth Observation Foundation Models

Xiang Li, Yong Tao, Siyuan Zhang et al.

Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models. Code and data are publicly available at https://github.com/lx709/REOBench.

CVMar 22, 2024Code
Multimodal Fusion with Pre-Trained Model Features in Affective Behaviour Analysis In-the-wild

Zhuofan Wen, Fengyu Zhang, Siyuan Zhang et al.

Multimodal fusion is a significant method for most multimodal tasks. With the recent surge in the number of large pre-trained models, combining both multimodal fusion methods and pre-trained model features can achieve outstanding performance in many multimodal tasks. In this paper, we present our approach, which leverages both advantages for addressing the task of Expression (Expr) Recognition and Valence-Arousal (VA) Estimation. We evaluate the Aff-Wild2 database using pre-trained models, then extract the final hidden layers of the models as features. Following preprocessing and interpolation or convolution to align the extracted features, different models are employed for modal fusion. Our code is available at GitHub - FulgenceWen/ABAW6th.

CVMar 21, 2025Code
Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition

Ran Liu, Fengyu Zhang, Cong Yu et al.

This article presents our results for the eighth Affective Behavior Analysis in-the-wild (ABAW) competition.Multimodal emotion recognition (ER) has important applications in affective computing and human-computer interaction. However, in the real world, compound emotion recognition faces greater issues of uncertainty and modal conflicts. For the Compound Expression (CE) Recognition Challenge,this paper proposes a multimodal emotion recognition method that fuses the features of Vision Transformer (ViT) and Residual Network (ResNet). We conducted experiments on the C-EXPR-DB and MELD datasets. The results show that in scenarios with complex visual and audio cues (such as C-EXPR-DB), the model that fuses the features of ViT and ResNet exhibits superior performance.Our code are avalible on https://github.com/MyGitHub-ax/8th_ABAW

CVApr 17, 2024
Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Yichi Zhang, Yinpeng Dong, Siyuan Zhang et al.

Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.

CLFeb 26, 2025
Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization

Siyuan Zhang, Yichi Zhang, Yinpeng Dong et al.

Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in the issue of factual hallucinations, which can be difficult to detect and mislead users without relevant knowledge. Although post-training techniques have been employed to mitigate the issue, existing methods usually suffer from poor generalization and trade-offs in other different capabilities. In this paper, we propose to address these by directly augmenting LLM's fundamental ability to precisely leverage its knowledge and introduce PKUE (Precise Knowledge Utilization Enhancement), which fine-tunes the model on self-generated responses to precise and simple factual questions through preference optimization. Furthermore, we construct FactualBench, a comprehensive and precise factual QA dataset containing 181k Chinese data spanning 21 domains, to facilitate both evaluation and training. Extensive experiments demonstrate that PKUE significantly improves LLM overall performance, with consistent enhancement across factual tasks of various forms, general tasks beyond factuality, and tasks in different language.

LGMar 10, 2025
Log Optimization Simplification Method for Predicting Remaining Time

Jianhong Ye, Siyuan Zhang, Yan Lin

Information systems generate a large volume of event log data during business operations, much of which consists of low-value and redundant information. When performance predictions are made directly from these logs, the accuracy of the predictions can be compromised. Researchers have explored methods to simplify and compress these data while preserving their valuable components. Most existing approaches focus on reducing the dimensionality of the data by eliminating redundant and irrelevant features. However, there has been limited investigation into the efficiency of execution both before and after event log simplification. In this paper, we present a prediction point selection algorithm designed to avoid the simplification of all points that function similarly. We select sequences or self-loop structures to form a simplifiable segment, and we optimize the deviation between the actual simplifiable value and the original data prediction value to prevent over-simplification. Experiments indicate that the simplified event log retains its predictive performance and, in some cases, enhances its predictive accuracy compared to the original event log.

OCMar 18, 2024
Convergence of Decentralized Stochastic Subgradient-based Methods for Nonsmooth Nonconvex functions

Siyuan Zhang, Nachuan Xiao, Xin Liu

In this paper, we focus on the decentralized stochastic subgradient-based methods in minimizing nonsmooth nonconvex functions without Clarke regularity, especially in the decentralized training of nonsmooth neural networks. We propose a general framework that unifies various decentralized subgradient-based methods, such as decentralized stochastic subgradient descent (DSGD), DSGD with gradient-tracking technique (DSGD-T), and DSGD with momentum (DSGD-M). To establish the convergence properties of our proposed framework, we relate the discrete iterates to the trajectories of a continuous-time differential inclusion, which is assumed to have a coercive Lyapunov function with a stable set $\mathcal{A}$. We prove the asymptotic convergence of the iterates to the stable set $\mathcal{A}$ with sufficiently small and diminishing step-sizes. These results provide first convergence guarantees for some well-recognized of decentralized stochastic subgradient-based methods without Clarke regularity of the objective function. Preliminary numerical experiments demonstrate that our proposed framework yields highly efficient decentralized stochastic subgradient-based methods with convergence guarantees in the training of nonsmooth neural networks.

CVFeb 14, 2025
Using MRNet to Predict Lunar Rock Categories Detected by Chang'e 5 Probe

Jin Cui, Yifei Zou, Siyuan Zhang

China's Chang'e 5 mission has been a remarkable success, with the chang'e 5 lander traveling on the Oceanus Procellarum to collect images of the lunar surface. Over the past half century, people have brought back some lunar rock samples, but its quantity does not meet the need for research. Under current circumstances, people still mainly rely on the analysis of rocks on the lunar surface through the detection of lunar rover. The Oceanus Procellarum, chosen by Chang'e 5 mission, contains various kind of rock species. Therefore, we first applied to the National Astronomical Observatories of the China under the Chinese Academy of Sciences for the Navigation and Terrain Camera (NaTeCam) of the lunar surface image, and established a lunar surface rock image data set CE5ROCK. The data set contains 100 images, which randomly divided into training, validation and test set. Experimental results show that the identification accuracy testing on convolutional neural network (CNN) models like AlexNet or MobileNet is about to 40.0%. In order to make full use of the global information in Moon images, this paper proposes the MRNet (MoonRockNet) network architecture. The encoding structure of the network uses VGG16 for feature extraction, and the decoding part adds dilated convolution and commonly used U-Net structure on the original VGG16 decoding structure, which is more conducive to identify more refined but more sparsely distributed types of lunar rocks. We have conducted extensive experiments on the established CE5ROCK data set, and the experimental results show that MRNet can achieve more accurate rock type identification, and outperform other existing mainstream algorithms in the identification performance.

LGDec 19, 2024
Parametric $ρ$-Norm Scaling Calibration

Siyuan Zhang, Linbo Xie

Output uncertainty indicates whether the probabilistic properties reflect objective characteristics of the model output. Unlike most loss functions and metrics in machine learning, uncertainty pertains to individual samples, but validating it on individual samples is unfeasible. When validated collectively, it cannot fully represent individual sample properties, posing a challenge in calibrating model confidence in a limited data set. Hence, it is crucial to consider confidence calibration characteristics. To counter the adverse effects of the gradual amplification of the classifier output amplitude in supervised learning, we introduce a post-processing parametric calibration method, $ρ$-Norm Scaling, which expands the calibrator expression and mitigates overconfidence due to excessive amplitude while preserving accuracy. Moreover, bin-level objective-based calibrator optimization often results in the loss of significant instance-level information. Therefore, we include probability distribution regularization, which incorporates specific priori information that the instance-level uncertainty distribution after calibration should resemble the distribution before calibration. Experimental results demonstrate the substantial enhancement in the post-processing calibrator for uncertainty calibration with our proposed method.

LGOct 26, 2021
Towards Hyperparameter-free Policy Selection for Offline Reinforcement Learning

Siyuan Zhang, Nan Jiang

How to select between policies and value functions produced by different training algorithms in offline reinforcement learning (RL) -- which is crucial for hyperpa-rameter tuning -- is an important open question. Existing approaches based on off-policy evaluation (OPE) often require additional function approximation and hence hyperparameters, creating a chicken-and-egg situation. In this paper, we design hyperparameter-free algorithms for policy selection based on BVFT [XJ21], a recent theoretical advance in value-function selection, and demonstrate their effectiveness in discrete-action benchmarks such as Atari. To address performance degradation due to poor critics in continuous-action domains, we further combine BVFT with OPE to get the best of both worlds, and obtain a hyperparameter-tuning method for Q-function based OPE with theoretical guarantees as a side product.

ASJul 27, 2021
Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement

Siyuan Zhang, Xiaofei Li

This paper addresses the problem of microphone array generalization for deep-learning-based end-to-end multichannel speech enhancement. We aim to train a unique deep neural network (DNN) potentially performing well on unseen microphone arrays. The microphone array geometry shapes the network's parameters when training on a fixed microphone array, and thus restricts the generalization of the trained network to another microphone array. To resolve this problem, a single network is trained using data recorded by various microphone arrays of different geometries. We design three variants of our recently proposed narrowband network to cope with the agnostic number of microphones. Overall, the goal is to make the network learn the universal information for speech enhancement that is available for any array geometry, rather than learn the one-array-dedicated characteristics. The experiments on both simulated and real room impulse responses (RIR) demonstrate the excellent across-array generalization capability of the proposed networks, in the sense that their performance measures are very close to, or even exceed the network trained with test arrays. Moreover, they notably outperform various beamforming methods and other advanced deep-learning-based methods.

CVAug 5, 2020
A feature-supervised generative adversarial network for environmental monitoring during hazy days

Ke Wang, Siyuan Zhang, Junlan Chen et al.

The adverse haze weather condition has brought considerable difficulties in vision-based environmental applications. While, until now, most of the existing environmental monitoring studies are under ordinary conditions, and the studies of complex haze weather conditions have been ignored. Thence, this paper proposes a feature-supervised learning network based on generative adversarial networks (GAN) for environmental monitoring during hazy days. Its main idea is to train the model under the supervision of feature maps from the ground truth. Four key technical contributions are made in the paper. First, pairs of hazy and clean images are used as inputs to supervise the encoding process and obtain high-quality feature maps. Second, the basic GAN formulation is modified by introducing perception loss, style loss, and feature regularization loss to generate better results. Third, multi-scale images are applied as the input to enhance the performance of discriminator. Finally, a hazy remote sensing dataset is created for testing our dehazing method and environmental detection. Extensive experimental results show that the proposed method has achieved better performance than current state-of-the-art methods on both synthetic datasets and real-world remote sensing images.

CLFeb 12, 2020
Joint Embedding in Named Entity Linking on Sentence Level

Wei Shi, Siyuan Zhang, Zhiwei Zhang et al.

Named entity linking is to map an ambiguous mention in documents to an entity in a knowledge base. The named entity linking is challenging, given the fact that there are multiple candidate entities for a mention in a document. It is difficult to link a mention when it appears multiple times in a document, since there are conflicts by the contexts around the appearances of the mention. In addition, it is difficult since the given training dataset is small due to the reason that it is done manually to link a mention to its mapping entity. In the literature, there are many reported studies among which the recent embedding methods learn vectors of entities from the training dataset at document level. To address these issues, we focus on how to link entity for mentions at a sentence level, which reduces the noises introduced by different appearances of the same mention in a document at the expense of insufficient information to be used. We propose a new unified embedding method by maximizing the relationships learned from knowledge graphs. We confirm the effectiveness of our method in our experimental studies.

CVDec 26, 2018
A Poisson-Gaussian Denoising Dataset with Real Fluorescence Microscopy Images

Yide Zhang, Yinhao Zhu, Evan Nichols et al.

Fluorescence microscopy has enabled a dramatic development in modern biology. Due to its inherently weak signal, fluorescence microscopy is not only much noisier than photography, but also presented with Poisson-Gaussian noise where Poisson noise, or shot noise, is the dominating noise source. To get clean fluorescence microscopy images, it is highly desirable to have effective denoising algorithms and datasets that are specifically designed to denoise fluorescence microscopy images. While such algorithms exist, no such datasets are available. In this paper, we fill this gap by constructing a dataset - the Fluorescence Microscopy Denoising (FMD) dataset - that is dedicated to Poisson-Gaussian denoising. The dataset consists of 12,000 real fluorescence microscopy images obtained with commercial confocal, two-photon, and wide-field microscopes and representative biological samples such as cells, zebrafish, and mouse brain tissues. We use image averaging to effectively obtain ground truth images and 60,000 noisy images with different noise levels. We use this dataset to benchmark 10 representative denoising algorithms and find that deep learning methods have the best performance. To our knowledge, this is the first real microscopy image dataset for Poisson-Gaussian denoising purposes and it could be an important tool for high-quality, real-time denoising applications in biomedical research.

CVJun 28, 2018
Deep Learning Based Instance Segmentation in 3D Biomedical Images Using Weak Annotation

Zhuo Zhao, Lin Yang, Hao Zheng et al.

Instance segmentation in 3D images is a fundamental task in biomedical image analysis. While deep learning models often work well for 2D instance segmentation, 3D instance segmentation still faces critical challenges, such as insufficient training data due to various annotation difficulties in 3D biomedical images. Common 3D annotation methods (e.g., full voxel annotation) incur high workloads and costs for labeling enough instances for training deep learning 3D instance segmentation models. In this paper, we propose a new weak annotation approach for training a fast deep learning 3D instance segmentation model without using full voxel mask annotation. Our approach needs only 3D bounding boxes for all instances and full voxel annotation for a small fraction of the instances, and uses a novel two-stage 3D instance segmentation model utilizing these two kinds of annotation, respectively. We evaluate our approach on several biomedical image datasets, and the experimental results show that (1) with full annotated boxes and a small amount of masks, our approach can achieve similar performance as the best known methods using full annotation, and (2) with similar annotation time, our approach outperforms the best known methods that use full annotation.

CVJun 15, 2017
Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation

Lin Yang, Yizhe Zhang, Jianxu Chen et al.

Image segmentation is a fundamental problem in biomedical image analysis. Recent advances in deep learning have achieved promising results on many biomedical image segmentation benchmarks. However, due to large variations in biomedical images (different modalities, image settings, objects, noise, etc), to utilize deep learning on a new application, it usually needs a new set of training data. This can incur a great deal of annotation effort and cost, because only biomedical experts can annotate effectively, and often there are too many instances in images (e.g., cells) to annotate. In this paper, we aim to address the following question: With limited effort (e.g., time) for annotation, what instances should be annotated in order to attain the best performance? We present a deep active learning framework that combines fully convolutional network (FCN) and active learning to significantly reduce annotation effort by making judicious suggestions on the most effective annotation areas. We utilize uncertainty and similarity information provided by FCN and formulate a generalized version of the maximum set cover problem to determine the most representative and uncertain areas for annotation. Extensive experiments using the 2015 MICCAI Gland Challenge dataset and a lymph node ultrasound image segmentation dataset show that, using annotation suggestions by our method, state-of-the-art segmentation performance can be achieved by using only 50% of training data.