IVJul 12, 2023
Local Conditional Neural Fields for Versatile and Generalizable Large-Scale Reconstructions in Computational ImagingHao Wang, Jiabei Zhu, Yunzhe Li et al.
Deep learning has transformed computational imaging, but traditional pixel-based representations limit their ability to capture continuous, multiscale details of objects. Here we introduce a novel Local Conditional Neural Fields (LCNF) framework, leveraging a continuous implicit neural representation to address this limitation. LCNF enables flexible object representation and facilitates the reconstruction of multiscale information. We demonstrate the capabilities of LCNF in solving the highly ill-posed inverse problem in Fourier ptychographic microscopy (FPM) with multiplexed measurements, achieving robust, scalable, and generalizable large-scale phase retrieval. Unlike traditional neural fields frameworks, LCNF incorporates a local conditional representation that promotes model generalization, learning multiscale information, and efficient processing of large-scale imaging data. By combining an encoder and a decoder conditioned on a learned latent vector, LCNF achieves versatile continuous-domain super-resolution image reconstruction. We demonstrate accurate reconstruction of wide field-of-view, high-resolution phase images using only a few multiplexed measurements. LCNF robustly captures the continuous object priors and eliminates various phase artifacts, even when it is trained on imperfect datasets. The framework exhibits strong generalization, reconstructing diverse objects even with limited training data. Furthermore, LCNF can be trained on a physics simulator using natural images and successfully applied to experimental measurements on biological samples. Our results highlight the potential of LCNF for solving large-scale inverse problems in computational imaging, with broad applicability in various deep-learning-based techniques.
CLSep 10, 2022
Adversarial Learning-based Stance Classifier for COVID-19-related Health PoliciesFeng Xie, Zhong Zhang, Xuechen Zhao et al.
The ongoing COVID-19 pandemic has caused immeasurable losses for people worldwide. To contain the spread of the virus and further alleviate the crisis, various health policies (e.g., stay-at-home orders) have been issued which spark heated discussions as users turn to share their attitudes on social media. In this paper, we consider a more realistic scenario on stance detection (i.e., cross-target and zero-shot settings) for the pandemic and propose an adversarial learning-based stance classifier to automatically identify the public's attitudes toward COVID-19-related health policies. Specifically, we adopt adversarial learning that allows the model to train on a large amount of labeled data and capture transferable knowledge from source topics, so as to enable generalize to the emerging health policies with sparse labeled data. To further enhance the model's deeper understanding, we incorporate policy descriptions as external knowledge into the model. Meanwhile, a GeoEncoder is designed which encourages the model to capture unobserved background factors specified by each region and then represent them as non-text information. We evaluate the performance of a broad range of baselines on the stance detection task for COVID-19-related health policies, and experimental results show that our proposed method achieves state-of-the-art performance in both cross-target and zero-shot settings.
CLOct 7, 2022
Zero-shot stance detection based on cross-domain feature enhancement by contrastive learningXuechen Zhao, Jiaying Zou, Zhong Zhang et al.
Zero-shot stance detection is challenging because it requires detecting the stance of previously unseen targets in the inference phase. The ability to learn transferable target-invariant features is critical for zero-shot stance detection. In this work, we propose a stance detection approach that can efficiently adapt to unseen targets, the core of which is to capture target-invariant syntactic expression patterns as transferable knowledge. Specifically, we first augment the data by masking the topic words of sentences, and then feed the augmented data to an unsupervised contrastive learning module to capture transferable features. Then, to fit a specific target, we encode the raw texts as target-specific features. Finally, we adopt an attention mechanism, which combines syntactic expression patterns with target-specific features to obtain enhanced features for predicting previously unseen targets. Experiments demonstrate that our model outperforms competitive baselines on four benchmark datasets.
LGFeb 17, 2023
Cross-Domain Label Propagation for Domain Adaptation with Discriminative Graph Self-LearningLei Tian, Yongqiang Tang, Liangchen Hu et al.
Domain adaptation manages to transfer the knowledge of well-labeled source data to unlabeled target data. Many recent efforts focus on improving the prediction accuracy of target pseudo-labels to reduce conditional distribution shift. In this paper, we propose a novel domain adaptation method, which infers target pseudo-labels through cross-domain label propagation, such that the underlying manifold structure of two domain data can be explored. Unlike existing cross-domain label propagation methods that separate domain-invariant feature learning, affinity matrix constructing and target labels inferring into three independent stages, we propose to integrate them into a unified optimization framework. In such way, these three parts can boost each other from an iterative optimization perspective and thus more effective knowledge transfer can be achieved. Furthermore, to construct a high-quality affinity matrix, we propose a discriminative graph self-learning strategy, which can not only adaptively capture the inherent similarity of the data from two domains but also effectively exploit the discriminative information contained in well-labeled source data and pseudo-labeled target data. An efficient iterative optimization algorithm is designed to solve the objective function of our proposal. Notably, the proposed method can be extended to semi-supervised domain adaptation in a simple but effective way and the corresponding optimization problem can be solved with the identical algorithm. Extensive experiments on six standard datasets verify the significant superiority of our proposal in both unsupervised and semi-supervised domain adaptation settings.
LGNov 8, 2023
MixTEA: Semi-supervised Entity Alignment with Mixture TeachingFeng Xie, Xin Song, Xiang Zeng et al.
Semi-supervised entity alignment (EA) is a practical and challenging task because of the lack of adequate labeled mappings as training data. Most works address this problem by generating pseudo mappings for unlabeled entities. However, they either suffer from the erroneous (noisy) pseudo mappings or largely ignore the uncertainty of pseudo mappings. In this paper, we propose a novel semi-supervised EA method, termed as MixTEA, which guides the model learning with an end-to-end mixture teaching of manually labeled mappings and probabilistic pseudo mappings. We firstly train a student model using few labeled mappings as standard. More importantly, in pseudo mapping learning, we propose a bi-directional voting (BDV) strategy that fuses the alignment decisions in different directions to estimate the uncertainty via the joint matching confidence score. Meanwhile, we also design a matching diversity-based rectification (MDR) module to adjust the pseudo mapping learning, thus reducing the negative influence of noisy mappings. Extensive results on benchmark datasets as well as further analyses demonstrate the superiority and the effectiveness of our proposed method.
CVJan 17, 2024Code
UniVG: Towards UNIfied-modal Video GenerationLudan Ruan, Lei Tian, Chuanwei Huang et al.
Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fréchet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.
CLMay 6, 2025Code
Uncertainty-Aware Large Language Models for Explainable Disease DiagnosisShuang Zhou, Jiashuo Wang, Zidu Xu et al.
Explainable disease diagnosis, which leverages patient information (e.g., signs and symptoms) and computational models to generate probable diagnoses and reasonings, offers clear clinical values. However, when clinical notes encompass insufficient evidence for a definite diagnosis, such as the absence of definitive symptoms, diagnostic uncertainty usually arises, increasing the risk of misdiagnosis and adverse outcomes. Although explicitly identifying and explaining diagnostic uncertainties is essential for trustworthy diagnostic systems, it remains under-explored. To fill this gap, we introduce ConfiDx, an uncertainty-aware large language model (LLM) created by fine-tuning open-source LLMs with diagnostic criteria. We formalized the task and assembled richly annotated datasets that capture varying degrees of diagnostic ambiguity. Evaluating ConfiDx on real-world datasets demonstrated that it excelled in identifying diagnostic uncertainties, achieving superior diagnostic performance, and generating trustworthy explanations for diagnoses and uncertainties. To our knowledge, this is the first study to jointly address diagnostic uncertainty recognition and explanation, substantially enhancing the reliability of automatic diagnostic systems.
OPTICSMay 13
DeepFilters: Scattering-Aware Pupil Engineering with Learned Digital Filter Reconstruction for Extended Depth of Field MicroscopyJoseph L. Greene, Suet YIng Chan, Qilin Deng et al.
Extended depth of field microscopy encodes axial information into a single acquisition through engineered point spread functions, but conventional and deep optics approaches are subject to degradation in scattering tissue. We introduce DeepFilters, a scattering-aware deep optics framework that jointly optimizes a parameterized pupil filter and a digital-filter-based reconstruction network through a calibrated differentiable forward model to achieve broad generalization without retraining. Incorporating empirical scattering kernels, physics-guided regularization, and a hybrid genetic-gradient initialization strategy, DeepFilters extends the PSF from 16 micron to >400 micron in clear media and enables signal recovery beyond 120 micron deep in biological tissues, validated across fixed brain slices and sea urchin embryos.
CVSep 23, 2024
StarVid: Enhancing Semantic Alignment in Video Diffusion Models via Spatial and SynTactic Guided Attention RefocusingYuanhang Li, Qi Mao, Lan Chen et al.
Recent advances in text-to-video (T2V) generation with diffusion models have garnered significant attention. However, they typically perform well in scenes with a single object and motion, struggling in compositional scenarios with multiple objects and distinct motions to accurately reflect the semantic content of text prompts. To address these challenges, we propose \textbf{StarVid}, a plug-and-play, training-free method that improves semantic alignment between multiple subjects, their motions, and text prompts in T2V models. StarVid first leverages the spatial reasoning capabilities of large language models (LLMs) for two-stage motion trajectory planning based on text prompts. Such trajectories serve as spatial priors, guiding a spatial-aware loss to refocus cross-attention (CA) maps into distinctive regions. Furthermore, we propose a syntax-guided contrastive constraint to strengthen the correlation between the CA maps of verbs and their corresponding nouns, enhancing motion-subject binding. Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline methods, delivering videos of higher quality with improved semantic consistency.
CVJan 11, 2024
HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion ModelsHanzhang Wang, Haoran Wang, Jinze Yang et al.
The goal of Arbitrary Style Transfer (AST) is injecting the artistic features of a style reference into a given image/video. Existing methods usually focus on pursuing the balance between style and content, whereas ignoring the significant demand for flexible and customized stylization results and thereby limiting their practical application. To address this critical issue, a novel AST approach namely HiCAST is proposed, which is capable of explicitly customizing the stylization results according to various source of semantic clues. In the specific, our model is constructed based on Latent Diffusion Model (LDM) and elaborately designed to absorb content and style instance as conditions of LDM. It is characterized by introducing of \textit{Style Adapter}, which allows user to flexibly manipulate the output results by aligning multi-level style information and intrinsic knowledge in LDM. Lastly, we further extend our model to perform video AST. A novel learning objective is leveraged for video diffusion model training, which significantly improve cross-frame temporal consistency in the premise of maintaining stylization strength. Qualitative and quantitative comparisons as well as comprehensive user studies demonstrate that our HiCAST outperforms the existing SoTA methods in generating visually plausible stylization results.
CVNov 19, 2025
DCL-SE: Dynamic Curriculum Learning for Spatiotemporal Encoding of Brain ImagingMeihua Zhou, Xinyu Tong, Jiarui Zhao et al.
High-dimensional neuroimaging analyses for clinical diagnosis are often constrained by compromises in spatiotemporal fidelity and by the limited adaptability of large-scale, general-purpose models. To address these challenges, we introduce Dynamic Curriculum Learning for Spatiotemporal Encoding (DCL-SE), an end-to-end framework centered on data-driven spatiotemporal encoding (DaSE). We leverage Approximate Rank Pooling (ARP) to efficiently encode three-dimensional volumetric brain data into information-rich, two-dimensional dynamic representations, and then employ a dynamic curriculum learning strategy, guided by a Dynamic Group Mechanism (DGM), to progressively train the decoder, refining feature extraction from global anatomical structures to fine pathological details. Evaluated across six publicly available datasets, including Alzheimer's disease and brain tumor classification, cerebral artery segmentation, and brain age prediction, DCL-SE consistently outperforms existing methods in accuracy, robustness, and interpretability. These findings underscore the critical importance of compact, task-specific architectures in the era of large-scale pretrained networks.
IVNov 27, 2021
Recovery of Continuous 3D Refractive Index Maps from Discrete Intensity-Only Measurements using Neural FieldsRenhao Liu, Yu Sun, Jiabei Zhu et al.
Intensity diffraction tomography (IDT) refers to a class of optical microscopy techniques for imaging the 3D refractive index (RI) distribution of a sample from a set of 2D intensity-only measurements. The reconstruction of artifact-free RI maps is a fundamental challenge in IDT due to the loss of phase information and the missing cone problem. Neural fields (NF) has recently emerged as a new deep learning (DL) approach for learning continuous representations of physical fields. NF uses a coordinate-based neural network to represent the field by mapping the spatial coordinates to the corresponding physical quantities, in our case the complex-valued refractive index values. We present DeCAF as the first NF-based IDT method that can learn a high-quality continuous representation of a RI volume from its intensity-only and limited-angle measurements. The representation in DeCAF is learned directly from the measurements of the test sample by using the IDT forward model, without any ground-truth RI maps. We qualitatively and quantitatively evaluate DeCAF on the simulated and experimental biological samples. Our results show that DeCAF can generate high-contrast and artifact-free RI maps and lead to up to 2.1 times reduction in MSE over existing methods.
IVMar 29, 2021
Physical model simulator-trained neural network for computational 3D phase imaging of multiple-scattering samplesAlex Matlock, Lei Tian
Recovering 3D phase features of complex, multiple-scattering biological samples traditionally sacrifices computational efficiency and processing time for physical model accuracy and reconstruction quality. This trade-off hinders the rapid analysis of living, dynamic biological samples that are often of greatest interest to biological research. Here, we overcome this bottleneck by combining annular intensity diffraction tomography (aIDT) with an approximant-guided deep learning framework. Using a novel physics model simulator-based learning strategy trained entirely on natural image datasets, we show our network can robustly reconstruct complex 3D biological samples of arbitrary size and structure. This approach highlights that large-scale multiple-scattering models can be leveraged in place of acquiring experimental datasets for achieving highly generalizable deep learning models. We devise a new model-based data normalization pre-processing procedure for homogenizing the sample contrast and achieving uniform prediction quality regardless of scattering strength. To achieve highly efficient training and prediction, we implement a lightweight 2D network structure that utilizes a multi-channel input for encoding the axial information. We demonstrate this framework's capabilities on experimental measurements of epithelial buccal cells and Caenorhabditis elegans worms. We highlight the robustness of this approach by evaluating dynamic samples on a living worm video, and we emphasize our approach's generalizability by recovering algae samples evaluated with different experimental setups. To assess the prediction quality, we develop a novel quantitative evaluation metric and show that our predictions are consistent with our experimental measurements and multiple-scattering physics.
CVMar 29, 2021
An Adversarial Human Pose Estimation Network Injected with Graph StructureLei Tian, Guoqiang Liang, Peng Wang et al.
Because of the invisible human keypoints in images caused by illumination, occlusion and overlap, it is likely to produce unreasonable human pose prediction for most of the current human pose estimation methods. In this paper, we design a novel generative adversarial network (GAN) to improve the localization accuracy of visible joints when some joints are invisible. The network consists of two simple but efficient modules, Cascade Feature Network (CFN) and Graph Structure Network (GSN). First, the CFN utilizes the prediction maps from the previous stages to guide the prediction maps in the next stage to produce accurate human pose. Second, the GSN is designed to contribute to the localization of invisible joints by passing message among different joints. According to GAN, if the prediction pose produced by the generator G cannot be distinguished by the discriminator D, the generator network G has successfully obtained the underlying dependence of human joints. We conduct experiments on three widely used human pose estimation benchmark datasets, LSP, MPII and COCO, whose results show the effectiveness of our proposed framework.
CVMar 20, 2020
Domain Adaptation by Class Centroid Matching and Local Manifold Self-LearningLei Tian, Yongqiang Tang, Liangchen Hu et al.
Domain adaptation has been a fundamental technology for transferring knowledge from a source domain to a target domain. The key issue of domain adaptation is how to reduce the distribution discrepancy between two domains in a proper way such that they can be treated indifferently for learning. In this paper, we propose a novel domain adaptation approach, which can thoroughly explore the data distribution structure of target domain.Specifically, we regard the samples within the same cluster in target domain as a whole rather than individuals and assigns pseudo-labels to the target cluster by class centroid matching. Besides, to exploit the manifold structure information of target data more thoroughly, we further introduce a local manifold self-learning strategy into our proposal to adaptively capture the inherent local connectivity of target samples. An efficient iterative optimization algorithm is designed to solve the objective function of our proposal with theoretical convergence guarantee. In addition to unsupervised domain adaptation, we further extend our method to the semi-supervised scenario including both homogeneous and heterogeneous settings in a direct but elegant way. Extensive experiments on seven benchmark datasets validate the significant superiority of our proposal in both unsupervised and semi-supervised manners.
CVOct 31, 2018
Regularized Fourier Ptychography using an Online Plug-and-Play AlgorithmYu Sun, Shiqi Xu, Yunzhe Li et al.
The plug-and-play priors (PnP) framework has been recently shown to achieve state-of-the-art results in regularized image reconstruction by leveraging a sophisticated denoiser within an iterative algorithm. In this paper, we propose a new online PnP algorithm for Fourier ptychographic microscopy (FPM) based on the fast iterative shrinkage/threshold algorithm (FISTA). Specifically, the proposed algorithm uses only a subset of measurements, which makes it scalable to a large set of measurements. We validate the algorithm by showing that it can lead to significant performance gains on both simulated and experimental data.
CVApr 27, 2018
Deep learning approach to Fourier ptychographic microscopyThanh Nguyen, Yujia Xue, Yunzhe Li et al.
Convolutional neural networks (CNNs) have gained tremendous success in solving complex inverse problems. The aim of this work is to develop a novel CNN framework to reconstruct video sequence of dynamic live cells captured using a computational microscopy technique, Fourier ptychographic microscopy (FPM). The unique feature of the FPM is its capability to reconstruct images with both wide field-of-view (FOV) and high resolution, i.e. a large space-bandwidth-product (SBP), by taking a series of low resolution intensity images. For live cell imaging, a single FPM frame contains thousands of cell samples with different morphological features. Our idea is to fully exploit the statistical information provided by this large spatial ensemble so as to make predictions in a sequential measurement, without using any additional temporal dataset. Specifically, we show that it is possible to reconstruct high-SBP dynamic cell videos by a CNN trained only on the first FPM dataset captured at the beginning of a time-series experiment. Our CNN approach reconstructs a 12800X10800 pixels phase image using only ~25 seconds, a 50X speedup compared to the model-based FPM algorithm. In addition, the CNN further reduces the required number of images in each time frame by ~6X. Overall, this significantly improves the imaging throughput by reducing both the acquisition and computational times. The proposed CNN is based on the conditional generative adversarial network (cGAN) framework. Additionally, we also exploit transfer learning so that our pre-trained CNN can be further optimized to image other cell types. Our technique demonstrates a promising deep learning approach to continuously monitor large live-cell populations over an extended time and gather useful spatial and temporal information with sub-cellular resolution.
CVSep 30, 2017
PCANet-II: When PCANet Meets the Second Order PoolingLei Tian, Xiaopeng Hong, Guoying Zhao et al.
PCANet, as one noticeable shallow network, employs the histogram representation for feature pooling. However, there are three main problems about this kind of pooling method. First, the histogram-based pooling method binarizes the feature maps and leads to inevitable discriminative information loss. Second, it is difficult to effectively combine other visual cues into a compact representation, because the simple concatenation of various visual cues leads to feature representation inefficiency. Third, the dimensionality of histogram-based output grows exponentially with the number of feature maps used. In order to overcome these problems, we propose a novel shallow network model, named as PCANet-II. Compared with the histogram-based output, the second order pooling not only provides more discriminative information by preserving both the magnitude and sign of convolutional responses, but also dramatically reduces the size of output features. Thus we combine the second order statistical pooling method with the shallow network, i.e., PCANet. Moreover, it is easy to combine other discriminative and robust cues by using the second order pooling. So we introduce the binary feature difference encoding scheme into our PCANet-II to further improve robustness. Experiments demonstrate the effectiveness and robustness of our proposed PCANet-II method.
CVOct 27, 2016
Compressive Holographic VideoZihao Wang, Leonidas Spinoulas, Kuan He et al.
Compressed sensing has been discussed separately in spatial and temporal domains. Compressive holography has been introduced as a method that allows 3D tomographic reconstruction at different depths from a single 2D image. Coded exposure is a temporal compressed sensing method for high speed video acquisition. In this work, we combine compressive holography and coded exposure techniques and extend the discussion to 4D reconstruction in space and time from one coded captured image. In our prototype, digital in-line holography was used for imaging macroscopic, fast moving objects. The pixel-wise temporal modulation was implemented by a digital micromirror device. In this paper we demonstrate $10\times$ temporal super resolution with multiple depths recovery from a single image. Two examples are presented for the purpose of recording subtle vibrations and tracking small particles within 5 ms.
CVOct 26, 2016
Structured illumination microscopy with unknown patterns and a statistical priorLi-Hao Yeh, Lei Tian, Laura Waller
Structured illumination microscopy (SIM) improves resolution by down-modulating high-frequency information of an object to fit within the passband of the optical system. Generally, the reconstruction process requires prior knowledge of the illumination patterns, which implies a well-calibrated and aberration-free system. Here, we propose a new \textit{algorithmic self-calibration} strategy for SIM that does not need to know the exact patterns {\it a priori}, but only their covariance. The algorithm, termed PE-SIMS, includes a Pattern-Estimation (PE) step requiring the uniformity of the sum of the illumination patterns and a SIM reconstruction procedure using a Statistical prior (SIMS). Additionally, we perform a pixel reassignment process (SIMS-PR) to enhance the reconstruction quality. We achieve 2$\times$ better resolution than a conventional widefield microscope, while remaining insensitive to aberration-induced pattern distortion and robust against parameter tuning.
OPTICSNov 10, 2015
Experimental robustness of Fourier Ptychography phase retrieval algorithmsLi-Hao Yeh, Jonathan Dong, Jingshan Zhong et al.
Fourier ptychography is a new computational microscopy technique that provides gigapixel-scale intensity and phase images with both wide field-of-view and high resolution. By capturing a stack of low-resolution images under different illumination angles, a nonlinear inverse algorithm can be used to computationally reconstruct the high-resolution complex field. Here, we compare and classify multiple proposed inverse algorithms in terms of experimental robustness. We find that the main sources of error are noise, aberrations and mis-calibration (i.e. model mis-match). Using simulations and experiments, we demonstrate that the choice of cost function plays a critical role, with amplitude-based cost functions performing better than intensity-based ones. The reason for this is that Fourier ptychography datasets consist of images from both brightfield and darkfield illumination, representing a large range of measured intensities. Both noise (e.g. Poisson noise) and model mis-match errors are shown to scale with intensity. Hence, algorithms that use an appropriate cost function will be more tolerant to both noise and model mis-match. Given these insights, we propose a global Newton's method algorithm which is robust and computationally efficient. Finally, we discuss the impact of procedures for algorithmic correction of aberrations and mis-calibration.