IVAug 27, 2023Code
Bi-Modality Medical Image Synthesis Using Semi-Supervised Sequential Generative Adversarial NetworksXin Yang, Yi Lin, Zhiwei Wang et al.
In this paper, we propose a bi-modality medical image synthesis approach based on sequential generative adversarial network (GAN) and semi-supervised learning. Our approach consists of two generative modules that synthesize images of the two modalities in a sequential order. A method for measuring the synthesis complexity is proposed to automatically determine the synthesis order in our sequential GAN. Images of the modality with a lower complexity are synthesized first, and the counterparts with a higher complexity are generated later. Our sequential GAN is trained end-to-end in a semi-supervised manner. In supervised training, the joint distribution of bi-modality images are learned from real paired images of the two modalities by explicitly minimizing the reconstruction losses between the real and synthetic images. To avoid overfitting limited training images, in unsupervised training, the marginal distribution of each modality is learned based on unpaired images by minimizing the Wasserstein distance between the distributions of real and fake images. We comprehensively evaluate the proposed model using two synthesis tasks based on three types of evaluate metrics and user studies. Visual and quantitative results demonstrate the superiority of our method to the state-of-the-art methods, and reasonable visual quality and clinical significance. Code is made publicly available at https://github.com/hustlinyi/Multimodal-Medical-Image-Synthesis.
SYNov 26, 2018
Optimization of Battery Energy Storage to Improve Power System Oscillation DampingYongli Zhu, Chengxi Liu, Kai Sun et al.
A placement problem for multiple Battery Energy Storage System (BESS) units is formulated towards power system transient voltage stability enhancement in this paper. The problem is solved by the Cross-Entropy (CE) optimization method. A simulation-based approach is adopted to incorporate higher-order dynamics and nonlinearities of generators and loads. The objective is to maximize the voltage stability index, which is setup based on certain grid-codes. Formulations of the optimization problem are then discussed. Finally, the proposed approach is implemented in MATLAB/DIgSILENT and tested on the New England 39-Bus system. Results indicate that installing BESS units at the optimized location can alleviate transient voltage instability issue compared with the original system with no BESS. The CE placement algorithm is also compared with the classic PSO (Particle Swarm Optimization) method, and its superiority is demonstrated in terms of a faster convergence rate with matched solution qualities.
CVNov 26, 2022Code
Robust One-shot Segmentation of Brain Tissues via Image-aligned Style TransformationJinxin Lv, Xiaoyu Zeng, Sheng Wang et al.
One-shot segmentation of brain tissues is typically a dual-model iterative learning: a registration model (reg-model) warps a carefully-labeled atlas onto unlabeled images to initialize their pseudo masks for training a segmentation model (seg-model); the seg-model revises the pseudo masks to enhance the reg-model for a better warping in the next iteration. However, there is a key weakness in such dual-model iteration that the spatial misalignment inevitably caused by the reg-model could misguide the seg-model, which makes it converge on an inferior segmentation performance eventually. In this paper, we propose a novel image-aligned style transformation to reinforce the dual-model iterative learning for robust one-shot segmentation of brain tissues. Specifically, we first utilize the reg-model to warp the atlas onto an unlabeled image, and then employ the Fourier-based amplitude exchange with perturbation to transplant the style of the unlabeled image into the aligned atlas. This allows the subsequent seg-model to learn on the aligned and style-transferred copies of the atlas instead of unlabeled images, which naturally guarantees the correct spatial correspondence of an image-mask training pair, without sacrificing the diversity of intensity patterns carried by the unlabeled images. Furthermore, we introduce a feature-aware content consistency in addition to the image-level similarity to constrain the reg-model for a promising initialization, which avoids the collapse of image-aligned style transformation in the first iteration. Experimental results on two public datasets demonstrate 1) a competitive segmentation performance of our method compared to the fully-supervised method, and 2) a superior performance over other state-of-the-art with an increase of average Dice by up to 4.67%. The source code is available at: https://github.com/JinxLv/One-shot-segmentation-via-IST.
SYMay 10, 2017
Online Calibration of Phasor Measurement Unit Using Density-Based Spatial ClusteringXinan Wang, Di Shi, Zhiwei Wang et al.
Data quality of Phasor Measurement Unit (PMU) is receiving increasing attention as it has been identified as one of the limiting factors that affect many wide-area measurement system (WAMS) based applications. In general, existing PMU calibration methods include offline testing and model based approaches. However, in practice, the effectiveness of both is limited due to the very strong assumptions employed. This paper presents a novel framework for online bias error detection and calibration of PMU measurement using density-based spatial clustering of applications with noise (DBSCAN) based on much relaxed assumptions. With a new problem formulation, the proposed data mining based methodology is applicable across a wide spectrum of practical conditions and one side-product of it is more accurate transmission line parameters for EMS database and protective relay settings. Case studies demonstrate the effectiveness of the proposed approach.
MLMar 5, 2023Code
MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary InformationZhiwei Wang, Fa Zhang, Cong Zheng et al.
In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to effectively leverage auxiliary information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships and robustness to irrelevant features and missing values in auxiliary information. The parameters in MFAI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of auxiliary information and immune to overfitting. Moreover, MFAI is computationally efficient and scalable to large datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analyses. Our approach is implemented in the R package mfair available at https://github.com/YangLabHKUST/mfair.
46.5CVMar 12Code
FL-MedSegBench: A Comprehensive Benchmark for Federated Learning on Medical Image SegmentationMeilu Zhu, Zhiwei Wang, Axiu Mao et al.
Federated learning (FL) offers a privacy-preserving paradigm for collaborative medical image analysis without sharing raw data. However, the absence of standardized benchmarks for medical image segmentation hinders fair and comprehensive evaluation of FL methods. To address this gap, we introduce FL-MedSegBench, the first comprehensive benchmark for federated learning on medical image segmentation. Our benchmark encompasses nine segmentation tasks across ten imaging modalities, covering both 2D and 3D formats with realistic clinical heterogeneity. We systematically evaluate eight generic FL (gFL) and five personalized FL (pFL) methods across multiple dimensions: segmentation accuracy, fairness, communication efficiency, convergence behavior, and generalization to unseen domains. Extensive experiments reveal several key insights: (i) pFL methods, particularly those with client-specific batch normalization (\textit{e.g.}, FedBN), consistently outperform generic approaches; (ii) No single method universally dominates, with performance being dataset-dependent; (iii) Communication frequency analysis shows normalization-based personalization methods exhibit remarkable robustness to reduced communication frequency; (iv) Fairness evaluation identifies methods like Ditto and FedRDN that protect underperforming clients; (v) A method's generalization to unseen domains is strongly tied to its ability to perform well across participating clients. We will release an open-source toolkit to foster reproducible research and accelerate clinically applicable FL solutions, providing empirically grounded guidelines for real-world clinical deployment. The source code is available at https://github.com/meiluzhu/FL-MedSegBench.
SYMar 22, 2019
Bayesian Estimation Based Parameter Estimation for Composite LoadChang Fu, Zhe Yu, Di Shi et al.
Accurate identification of parameters of load models is essential in power system computations, including simulation, prediction, and stability and reliability analysis. Conventional point estimation based composite load modeling approaches suffer from disturbances and noises and provide limited information of the system dynamics. In this work, a statistic (Bayesian Estimation) based distribution estimation approach is proposed for both static (ZIP) and dynamic (Induction Motor) load modeling. When dealing with multiple parameters, Gibbs sampling method is employed. In each iteration, the proposal samples each parameter while keeps others fixed. The proposed method provides a distribution estimation of load models coefficients and is robust to measurement errors.
CVAug 16, 2024Code
Decoupling Feature Representations of Ego and Other Modalities for Incomplete Multi-modal Brain Tumor SegmentationKaixiang Yang, Wenqi Shan, Xudong Li et al.
Multi-modal brain tumor segmentation typically involves four magnetic resonance imaging (MRI) modalities, while incomplete modalities significantly degrade performance. Existing solutions employ explicit or implicit modality adaptation, aligning features across modalities or learning a fused feature robust to modality incompleteness. They share a common goal of encouraging each modality to express both itself and the others. However, the two expression abilities are entangled as a whole in a seamless feature space, resulting in prohibitive learning burdens. In this paper, we propose DeMoSeg to enhance the modality adaptation by Decoupling the task of representing the ego and other Modalities for robust incomplete multi-modal Segmentation. The decoupling is super lightweight by simply using two convolutions to map each modality onto four feature sub-spaces. The first sub-space expresses itself (Self-feature), while the remaining sub-spaces substitute for other modalities (Mutual-features). The Self- and Mutual-features interactively guide each other through a carefully-designed Channel-wised Sparse Self-Attention (CSSA). After that, a Radiologist-mimic Cross-modality expression Relationships (RCR) is introduced to have available modalities provide Self-feature and also `lend' their Mutual-features to compensate for the absent ones by exploiting the clinical prior knowledge. The benchmark results on BraTS2020, BraTS2018 and BraTS2015 verify the DeMoSeg's superiority thanks to the alleviated modality adaptation difficulty. Concretely, for BraTS2020, DeMoSeg increases Dice by at least 0.92%, 2.95% and 4.95% on whole tumor, tumor core and enhanced tumor regions, respectively, compared to other state-of-the-arts. Codes are at https://github.com/kk42yy/DeMoSeg
34.2CVMar 15Code
Joint Segmentation and Grading with Iterative Optimization for Multimodal Glaucoma DiagnosisZhiwei Wang, Yuxing Li, Meilu Zhu et al.
Accurate diagnosis of glaucoma is challenging, as early-stage changes are subtle and often lack clear structural or appearance cues. Most existing approaches rely on a single modality, such as fundus or optical coherence tomography (OCT), capturing only partial pathological information and often missing early disease progression. In this paper, we propose an iterative multimodal optimization model (IMO) for joint segmentation and grading. IMO integrates fundus and OCT features through a mid-level fusion strategy, enhanced by a cross-modal feature alignment (CMFA) module to reduce modality discrepancies. An iterative refinement decoder progressively optimizes the multimodal features through a denoising diffusion mechanism, enabling fine-grained segmentation of the optic disc and cup while supporting accurate glaucoma grading. Extensive experiments show that our method effectively integrates multimodal features, providing a comprehensive and clinically significant approach to glaucoma assessment. Source codes are available at https://github.com/warren-wzw/IMO.git.
SYMar 8, 2018
Exploration of Graph Computing in Power System State EstimationChen Yuan, Yuqi Zhou, Guofang Zhang et al.
With the increased complexity of power systems due to the integration of smart grid technologies and renewable energy resources, more frequent changes have been introduced to system status, and the traditional serial mode of state estimation algorithm cannot well meet the restrict time-constrained requirement for the future dynamic power grid, even with advanced computer hardware. To guarantee the grid reliability and minimize the impacts caused by system status fluctuations, a fast, even SCADA-rate, state estimator is urgently needed. In this paper, a graph based power system modeling is firstly explored and a graph computing based state estimation is proposed to speed up its performance. The power system is represented by a graph, which is a collection of vertices and edges, and the measurements are attributes of vertices and edges. Each vertex can independently implement local computation, like formulations of the node-based H matrix, gain matrix and righthand-side (RHS) vector, only with the information on its connected edges and neighboring vertices. Then, by taking advantages of graph database, these node-based data are conveniently collected and stored in the compressed sparse row (CSR) format avoiding the complexity and heaviness introduced by the sparse matrices. With communications and synchronization, centralized computation of solving the weighted least square (WLS) state estimation is completed with hierarchical parallel computing. The proposed strategy is implemented on a graph database platform. The testing results of IEEE 14-bus, IEEE 118-bus systems and a provincial system in China verify the accuracy and high-performance of the proposed methodology.
OCMar 13, 2018
Sensitivity Based Thevenin Index for Voltage Stability Assessment Considering N-1 ContingencyXiaohu Zhang, Di Shi, Xiao Lu et al.
This paper proposes an approach to address the voltage stability assessment (VSA) considering N-1 contingency. The approach leverages the sensitivity based Thevenin index (STI) which involves evaluating the Jacobian matrix at current operating condition. Since the N-1 contingency case is hypothetical, there is no information regarding the operating condition after a foreseen contingency. The proposed approach first estimates the post-contingency operating point as well as possible PV-PQ transitions based on the current operating point. Then the STI for each contingency can be predicted using the estimated operating condition. Numerical results based on IEEE 14-bus system demonstrate the accuracy of the proposed approach in predicting the voltage stability margin under contingency. Moreover, the on-line implementation of the proposed approach is promising since it only involves solving several linear equations.
CVJun 19, 2023
Dual-view Correlation Hybrid Attention Network for Robust Holistic Mammogram ClassificationZhiwei Wang, Junlin Xian, Kangyi Liu et al.
Mammogram image is important for breast cancer screening, and typically obtained in a dual-view form, i.e., cranio-caudal (CC) and mediolateral oblique (MLO), to provide complementary information. However, previous methods mostly learn features from the two views independently, which violates the clinical knowledge and ignores the importance of dual-view correlation. In this paper, we propose a dual-view correlation hybrid attention network (DCHA-Net) for robust holistic mammogram classification. Specifically, DCHA-Net is carefully designed to extract and reinvent deep features for the two views, and meanwhile to maximize the underlying correlations between them. A hybrid attention module, consisting of local relation and non-local attention blocks, is proposed to alleviate the spatial misalignment of the paired views in the correlation maximization. A dual-view correlation loss is introduced to maximize the feature similarity between corresponding strip-like regions with equal distance to the chest wall, motivated by the fact that their features represent the same breast tissues, and thus should be highly-correlated. Experimental results on two public datasets, i.e., INbreast and CBIS-DDSM, demonstrate that DCHA-Net can well preserve and maximize feature correlations across views, and thus outperforms the state-of-the-arts for classifying a whole mammogram as malignant or not.
CVNov 15, 2025Code
FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image EditingKaixiang Yang, Boyang Shen, Xin Li et al.
Text-guided image editing has advanced rapidly with the rise of diffusion models. While flow-based inversion-free methods offer high efficiency by avoiding latent inversion, they often fail to effectively integrate source information, leading to poor background preservation, spatial inconsistencies, and over-editing due to the lack of effective integration of source information. In this paper, we present FIA-Edit, a novel inversion-free framework that achieves high-fidelity and semantically precise edits through a Frequency-Interactive Attention. Specifically, we design two key components: (1) a Frequency Representation Interaction (FRI) module that enhances cross-domain alignment by exchanging frequency components between source and target features within self-attention, and (2) a Feature Injection (FIJ) module that explicitly incorporates source-side queries, keys, values, and text embeddings into the target branch's cross-attention to preserve structure and semantics. Comprehensive and extensive experiments demonstrate that FIA-Edit supports high-fidelity editing at low computational cost (~6s per 512 * 512 image on an RTX 4090) and consistently outperforms existing methods across diverse tasks in visual quality, background fidelity, and controllability. Furthermore, we are the first to extend text-guided image editing to clinical applications. By synthesizing anatomically coherent hemorrhage variations in surgical images, FIA-Edit opens new opportunities for medical data augmentation and delivers significant gains in downstream bleeding classification. Our project is available at: https://github.com/kk42yy/FIA-Edit.
SPOct 25, 2018
Graph Based Power Flow Calculation for Energy Management SystemJunjie Shi, Guangyi Liu, Renchang Dai et al.
Power flow calculation in EMS is required to accommodate a large and complex power system. To achieve a faster than real-time calculation, a graph based power flow calculation is proposed in this paper. Graph database and graph computing advantages in power system calculations are presented. A linear solver for power flow application is formulated and decomposed in nodal parallelism and hierarchical parallelism to fully utilize graph parallel computing capability. Comparison of the algorithm with traditional sequential programs shows significant benefits on computation efficiency. Case studies on practical large-scale systems provide supporting evidence that the new algorithm is promising for online computing for EMS.
CVSep 10, 2024Code
DACAT: Dual-stream Adaptive Clip-aware Time Modeling for Robust Online Surgical Phase RecognitionKaixiang Yang, Qiang Li, Zhiwei Wang
Surgical phase recognition has become a crucial requirement in laparoscopic surgery, enabling various clinical applications like surgical risk forecasting. Current methods typically identify the surgical phase using individual frame-wise embeddings as the fundamental unit for time modeling. However, this approach is overly sensitive to current observations, often resulting in discontinuous and erroneous predictions within a complete surgical phase. In this paper, we propose DACAT, a novel dual-stream model that adaptively learns clip-aware context information to enhance the temporal relationship. In one stream, DACAT pretrains a frame encoder, caching all historical frame-wise features. In the other stream, DACAT fine-tunes a new frame encoder to extract the frame-wise feature at the current moment. Additionally, a max clip-response read-out (Max-R) module is introduced to bridge the two streams by using the current frame-wise feature to adaptively fetch the most relevant past clip from the feature cache. The clip-aware context feature is then encoded via cross-attention between the current frame and its fetched adaptive clip, and further utilized to enhance the time modeling for accurate online surgical phase recognition. The benchmark results on three public datasets, i.e., Cholec80, M2CAI16, and AutoLaparo, demonstrate the superiority of our proposed DACAT over existing state-of-the-art methods, with improvements in Jaccard scores of at least 4.5%, 4.6%, and 2.7%, respectively. Our code and models have been released at https://github.com/kk42yy/DACAT.
CVJun 5, 2022
Accurate Scoliosis Vertebral Landmark Localization on X-ray Images via Shape-constrained Multi-stage Cascaded CNNsZhiwei Wang, Jinxin Lv, Yunqiao Yang et al.
Vertebral landmark localization is a crucial step for variant spine-related clinical applications, which requires detecting the corner points of 17 vertebrae. However, the neighbor landmarks often disturb each other for the homogeneous appearance of vertebrae, which makes vertebral landmark localization extremely difficult. In this paper, we propose multi-stage cascaded convolutional neural networks (CNNs) to split the single task into two sequential steps, i.e., center point localization to roughly locate 17 center points of vertebrae, and corner point localization to find 4 corner points for each vertebra without distracted by others. Landmarks in each step are located gradually from a set of initialized points by regressing offsets via cascaded CNNs. Principal Component Analysis (PCA) is employed to preserve a shape constraint in offset regression to resist the mutual attraction of vertebrae. We evaluate our method on the AASCE dataset that consists of 609 tight spinal anterior-posterior X-ray images and each image contains 17 vertebrae composed of the thoracic and lumbar spine for spinal shape characterization. Experimental results demonstrate our superior performance of vertebral landmark localization over other state-of-the-arts with the relative error decreasing from 3.2e-3 to 7.2e-4.
61.6CVMar 17Code
EPOFusion: Exposure aware Progressive Optimization Method for Infrared and Visible Image FusionZhiwei Wang, Yayu Zheng, Defeng He et al.
Overexposure frequently occurs in practical scenarios, causing the loss of critical visual information. However, existing infrared and visible fusion methods still exhibit unsatisfactory performance in highly bright regions. To address this, we propose EPOFusion, an exposure-aware fusion model. Specifically, a guidance module is introduced to facilitate the encoder in extracting fine-grained infrared features from overexposed regions. Meanwhile, an iterative decoder incorporating a multiscale context fusion module is designed to progressively enhance the fused image, ensuring consistent details and superior visual quality. Finally, an adaptive loss function dynamically constrains the fusion process, enabling an effective balance between the modalities under varying exposure conditions. To achieve better exposure awareness, we construct the first infrared and visible overexposure dataset (IVOE) with high quality infrared guided annotations for overexposed regions. Extensive experiments show that EPOFusion outperforms existing methods. It maintains infrared cues in overexposed regions while achieving visually faithful fusion in non-overexposed areas, thereby enhancing both visual fidelity and downstream task performance. Code, fusion results and IVOE dataset will be made available at https://github.com/warren-wzw/EPOFusion.git.
SPNov 8, 2017
An Extended Kalman Filter Enhanced Hilbert-Huang Transform in Oscillation DetectionZhe Yu, Di Shi, Haifeng Li et al.
Hilbert-Huang transform (HHT) has drawn great attention in power system analysis due to its capability to deal with dynamic signal and provide instantaneous characteristics such as frequency, damping, and amplitudes. However, its shortcomings, including mode mixing and end effects, are as significant as its advantages. A preliminary result of an extended Kalman filter (EKF) method to enhance HHT and hopefully to overcome these disadvantages is presented in this paper. The proposal first removes dynamic DC components in signals using empirical mode decomposition. Then an EKF model is applied to extract instant coefficients. Numerical results using simulated and real-world low-frequency oscillation data suggest the proposal can help to overcome the mode mixing and end effects with a properly chosen number of modes.
CVOct 15, 2022
Bidirectional Semi-supervised Dual-branch CNN for Robust 3D Reconstruction of Stereo Endoscopic Images via Adaptive Cross and Parallel SupervisionsHongkuan Shi, Zhiwei Wang, Ying Zhou et al.
Semi-supervised learning via teacher-student network can train a model effectively on a few labeled samples. It enables a student model to distill knowledge from the teacher's predictions of extra unlabeled data. However, such knowledge flow is typically unidirectional, having the performance vulnerable to the quality of teacher model. In this paper, we seek to robust 3D reconstruction of stereo endoscopic images by proposing a novel fashion of bidirectional learning between two learners, each of which can play both roles of teacher and student concurrently. Specifically, we introduce two self-supervisions, i.e., Adaptive Cross Supervision (ACS) and Adaptive Parallel Supervision (APS), to learn a dual-branch convolutional neural network. The two branches predict two different disparity probability distributions for the same position, and output their expectations as disparity values. The learned knowledge flows across branches along two directions: a cross direction (disparity guides distribution in ACS) and a parallel direction (disparity guides disparity in APS). Moreover, each branch also learns confidences to dynamically refine its provided supervisions. In ACS, the predicted disparity is softened into a unimodal distribution, and the lower the confidence, the smoother the distribution. In APS, the incorrect predictions are suppressed by lowering the weights of those with low confidence. With the adaptive bidirectional learning, the two branches enjoy well-tuned supervisions, and eventually converge on a consistent and more accurate disparity estimation. The extensive and comprehensive experimental results on four public datasets demonstrate our superior performance over other state-of-the-arts with a relative decrease of averaged disparity error by at least 9.76%.
18.3ITMay 22
A Posterior MWPM Decoding Boosts the XYZ Planar CodeZhiwei Wang, Liqi Wang
The minimum-weight perfect matching (MWPM) decoder is a standard decoding strategy for surface codes, but its performance degrades considerably under biased noise. In this paper, a modified surface code, termed the XYZ planar code, is introduced, and the MWPM decoder is extended to posterior MWPM (pMWPM) with almost no increase in decoding complexity. The XYZ planar code exhibits higher and more stable thresholds than the planar code under almost all bias conditions, while also achieving significantly lower logical error rates. Specifically, in the infinite-bias case, the threshold of the XYZ planar code is improved by about \(36\%\) compared to that of the surface code, and it maintains comparable or higher thresholds under other biases -- for example, the threshold reaches approximately \(15.5\%\) at bias \(η= 1\) and \(14.2\%\) at \(η= 100\). Furthermore, pMWPM can be adapted to a wide range of modified surface codes, and the results presented in this work also indicate its excellent potential in other scenarios, such as configurations in which \(Y\) operators involve a larger number of data qubits.
SYFeb 26, 2019
Adaptive Online Learning with Momentum for Contingency-based Voltage Stability AssessmentZhijie Nie, Xiaohu Zhang, Xiaoying Zhao et al.
Voltage stability refers to the ability of a power system to maintain acceptable voltages among all buses under normal operating conditions and after a disturbance. In this paper, a measurement-based voltage stability assessment (VSA) framework using online deep learning is developed. Since the topology changes induced by transmission contingencies may significantly reduce the voltage stability margin, different network topologies under different operating conditions are involved in our training dataset. To achieve high accuracy in the training process, a gradient-based adaptive learning algorithms is adopted. Numerical results based on the NETS-NYPS 68-bus system demonstrate the effectiveness of the proposed VSA approach. Moreover, with the proximal function modified adaptively, the adaptive algorithm with momentum outperforms traditional nonadaptive algorithms whose learning rate is constant.
NANov 8, 2017
Recover the lost Phasor Measurement Unit Data Using Alternating Direction Multipliers MethodMang Liao, Di Shi, Zhe Yu et al.
This paper presents a novel algorithm for recovering missing data of phasor measurement units (PMUs). Due to the low-rank property of PMU data, missing measurement recovery can be formulated as a low-rank matrix-completion problem. Based on maximum-margin matrix factorization, we propose an efficient algorithm based on alternating direction method of multipliers (ADMM) for solving the matrix completion problem. Comparing to existing approaches, the proposed ADMM based algorithm does not need to estimate the rank of the target data matrix and provides better performance in computation complexity. In addition, we consider the case of measurements missing from all PMU channels and provide a strategy of reshaping the matrix which contains the received PMU data for recovery. Numerical results using PMU measurements from IEEE 68-bus power system model illustrate the effectiveness and efficiency of the proposed approaches.
96.8CVApr 30Code
Echo-α: Large Agentic Multimodal Reasoning Model for Ultrasound InterpretationJing Zhang, Wentao Jiang, Tao Huang et al.
Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer strong localization but limited reasoning, whereas multimodal large language models (MLLMs) provide flexible reasoning but weak grounding in specialized medical domains. We present Echo-α, an agentic multimodal reasoning model for ultrasound interpretation that unifies these strengths within an invoke-and-reason framework. Echo-α is trained to coordinate organ-specific detector outputs, integrate them with global visual context, and convert the resulting evidence into grounded diagnostic decisions beyond detector-only inference. This behavior is established through a nine-task supervised curriculum and then refined by sequential reinforcement learning under different reward trade-offs, yielding Echo-α-Grounding for lesion anchoring and Echo-α-Diagnosis for final diagnosis. On multi-center renal and breast ultrasound benchmarks, Echo-α outperforms competitive baselines on both grounding and diagnosis. In particular, on cross-center test sets, Echo-α-Grounding attains 56.73%/43.78% F1@0.5 and Echo- α-Diagnosis reaches 74.90%/49.20% overall accuracy on renal/breast ultrasound. These results suggest that agentic multimodal reasoning can turn specialized detectors into verifiable clinical evidence, offering a practical route toward ultrasound AI systems that are more accurate, interpretable, and transferable. The repository is at https://github.com/MiliLab/Echo-Alpha.
CVNov 11, 2025Code
MAUGIF: Mechanism-Aware Unsupervised General Image Fusion via Dual Cross-Image AutoencodersKunjing Yang, Zhiwei Wang, Minru Bai
Image fusion aims to integrate structural and complementary information from multi-source images. However, existing fusion methods are often either highly task-specific, or general frameworks that apply uniform strategies across diverse tasks, ignoring their distinct fusion mechanisms. To address this issue, we propose a mechanism-aware unsupervised general image fusion (MAUGIF) method based on dual cross-image autoencoders. Initially, we introduce a classification of additive and multiplicative fusion according to the inherent mechanisms of different fusion tasks. Then, dual encoders map source images into a shared latent space, capturing common content while isolating modality-specific details. During the decoding phase, dual decoders act as feature injectors, selectively reintegrating the unique characteristics of each modality into the shared content for reconstruction. The modality-specific features are injected into the source image in the fusion process, generating the fused image that integrates information from both modalities. The architecture of decoders varies according to their fusion mechanisms, enhancing both performance and interpretability. Extensive experiments are conducted on diverse fusion tasks to validate the effectiveness and generalization ability of our method. The code is available at https://anonymous.4open.science/r/MAUGIF.
19.6CVMar 16
Personalized Federated Learning with Residual Fisher Information for Medical Image SegmentationMeilu Zhu, Yuxing Li, Zhiwei Wang et al.
Federated learning enables multiple clients (institutions) to collaboratively train machine learning models without sharing their private data. To address the challenge of data heterogeneity across clients, personalized federated learning (pFL) aims to learn customized models for each client. In this work, we propose pFL-ResFIM, a novel pFL framework that achieves client-adaptive personalization at the parameter level. Specifically, we introduce a new metric, Residual Fisher Information Matrix (ResFIM), to quantify the sensitivity of model parameters to domain discrepancies. To estimate ResFIM for each client model under privacy constraints, we employ a spectral transfer strategy that generates simulated data reflecting the domain styles of different clients. Based on the estimated ResFIM, we partition model parameters into domain-sensitive and domain-invariant components. A personalized model for each client is then constructed by aggregating only the domain-invariant parameters on the server. Extensive experiments on public datasets demonstrate that pFL-ResFIM consistently outperforms state-of-the-art methods, validating its effectiveness.
CVApr 1, 2024Code
MonoBox: Tightness-free Box-supervised Polyp Segmentation using Monotonicity ConstraintQiang Hu, Zhenyu Yi, Ying Zhou et al.
We propose MonoBox, an innovative box-supervised segmentation method constrained by monotonicity to liberate its training from the user-unfriendly box-tightness assumption. In contrast to conventional box-supervised segmentation, where the box edges must precisely touch the target boundaries, MonoBox leverages imprecisely-annotated boxes to achieve robust pixel-wise segmentation. The 'linchpin' is that, within the noisy zones around box edges, MonoBox discards the traditional misguiding multiple-instance learning loss, and instead optimizes a carefully-designed objective, termed monotonicity constraint. Along directions transitioning from the foreground to background, this new constraint steers responses to adhere to a trend of monotonically decreasing values. Consequently, the originally unreliable learning within the noisy zones is transformed into a correct and effective monotonicity optimization. Moreover, an adaptive label correction is introduced, enabling MonoBox to enhance the tightness of box annotations using predicted masks from the previous epoch and dynamically shrink the noisy zones as training progresses. We verify MonoBox in the box-supervised segmentation task of polyps, where satisfying box-tightness is challenging due to the vague boundaries between the polyp and normal tissues. Experiments on both public synthetic and in-house real noisy datasets demonstrate that MonoBox exceeds other anti-noise state-of-the-arts by improving Dice by at least 5.5% and 3.3%, respectively. Codes are at https://github.com/Huster-Hq/MonoBox.
IVFeb 7, 2025Code
Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 ChallengeMuhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman et al.
Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently available to support the development of multi-class aortic segmentation methods. To address this gap, we organized the AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes annotated for 23 clinically relevant aortic branches and zones. This dataset was designed to facilitate both model development and validation. The challenge attracted 121 teams worldwide, with participants leveraging state-of-the-art frameworks such as nnU-Net and exploring novel techniques, including cascaded models, data augmentation strategies, and custom loss functions. We evaluated the submitted algorithms using the Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD), highlighting the approaches adopted by the top five performing teams. This paper presents the challenge design, dataset details, evaluation metrics, and an in-depth analysis of the top-performing algorithms. The annotated dataset, evaluation code, and implementations of the leading methods are publicly available to support further research. All resources can be accessed at https://aortaseg24.grand-challenge.org.
CVOct 11, 2023
IBoxCLA: Towards Robust Box-supervised Segmentation of Polyp via Improved Box-dice and Contrastive Latent-anchorsZhiwei Wang, Qiang Hu, Hongkuan Shi et al.
Box-supervised polyp segmentation attracts increasing attention for its cost-effective potential. Existing solutions often rely on learning-free methods or pretrained models to laboriously generate pseudo masks, triggering Dice constraint subsequently. In this paper, we found that a model guided by the simplest box-filled masks can accurately predict polyp locations/sizes, but suffers from shape collapsing. In response, we propose two innovative learning fashions, Improved Box-dice (IBox) and Contrastive Latent-Anchors (CLA), and combine them to train a robust box-supervised model IBoxCLA. The core idea behind IBoxCLA is to decouple the learning of location/size and shape, allowing for focused constraints on each of them. Specifically, IBox transforms the segmentation map into a proxy map using shape decoupling and confusion-region swapping sequentially. Within the proxy map, shapes are disentangled, while locations/sizes are encoded as box-like responses. By constraining the proxy map instead of the raw prediction, the box-filled mask can well supervise IBoxCLA without misleading its shape learning. Furthermore, CLA contributes to shape learning by generating two types of latent anchors, which are learned and updated using momentum and segmented polyps to steadily represent polyp and background features. The latent anchors facilitate IBoxCLA to capture discriminative features within and outside boxes in a contrastive manner, yielding clearer boundaries. We benchmark IBoxCLA on five public polyp datasets. The experimental results demonstrate the competitive performance of IBoxCLA compared to recent fully-supervised polyp segmentation methods, and its superiority over other box-supervised state-of-the-arts with a relative increase of overall mDice and mIoU by at least 6.5% and 7.5%, respectively.
IVJul 25, 2025Code
Joint Holistic and Lesion Controllable Mammogram Synthesis via Gated Conditional Diffusion ModelXin Li, Kaixiang Yang, Qiang Li et al.
Mammography is the most commonly used imaging modality for breast cancer screening, driving an increasing demand for deep-learning techniques to support large-scale analysis. However, the development of accurate and robust methods is often limited by insufficient data availability and a lack of diversity in lesion characteristics. While generative models offer a promising solution for data synthesis, current approaches often fail to adequately emphasize lesion-specific features and their relationships with surrounding tissues. In this paper, we propose Gated Conditional Diffusion Model (GCDM), a novel framework designed to jointly synthesize holistic mammogram images and localized lesions. GCDM is built upon a latent denoising diffusion framework, where the noised latent image is concatenated with a soft mask embedding that represents breast, lesion, and their transitional regions, ensuring anatomical coherence between them during the denoising process. To further emphasize lesion-specific features, GCDM incorporates a gated conditioning branch that guides the denoising process by dynamically selecting and fusing the most relevant radiomic and geometric properties of lesions, effectively capturing their interplay. Experimental results demonstrate that GCDM achieves precise control over small lesion areas while enhancing the realism and diversity of synthesized mammograms. These advancements position GCDM as a promising tool for clinical applications in mammogram synthesis. Our code is available at https://github.com/lixinHUST/Gated-Conditional-Diffusion-Model/
73.7AIMay 11
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action ModelsBoyang Shen, Kaixiang Yang, Hao Wang et al.
Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a candidate action and a sufficiency score that estimates whether further refinement is necessary. By sharing parameters across iterations, LoopVLA decouples refinement from absolute layer indices and grounds sufficiency estimation in the evolving representation itself. Since sufficiency has no direct supervision, we introduce a self-supervised distribution alignment objective, where intermediate confidence scores are trained to match the relative action quality across refinement steps, thereby linking sufficiency learning to policy optimization signals. Experiments on LIBERO, LIBERO-Plus, and VLA-Arena show that LoopVLA pushes the efficiency-performance frontier of VLA policies, reducing parameters by 45% and improving inference throughput by up to 1.7 times while matching or outperforming strong baselines in task success.
CVJun 22, 2025Code
Targeted False Positive Synthesis via Detector-guided Adversarial Diffusion Attacker for Robust Polyp DetectionQuan Zhou, Gan Luo, Qiang Hu et al.
Polyp detection is crucial for colorectal cancer screening, yet existing models are limited by the scale and diversity of available data. While generative models show promise for data augmentation, current methods mainly focus on enhancing polyp diversity, often overlooking the critical issue of false positives. In this paper, we address this gap by proposing an adversarial diffusion framework to synthesize high-value false positives. The extensive variability of negative backgrounds presents a significant challenge in false positive synthesis. To overcome this, we introduce two key innovations: First, we design a regional noise matching strategy to construct a negative synthesis space using polyp detection datasets. This strategy trains a negative-centric diffusion model by masking polyp regions, ensuring the model focuses exclusively on learning diverse background patterns. Second, we introduce the Detector-guided Adversarial Diffusion Attacker (DADA) module, which perturbs the negative synthesis process to disrupt a pre-trained detector's decision, guiding the negative-centric diffusion model to generate high-value, detector-confusing false positives instead of low-value, ordinary backgrounds. Our approach is the first to apply adversarial diffusion to lesion detection, establishing a new paradigm for targeted false positive synthesis and paving the way for more reliable clinical applications in colorectal cancer screening. Extensive results on public and in-house datasets verify the superiority of our method over the current state-of-the-arts, with our synthesized data improving the detectors by at least 2.6% and 2.7% in F1-score, respectively, over the baselines. Codes are at https://github.com/Huster-Hq/DADA.
CVMay 25, 2025Code
Holistic White-light Polyp Classification via Alignment-free Dense Distillation of Auxiliary Optical ChromoendoscopyQiang Hu, Qimei Wang, Jia Chen et al.
White Light Imaging (WLI) and Narrow Band Imaging (NBI) are the two main colonoscopic modalities for polyp classification. While NBI, as optical chromoendoscopy, offers valuable vascular details, WLI remains the most common and often the only available modality in resource-limited settings. However, WLI-based methods typically underperform, limiting their clinical applicability. Existing approaches transfer knowledge from NBI to WLI through global feature alignment but often rely on cropped lesion regions, which are susceptible to detection errors and neglect contextual and subtle diagnostic cues. To address this, this paper proposes a novel holistic classification framework that leverages full-image diagnosis without requiring polyp localization. The key innovation lies in the Alignment-free Dense Distillation (ADD) module, which enables fine-grained cross-domain knowledge distillation regardless of misalignment between WLI and NBI images. Without resorting to explicit image alignment, ADD learns pixel-wise cross-domain affinities to establish correspondences between feature maps, guiding the distillation along the most relevant pixel connections. To further enhance distillation reliability, ADD incorporates Class Activation Mapping (CAM) to filter cross-domain affinities, ensuring the distillation path connects only those semantically consistent regions with equal contributions to polyp diagnosis. Extensive results on public and in-house datasets show that our method achieves state-of-the-art performance, relatively outperforming the other approaches by at least 2.5% and 16.2% in AUC, respectively. Code is available at: https://github.com/Huster-Hq/ADD.
CVMar 13, 2025Code
CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and RecognitionKaixiang Yang, Xin Li, Qiang Li et al.
Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world surgeries.In this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument usage.Theoretically, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing accuracy.Experiments on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and weight are available at https://github.com/kk42yy/CoStoDet-DDPM.
CVJun 19, 2024Code
SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp SegmentationQiang Hu, Zhenyu Yi, Ying Zhou et al.
Colonoscopy videos provide richer information in polyp segmentation for rectal cancer diagnosis. However, the endoscope's fast moving and close-up observing make the current methods suffer from large spatial incoherence and continuous low-quality frames, and thus yield limited segmentation accuracy. In this context, we focus on robust video polyp segmentation by enhancing the adjacent feature consistency and rebuilding the reliable polyp representation. To achieve this goal, we in this paper propose SALI network, a hybrid of Short-term Alignment Module (SAM) and Long-term Interaction Module (LIM). The SAM learns spatial-aligned features of adjacent frames via deformable convolution and further harmonizes them to capture more stable short-term polyp representation. In case of low-quality frames, the LIM stores the historical polyp representations as a long-term memory bank, and explores the retrospective relations to interactively rebuild more reliable polyp features for the current segmentation. Combing SAM and LIM, the SALI network of video segmentation shows a great robustness to the spatial variations and low-visual cues. Benchmark on the large-scale SUNSEG verifies the superiority of SALI over the current state-of-the-arts by improving Dice by 2.1%, 2.5%, 4.1% and 1.9%, for the four test sub-sets, respectively. Codes are at https://github.com/Scatteredrain/SALI.
LGJul 12, 2019Code
R-Transformer: Recurrent Neural Network Enhanced TransformerZhiwei Wang, Yao Ma, Zitao Liu et al.
Recurrent Neural Networks have long been the dominating choice for sequence modeling. However, it severely suffers from two issues: impotent in capturing very long-term dependencies and unable to parallelize the sequential computation procedure. Therefore, many non-recurrent sequence models that are built on convolution and attention operations have been proposed recently. Notably, models with multi-head attention such as Transformer have demonstrated extreme effectiveness in capturing long-term dependencies in a variety of sequence modeling tasks. Despite their success, however, these models lack necessary components to model local structures in sequences and heavily rely on position embeddings that have limited effects and require a considerable amount of design efforts. In this paper, we propose the R-Transformer which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks. The proposed model can effectively capture both local structures and global long-term dependencies in sequences without any use of position embeddings. We evaluate R-Transformer through extensive experiments with data from a wide range of domains and the empirical results show that R-Transformer outperforms the state-of-the-art methods by a large margin in most of the tasks. We have made the code publicly available at \url{https://github.com/DSE-MSU/R-transformer}.
CVJan 14
Pairing-free Group-level Knowledge Distillation for Robust Gastrointestinal Lesion Classification in White-Light EndoscopyQiang Hu, Qimei Wang, Yingjie Guo et al.
White-Light Imaging (WLI) is the standard for endoscopic cancer screening, but Narrow-Band Imaging (NBI) offers superior diagnostic details. A key challenge is transferring knowledge from NBI to enhance WLI-only models, yet existing methods are critically hampered by their reliance on paired NBI-WLI images of the same lesion, a costly and often impractical requirement that leaves vast amounts of clinical data untapped. In this paper, we break this paradigm by introducing PaGKD, a novel Pairing-free Group-level Knowledge Distillation framework that that enables effective cross-modal learning using unpaired WLI and NBI data. Instead of forcing alignment between individual, often semantically mismatched image instances, PaGKD operates at the group level to distill more complete and compatible knowledge across modalities. Central to PaGKD are two complementary modules: (1) Group-level Prototype Distillation (GKD-Pro) distills compact group representations by extracting modality-invariant semantic prototypes via shared lesion-aware queries; (2) Group-level Dense Distillation (GKD-Den) performs dense cross-modal alignment by guiding group-aware attention with activation-derived relation maps. Together, these modules enforce global semantic consistency and local structural coherence without requiring image-level correspondence. Extensive experiments on four clinical datasets demonstrate that PaGKD consistently and significantly outperforms state-of-the-art methods, achieving relative AUC improvements of 3.3%, 1.1%, 2.8%, and 3.2%, respectively, establishing a new direction for cross-modal learning from unpaired data.
CVNov 10, 2025
TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student LearningRui Wang, Ying Zhou, Hao Wang et al.
Stereo matching in minimally invasive surgery (MIS) is essential for next-generation navigation and augmented reality. Yet, dense disparity supervision is nearly impossible due to anatomical constraints, typically limiting annotations to only a few image-level labels acquired before the endoscope enters deep body cavities. Teacher-Student Learning (TSL) offers a promising solution by leveraging a teacher trained on sparse labels to generate pseudo labels and associated confidence maps from abundant unlabeled surgical videos. However, existing TSL methods are confined to image-level supervision, providing only spatial confidence and lacking temporal consistency estimation. This absence of spatio-temporal reliability results in unstable disparity predictions and severe flickering artifacts across video frames. To overcome these challenges, we propose TiS-TSL, a novel time-switchable teacher-student learning framework for video stereo matching under minimal supervision. At its core is a unified model that operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP), enabling flexible temporal modeling within a single architecture. Enabled by this unified model, TiS-TSL adopts a two-stage learning strategy. The Image-to-Video (I2V) stage transfers sparse image-level knowledge to initialize temporal modeling. The subsequent Video-to-Video (V2V) stage refines temporal disparity predictions by comparing forward and backward predictions to calculate bidirectional spatio-temporal consistency. This consistency identifies unreliable regions across frames, filters noisy video-level pseudo labels, and enforces temporal coherence. Experimental results on two public datasets demonstrate that TiS-TSL exceeds other image-based state-of-the-arts by improving TEPE and EPE by at least 2.11% and 4.54%, respectively.
IVDec 15, 2023
SegRap2023: A Benchmark of Organs-at-Risk and Gross Tumor Volume Segmentation for Radiotherapy Planning of Nasopharyngeal CarcinomaXiangde Luo, Jia Fu, Yunxin Zhong et al.
Radiation therapy is a primary and effective NasoPharyngeal Carcinoma (NPC) treatment strategy. The precise delineation of Gross Tumor Volumes (GTVs) and Organs-At-Risk (OARs) is crucial in radiation treatment, directly impacting patient prognosis. Previously, the delineation of GTVs and OARs was performed by experienced radiation oncologists. Recently, deep learning has achieved promising results in many medical image segmentation tasks. However, for NPC OARs and GTVs segmentation, few public datasets are available for model development and evaluation. To alleviate this problem, the SegRap2023 challenge was organized in conjunction with MICCAI2023 and presented a large-scale benchmark for OAR and GTV segmentation with 400 Computed Tomography (CT) scans from 200 NPC patients, each with a pair of pre-aligned non-contrast and contrast-enhanced CT scans. The challenge's goal was to segment 45 OARs and 2 GTVs from the paired CT scans. In this paper, we detail the challenge and analyze the solutions of all participants. The average Dice similarity coefficient scores for all submissions ranged from 76.68\% to 86.70\%, and 70.42\% to 73.44\% for OARs and GTVs, respectively. We conclude that the segmentation of large-size OARs is well-addressed, and more efforts are needed for GTVs and small-size or thin-structure OARs. The benchmark will remain publicly available here: https://segrap2023.grand-challenge.org
84.5CVApr 5
NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge ResultsShuhong Liu, Chenyu Bao, Ziteng Cui et al.
This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.
LGMay 8, 2024
Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or MemorizingZhongwang Zhang, Pengxiao Lin, Zhiwei Wang et al.
Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks. We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential (reasoning-based) solutions, which capture the underlying compositional primitives, or symmetric (memory-based) solutions, which simply memorize mappings without understanding the compositional structure. By analyzing the information flow and vector representations within the model, we reveal the distinct mechanisms underlying these solution types. We further find that inferential (reasoning-based) solutions exhibit low complexity bias, which we hypothesize is a key factor enabling them to learn individual mappings for single anchors. We validate our conclusions on various real-world datasets. Our findings provide valuable insights into the role of initialization scale in tuning the reasoning and memorizing ability and we propose the initialization rate $γ$ to be a convenient tunable hyper-parameter in common deep learning frameworks, where $1/d_{\mathrm{in}}^γ$ is the standard deviation of parameters of the layer with $d_{\mathrm{in}}$ input neurons.
CLJan 15, 2025
Complexity Control Facilitates Reasoning-Based Compositional Generalization in TransformersZhongwang Zhang, Pengxiao Lin, Zhiwei Wang et al.
Transformers have demonstrated impressive capabilities across various tasks, yet their performance on compositional problems remains a subject of debate. In this study, we investigate the internal mechanisms underlying Transformers' behavior in compositional tasks. We find that complexity control strategies significantly influence whether the model learns primitive-level rules that generalize out-of-distribution (reasoning-based solutions) or relies solely on memorized mappings (memory-based solutions). By applying masking strategies to the model's information circuits and employing multiple complexity metrics, we reveal distinct internal working mechanisms associated with different solution types. Further analysis reveals that reasoning-based solutions exhibit a lower complexity bias, which aligns with the well-studied neuron condensation phenomenon. This lower complexity bias is hypothesized to be the key factor enabling these solutions to learn reasoning rules. We validate these conclusions across multiple real-world datasets, including image generation and natural language processing tasks, confirming the broad applicability of our findings.
LGFeb 10, 2024
LiRank: Industrial Large Scale Ranking Models at LinkedInFedor Borisyuk, Mingzhou Zhou, Qingquan Song et al.
We present LiRank, a large-scale ranking framework at LinkedIn that brings to production state-of-the-art modeling architectures and optimization methods. We unveil several modeling improvements, including Residual DCN, which adds attention and residual connections to the famous DCNv2 architecture. We share insights into combining and tuning SOTA architectures to create a unified model, including Dense Gating, Transformers and Residual DCN. We also propose novel techniques for calibration and describe how we productionalized deep learning based explore/exploit methods. To enable effective, production-grade serving of large ranking models, we detail how to train and compress models using quantization and vocabulary compression. We provide details about the deployment setup for large-scale use cases of Feed ranking, Jobs Recommendations, and Ads click-through rate (CTR) prediction. We summarize our learnings from various A/B tests by elucidating the most effective technical approaches. These ideas have contributed to relative metrics improvements across the board at LinkedIn: +0.5% member sessions in the Feed, +1.76% qualified job applications for Jobs search and recommendations, and +4.3% for Ads CTR. We hope this work can provide practical insights and solutions for practitioners interested in leveraging large-scale deep ranking systems.
CVApr 25, 2024
MonoPCC: Photometric-invariant Cycle Constraint for Monocular Depth Estimation of Endoscopic ImagesZhiwei Wang, Ying Zhou, Shiquan He et al.
Photometric constraint is indispensable for self-supervised monocular depth estimation. It involves warping a source image onto a target view using estimated depth&pose, and then minimizing the difference between the warped and target images. However, the endoscopic built-in light causes significant brightness fluctuations, and thus makes the photometric constraint unreliable. Previous efforts only mitigate this relying on extra models to calibrate image brightness. In this paper, we propose MonoPCC to address the brightness inconsistency radically by reshaping the photometric constraint into a cycle form. Instead of only warping the source image, MonoPCC constructs a closed loop consisting of two opposite forward-backward warping paths: from target to source and then back to target. Thus, the target image finally receives an image cycle-warped from itself, which naturally makes the constraint invariant to brightness changes. Moreover, MonoPCC transplants the source image's phase-frequency into the intermediate warped image to avoid structure lost, and also stabilizes the training via an exponential moving average (EMA) strategy to avoid frequent changes in the forward warping. The comprehensive and extensive experimental results on four endoscopic datasets demonstrate that our proposed MonoPCC shows a great robustness to the brightness inconsistency, and exceeds other state-of-the-arts by reducing the absolute relative error by at least 7.27%, 9.38%, 9.90% and 3.17%, respectively.
AIMay 24, 2024
Understanding the Language Model to Solve the Symbolic Multi-Step Reasoning Problem from the Perspective of Buffer MechanismZhiwei Wang, Yunji Wang, Zhongwang Zhang et al.
Large language models have consistently struggled with complex reasoning tasks, such as mathematical problem-solving. Investigating the internal reasoning mechanisms of these models can help us design better model architectures and training strategies, ultimately enhancing their reasoning capability. In this study, we constructed a symbolic multi-step reasoning task to investigate the information propagation mechanisms in Transformer models when solving the task through direct answering and Chain-of-Thought (CoT) reasoning. We introduced the concept of buffer mechanism: the model stores various information in distinct buffers and selectively extracts it through the query-key matrix. We proposed a random matrix-based algorithm to enhance the model's reasoning ability. This algorithm introduces only 132 trainable parameters, yet leads to significant performance improvements on 7 multi-step reasoning datasets, including PrOntoQA, LogicAsker, and LogicInference. These findings provide new insights into understanding the large language models.
LGMay 29, 2025
Scalable Complexity Control Facilitates Reasoning Ability of LLMsLiangkai Hang, Junjie Yao, Zhiwei Bai et al.
The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over varying model sizes and data sizes. This gain is further illustrated by comparing the benchmark performance of 2.4B models pretrained on 1T tokens with different complexity hyperparameters. Instead of fixing the initialization std, we found that a constant initialization rate (the exponent of std) enables the scaling law to descend faster in both model and data sizes. These results indicate that complexity control is a promising direction for the continual advancement of LLMs.
CVDec 21, 2024
First-frame Supervised Video Polyp Segmentation via Propagative and Semantic Dual-teacher NetworkQiang Hu, Mei Liu, Qiang Li et al.
Automatic video polyp segmentation plays a critical role in gastrointestinal cancer screening, but the cost of frameby-frame annotations is prohibitively high. While sparse-frame supervised methods have reduced this burden proportionately, the cost remains overwhelming for long-duration videos and large-scale datasets. In this paper, we, for the first time, reduce the annotation cost to just a single frame per polyp video, regardless of the video's length. To this end, we introduce a new task, First-Frame Supervised Video Polyp Segmentation (FSVPS), and propose a novel Propagative and Semantic Dual-Teacher Network (PSDNet). Specifically, PSDNet adopts a teacher-student framework but employs two distinct types of teachers: the propagative teacher and the semantic teacher. The propagative teacher is a universal object tracker that propagates the first-frame annotation to subsequent frames as pseudo labels. However, tracking errors may accumulate over time, gradually degrading the pseudo labels and misguiding the student model. To address this, we introduce the semantic teacher, an exponential moving average of the student model, which produces more stable and time-invariant pseudo labels. PSDNet merges the pseudo labels from both teachers using a carefully-designed back-propagation strategy. This strategy assesses the quality of the pseudo labels by tracking them backward to the first frame. High-quality pseudo labels are more likely to spatially align with the firstframe annotation after this backward tracking, ensuring more accurate teacher-to-student knowledge transfer and improved segmentation performance. Benchmarking on SUN-SEG, the largest VPS dataset, demonstrates the competitive performance of PSDNet compared to fully-supervised approaches, and its superiority over sparse-frame supervised state-of-the-arts with a minimum improvement of 4.5% in Dice score.
CLMar 25, 2025
KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative ModelsZhiwei Wang, Zhongxin Liu, Ying Li et al.
The emergence of large language models (LLMs) has significantly advanced the development of natural language processing (NLP), especially in text generation tasks like question answering. However, model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes. We systematically expand on the causes of factual hallucinations from the perspective of knowledge shortcuts, analyzing hallucinations arising from correct and defect-free data and demonstrating that knowledge-shortcut hallucinations are prevalent in generative models. To mitigate this issue, we propose a high similarity pruning algorithm at the data preprocessing level to reduce spurious correlations in the data. Additionally, we design a specific detection method for knowledge-shortcut hallucinations to evaluate the effectiveness of our mitigation strategy. Experimental results show that our approach effectively reduces knowledge-shortcut hallucinations, particularly in fine-tuning tasks, without negatively impacting model performance in question answering. This work introduces a new paradigm for mitigating specific hallucination issues in generative models, enhancing their robustness and reliability in real-world applications.
LGOct 25, 2024
Robust Thompson Sampling Algorithms Against Reward Poisoning AttacksYinglun Xu, Zhiwei Wang, Gagandeep Singh
Thompson sampling is one of the most popular learning algorithms for online sequential decision-making problems and has rich real-world applications. However, current Thompson sampling algorithms are limited by the assumption that the rewards received are uncorrupted, which may not be true in real-world applications where adversarial reward poisoning exists. To make Thompson sampling more reliable, we want to make it robust against adversarial reward poisoning. The main challenge is that one can no longer compute the actual posteriors for the true reward, as the agent can only observe the rewards after corruption. In this work, we solve this problem by computing pseudo-posteriors that are less likely to be manipulated by the attack. We propose robust algorithms based on Thompson sampling for the popular stochastic and contextual linear bandit settings in both cases where the agent is aware or unaware of the budget of the attacker. We theoretically show that our algorithms guarantee near-optimal regret under any attack strategy.
LGFeb 21, 2024
Stealthy Adversarial Attacks on Stochastic Multi-Armed BanditsZhiwei Wang, Huazheng Wang, Hongning Wang
Adversarial attacks against stochastic multi-armed bandit (MAB) algorithms have been extensively studied in the literature. In this work, we focus on reward poisoning attacks and find most existing attacks can be easily detected by our proposed detection method based on the test of homogeneity, due to their aggressive nature in reward manipulations. This motivates us to study the notion of stealthy attack against stochastic MABs and investigate the resulting attackability. Our analysis shows that against two popularly employed MAB algorithms, UCB1 and $ε$-greedy, the success of a stealthy attack depends on the environmental conditions and the realized reward of the arm pulled in the first round. We also analyze the situation for general MAB algorithms equipped with our attack detection method and find that it is possible to have a stealthy attack that almost always succeeds. This brings new insights into the security risks of MAB algorithms.
LGFeb 11
CADET: Context-Conditioned Ads CTR Prediction With a Decoder-Only TransformerDavid Pardoe, Neil Daftary, Miro Furtado et al.
Click-through rate (CTR) prediction is fundamental to online advertising systems. While Deep Learning Recommendation Models (DLRMs) with explicit feature interactions have long dominated this domain, recent advances in generative recommenders have shown promising results in content recommendation. However, adapting these transformer-based architectures to ads CTR prediction still presents unique challenges, including handling post-scoring contextual signals, maintaining offline-online consistency, and scaling to industrial workloads. We present CADET (Context-Conditioned Ads Decoder-Only Transformer), an end-to-end decoder-only transformer for ads CTR prediction deployed at LinkedIn. Our approach introduces several key innovations: (1) a context-conditioned decoding architecture with multi-tower prediction heads that explicitly model post-scoring signals such as ad position, resolving the chicken-and-egg problem between predicted CTR and ranking; (2) a self-gated attention mechanism that stabilizes training by adaptively regulating information flow at both representation and interaction levels; (3) a timestamp-based variant of Rotary Position Embedding (RoPE) that captures temporal relationships across timescales from seconds to months; (4) session masking strategies that prevent the model from learning dependencies on unavailable in-session events, addressing train-serve skew; and (5) production engineering techniques including tensor packing, sequence chunking, and custom Flash Attention kernels that enable efficient training and serving at scale. In online A/B testing, CADET achieves a 11.04\% CTR lift compared to the production LiRank baseline model, a hybrid ensemble of DCNv2 and sequential encoders. The system has been successfully deployed on LinkedIn's advertising platform, serving the main traffic for homefeed sponsored updates.