CVJan 16, 2023
Linguistic Query-Guided Mask Generation for Referring Image SegmentationZhichao Wei, Xiaohao Chen, Mingqiang Chen et al.
Referring image segmentation aims to segment the image region of interest according to the given language expression, which is a typical multi-modal task. Existing methods either adopt the pixel classification-based or the learnable query-based framework for mask generation, both of which are insufficient to deal with various text-image pairs with a fix number of parametric prototypes. In this work, we propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation, dubbed LGFormer. It views the linguistic features as query to generate a specialized prototype for arbitrary input image-text pair, thus generating more consistent segmentation results. Moreover, we design several cross-modal interaction modules (\eg, vision-language bidirectional attention module, VLBA) in both encoder and decoder to achieve better cross-modal alignment.
CEApr 10
Transfer-learned Kolosov-Muskhelishvili Informed Neural Networks for Fracture MechanicsShuwei Zhou, Christian Haeffner, Shuancheng Wang et al.
Physics-informed neural networks have been widely applied to solid mechanics problems. However, balancing the governing partial differential equations and boundary conditions remains challenging, particularly in fracture mechanics, where accurate predictions strongly depend on refined sampling near crack tips. To overcome these limitations, a Kolosov-Muskhelishvili informed neural network with Williams enrichment is developed in this study. Benefiting from the holomorphic representation, the governing equations are satisfied by construction, and only boundary points are required for training. Across a series of benchmark problems, the Kolosov-Muskhelishvili informed neural network shows excellent agreement with analytical and finite element method references, achieving average relative errors below 1\% and $R^2$ above 0.99 for both mode I and mode II loadings. Furthermore, three crack propagation criteria (maximum tangential stress, maximum energy release rate, and principle of local symmetry) are integrated into the framework using a transfer learning strategy to predict crack propagation directions. The predicted paths are nearly identical across all criteria, and the transfer learning strategy reduces the required training time by more than 70\%. Overall, the developed framework provides a unified, mesh-free, and physically consistent approach for accurate and efficient crack propagation analysis.
CVAug 15, 2025Code
Ovis2.5 Technical ReportShiyin Lu, Yang Li, Yu Xia et al.
We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.
CVMay 22, 2023Code
UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything ModelZhenghao Zhang, Shengfan Zhang, Zhichao Wei et al.
The current state-of-the-art methods for unsupervised video object segmentation (UVOS) require extensive training on video datasets with mask annotations, limiting their effectiveness in handling challenging scenarios. However, the Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities. In this study, we investigate SAM's potential for UVOS through different prompt strategies. We then propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker. STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features, remarkably enhancing the quality of box prompts in complex video scenes. Extensive experiments on the DAVIS2017-unsupervised and YoutubeVIS19\&21 datasets demonstrate the superior performance of UVOSAM without mask supervision compared to existing mask-supervised methods, as well as its ability to generalize to weakly-annotated video datasets. Code can be found at https://github.com/alibaba/UVOSAM.
CEMay 4
A Variational Kolosov--Muskhelishvili Network for Elasticity and FractureShuwei Zhou, Christian Häffner, Sophie Stebner et al.
Physics-informed neural networks provide a mesh-free framework for solving partial differential equation-governed problems in solid mechanics. However, most existing formulations in linear elasticity still learn the displacement field directly, which does not explicitly exploit the analytic structure of two-dimensional elasticity and becomes restrictive for fracture problems with crack face discontinuities and crack tip singularities. Moreover, existing Kolosov--Muskhelishvili informed neural network formulations still rely on residual-based loss functions with multiple boundary and interface terms, whereas a variational concept has not yet been established. To address these issues, a variational Kolosov--Muskhelishvili informed neural network framework for two-dimensional linear elastic problems with and without cracks is proposed in this work. The solution is represented by two holomorphic Kolosov--Muskhelishvili potentials and trained through an energy-based loss function derived from the principle of minimum total potential energy. For crack problems, a discontinuous stress potential representation is further introduced to embed the crack face condition and crack tip singularity directly into the solution ansatz. The proposed framework is validated on a series of benchmark problems with or without crack problems. The results show that variational Kolosov--Muskhelishvili informed neural network can accurately predict stress and displacement field as well as stress intensity factors. Compared with traditional neural network models, it achieves higher accuracy, simpler loss construction, and faster convergence in the considered cases. Overall, the proposed variational Kolosov--Muskhelishvili informed neural network provides an effective and physically consistent variational framework for two-dimensional linear elastic fracture analysis.
CVMar 22, 2024
MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition IntegrationZhichao Wei, Qingkun Su, Long Qin et al.
Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention mechanism. In this paper, we propose MM-Diff, a unified and tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. Specifically, to simultaneously enhance text consistency and subject fidelity, MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings, both of which are efficiently integrated into the diffusion model through the well-designed multimodal cross-attention mechanism. Additionally, MM-Diff introduces cross-attention map constraints during the training phase, ensuring flexible multi-subject image sampling during inference without any predefined inputs (e.g., layout). Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods.
CVDec 4, 2024
PEMF-VTO: Point-Enhanced Video Virtual Try-on via Mask-free ParadigmTianyu Chang, Xiaohao Chen, Zhichao Wei et al.
Video Virtual Try-on aims to seamlessly transfer a reference garment onto a target person in a video while preserving both visual fidelity and temporal coherence. Existing methods typically rely on inpainting masks to define the try-on area, enabling accurate garment transfer for simple scenes (e.g., in-shop videos). However, these mask-based approaches struggle with complex real-world scenarios, as overly large and inconsistent masks often destroy spatial-temporal information, leading to distorted results. Mask-free methods alleviate this issue but face challenges in accurately determining the try-on area, especially for videos with dynamic body movements. To address these limitations, we propose PEMF-VTO, a novel Point-Enhanced Mask-Free Video Virtual Try-On framework that leverages sparse point alignments to explicitly guide garment transfer. Our key innovation is the introduction of point-enhanced guidance, which provides flexible and reliable control over both spatial-level garment transfer and temporal-level video coherence. Specifically, we design a Point-Enhanced Transformer (PET) with two core components: Point-Enhanced Spatial Attention (PSA), which uses frame-cloth point alignments to precisely guide garment transfer, and Point-Enhanced Temporal Attention (PTA), which leverages frame-frame point correspondences to enhance temporal coherence and ensure smooth transitions across frames. Extensive experiments demonstrate that our PEMF-VTO outperforms state-of-the-art methods, generating more natural, coherent, and visually appealing try-on videos, particularly for challenging in-the-wild scenarios. The link to our paper's homepage is https://pemf-vto.github.io/.