GEO-PHMay 10, 2017
Application of Optimal Transport and the Quadratic Wasserstein Metric to Full-Waveform InversionYunan Yang, Björn Engquist, Junzhe Sun et al.
Conventional full-waveform inversion (FWI) using the least-squares norm ($L^2$) as a misfit function is known to suffer from cycle skipping. This increases the risk of computing a local rather than the global minimum of the misfit. In our previous work, we proposed the quadratic Wasserstein metric ($W_2$) as a new misfit function for FWI. The $W_2$ metric has been proved to have many ideal properties with regards to convexity and insensitivity to noise. When the observed and predicted seismic data are regarded as two density functions, the quadratic Wasserstein metric corresponds to the optimal cost of rearranging one density into the other, where the transportation cost is quadratic in distance. The difficulty of transforming seismic signals into nonnegative density functions is discussed. Unlike the $L^2$ norm, $W_2$ measures not only amplitude differences, but also global phase shifts, which helps to avoid cycle skipping issues. In this work, we build on our earlier method to cover more realistic high-resolution applications by embedding the $W_2$ technique into the framework of the adjoint-state method and applying it to seismic relevant 2D examples: the Camembert, the Marmousi, and the 2004 BP models. We propose a new way of using the $W_2$ metric trace-by-trace in FWI and compare it to global $W_2$ via the solution of the Monge-Ampère equation. With corresponding adjoint source, the velocity model can be updated using the l-BFGS method. Numerical results show the effectiveness of $W_2$ for alleviating cycle skipping issues and sensitivity to noise. Both mathematical theory and numerical examples demonstrate that the quadratic Wasserstein metric is a good candidate for a misfit function in seismic inversion.
CVDec 31, 2025Code
PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video GenerationYuanhao Cai, Kunpeng Li, Menglin Jia et al.
Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that uses real-world video as winning case to guarantee correct physics learning and builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that leverages VLM-based physical rewards to direct the optimization to focus on challenging physics cases. In addition, we propose a LoRA-Switch Reference (LoRA-SR) scheme that avoids full-model duplication as reference for efficient DPO training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO
CVFeb 12
UniT: Unified Multimodal Chain-of-Thought Test-time ScalingLeon Liangyu Chen, Haoyu Ma, Zhipeng Fan et al.
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
CVDec 12, 2025
Exploring MLLM-Diffusion Information Transfer with MetaCanvasHan Lin, Xichen Pan, Ziqi Huang et al.
Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.