NAMay 24, 2016
Online Adaptive Local Multiscale Model Reduction for Heterogeneous Problems in Perforated DomainsEric T. Chung, Yalchin Efendiev, Wing Tat Leung et al.
In this paper, we develop and analyze an adaptive multiscale approach for heterogeneous problems in perforated domains. In many applications, these problems have a multiscale nature arising because of the perforations, their geometries, the sizes of the perforations, and configurations. In this paper, we present a general offline/online procedure, which can adequately and adaptively represent the local degrees of freedom and derive appropriate coarse-grid equations. The main contributions of this paper are (1) the rigorous analysis of the offline approach (2) the development of the online procedures and their analysis (3) the development of adaptive strategies. We present an online procedure, which allows adaptively incorporating global information and is important for a fast convergence when combined with the adaptivity. Our methodology allows adding and guides constructing new online multiscale basis functions adaptively in appropriate regions. We present the convergence analysis of the online adaptive enrichment algorithm for the Stokes system. In particular, we show that the online procedure has a rapid convergence with a rate related to the number of offline basis functions, and one can obtain fast convergence by a sufficient number of offline basis functions, which are computed in the offline stage. To illustrate the performance of our method, we present numerical results with both small and large perforations. We see that only a few (1 or 2) online iterations can significantly improve the offline solution.
97.2CVJun 1
MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video DataTeng Hu, Mingchun Lu, Yating Wang et al.
Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-agent settings introduces two critical challenges: data scarcity (coordinated multi-view recordings are prohibitively expensive to collect for general open-domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi-agent video world models to open-domain environments directly from single-view videos. First, we introduce Monocular World-State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego-motion and the visible subject's spatial trajectory. This camera-trajectory decomposition naturally extracts synchronized multi-agent motion data within a shared 3D space, completely bypassing the need for multi-camera setups. Second, for precise visual control, we develop the Subject-Aware World Generator to enable appearance-driven simulation conditioned on per-agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World-State Alignment, a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well-aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross-view consistency and identity fidelity, establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.
NAJul 16, 2018
Nonlocal multicontinua upscaling for multicontinua flow problems in fractured porous mediaMaria Vasilyeva, Eric T. Chung, Siu Wun Cheung et al.
Our goal of this paper is to develop a new upscaling method for multicontinua flow problems in fractured porous media. We consider a system of equations that describes flow phenomena with multiple flow variables defined on both matrix and fractures. To construct our upscaled model, we will apply the nonlocal multicontinua (NLMC) upscaling technique. The upscaled coefficients are obtained by using some multiscale basis functions, which are solutions of local problems defined on oversampled regions. For each continuum within a target coarse element, we will solve a local problem defined on an oversampling region obtained by extending the target element by few coarse grid layers, with a set of constraints which enforce the local solution to have mean value one on the chosen continuum and zero mean otherwise. The resulting multiscale basis functions have been shown to have good approximation properties. To illustrate the idea of our approach, we will consider a dual continua background model consisting of discrete fractures in two space dimensions, that is, we consider a system with three continua. We will present several numerical examples, and they show that our method is able to capture the interaction between matrix continua and discrete fractures on the coarse grid efficiently.
NAOct 29, 2018
Prediction of Discretization of GMsFEM using Deep LearningMin Wang, Siu Wun Cheung, Eric T. Chung et al.
In this paper, we propose a deep-learning-based approach to a class of multiscale problems. THe Generalized Multiscale Finite Element Method (GMsFEM) has been proven successful as a model reduction technique of flow problems in heterogeneous and high-contrast porous media. The key ingredients of GMsFEM include mutlsicale basis functions and coarse-scale parameters, which are obtained from solving local problems in each coarse neighborhood. Given a fixed medium, these quantities are precomputed by solving local problems in an offline stage, and result in a reduced-order model. However, these quantities have to be re-computed in case of varying media. The objective of our work is to make use of deep learning techniques to mimic the nonlinear relation between the permeability field and the GMsFEM discretizations, and use neural networks to perform fast computation of GMsFEM ingredients repeatedly for a class of media. We provide numerical experiments to investigate the predictive power of neural networks and the usefulness of the resultant multiscale model in solving channelized porous media flow problems.
NAMay 23, 2018
Upscaling method for problems in perforated domains with non-homogeneous boundary conditions on perforations using Non-Local Multi-Continuum method (NLMC)Maria Vasilyeva, Eric T. Chung, Wing Tat Leung et al.
In this paper, we present an upscaling method for problems in perforated domains with non-homogeneous boundary conditions on perforations. Our methodology is based on the recently developed Non-local multicontinuum method (NLMC). The main ingredient of the method is the construction of suitable local basis functions with the capability of capturing multiscale features and non-local effects. We will construct multiscale basis functions for the coarse regions and additional multiscale basis functions for perforations, with the aim of handling non-homogeneous boundary conditions on perforations. We start with describing our method for the Laplace equation, and then extending the framework for the elasticity problem and parabolic equations. The resulting upscaled model has minimal size and the solution has physical meaning on the coarse grid. We will present numerical results (1) for steady and unsteady problems, (2) for Laplace and Elastic operators, and (3) for Neumann and Robin non-homogeneous boundary conditions on perforations. Numerical results show that the proposed method can provide good accuracy and provide significant reduction on the degrees of freedom.
LGJul 24, 2022
AMS-Net: Adaptive Multiscale Sparse Neural Network with Interpretable Basis Expansion for Multiphase Flow ProblemsYating Wang, Wing Tat Leung, Guang Lin
In this work, we propose an adaptive sparse learning algorithm that can be applied to learn the physical processes and obtain a sparse representation of the solution given a large snapshot space. Assume that there is a rich class of precomputed basis functions that can be used to approximate the quantity of interest. We then design a neural network architecture to learn the coefficients of solutions in the spaces which are spanned by these basis functions. The information of the basis functions are incorporated in the loss function, which minimizes the differences between the downscaled reduced order solutions and reference solutions at multiple time steps. The network contains multiple submodules and the solutions at different time steps can be learned simultaneously. We propose some strategies in the learning framework to identify important degrees of freedom. To find a sparse solution representation, a soft thresholding operator is applied to enforce the sparsity of the output coefficient vectors of the neural network. To avoid over-simplification and enrich the approximation space, some degrees of freedom can be added back to the system through a greedy algorithm. In both scenarios, that is, removing and adding degrees of freedom, the corresponding network connections are pruned or reactivated guided by the magnitude of the solution coefficients obtained from the network outputs. The proposed adaptive learning process is applied to some toy case examples to demonstrate that it can achieve a good basis selection and accurate approximation. More numerical tests are performed on two-phase multiscale flow problems to show the capability and interpretability of the proposed method on complicated applications.
ROFeb 4, 2024Code
Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot LearningHaoyi Zhu, Yating Wang, Di Huang et al.
In robot learning, the observation space is crucial due to the distinct characteristics of different modalities, which can potentially become a bottleneck alongside policy design. In this study, we explore the influence of various observation spaces on robot learning, focusing on three predominant modalities: RGB, RGB-D, and point cloud. We introduce OBSBench, a benchmark comprising two simulators and 125 tasks, along with standardized pipelines for various encoders and policy baselines. Extensive experiments on diverse contact-rich manipulation tasks reveal a notable trend: point cloud-based methods, even those with the simplest designs, frequently outperform their RGB and RGB-D counterparts. This trend persists in both scenarios: training from scratch and utilizing pre-training. Furthermore, our findings demonstrate that point cloud observations often yield better policy performance and significantly stronger generalization capabilities across various geometric and visual conditions. These outcomes suggest that the 3D point cloud is a valuable observation modality for intricate robotic tasks. We also suggest that incorporating both appearance and coordinate information can enhance the performance of point cloud methods. We hope our work provides valuable insights and guidance for designing more generalizable and robust robotic models. Codes are available at https://github.com/HaoyiZhu/PointCloudMatters.
CVNov 6, 2024Code
Textual Decomposition Then Sub-motion-space Scattering for Open-Vocabulary Motion GenerationKe Fan, Jiangning Zhang, Ran Yi et al.
Text-to-motion generation is a crucial task in computer vision, which generates the target 3D motion by the given text. The existing annotated datasets are limited in scale, resulting in most existing methods overfitting to the small datasets and unable to generalize to the motions of the open domain. Some methods attempt to solve the open-vocabulary motion generation problem by aligning to the CLIP space or using the Pretrain-then-Finetuning paradigm. However, the current annotated dataset's limited scale only allows them to achieve mapping from sub-text-space to sub-motion-space, instead of mapping between full-text-space and full-motion-space (full mapping), which is the key to attaining open-vocabulary motion generation. To this end, this paper proposes to leverage the atomic motion (simple body part motions over a short time period) as an intermediate representation, and leverage two orderly coupled steps, i.e., Textual Decomposition and Sub-motion-space Scattering, to address the full mapping problem. For Textual Decomposition, we design a fine-grained description conversion algorithm, and combine it with the generalization ability of a large language model to convert any given motion text into atomic texts. Sub-motion-space Scattering learns the compositional process from atomic motions to the target motions, to make the learned sub-motion-space scattered to form the full-motion-space. For a given motion of the open domain, it transforms the extrapolation into interpolation and thereby significantly improves generalization. Our network, $DSO$-Net, combines textual $d$ecomposition and sub-motion-space $s$cattering to solve the $o$pen-vocabulary motion generation. Extensive experiments demonstrate that our DSO-Net achieves significant improvements over the state-of-the-art methods on open-vocabulary motion generation. Code is available at https://vankouf.github.io/DSONet/.
74.3CLApr 9Code
Distributed Multi-Layer Editing for Rule-Level Knowledge in Large Language ModelsYating Wang, Wenting Zhao, Yaqi Zhao et al.
Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact-level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule-level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule-level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine-grained causal tracing reveals a form-specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single-layer or contiguous-block intervention. Based on this insight, we propose Distributed Multi-Layer Editing (DMLE), which applies a shared early-layer update to formulas and descriptions and a separate middle-layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule-level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT-J-6B, Qwen2.5-7B, Qwen2-7B, and LLaMA-3-8B. The code is available at https://github.com/Pepper66/DMLE.
30.4CVApr 8
Holistic Optimal Label Selection for Robust Prompt Learning under Partial LabelsYaqi Zhao, Haoliang Sun, Yating Wang et al.
Prompt learning has gained significant attention as a parameter-efficient approach for adapting large pre-trained vision-language models to downstream tasks. However, when only partial labels are available, its performance is often limited by label ambiguity and insufficient supervisory information. To address this issue, we propose Holistic Optimal Label Selection (HopS), leveraging the generalization ability of pre-trained feature encoders through two complementary strategies. First, we design a local density-based filter that selects the top frequent labels from the nearest neighbors' candidate sets and uses the softmax scores to identify the most plausible label, capturing structural regularities in the feature space. Second, we introduce a global selection objective based on optimal transport that maps the uniform sampling distribution to the candidate label distributions across a batch. By minimizing the expected transport cost, it can determine the most likely label assignments. These two strategies work together to provide robust label selection from both local and global perspectives. Extensive experiments on eight benchmark datasets show that HopS consistently improves performance under partial supervision and outperforms all baselines. Those results highlight the merit of holistic label selection and offer a practical solution for prompt learning in weakly supervised settings.
CVJan 5
HeadLighter: Disentangling Illumination in Generative 3D Gaussian Heads via Lightstage CapturesYating Wang, Yuan Sun, Xuan Wang et al.
Recent 3D-aware head generative models based on 3D Gaussian Splatting achieve real-time, photorealistic and view-consistent head synthesis. However, a fundamental limitation persists: the deep entanglement of illumination and intrinsic appearance prevents controllable relighting. Existing disentanglement methods rely on strong assumptions to enable weakly supervised learning, which restricts their capacity for complex illumination. To address this challenge, we introduce HeadLighter, a novel supervised framework that learns a physically plausible decomposition of appearance and illumination in head generative models. Specifically, we design a dual-branch architecture that separately models lighting-invariant head attributes and physically grounded rendering components. A progressive disentanglement training is employed to gradually inject head appearance priors into the generative architecture, supervised by multi-view images captured under controlled light conditions with a light stage setup. We further introduce a distillation strategy to generate high-quality normals for realistic rendering. Experiments demonstrate that our method preserves high-quality generation and real-time rendering, while simultaneously supporting explicit lighting and viewpoint editing. We will publicly release our code and dataset.
28.7CVMay 6
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt LearningYating Wang, Yaqi Zhao, Yongshun Gong et al.
Vision-language models such as CLIP achieve strong visual-textual alignment, but often suffer from overfitting and limited interpretability when adapted through continuous prompt learning. While discrete prompt optimization improves interpretability, it usually depends on large external models, leading to high computational costs and limited scalability. In this paper, we propose Interpretable Prompt Learning (IPL), a hybrid framework that alternates between discrete semantic token selection and continuous prompt optimization. Specifically, IPL formulates semantic token selection as an approximate submodular optimization problem, encouraging tokens that are both human-understandable and semantically diverse. It further adopts an alternating optimization strategy to integrate discrete token selection with continuous prompt tuning, improving interpretability while preserving adaptability to downstream tasks. Our framework is plug-and-play, allowing seamless integration with existing prompt learning methods. Extensive experiments on multiple benchmarks show that IPL consistently improves both interpretability and accuracy across five representative prompt learning methods, providing an effective and scalable extension to existing frameworks.
CVApr 24, 2024
MotionMaster: Training-free Camera Motion Transfer For Video GenerationTeng Hu, Jiangning Zhang, Ran Yi et al.
The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.
ROJul 1, 2025
VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action TokenizersYating Wang, Haoyi Zhu, Mingyu Liu et al.
In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.Project website: https://xiaoxiao0406.github.io/vqvla.github.io
CVMay 22, 2025
CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot LearningJiange Yang, Yansong Shi, Haoyi Zhu et al.
Learning latent motion from Internet videos is crucial for building generalist robots. However, existing discrete latent action methods suffer from information loss and struggle with complex and fine-grained dynamics. We propose CoMo, which aims to learn more informative continuous motion representations from diverse, internet-scale videos. CoMo employs a early temporal feature difference mechanism to prevent model collapse and suppress static appearance noise, effectively discouraging shortcut learning problem. Furthermore, guided by the information bottleneck principle, we constrain the latent motion embedding dimensionality to achieve a better balance between retaining sufficient action-relevant information and minimizing the inclusion of action-irrelevant appearance noise. Additionally, we also introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods development: (i) the linear probing MSE of action prediction, and (ii) the cosine similarity between past-to-current and future-to-current motion embeddings. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains. This capability facilitates unified policy joint learning using pseudo actions derived from various action-less video datasets (such as cross-embodiment videos and, notably, human demonstration videos), potentially augmented with limited labeled robot data. Extensive experiments show that policies co-trained with CoMo pseudo actions achieve superior performance with both diffusion and autoregressive architectures in simulated and real-world settings.
CVApr 21, 2025
3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial RepresentationsYating Wang, Xuan Wang, Ran Yi et al.
Recent studies have combined 3D Gaussian and 3D Morphable Models (3DMM) to construct high-quality 3D head avatars. In this line of research, existing methods either fail to capture the dynamic textures or incur significant overhead in terms of runtime speed or storage space. To this end, we propose a novel method that addresses all the aforementioned demands. In specific, we introduce an expressive and compact representation that encodes texture-related attributes of the 3D Gaussians in the tensorial format. We store appearance of neutral expression in static tri-planes, and represents dynamic texture details for different expressions using lightweight 1D feature lines, which are then decoded into opacity offset relative to the neutral face. We further propose adaptive truncated opacity penalty and class-balanced sampling to improve generalization across different expressions. Experiments show this design enables accurate face dynamic details capturing while maintains real-time rendering and significantly reduces storage costs, thus broadening the applicability to more scenarios.
80.0CVApr 6
AvatarPointillist: AutoRegressive 4D Gaussian AvatarizationHongyu Liu, Xuan Wang, Yating Wang et al.
We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code inspire future research.
CVOct 4, 2025
Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action ExpertMingyu Liu, Zheng Huang, Xiaoyi Lin et al.
Although Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities, translating these abilities into the physical world introduces significant challenges. Conventional Vision-Language-Action (VLA) models, which integrate reasoning and action into a monolithic architecture, generalize poorly because they are constrained by scarce, narrow-domain data. While recent dual-system approaches attempt to decouple "thinking" from "acting", they are often constrained by semantic ambiguities within the action module. This ambiguity makes large-scale, cross-task training infeasible. Consequently, these systems typically necessitate fine-tuning on newly collected data when deployed to novel environments, and the cooperation mechanism between the two systems remains ill-defined. To address these limitations, we introduce, for the first time, a framework centered around a generalizable action expert. Our approach utilizes sparse 3D trajectories as an intermediate representation, effectively bridging the high-level planning capabilities of the VLM with the low-level physical action module. During the planning phase, the VLM is only required to generate coarse 3D waypoints. These waypoints are then processed by our generalizable action expert, which refines them into dense, executable action sequences by sampling real-time point cloud observations of the environment. To promote training efficiency and robust generalization, we introduce a novel "Action Pre-training, Pointcloud Fine-tuning" paradigm. Our method combines the broad generalization capabilities of VLMs in visual understanding and planning with the fine-grained, action-level generalization of action expert.
CVSep 15, 2025
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World ModelingYang Zhou, Yifan Wang, Jianjun Zhou et al.
The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.
ROAug 28, 2025
Learning Primitive Embodied World Models: Towards Scalable Robotic LearningQiao Sun, Liujia Yang, Wei Tang et al.
While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
CLJun 18, 2025
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMsYongqi Fan, Yating Wang, Guandong Wang et al.
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose \textbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.
CVJun 24, 2024
ID-Sculpt: ID-aware 3D Head Generation from Single In-the-wild Portrait ImageJinkun Hao, Junshu Tang, Jiangning Zhang et al.
While recent works have achieved great success on image-to-3D object generation, high quality and fidelity 3D head generation from a single image remains a great challenge. Previous text-based methods for generating 3D heads were limited by text descriptions and image-based methods struggled to produce high-quality head geometry. To handle this challenging problem, we propose a novel framework, ID-Sculpt, to generate high-quality 3D heads while preserving their identities. Our work incorporates the identity information of the portrait image into three parts: 1) geometry initialization, 2) geometry sculpting, and 3) texture generation stages. Given a reference portrait image, we first align the identity features with text features to realize ID-aware guidance enhancement, which contains the control signals representing the face information. We then use the canny map, ID features of the portrait image, and a pre-trained text-to-normal/depth diffusion model to generate ID-aware geometry supervision, and 3D-GAN inversion is employed to generate ID-aware geometry initialization. Furthermore, with the ability to inject identity information into 3D head generation, we use ID-aware guidance to calculate ID-aware Score Distillation (ISD) for geometry sculpting. For texture generation, we adopt the ID Consistent Texture Inpainting and Refinement which progressively expands the view for texture inpainting to obtain an initialization UV texture map. We then use the ID-aware guidance to provide image-level supervision for noisy multi-view images to obtain a refined texture map. Extensive experiments demonstrate that we can generate high-quality 3D heads with accurate geometry and texture from a single in-the-wild portrait image.
CVApr 8, 2024
Reconstructing Topology-Consistent Face Mesh by Volume Rendering from Multi-View ImagesYating Wang, Ran Yi, Xiaoning Lei et al.
Industrial 3D face assets creation typically reconstructs topology-consistent face meshes from multi-view images for downstream production. However, high-quality reconstruction usually requires manual processing or specific capture settings. Recently NeRF has shown great advantages in 3D reconstruction, by representing scenes as density and radiance fields and utilizing neural volume rendering for novel view synthesis. Inspired by this, we introduce a novel method which combines explicit mesh with neural volume rendering to optimize geometry of an artist-made template face mesh from multi-view images while keeping the topology unchanged. Our method derives density fields from meshes using distance fields as an intermediary and encodes radiance field in compact tri-planes. To improve convergence, several adaptions tailored for meshes are introduced to the volume rendering. Experiments demonstrate that our method achieves superior reconstruction quality compared to previous approaches, validating the feasibility of integrating mesh and neural volume rendering.
NAMar 21, 2024
Learning-based Multi-continuum Model for Multiscale Flow ProblemsFan Wang, Yating Wang, Wing Tat Leung et al.
Multiscale problems can usually be approximated through numerical homogenization by an equation with some effective parameters that can capture the macroscopic behavior of the original system on the coarse grid to speed up the simulation. However, this approach usually assumes scale separation and that the heterogeneity of the solution can be approximated by the solution average in each coarse block. For complex multiscale problems, the computed single effective properties/continuum might be inadequate. In this paper, we propose a novel learning-based multi-continuum model to enrich the homogenized equation and improve the accuracy of the single continuum model for multiscale problems with some given data. Without loss of generalization, we consider a two-continuum case. The first flow equation keeps the information of the original homogenized equation with an additional interaction term. The second continuum is newly introduced, and the effective permeability in the second flow equation is determined by a neural network. The interaction term between the two continua aligns with that used in the Dual-porosity model but with a learnable coefficient determined by another neural network. The new model with neural network terms is then optimized using trusted data. We discuss both direct back-propagation and the adjoint method for the PDE-constraint optimization problem. Our proposed learning-based multi-continuum model can resolve multiple interacted media within each coarse grid block and describe the mass transfer among them, and it has been demonstrated to significantly improve the simulation results through numerical experiments involving both linear and nonlinear flow equations.
NAOct 3, 2020
An adaptive Hessian approximated stochastic gradient MCMC methodYating Wang, Wei Deng, Guang Lin
Bayesian approaches have been successfully integrated into training deep neural networks. One popular family is stochastic gradient Markov chain Monte Carlo methods (SG-MCMC), which have gained increasing interest due to their scalability to handle large datasets and the ability to avoid overfitting. Although standard SG-MCMC methods have shown great performance in a variety of problems, they may be inefficient when the random variables in the target posterior densities have scale differences or are highly correlated. In this work, we present an adaptive Hessian approximated stochastic gradient MCMC method to incorporate local geometric information while sampling from the posterior. The idea is to apply stochastic approximation to sequentially update a preconditioning matrix at each iteration. The preconditioner possesses second-order information and can guide the random walk of a sampler efficiently. Instead of computing and saving the full Hessian of the log posterior, we use limited memory of the sample and their stochastic gradients to approximate the inverse Hessian-vector multiplication in the updating formula. Moreover, by smoothly optimizing the preconditioning matrix, our proposed algorithm can asymptotically converge to the target distribution with a controllable bias under mild conditions. To reduce the training and testing computational burden, we adopt a magnitude-based weight pruning method to enforce the sparsity of the network. Our method is user-friendly and is scalable to standard SG-MCMC updating rules by implementing an additional preconditioner. The sparse approximation of inverse Hessian alleviates storage and computational complexities for large dimensional models. The bias introduced by stochastic approximation is controllable and can be analyzed theoretically. Numerical experiments are performed on several problems.
NAJun 29, 2020
Bayesian Sparse learning with preconditioned stochastic gradient MCMC and its applicationsYating Wang, Wei Deng, Lin Guang
In this work, we propose a Bayesian type sparse deep learning algorithm. The algorithm utilizes a set of spike-and-slab priors for the parameters in the deep neural network. The hierarchical Bayesian mixture will be trained using an adaptive empirical method. That is, one will alternatively sample from the posterior using preconditioned stochastic gradient Langevin Dynamics (PSGLD), and optimize the latent variables via stochastic approximation. The sparsity of the network is achieved while optimizing the hyperparameters with adaptive searching and penalizing. A popular SG-MCMC approach is Stochastic gradient Langevin dynamics (SGLD). However, considering the complex geometry in the model parameter space in non-convex learning, updating parameters using a universal step size in each component as in SGLD may cause slow mixing. To address this issue, we apply a computationally manageable preconditioner in the updating rule, which provides a step-size parameter to adapt to local geometric properties. Moreover, by smoothly optimizing the hyperparameter in the preconditioning matrix, our proposed algorithm ensures a decreasing bias, which is introduced by ignoring the correction term in preconditioned SGLD. According to the existing theoretical framework, we show that the proposed algorithm can asymptotically converge to the correct distribution with a controllable bias under mild conditions. Numerical tests are performed on both synthetic regression problems and learning the solutions of elliptic PDE, which demonstrate the accuracy and efficiency of present work.
SIMar 1, 2019
Data-driven Approach for Quality Evaluation on Knowledge Sharing PlatformLu Xu, Jinhai Xiang, Yating Wang et al.
In recent years, voice knowledge sharing and question answering (Q&A) platforms have attracted much attention, which greatly facilitate the knowledge acquisition for people. However, little research has evaluated on the quality evaluation on voice knowledge sharing. This paper presents a data-driven approach to automatically evaluate the quality of a specific Q&A platform (Zhihu Live). Extensive experiments demonstrate the effectiveness of the proposed method. Furthermore, we introduce a dataset of Zhihu Live as an open resource for researchers in related areas. This dataset will facilitate the development of new methods on knowledge sharing services quality evaluation.
NAJun 13, 2018
Deep Multiscale Model LearningYating Wang, Siu Wun Cheung, Eric T. Chung et al.
The objective of this paper is to design novel multi-layer neural network architectures for multiscale simulations of flows taking into account the observed data and physical modeling concepts. Our approaches use deep learning concepts combined with local multiscale model reduction methodologies to predict flow dynamics. Using reduced-order model concepts is important for constructing robust deep learning architectures since the reduced-order models provide fewer degrees of freedom. Flow dynamics can be thought of as multi-layer networks. More precisely, the solution (e.g., pressures and saturations) at the time instant $n+1$ depends on the solution at the time instant $n$ and input parameters, such as permeability fields, forcing terms, and initial conditions. One can regard the solution as a multi-layer network, where each layer, in general, is a nonlinear forward map and the number of layers relates to the internal time steps. We will rely on rigorous model reduction concepts to define unknowns and connections for each layer. In each layer, our reduced-order models will provide a forward map, which will be modified ("trained") using available data. It is critical to use reduced-order models for this purpose, which will identify the regions of influence and the appropriate number of variables. Because of the lack of available data, the training will be supplemented with computational data as needed and the interpolation between data-rich and data-deficient models. We will also use deep learning algorithms to train the elements of the reduced model discrete system. We will present main ingredients of our approach and numerical results. Numerical results show that using deep learning and multiscale models, we can improve the forward models, which are conditioned to the available data.
NAAug 28, 2017
Non-local Multi-continua Upscaling for Flows in Heterogeneous Fractured MediaEric T. Chung, Yalchin Efendiev, Wing Tat Leung et al.
In this paper, we propose a rigorous and accurate non-local (in the oversampled region) upscaling framework based on some recently developed multiscale methods [10]. Our proposed method consists of identifying multi-continua parameters via local basis functions and constructing non-local (in the oversampled region) transfer and effective properties. To achieve this, we significantly modify our recent work proposed within Generalized Multiscale Finite Element Method (GMsFEM) in [10] and derive appropriate local problems in oversampled regions once we identify important modes representing each continua. We use piecewise constant functions in each fracture network and in the matrix to write an upscaled equation. Thus, the resulting upscaled equation is of minimal size and the unknowns are average pressures in the fractures and the matrix. We note that the use of non-local upscaled model for porous media flows is not new, e.g., in [14], the authors derive non-local approach. Our main contribution is identifying appropriate local problems together with local spectral modes to represent each continua. The model problem for fractures assumes that one can identify fracture networks. The resulting non-local equation (restricted to the oversampling region, which is several times larger compared to the target coarse block) has the same form as \cite{Hamdi_Nonlocal} with much smaller local regions. We present numerical results, which show that the proposed approach can provide good accuracy.
NAAug 25, 2016
A conservative local multiscale model reduction technique for Stokes flows in heterogeneous perforated domainsEric T. Chung, Maria Vasilyeva, Yating Wang
In this paper, we present a new multiscale model reduction technique for the Stokes flows in heterogeneous perforated domains. The challenge in the numerical simulations of this problem lies in the fact that the solution contains many multiscale features and requires a very fine mesh to resolve all details. In order to efficiently compute the solutions, some model reductions are necessary. To obtain a reduced model, we apply the generalized multiscale finite element approach, which is a framework allowing systematic construction of reduced models. Based on this general framework, we will first construct a local snapshot space, which contains many possible multiscale features of the solution. Using the snapshot space and a local spectral problem, we identify dominant modes in the snapshot space and use them as the multiscale basis functions. Our basis functions are constructed locally with non-overlapping supports, which enhances the sparsity of the resulting linear system. In order to enforce the mass conservation, we propose a hybridized technique, and uses a Lagrange multiplier to achieve mass conservation. We will mathematically analyze the stability and the convergence of the proposed method. In addition, we will present some numerical examples to show the performance of the scheme. We show that, with a few basis functions per coarse region, one can obtain a solution with excellent accuracy.