CVNov 22, 2023
Compact 3D Gaussian Representation for Radiance FieldJoo Chan Lee, Daniel Rho, Xiangyu Sun et al.
Neural Radiance Fields (NeRFs) have demonstrated remarkable potential in capturing complex 3D scenes with high fidelity. However, one persistent challenge that hinders the widespread adoption of NeRFs is the computational bottleneck due to the volumetric rendering. On the other hand, 3D Gaussian splatting (3DGS) has recently emerged as an alternative representation that leverages a 3D Gaussisan-based representation and adopts the rasterization pipeline to render the images rather than volumetric rendering, achieving very fast rendering speed and promising image quality. However, a significant drawback arises as 3DGS entails a substantial number of 3D Gaussians to maintain the high fidelity of the rendered images, which requires a large amount of memory and storage. To address this critical issue, we place a specific emphasis on two key objectives: reducing the number of Gaussian points without sacrificing performance and compressing the Gaussian attributes, such as view-dependent color and covariance. To this end, we propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance. In addition, we propose a compact but effective representation of view-dependent color by employing a grid-based neural field rather than relying on spherical harmonics. Finally, we learn codebooks to compactly represent the geometric attributes of Gaussian by vector quantization. With model compression techniques such as quantization and entropy coding, we consistently show over 25$\times$ reduced storage and enhanced rendering speed, while maintaining the quality of the scene representation, compared to 3DGS. Our work provides a comprehensive framework for 3D scene representation, achieving high performance, fast training, compactness, and real-time rendering. Our project page is available at https://maincold2.github.io/c3dgs/.
DBApr 7, 2023
ChatPipe: Orchestrating Data Preparation Program by Optimizing Human-ChatGPT InteractionsSibei Chen, Hanbing Liu, Weiting Jin et al.
Orchestrating a high-quality data preparation program is essential for successful machine learning (ML), but it is known to be time and effort consuming. Despite the impressive capabilities of large language models like ChatGPT in generating programs by interacting with users through natural language prompts, there are still limitations. Specifically, a user must provide specific prompts to iteratively guide ChatGPT in improving data preparation programs, which requires a certain level of expertise in programming, the dataset used and the ML task. Moreover, once a program has been generated, it is non-trivial to revisit a previous version or make changes to the program without starting the process over again. In this paper, we present ChatPipe, a novel system designed to facilitate seamless interaction between users and ChatGPT. ChatPipe provides users with effective recommendation on next data preparation operations, and guides ChatGPT to generate program for the operations. Also, ChatPipe enables users to easily roll back to previous versions of the program, which facilitates more efficient experimentation and testing. We have developed a web application for ChatPipe and prepared several real-world ML tasks from Kaggle. These tasks can showcase the capabilities of ChatPipe and enable VLDB attendees to easily experiment with our novel features to rapidly orchestrate a high-quality data preparation program.
LGJan 26, 2023
Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence TestingXiangyu Sun, Oliver Schulte
A fundamental problem of causal discovery is cause-effect inference, learning the correct causal direction between two random variables. Significant progress has been made through modelling the effect as a function of its cause and a noise term, which allows us to leverage assumptions about the generating function class. The recently introduced heteroscedastic location-scale noise functional models (LSNMs) combine expressive power with identifiability guarantees. LSNM model selection based on maximizing likelihood achieves state-of-the-art accuracy, when the noise distributions are correctly specified. However, through an extensive empirical evaluation, we demonstrate that the accuracy deteriorates sharply when the form of the noise distribution is misspecified by the user. Our analysis shows that the failure occurs mainly when the conditional variance in the anti-causal direction is smaller than that in the causal direction. As an alternative, we find that causal model selection through residual independence testing is much more robust to noise misspecification and misleading conditional variance.
CVAug 7, 2024
Compact 3D Gaussian Splatting for Static and Dynamic Radiance FieldsJoo Chan Lee, Daniel Rho, Xiangyu Sun et al.
3D Gaussian splatting (3DGS) has recently emerged as an alternative representation that leverages a 3D Gaussian-based representation and introduces an approximated volumetric rendering, achieving very fast rendering speed and promising image quality. Furthermore, subsequent studies have successfully extended 3DGS to dynamic 3D scenes, demonstrating its wide range of applications. However, a significant drawback arises as 3DGS and its following methods entail a substantial number of Gaussians to maintain the high fidelity of the rendered images, which requires a large amount of memory and storage. To address this critical issue, we place a specific emphasis on two key objectives: reducing the number of Gaussian points without sacrificing performance and compressing the Gaussian attributes, such as view-dependent color and covariance. To this end, we propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance. In addition, we propose a compact but effective representation of view-dependent color by employing a grid-based neural field rather than relying on spherical harmonics. Finally, we learn codebooks to compactly represent the geometric and temporal attributes by residual vector quantization. With model compression techniques such as quantization and entropy coding, we consistently show over 25x reduced storage and enhanced rendering speed compared to 3DGS for static scenes, while maintaining the quality of the scene representation. For dynamic scenes, our approach achieves more than 12x storage efficiency and retains a high-quality reconstruction compared to the existing state-of-the-art methods. Our work provides a comprehensive framework for 3D scene representation, achieving high performance, fast training, compactness, and real-time rendering. Our project page is available at https://maincold2.github.io/c3dgs/.
85.1CVMar 24
2Xplat: Two Experts Are Better Than One GeneralistHwasik Jeong, Seungryong Lee, Gyeongjin Kang et al.
Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such "all-in-one" designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.
CVDec 31, 2024Code
OVGaussian: Generalizable 3D Gaussian Segmentation with Open VocabulariesRunnan Chen, Xiangyu Sun, Zhaoqing Wang et al.
Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbf{OVGaussian}, a generalizable \textbf{O}pen-\textbf{V}ocabulary 3D semantic segmentation framework based on the 3D \textbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (https://github.com/runnanchen/OVGaussian).
CVAug 5, 2025Code
Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View ImagesXiangyu Sun, Haoyi Jiang, Liu Liu et al.
Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.
73.7AIApr 2
MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and OptimizationLisheng Zhang, Lilong Wang, Xiangyu Sun et al.
Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.
LGSep 9, 2021Code
NTS-NOTEARS: Learning Nonparametric DBNs With Prior KnowledgeXiangyu Sun, Oliver Schulte, Guiliang Liu et al.
We describe NTS-NOTEARS, a score-based structure learning method for time-series data to learn dynamic Bayesian networks (DBNs) that captures nonlinear, lagged (inter-slice) and instantaneous (intra-slice) relations among variables. NTS-NOTEARS utilizes 1D convolutional neural networks (CNNs) to model the dependence of child variables on their parents; 1D CNN is a neural function approximation model well-suited for sequential data. DBN-CNN structure learning is formulated as a continuous optimization problem with an acyclicity constraint, following the NOTEARS DAG learning approach. We show how prior knowledge of dependencies (e.g., forbidden and required edges) can be included as additional optimization constraints. Empirical evaluation on simulated and benchmark data show that NTS-NOTEARS achieves state-of-the-art DAG structure quality compared to both parametric and nonparametric baseline methods, with improvement in the range of 10-20% on the F1-score. We also evaluate NTS-NOTEARS on complex real-world data acquired from professional ice hockey games that contain a mixture of continuous and discrete variables. The code is available online.
CVJan 1, 2024
Deblurring 3D Gaussian SplattingByeonghyeon Lee, Howoong Lee, Xiangyu Sun et al.
Recent studies in Radiance Fields have paved the robust way for novel view synthesis with their photorealistic rendering quality. Nevertheless, they usually employ neural networks and volumetric rendering, which are costly to train and impede their broad use in various real-time applications due to the lengthy rendering time. Lately 3D Gaussians splatting-based approach has been proposed to model the 3D scene, and it achieves remarkable visual quality while rendering the images in real-time. However, it suffers from severe degradation in the rendering quality if the training images are blurry. Blurriness commonly occurs due to the lens defocusing, object motion, and camera shake, and it inevitably intervenes in clean image acquisition. Several previous studies have attempted to render clean and sharp images from blurry input images using neural fields. The majority of those works, however, are designed only for volumetric rendering-based neural radiance fields and are not straightforwardly applicable to rasterization-based 3D Gaussian splatting methods. Thus, we propose a novel real-time deblurring framework, Deblurring 3D Gaussian Splatting, using a small Multi-Layer Perceptron (MLP) that manipulates the covariance of each 3D Gaussian to model the scene blurriness. While Deblurring 3D Gaussian Splatting can still enjoy real-time rendering, it can reconstruct fine and sharp details from blurry images. A variety of experiments have been conducted on the benchmark, and the results have revealed the effectiveness of our approach for deblurring. Qualitative results are available at https://benhenryl.github.io/Deblurring-3D-Gaussian-Splatting/
CVDec 9, 2024
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D ReconstructionSeungtae Nam, Xiangyu Sun, Gyeongjin Kang et al.
Generalized feed-forward Gaussian models have achieved significant progress in sparse-view 3D reconstruction by leveraging prior knowledge from large multi-view datasets. However, these models often struggle to represent high-frequency details due to the limited number of Gaussians. While the densification strategy used in per-scene 3D Gaussian splatting (3D-GS) optimization can be adapted to the feed-forward models, it may not be ideally suited for generalized scenarios. In this paper, we propose Generative Densification, an efficient and generalizable method to densify Gaussians generated by feed-forward models. Unlike the 3D-GS densification strategy, which iteratively splits and clones raw Gaussian parameters, our method up-samples feature representations from the feed-forward models and generates their corresponding fine Gaussians in a single forward pass, leveraging the embedded prior knowledge for enhanced generalization. Experimental results on both object-level and scene-level reconstruction tasks demonstrate that our method outperforms state-of-the-art approaches with comparable or smaller model sizes, achieving notable improvements in representing fine details.
93.0CVMar 13
VGGT-World: Transforming VGGT into an Autoregressive Geometry World ModelXiangyu Sun, Shijie Wang, Fengyi Zhang et al.
World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.
CVJul 31, 2025
iLRM: An Iterative Large 3D Reconstruction ModelGyeongjin Kang, Seungtae Nam, Seungkwon Yang et al.
Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed.
CVMay 27, 2025
Intern-GS: Vision Model Guided Sparse-View 3D Gaussian SplattingXiangyu Sun, Runnan Chen, Mingming Gong et al.
Sparse-view scene reconstruction often faces significant challenges due to the constraints imposed by limited observational data. These limitations result in incomplete information, leading to suboptimal reconstructions using existing methodologies. To address this, we present Intern-GS, a novel approach that effectively leverages rich prior knowledge from vision foundation models to enhance the process of sparse-view Gaussian Splatting, thereby enabling high-quality scene reconstruction. Specifically, Intern-GS utilizes vision foundation models to guide both the initialization and the optimization process of 3D Gaussian splatting, effectively addressing the limitations of sparse inputs. In the initialization process, our method employs DUSt3R to generate a dense and non-redundant gaussian point cloud. This approach significantly alleviates the limitations encountered by traditional structure-from-motion (SfM) methods, which often struggle under sparse-view constraints. During the optimization process, vision foundation models predict depth and appearance for unobserved views, refining the 3D Gaussians to compensate for missing information in unseen regions. Extensive experiments demonstrate that Intern-GS achieves state-of-the-art rendering quality across diverse datasets, including both forward-facing and large-scale scenes, such as LLFF, DTU, and Tanks and Temples.
LGJul 14, 2025
Compliance Minimization via Physics-Informed Gaussian ProcessesXiangyu Sun, Amin Yousefpour, Shirin Hosseinmardi et al.
Machine learning (ML) techniques have recently gained significant attention for solving compliance minimization (CM) problems. However, these methods typically provide poor feature boundaries, are very expensive, and lack a systematic mechanism to control the design complexity. Herein, we address these limitations by proposing a mesh-free and simultaneous framework based on physics-informed Gaussian processes (GPs). In our approach, we parameterize the design and state variables with GP priors which have independent kernels but share a multi-output neural network (NN) as their mean function. The architecture of this NN is based on Parametric Grid Convolutional Attention Networks (PGCANs) which not only mitigate spectral bias issues, but also provide an interpretable mechanism to control design complexity. We estimate all the parameters of our GP-based representations by simultaneously minimizing the compliance, total potential energy, and residual of volume fraction constraint. Importantly, our loss function exclude all data-based residuals as GPs automatically satisfy them. We also develop computational schemes based on curriculum training and numerical integration to increase the efficiency and robustness of our approach which is shown to (1) produce super-resolution topologies with fast convergence, (2) achieve comparable compliance and less gray area fraction compared to traditional numerical methods, (3) provide control over fine-scale features, and (4) outperform competing ML-based methods.
LGMar 18, 2025
Localized Physics-informed Gaussian Processes with Curriculum Training for Topology OptimizationAmin Yousefpour, Shirin Hosseinmardi, Xiangyu Sun et al.
We introduce a simultaneous and meshfree topology optimization (TO) framework based on physics-informed Gaussian processes (GPs). Our framework endows all design and state variables via GP priors which have a shared, multi-output mean function that is parametrized via a customized deep neural network (DNN). The parameters of this mean function are estimated by minimizing a multi-component loss function that depends on the performance metric, design constraints, and the residuals on the state equations. Our TO approach yields well-defined material interfaces and has a built-in continuation nature that promotes global optimality. Other unique features of our approach include (1) its customized DNN which, unlike fully connected feed-forward DNNs, has a localized learning capacity that enables capturing intricate topologies and reducing residuals in high gradient fields, (2) its loss function that leverages localized weights to promote solution accuracy around interfaces, and (3) its use of curriculum training to avoid local optimality.To demonstrate the power of our framework, we validate it against commercial TO package COMSOL on three problems involving dissipated power minimization in Stokes flow.
CVJan 31, 2025
Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image ClassificationXiangyu Sun, Xiaoguang Zou, Yuanquan Wu et al.
X-ray imaging is pivotal in medical diagnostics, offering non-invasive insights into a range of health conditions. Recently, vision-language models, such as the Contrastive Language-Image Pretraining (CLIP) model, have demonstrated potential in improving diagnostic accuracy by leveraging large-scale image-text datasets. However, since CLIP was not initially designed for medical images, several CLIP-like models trained specifically on medical images have been developed. Despite their enhanced performance, issues of fairness - particularly regarding demographic attributes - remain largely unaddressed. In this study, we perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification. We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques, including Linear Probing, Multilayer Perceptron (MLP), Low-Rank Adaptation (LoRA), and full fine-tuning. Our results indicate that while fine-tuning improves model accuracy, fairness concerns persist, highlighting the need for further fairness interventions in these foundational models.
LGMay 27, 2025
TabReason: A Reinforcement Learning-Enhanced Reasoning LLM for Explainable Tabular Data PredictionTommy Xu, Zhitian Zhang, Xiangyu Sun et al.
Predictive modeling on tabular data is the cornerstone of many real-world applications. Although gradient boosting machines and some recent deep models achieve strong performance on tabular data, they often lack interpretability. On the other hand, large language models (LLMs) have demonstrated powerful capabilities to generate human-like reasoning and explanations, but remain under-performed for tabular data prediction. In this paper, we propose a new approach that leverages reasoning-based LLMs, trained using reinforcement learning, to perform more accurate and explainable predictions on tabular data. Our method introduces custom reward functions that guide the model not only toward better prediction accuracy but also toward human-understandable reasons for its predictions. The proposed method is evaluated on financial benchmark datasets and compared against established LLMs.
LGMay 14, 2025
Should We Simultaneously Calibrate Multiple Computer Models?Jonathan Tammer Eweis-Labolle, Tyler Johnson, Xiangyu Sun et al.
In an increasing number of applications designers have access to multiple computer models which typically have different levels of fidelity and cost. Traditionally, designers calibrate these models one at a time against some high-fidelity data (e.g., experiments). In this paper, we question this tradition and assess the potential of calibrating multiple computer models at the same time. To this end, we develop a probabilistic framework that is founded on customized neural networks (NNs) that are designed to calibrate an arbitrary number of computer models. In our approach, we (1) consider the fact that most computer models are multi-response and that the number and nature of calibration parameters may change across the models, and (2) learn a unique probability distribution for each calibration parameter of each computer model, (3) develop a loss function that enables our NN to emulate all data sources while calibrating the computer models, and (4) aim to learn a visualizable latent space where model-form errors can be identified. We test the performance of our approach on analytic and engineering problems to understand the potential advantages and pitfalls in simultaneous calibration of multiple computer models. Our method can improve predictive accuracy, however, it is prone to non-identifiability issues in higher-dimensional input spaces that are normally constrained by underlying physics.
LGJun 4, 2020
Cracking the Black Box: Distilling Deep Sports AnalyticsXiangyu Sun, Jack Davis, Oliver Schulte et al.
This paper addresses the trade-off between Accuracy and Transparency for deep learning applied to sports analytics. Neural nets achieve great predictive accuracy through deep learning, and are popular in sports analytics. But it is hard to interpret a neural net model and harder still to extract actionable insights from the knowledge implicit in it. Therefore, we built a simple and transparent model that mimics the output of the original deep learning model and represents the learned knowledge in an explicit interpretable way. Our mimic model is a linear model tree, which combines a collection of linear models with a regression-tree structure. The tree version of a neural network achieves high fidelity, explains itself, and produces insights for expert stakeholders such as athletes and coaches. We propose and compare several scalable model tree learning heuristics to address the computational challenge from datasets with millions of data points.