MTRL-SCIDec 21, 2022Code
End-to-end AI framework for interpretable prediction of molecular and crystal propertiesHyun Park, Ruijie Zhu, E. A. Huerta et al.
We introduce an end-to-end computational framework that allows for hyperparameter optimization using the DeepHyper library, accelerated model training, and interpretable AI inference. The framework is based on state-of-the-art AI models including CGCNN, PhysNet, SchNet, MPNN, MPNN-transformer, and TorchMD-NET. We employ these AI models along with the benchmark QM9, hMOF, and MD17 datasets to showcase how the models can predict user-specified material properties within modern computing environments. We demonstrate transferable applications in the modeling of small molecules, inorganic crystals and nanoporous metal organic frameworks with a unified, standalone framework. We have deployed and tested this framework in the ThetaGPU supercomputer at the Argonne Leadership Computing Facility, and in the Delta supercomputer at the National Center for Supercomputing Applications to provide researchers with modern tools to conduct accelerated AI-driven discovery in leadership-class computing environments. We release these digital assets as open source scientific software in GitLab, and ready-to-use Jupyter notebooks in Google Colab.
MTRL-SCIJun 14, 2023
A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon captureHyun Park, Xiaoli Yan, Ruijie Zhu et al.
Metal-organic frameworks (MOFs) exhibit great promise for CO2 capture. However, finding the best performing materials poses computational and experimental grand challenges in view of the vast chemical space of potential building blocks. Here, we introduce GHP-MOFassemble, a generative artificial intelligence (AI), high performance framework for the rational and accelerated design of MOFs with high CO2 adsorption capacity and synthesizable linkers. GHP-MOFassemble generates novel linkers, assembled with one of three pre-selected metal nodes (Cu paddlewheel, Zn paddlewheel, Zn tetramer) into MOFs in a primitive cubic topology. GHP-MOFassemble screens and validates AI-generated MOFs for uniqueness, synthesizability, structural validity, uses molecular dynamics simulations to study their stability and chemical consistency, and crystal graph neural networks and Grand Canonical Monte Carlo simulations to quantify their CO2 adsorption capacities. We present the top six AI-generated MOFs with CO2 capacities greater than 2 $m mol/g$, i.e., higher than 96.9% of structures in the hypothetical MOF dataset.
LGMar 8, 2022
Multi-Agent Broad Reinforcement Learning for Intelligent Traffic Light ControlRuijie Zhu, Lulu Li, Shuning Wu et al.
Intelligent Traffic Light Control System (ITLCS) is a typical Multi-Agent System (MAS), which comprises multiple roads and traffic lights.Constructing a model of MAS for ITLCS is the basis to alleviate traffic congestion. Existing approaches of MAS are largely based on Multi-Agent Deep Reinforcement Learning (MADRL). Although the Deep Neural Network (DNN) of MABRL is effective, the training time is long, and the parameters are difficult to trace. Recently, Broad Learning Systems (BLS) provided a selective way for learning in the deep neural networks by a flat network. Moreover, Broad Reinforcement Learning (BRL) extends BLS in Single Agent Deep Reinforcement Learning (SADRL) problem with promising results. However, BRL does not focus on the intricate structures and interaction of agents. Motivated by the feature of MADRL and the issue of BRL, we propose a Multi-Agent Broad Reinforcement Learning (MABRL) framework to explore the function of BLS in MAS. Firstly, unlike most existing MADRL approaches, which use a series of deep neural networks structures, we model each agent with broad networks. Then, we introduce a dynamic self-cycling interaction mechanism to confirm the "3W" information: When to interact, Which agents need to consider, What information to transmit. Finally, we do the experiments based on the intelligent traffic light control scenario. We compare the MABRL approach with six different approaches, and experimental results on three datasets verify the effectiveness of MABRL.
CVJul 27, 2023
The RoboDepth Challenge: Methods and Advancements Towards Robust Depth EstimationLingdong Kong, Yaru Niu, Shaoyuan Xie et al.
Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website.
CLSep 11, 2024
Gated Slot Attention for Efficient Linear-Time Sequence ModelingYu Zhang, Songlin Yang, Ruijie Zhu et al.
Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.
LGOct 23, 2023
Tensor Decomposition Based Attention Module for Spiking Neural NetworksHaoyu Deng, Ruijie Zhu, Xuerui Qiu et al.
The attention mechanism has been proven to be an effective way to improve spiking neural network (SNN). However, based on the fact that the current SNN input data flow is split into tensors to process on GPUs, none of the previous works consider the properties of tensors to implement an attention module. This inspires us to rethink current SNN from the perspective of tensor-relevant theories. Using tensor decomposition, we design the \textit{projected full attention} (PFA) module, which demonstrates excellent results with linearly growing parameters. Specifically, PFA is composed by the \textit{linear projection of spike tensor} (LPST) module and \textit{attention map composing} (AMC) module. In LPST, we start by compressing the original spike tensor into three projected tensors using a single property-preserving strategy with learnable parameters for each dimension. Then, in AMC, we exploit the inverse procedure of the tensor decomposition process to combine the three tensors into the attention map using a so-called connecting factor. To validate the effectiveness of the proposed PFA module, we integrate it into the widely used VGG and ResNet architectures for classification tasks. Our method achieves state-of-the-art performance on both static and dynamic benchmark datasets, surpassing the existing SNN models with Transformer-based and CNN-based backbones.
CVApr 7, 2024Code
MagicTime: Time-lapse Video Generation Models as Metamorphic SimulatorsShenghai Yuan, Jinfa Huang, Yujun Shi et al.
Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions. A largely overlooked problem in T2V is that existing models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose \textbf{MagicTime}, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic generation. First, we design a MagicAdapter scheme to decouple spatial and temporal training, encode more physical knowledge from metamorphic videos, and transform pre-trained T2V models to generate metamorphic videos. Second, we introduce a Dynamic Frames Extraction strategy to adapt to metamorphic time-lapse videos, which have a wider variation range and cover dramatic object metamorphic processes, thus embodying more physical knowledge than general videos. Finally, we introduce a Magic Text-Encoder to improve the understanding of metamorphic video prompts. Furthermore, we create a time-lapse video-text dataset called \textbf{ChronoMagic}, specifically curated to unlock the metamorphic video generation ability. Extensive experiments demonstrate the superiority and effectiveness of MagicTime for generating high-quality and dynamic metamorphic videos, suggesting time-lapse video generation is a promising path toward building metamorphic simulators of the physical world. Code: https://github.com/PKU-YuanGroup/MagicTime
CLMar 20, 2025Code
A Comprehensive Survey on Long Context Language ModelingJiaheng Liu, Dawei Zhu, Zhiqi Bai et al. · pku
Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling}{\color[RGB]{175,36,67}{LCLM-Horizon}}.
CVFeb 9
MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAERuijie Zhu, Jiahao Lu, Wenbo Hu et al.
We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
CVSep 4, 2024
Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth EstimationLi Liu, Ruijie Zhu, Jiacheng Deng et al.
Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.
CVJul 11, 2024
ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth EstimationRuijie Zhu, Chuxin Wang, Ziyang Song et al.
Estimating depth from a single image is a challenging visual task. Compared to relative depth estimation, metric depth estimation attracts more attention due to its practical physical significance and critical applications in real-life scenarios. However, existing metric depth estimation methods are typically trained on specific datasets with similar scenes, facing challenges in generalizing across scenes with significant scale variations. To address this challenge, we propose a novel monocular depth estimation method called ScaleDepth. Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction (SASP) module and an adaptive relative depth estimation (ARDE) module, respectively. The proposed ScaleDepth enjoys several merits. First, the SASP module can implicitly combine structural and semantic features of the images to predict precise scene scales. Second, the ARDE module can adaptively estimate the relative depth distribution of each image within a normalized depth space. Third, our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework, without the need for setting the depth range or fine-tuning model. Extensive experiments demonstrate that our method attains state-of-the-art performance across indoor, outdoor, unconstrained, and unseen scenes. Project page: https://ruijiezhu94.github.io/ScaleDepth
CVOct 12, 2023
EC-Depth: Exploring the consistency of self-supervised monocular depth estimation in challenging scenesZiyang Song, Ruijie Zhu, Chuxin Wang et al.
Self-supervised monocular depth estimation holds significant importance in the fields of autonomous driving and robotics. However, existing methods are typically trained and tested on standard datasets, overlooking the impact of various adverse conditions prevalent in real-world applications, such as rainy days. As a result, it is commonly observed that these methods struggle to handle these challenging scenarios. To address this issue, we present EC-Depth, a novel self-supervised two-stage framework to achieve a robust depth estimation. In the first stage, we propose depth consistency regularization to propagate reliable supervision from standard to challenging scenes. In the second stage, we adopt the Mean Teacher paradigm and propose a novel consistency-based pseudo-label filtering strategy to improve the quality of pseudo-labels, further improving both the accuracy and robustness of our model. Extensive experiments demonstrate that our method achieves accurate and consistent depth predictions in both standard and challenging scenarios, surpassing existing state-of-the-art methods on KITTI, KITTI-C, DrivingStereo, and NuScenes-Night benchmarks.
CVMar 3
Track4World: Feedforward World-centric Dense 3D Tracking of All PixelsJiahao Lu, Jiayi Xu, Wenbo Hu et al.
Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking. In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs. The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.
CVDec 15, 2023Code
TIFace: Improving Facial Reconstruction through Tensorial Radiance Fields and Implicit SurfacesRuijie Zhu, Jiahao Chang, Ziyang Song et al.
This report describes the solution that secured the first place in the "View Synthesis Challenge for Human Heads (VSCHH)" at the ICCV 2023 workshop. Given the sparse view images of human heads, the objective of this challenge is to synthesize images from novel viewpoints. Due to the complexity of textures on the face and the impact of lighting, the baseline method TensoRF yields results with significant artifacts, seriously affecting facial reconstruction. To address this issue, we propose TI-Face, which improves facial reconstruction through tensorial radiance fields (T-Face) and implicit surfaces (I-Face), respectively. Specifically, we employ an SAM-based approach to obtain the foreground mask, thereby filtering out intense lighting in the background. Additionally, we design mask-based constraints and sparsity constraints to eliminate rendering artifacts effectively. The experimental results demonstrate the effectiveness of the proposed improvements and superior performance of our method on face reconstruction. The code will be available at https://github.com/RuijieZhu94/TI-Face.
CVFeb 4
SkeletonGaussian: Editable 4D Generation through Gaussian SkeletonizationLifan Wu, Ruijie Zhu, Yubo Ai et al.
4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability. To address this issue, we propose SkeletonGaussian, a novel framework for generating editable dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical articulated representation that decomposes motion into sparse rigid motion explicitly driven by a skeleton and fine-grained non-rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane-based refinement for non-rigid deformations, enhancing interpretability and editability. Experimental results demonstrate that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation. Project page: https://wusar.github.io/projects/skeletongaussian/
LGJul 22, 2025Code
Scaling Linear Attention with Sparse State ExpansionYuqi Pan, Yongqi An, Zheng Li et al.
The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Supported by efficient parallelized implementations, our design achieves effective classification and highly discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.5 on AIME24 and 50.2 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.
CVApr 2
ReFlow: Self-correction Motion Learning for Dynamic Scene ReconstructionYanzhe Liang, Ruijie Zhu, Hanzhi Chang et al.
We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.
CVOct 17, 2024
DN-4DGS: Denoised Deformable Network with Temporal-Spatial Aggregation for Dynamic Scene RenderingJiahao Lu, Jiacheng Deng, Ruijie Zhu et al.
Dynamic scenes rendering is an intriguing yet challenging problem. Although current methods based on NeRF have achieved satisfactory performance, they still can not reach real-time levels. Recently, 3D Gaussian Splatting (3DGS) has garnered researchers attention due to their outstanding rendering quality and real-time speed. Therefore, a new paradigm has been proposed: defining a canonical 3D gaussians and deforming it to individual frames in deformable fields. However, since the coordinates of canonical 3D gaussians are filled with noise, which can transfer noise into the deformable fields, and there is currently no method that adequately considers the aggregation of 4D information. Therefore, we propose Denoised Deformable Network with Temporal-Spatial Aggregation for Dynamic Scene Rendering (DN-4DGS). Specifically, a Noise Suppression Strategy is introduced to change the distribution of the coordinates of the canonical 3D gaussians and suppress noise. Additionally, a Decoupled Temporal-Spatial Aggregation Module is designed to aggregate information from adjacent points and frames. Extensive experiments on various real-world datasets demonstrate that our method achieves state-of-the-art rendering quality under a real-time level.
LGNov 16, 2024
MetaLA: Unified Optimal Linear Approximation to Softmax Attention MapYuhong Chou, Man Yao, Kexin Wang et al.
Various linear complexity models, such as Linear Transformer (LinFormer), State Space Model (SSM), and Linear RNN (LinRNN), have been proposed to replace the conventional softmax attention in Transformer structures. However, the optimal design of these linear models is still an open question. In this work, we attempt to answer this question by finding the best linear approximation to softmax attention from a theoretical perspective. We start by unifying existing linear complexity models as the linear attention form and then identify three conditions for the optimal linear attention design: 1) Dynamic memory ability; 2) Static approximation ability; 3) Least parameter approximation. We find that none of the current linear models meet all three conditions, resulting in suboptimal performance. Instead, we propose Meta Linear Attention (MetaLA) as a solution that satisfies these conditions. Our experiments on Multi-Query Associative Recall (MQAR) task, language modeling, image classification, and Long-Range Arena (LRA) benchmark demonstrate that MetaLA is more effective than the existing linear models.
CVJan 5, 2025
DepthMaster: Taming Diffusion Models for Monocular Depth EstimationZiyang Song, Zerong Wang, Bo Li et al.
Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.
LGNov 20, 2024
Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and ChemistryYoel Zimmermann, Adib Bazgir, Zartashia Afzal et al.
Here, we present the outcomes from the second Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry, which engaged participants across global hybrid locations, resulting in 34 team submissions. The submissions spanned seven key application areas and demonstrated the diverse utility of LLMs for applications in (1) molecular and material property prediction; (2) molecular and material design; (3) automation and novel interfaces; (4) scientific communication and education; (5) research data management and automation; (6) hypothesis generation and evaluation; and (7) knowledge extraction and reasoning from scientific literature. Each team submission is presented in a summary table with links to the code and as brief papers in the appendix. Beyond team results, we discuss the hackathon event and its hybrid format, which included physical hubs in Toronto, Montreal, San Francisco, Berlin, Lausanne, and Tokyo, alongside a global online hub to enable local and virtual collaboration. Overall, the event highlighted significant improvements in LLM capabilities since the previous year's hackathon, suggesting continued expansion of LLMs for applications in materials science and chemistry research. These outcomes demonstrate the dual utility of LLMs as both multipurpose models for diverse machine learning tasks and platforms for rapid prototyping custom applications in scientific research.
CLFeb 6, 2025
Reformulation for Pretraining Data AugmentationXintong Hao, Ruijie Zhu, Ge Zhang et al.
Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training. To overcome this critical bottleneck, we propose the Massive Genre-Audience(MGA) reformulation method, a lightweight and scalable data augmentation technique inspired by synthetic data methodologies. MGA systematically reformulates existing corpora into diverse, contextually-rich variations to mitigate the negative effects of repetition, and we introduce this approach along with the resulting 770 billion token MGACorpus in this work. We experimentally validate its core benefit by demonstrating superior performance against data repetition and upsampling in scaling scenarios (up to 13B parameters). Furthermore, comprehensive analysis investigates the role of prompt engineering in generation quality and reveals nuances in evaluating model capabilities using standard loss metrics. Our work shows that MGA provides a reliable pathway to substantially augment training datasets, effectively alleviating repetition bottlenecks and enabling more efficient scaling of large language models.
GRJul 21, 2025
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian SplattingRuijie Zhu, Mulin Yu, Linning Xu et al.
3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing. Project page: https://ruijiezhu94.github.io/ObjectGS_page
LGJul 1, 2025
ZeCO: Zero Communication Overhead Sequence Parallelism for Linear AttentionYuhong Chou, Zehao Liu, Ruijie Zhu et al.
Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.
GRAug 25, 2025
MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian SplattingHanzhi Chang, Ruijie Zhu, Wenjie Chang et al.
Surface reconstruction has been widely studied in computer vision and graphics. However, existing surface reconstruction works struggle to recover accurate scene geometry when the input views are extremely sparse. To address this issue, we propose MeshSplat, a generalizable sparse-view surface reconstruction framework via Gaussian Splatting. Our key idea is to leverage 2DGS as a bridge, which connects novel view synthesis to learned geometric priors and then transfers these priors to achieve surface reconstruction. Specifically, we incorporate a feed-forward network to predict per-view pixel-aligned 2DGS, which enables the network to synthesize novel view images and thus eliminates the need for direct 3D ground-truth supervision. To improve the accuracy of 2DGS position and orientation prediction, we propose a Weighted Chamfer Distance Loss to regularize the depth maps, especially in overlapping areas of input views, and also a normal prediction network to align the orientation of 2DGS with normal vectors predicted by a monocular normal estimator. Extensive experiments validate the effectiveness of our proposed improvement, demonstrating that our method achieves state-of-the-art performance in generalizable sparse-view mesh reconstruction tasks. Project Page: https://hanzhichang.github.io/meshsplat_web
CVJun 26, 2024
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video GenerationShenghai Yuan, Jinfa Huang, Yongqi Xu et al.
We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude. [Homepage](https://pku-yuangroup.github.io/ChronoMagic-Bench/).