IVMar 3, 2022Code
E-CIR: Event-Enhanced Continuous Intensity RecoveryChen Song, Qixing Huang, Chandrajit Bajaj
A camera begins to sense light the moment we press the shutter button. During the exposure interval, relative motion between the scene and the camera causes motion blur, a common undesirable visual artifact. This paper presents E-CIR, which converts a blurry image into a sharp video represented as a parametric function from time to intensity. E-CIR leverages events as an auxiliary input. We discuss how to exploit the temporal event structure to construct the parametric bases. We demonstrate how to train a deep learning model to predict the function coefficients. To improve the appearance consistency, we further introduce a refinement module to propagate visual features among consecutive frames. Compared to state-of-the-art event-enhanced deblurring approaches, E-CIR generates smoother and more realistic results. The implementation of E-CIR is available at https://github.com/chensong1995/E-CIR.
LGAug 19, 2024Code
Machine Learning with Physics Knowledge for Prediction: A SurveyJoe Watson, Chen Song, Oliver Weeger et al. · cambridge
This survey examines the broad suite of methods and models for combining machine learning with physics knowledge for prediction and forecast, with a focus on partial differential equations. These methods have attracted significant interest due to their potential impact on advancing scientific research and industrial practices by improving predictive models with small- or large-scale datasets and expressive predictive models with useful inductive biases. The survey has two parts. The first considers incorporating physics knowledge on an architectural level through objective functions, structured predictive models, and data augmentation. The second considers data as physics knowledge, which motivates looking at multi-task, meta, and contextual learning as an alternative approach to incorporating physics knowledge in a data-driven fashion. Finally, we also provide an industrial perspective on the application of these methods and a survey of the open-source ecosystem for physics-informed machine learning.
CVMar 15, 2023Code
DeblurSR: Event-Based Motion Deblurring Under the Spiking RepresentationChen Song, Chandrajit Bajaj, Qixing Huang
We present DeblurSR, a novel motion deblurring approach that converts a blurry image into a sharp video. DeblurSR utilizes event data to compensate for motion ambiguities and exploits the spiking representation to parameterize the sharp output video as a mapping from time to intensity. Our key contribution, the Spiking Representation (SR), is inspired by the neuromorphic principles determining how biological neurons communicate with each other in living organisms. We discuss why the spikes can represent sharp edges and how the spiking parameters are interpreted from the neuromorphic perspective. DeblurSR has higher output quality and requires fewer computing resources than state-of-the-art event-based motion deblurring methods. We additionally show that our approach easily extends to video super-resolution when combined with recent advances in implicit neural representation. The implementation and animated visualization of DeblurSR are available at https://github.com/chensong1995/DeblurSR.
CVNov 7, 2023Code
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation ModelsYichao Cao, Qingfei Tang, Xiu Su et al.
Human-object interaction (HOI) detection aims to comprehend the intricate relationships between humans and objects, predicting $<human, action, object>$ triplets, and serving as the foundation for numerous computer vision tasks. The complexity and diversity of human-object interactions in the real world, however, pose significant challenges for both annotation and recognition, particularly in recognizing interactions within an open world context. This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs). The proposed method is dubbed as \emph{\textbf{UniHOI}}. We conduct a deep analysis of the three hierarchical features inherent in visual HOI detectors and propose a method for high-level relation extraction aimed at VL foundation models, which we call HO prompt-based learning. Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. Furthermore, we utilize a LLM (\emph{i.e.} GPT) for interaction interpretation, generating a richer linguistic understanding for complex HOIs. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence. Our efficient architecture design and learning methods effectively unleash the potential of the VL foundation models and LLMs, allowing UniHOI to surpass all existing methods with a substantial margin, under both supervised and zero-shot settings. The code and pre-trained weights are available at: \url{https://github.com/Caoyichao/UniHOI}.
LGMar 10, 2023
Deep Anomaly Detection on Tennessee Eastman Process DataFabian Hartung, Billy Joe Franks, Tobias Michels et al.
This paper provides the first comprehensive evaluation and analysis of modern (deep-learning) unsupervised anomaly detection methods for chemical process data. We focus on the Tennessee Eastman process dataset, which has been a standard litmus test to benchmark anomaly detection methods for nearly three decades. Our extensive study will facilitate choosing appropriate anomaly detection methods in industrial applications.
CVJun 5, 2023
Multi-View Representation is What You Need for Point-Cloud Pre-TrainingSiming Yan, Chen Song, Youkang Kong et al.
A promising direction for pre-training 3D point clouds is to leverage the massive amount of data in 2D, whereas the domain gap between 2D and 3D creates a fundamental challenge. This paper proposes a novel approach to point-cloud pre-training that learns 3D representations by leveraging pre-trained 2D networks. Different from the popular practice of predicting 2D features first and then obtaining 3D features through dimensionality lifting, our approach directly uses a 3D network for feature extraction. We train the 3D feature extraction network with the help of the novel 2D knowledge transfer loss, which enforces the 2D projections of the 3D feature to be consistent with the output of pre-trained 2D networks. To prevent the feature from discarding 3D signals, we introduce the multi-view consistency loss that additionally encourages the projected 2D feature representations to capture pixel-wise correspondences across different views. Such correspondences induce 3D geometry and effectively retain 3D features in the projected 2D features. Experimental results demonstrate that our pre-trained model can be successfully transferred to various downstream tasks, including 3D shape classification, part segmentation, 3D object detection, and semantic segmentation, achieving state-of-the-art performance.
CVApr 4, 2023
LiDAR-Based 3D Object Detection via Hybrid 2D Semantic Scene GenerationHaitao Yang, Zaiwei Zhang, Xiangru Huang et al.
Bird's-Eye View (BEV) features are popular intermediate scene representations shared by the 3D backbone and the detector head in LiDAR-based object detectors. However, little research has been done to investigate how to incorporate additional supervision on the BEV features to improve proposal generation in the detector head, while still balancing the number of powerful 3D layers and efficient 2D network operations. This paper proposes a novel scene representation that encodes both the semantics and geometry of the 3D environment in 2D, which serves as a dense supervision signal for better BEV feature learning. The key idea is to use auxiliary networks to predict a combination of explicit and implicit semantic probabilities by exploiting their complementary properties. Extensive experiments show that our simple yet effective design can be easily integrated into most state-of-the-art 3D object detectors and consistently improves upon baseline models.
CVSep 29, 2024Code
PPLNs: Parametric Piecewise Linear Networks for Event-Based Temporal Modeling and BeyondChen Song, Zhenxiao Liang, Bo Sun et al.
We present Parametric Piecewise Linear Networks (PPLNs) for temporal vision inference. Motivated by the neuromorphic principles that regulate biological neural behaviors, PPLNs are ideal for processing data captured by event cameras, which are built to simulate neural activities in the human retina. We discuss how to represent the membrane potential of an artificial neuron by a parametric piecewise linear function with learnable coefficients. This design echoes the idea of building deep models from learnable parametric functions recently popularized by Kolmogorov-Arnold Networks (KANs). Experiments demonstrate the state-of-the-art performance of PPLNs in event-based and image-based vision applications, including steering prediction, human pose estimation, and motion deblurring. The source code of our implementation is available at https://github.com/chensong1995/PPLN.
CVMay 21
Scene Reconstruction as Mapping Priors for 3D DetectionYang Fu, Yuliang Zou, Hao Xiang et al.
In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks such as 3D object detection. Maps can provide robust structural priors of the static environment, helping resolve ambiguities and correct for sensor data sparsity or noise, especially for distant objects or under adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for efficient, large-scale deployment. In this paper, we propose a scalable solution to systematically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Priors Augmented 3D Detection (MPA3D) framework to effectively integrate mapping priors with different sensor modalities. Extensive experiments on the Waymo Open Dataset demonstrate that our approach achieves new state-of-the-art results, proving the effectiveness of scalable reconstructed scene priors for enhancing 3D detection.
CVMay 19
STELLAR: Scaling 3D Perception Large Models for Autonomous DrivingYingwei Li, Xin Huang, Yang Liu et al.
Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.
CVJan 7, 2020Code
HybridPose: 6D Object Pose Estimation under Hybrid RepresentationsChen Song, Jiaru Song, Qixing Huang
We introduce HybridPose, a novel 6D object pose estimation approach. HybridPose utilizes a hybrid intermediate representation to express different geometric information in the input image, including keypoints, edge vectors, and symmetry correspondences. Compared to a unitary representation, our hybrid representation allows pose regression to exploit more and diverse features when one type of predicted representation is inaccurate (e.g., because of occlusion). Different intermediate representations used by HybridPose can all be predicted by the same simple neural network, and outliers in predicted intermediate representations are filtered by a robust regression module. Compared to state-of-the-art pose estimation approaches, HybridPose is comparable in running time and accuracy. For example, on Occlusion Linemod dataset, our method achieves a prediction speed of 30 fps with a mean ADD(-S) accuracy of 47.5%, representing a state-of-the-art performance. The implementation of HybridPose is available at https://github.com/chensong1995/HybridPose.
CVDec 16, 2025
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge DevicesHyperAI Team, Yuchen Liu, Kaiyang Han et al.
Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.
SIApr 29
Impact of Attitude and Bounded Rationality on Collective Behavioral TransitionsChen Song, Vladimir Cvetkovic, Angela Fontan et al.
The theory of planned behavior (TPB) is one of the most influential frameworks in social psychology, stating that a person's behavior is driven by intention, which is primarily shaped by attitude, subjective norms, and perceived behavioral control. Despite its strong empirical support, TPB remains a static conceptual framework without explicit mathematical formulations that capture the temporal evolution of its components. To address this gap, we develop a dynamic agent-based modeling framework that integrates the core principles of TPB with a behavior-to-attitude feedback mechanism. Specifically, we define behaviors based on their feedback effects on attitude and examine when the population undergoes collective transitions by either adopting a beneficial behavior or rejecting a harmful one. Results from our model demonstrate that collective transitions can be effectively controlled by adjusting two key behavioral parameters that reflect agents' attitude influence and decision rationality. These findings provide quantitative insights on TPB, highlighting the key factors that drive collective behavioral transitions and the need for further socio-psychological case studies.
FLU-DYNApr 22
RG-Based Local Hopf Reduction and Slow-Manifold Reconstruction for Nonlinear Aeroelastic SystemsGelin Chen, Chen Song, Chao Yang
Self-excited limit-cycle oscillations (LCOs) from Hopf bifurcations are a key feature of nonlinear aeroelasticity and depend sensitively on structural and aerodynamic parameters. Classical center-manifold and normal-form theory describe this local behavior, but can be cumbersome to apply in large discretized models and standard reduced-order modeling (ROM) workflows. A renormalization-group (RG)-based reduction is developed that directly yields a Hopf-type amplitude equation on a local invariant manifold, specialized for polynomial nonlinearities in tensor-based discretizations and compatible with finite-element-type settings. The method provides explicit coefficients governing the Hopf threshold, criticality, and leading LCO amplitude/frequency trends, and admits a companion slow-manifold approximation with selected stable modes retained as static coordinates. Representative nonlinear-aeroelastic examples illustrate how the proposed framework supplies compact, parameter-aware Hopf/LCO descriptors suitable for local ROM construction near flutter.
SYApr 7
On the Convergence of an Opinion-Action Coevolution Model with Bounded ConfidenceChen Song, Angela Fontan, Rong Su et al.
This paper presents a theoretical convergence analysis for an opinion-action coevolution model that integrates the opinion updating rule of the Hegselmann-Krause model with a utility-based decision-making mechanism. The model is reformulated into an augmented state-space representation, where the state matrix induces a time-varying social interaction digraph. The convergence analysis is grounded on two existing theoretical findings that establish convergence for the Hegselmann-Krause type of models and containment control systems with multiple stationary leaders, respectively. Results indicate that, if the structure of the interaction digraph stabilizes within finite time, the model either converges to consensus, where all agents' opinions and actions reach an identical state, or exhibits clustering, where some opinion nodes act as stationary leaders while the remaining nodes approach the convex hull formed by the leaders. Numerical simulations are then provided to validate the theoretical results.
CVMar 22, 2024
An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D MeshesZhengyi Zhao, Chen Song, Xiaodong Gu et al.
A fundamental problem in the texturing of 3D meshes using pre-trained text-to-image models is to ensure multi-view consistency. State-of-the-art approaches typically use diffusion models to aggregate multi-view inputs, where common issues are the blurriness caused by the averaging operation in the aggregation step or inconsistencies in local features. This paper introduces an optimization framework that proceeds in four stages to achieve multi-view consistency. Specifically, the first stage generates an over-complete set of 2D textures from a predefined set of viewpoints using an MV-consistent diffusion process. The second stage selects a subset of views that are mutually consistent while covering the underlying 3D model. We show how to achieve this goal by solving semi-definite programs. The third stage performs non-rigid alignment to align the selected views across overlapping regions. The fourth stage solves an MRF problem to associate each mesh face with a selected view. In particular, the third and fourth stages are iterated, with the cuts obtained in the fourth stage encouraging non-rigid alignment in the third stage to focus on regions close to the cuts. Experimental results show that our approach significantly outperforms baseline approaches both qualitatively and quantitatively. Project page: https://aigc3d.github.io/ConsistenTex.
CVJun 17, 2024
TutteNet: Injective 3D Deformations by Composition of 2D Mesh DeformationsBo Sun, Thibault Groueix, Chen Song et al.
This work proposes a novel representation of injective deformations of 3D space, which overcomes existing limitations of injective methods: inaccuracy, lack of robustness, and incompatibility with general learning and optimization frameworks. The core idea is to reduce the problem to a deep composition of multiple 2D mesh-based piecewise-linear maps. Namely, we build differentiable layers that produce mesh deformations through Tutte's embedding (guaranteed to be injective in 2D), and compose these layers over different planes to create complex 3D injective deformations of the 3D volume. We show our method provides the ability to efficiently and accurately optimize and learn complex deformations, outperforming other injective approaches. As a main application, we produce complex and artifact-free NeRF and SDF deformations.
CVJan 3, 2022
Implicit Autoencoder for Point-Cloud Self-Supervised Representation LearningSiming Yan, Zhenpei Yang, Haoxiang Li et al.
This paper advocates the use of implicit surface representation in autoencoder-based self-supervised 3D representation learning. The most popular and accessible 3D representation, i.e., point clouds, involves discrete samples of the underlying continuous 3D surface. This discretization process introduces sampling variations on the 3D shape, making it challenging to develop transferable knowledge of the true 3D geometry. In the standard autoencoding paradigm, the encoder is compelled to encode not only the 3D geometry but also information on the specific discrete sampling of the 3D shape into the latent code. This is because the point cloud reconstructed by the decoder is considered unacceptable unless there is a perfect mapping between the original and the reconstructed point clouds. This paper introduces the Implicit AutoEncoder (IAE), a simple yet effective method that addresses the sampling variation issue by replacing the commonly-used point-cloud decoder with an implicit decoder. The implicit decoder reconstructs a continuous representation of the 3D shape, independent of the imperfections in the discrete samples. Extensive experiments demonstrate that the proposed IAE achieves state-of-the-art performance across various self-supervised learning benchmarks.
DCJan 20, 2020
A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU KernelsLorenz Braun, Sotirios Nikas, Chen Song et al.
Characterizing compute kernel execution behavior on GPUs for efficient task scheduling is a non-trivial task. We address this with a simple model enabling portable and fast predictions among different GPUs using only hardware-independent features. This model is built based on random forests using 189 individual compute kernels from benchmarks such as Parboil, Rodinia, Polybench-GPU and SHOC. Evaluation of the model performance using cross-validation yields a median Mean Average Percentage Error (MAPE) of 8.86-52.00% and 1.84-2.94%, for time respectively power prediction across five different GPUs, while latency for a single prediction varies between 15 and 108 milliseconds.
CVFeb 15, 2018
Image Dataset for Visual Objects Classification in 3D PrintingHongjia Li, Xiaolong Ma, Aditya Singh Rathore et al.
The rapid development in additive manufacturing (AM), also known as 3D printing, has brought about potential risk and security issues along with significant benefits. In order to enhance the security level of the 3D printing process, the present research aims to detect and recognize illegal components using deep learning. In this work, we collected a dataset of 61,340 2D images (28x28 for each image) of 10 classes including guns and other non-gun objects, corresponding to the projection results of the original 3D models. To validate the dataset, we train a convolutional neural network (CNN) model for gun classification which can achieve 98.16% classification accuracy.