Yuhan Huang

CV
h-index30
8papers
66citations
Novelty56%
AI Score52

8 Papers

64.5SEMay 28
CODEFUSE-DEBENCH: An Empirical Study on Readability, Recompilability, and Functionality

Puzhuo Liu, Yuhan Huang, Jianlei Chi et al.

Binary decompilation aims to recover binaries into high-level source code, but existing evaluations mainly rely on syntactic similarity or single-axis readability metrics, which fail to capture practical reusability. We propose a reusability-driven evaluation paradigm that measures decompiler quality along three orthogonal dimensions: readability, recompilability, and functionality. We present DEBENCH, the first automated framework for multidimensional decompilation evaluation. DEBENCH contains 240 atomic test functions, organized into 8 source files and compiled into 640 binaries. It combines LLM-as-judge readability scoring with URAF (18 sub-dimensions), iterative compile-and-repair under a fixed 50-iteration budget, and Frida-based differential dynamic tracing at the program, function, and instruction levels. We evaluate five mainstream decompilers and three repair LLMs. Our study reveals four findings. First, the reusability cliff is steep: the best decompiler-LLM pair reaches 22.3% Exact+Partial program-level behavioral overlap but only 1.2% exact stdout match, nearly 50 points below recompilability. Second, settings that maximize readability do not maximize functionality: -O3 yields the lowest readability but the highest functionality, and Clang gives lower readability than GCC but 2.6x higher functionality. Third, cross-decompiler variation at the functional level is 20x, far larger than the 1.6x cross-LLM variation, showing that progress depends more on decompiler engines than larger repair models. Fourth, failures fall into three categories: syntactic noise, type-system collapse (about 19% of repair errors), and irreversible upstream losses such as ARM64 relocation idioms and C++ ABI features.

99.6CVApr 22Code
Building a Precise Video Language with Human-AI Oversight

Zhiqiu Lin, Chancharik Mitra, Siyuan Cen et al.

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

CVApr 17, 2025Code
Collaborative Perception Datasets for Autonomous Driving: A Review

Naibang Wang, Deyong Shang, Yan Gong et al.

Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi-agent information fusion. With the advancement of Vehicle-to-Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi-dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: https://github.com/frankwnb/Collaborative-Perception-Datasets-for-Autonomous-Driving.

87.5LGMay 18
Alignment Dynamics in LLM Fine-Tuning

Yuhan Huang, Huanran Chen, Yinpeng Dong

Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.

81.2GRMay 16
VoxScene: Anchor-Conditioned Voxel Diffusion for Indoor Scene Arrangement

Haotian Mao, Yuhan Huang, Jiatao Lin et al.

We present VoxScene, a novel anchor-conditioned voxel diffusion framework tailored for 3D scene synthesis. Current data-driven layout generation techniques typically rely on bounding proxies or implicit representations, which overlook volumetric structures. This geometric blindness inevitably leads to severe physical collisions and structural entanglement, particularly in densely populated environments. To overcome these limitations, we shift the paradigm to an explicit, object-centric voxel representation. Our pipeline sequentially synthesizes discrete volumetric occupancies conditioned on prior anchors and local context. By exploiting the mutually exclusive nature of discrete voxels, our approach eliminates spatial ambiguities and guarantees collision-free arrangements, even in highly complex environments. Furthermore, the synthesized high-fidelity voxel grids serve as discriminative geometric queries for downstream asset retrieval. Extensive experiments demonstrate the universality of our method, achieving state-of-the-art physical plausibility and unlocking shape diversity compared to existing layout planners.

CVApr 21, 2025
Towards Understanding Camera Motions in Any Video

Zhiqiu Lin, Siyuan Cen, Daniel Jiang et al.

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

LGDec 2, 2024
Fire-Image-DenseNet (FIDN) for predicting wildfire burnt area using remote sensing data

Bo Pang, Sibo Cheng, Yuhan Huang et al.

Predicting the extent of massive wildfires once ignited is essential to reduce the subsequent socioeconomic losses and environmental damage, but challenging because of the complexity of fire behaviour. Existing physics-based models are limited in predicting large or long-duration wildfire events. Here, we develop a deep-learning-based predictive model, Fire-Image-DenseNet (FIDN), that uses spatial features derived from both near real-time and reanalysis data on the environmental and meteorological drivers of wildfire. We trained and tested this model using more than 300 individual wildfires that occurred between 2012 and 2019 in the western US. In contrast to existing models, the performance of FIDN does not degrade with fire size or duration. Furthermore, it predicts final burnt area accurately even in very heterogeneous landscapes in terms of fuel density and flammability. The FIDN model showed higher accuracy, with a mean squared error (MSE) about 82% and 67% lower than those of the predictive models based on cellular automata (CA) and the minimum travel time (MTT) approaches, respectively. Its structural similarity index measure (SSIM) averages 97%, outperforming the CA and FlamMap MTT models by 6% and 2%, respectively. Additionally, FIDN is approximately three orders of magnitude faster than both CA and MTT models. The enhanced computational efficiency and accuracy advancements offer vital insights for strategic planning and resource allocation for firefighting operations.

SPJun 1, 2025
LD-RPMNet: Near-Sensor Diagnosis for Railway Point Machines

Wei Li, Xiaochun Wu, Xiaoxi Hu et al.

Near-sensor diagnosis has become increasingly prevalent in industry. This study proposes a lightweight model named LD-RPMNet that integrates Transformers and Convolutional Neural Networks, leveraging both local and global feature extraction to optimize computational efficiency for a practical railway application. The LD-RPMNet introduces a Multi-scale Depthwise Separable Convolution (MDSC) module, which decomposes cross-channel convolutions into pointwise and depthwise convolutions while employing multi-scale kernels to enhance feature extraction. Meanwhile, a Broadcast Self-Attention (BSA) mechanism is incorporated to simplify complex matrix multiplications and improve computational efficiency. Experimental results based on collected sound signals during the operation of railway point machines demonstrate that the optimized model reduces parameter count and computational complexity by 50% while improving diagnostic accuracy by nearly 3%, ultimately achieving an accuracy of 98.86%. This demonstrates the possibility of near-sensor fault diagnosis applications in railway point machines.