LGMar 13, 2025Code
Fine-Tuning Diffusion Generative Models via Rich Preference OptimizationHanyang Zhao, Haoxian Chen, Yucheng Guo et al.
We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals to improve the curation of preference pairs for fine-tuning text-to-image diffusion models. Traditional methods, like Diffusion-DPO, often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. In contrast, our approach begins with generating detailed critiques of synthesized images, from which we extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models. Our code is available at https://github.com/Diffusion-RLHF/RPO.
DCAug 26, 2025Code
ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective PrimitiveXinhao Luo, Zihan Liu, Yangjie Zhou et al.
Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip communication. To bridge this software-hardware gap, we introduce two cluster-level communication primitives, ClusterReduce and ClusterGather, which abstract common communication patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip memory. Building on these abstractions, we design ClusterFusion, an execution framework that schedules communication and computation jointly to expand operator fusion scope by composing decoding stages such as QKV Projection, Attention, and Output Projection into a single fused kernels. Evaluations on H100 GPUs show that ClusterFusion outperforms state-of-the-art inference frameworks by 1.61x on average in end-to-end latency across different models and configurations. The source code is available at https://github.com/xinhao-luo/ClusterFusion.
LOJan 24, 2025
Deeply Optimizing the SAT Solver for the IC3 AlgorithmYuheng Su, Qiusong Yang, Yiwei Ci et al.
The IC3 algorithm, also known as PDR, is a SAT-based model checking algorithm that has significantly influenced the field in recent years due to its efficiency, scalability, and completeness. It utilizes SAT solvers to solve a series of SAT queries associated with relative induction. In this paper, we introduce several optimizations for the SAT solver in IC3 based on our observations of the unique characteristics of these SAT queries. By observing that SAT queries do not necessarily require decisions on all variables, we compute a subset of variables that need to be decided before each solving process while ensuring that the result remains unaffected. Additionally, noting that the overhead of binary heap operations in VSIDS is non-negligible, we replace the binary heap with buckets to achieve constant-time operations. Furthermore, we support temporary clauses without the need to allocate a new activation variable for each solving process, thereby eliminating the need to reset solvers. We developed a novel lightweight CDCL SAT solver, GipSAT, which integrates these optimizations. A comprehensive evaluation highlights the performance improvements achieved by GipSAT. Specifically, the GipSAT-based IC3 demonstrates an average speedup of 3.61 times in solving time compared to the IC3 implementation based on MiniSat.