LGDec 18, 2024
PowerMLP: An Efficient Version of KANRuichen Qiu, Yibo Miao, Shiwen Wang et al.
The Kolmogorov-Arnold Network (KAN) is a new network architecture known for its high accuracy in several tasks such as function fitting and PDE solving. The superior expressive capability of KAN arises from the Kolmogorov-Arnold representation theorem and learnable spline functions. However, the computation of spline functions involves multiple iterations, which renders KAN significantly slower than MLP, thereby increasing the cost associated with model training and deployment. The authors of KAN have also noted that ``the biggest bottleneck of KANs lies in its slow training. KANs are usually 10x slower than MLPs, given the same number of parameters.'' To address this issue, we propose a novel MLP-type neural network PowerMLP that employs simpler non-iterative spline function representation, offering approximately the same training time as MLP while theoretically demonstrating stronger expressive power than KAN. Furthermore, we compare the FLOPs of KAN and PowerMLP, quantifying the faster computation speed of PowerMLP. Our comprehensive experiments demonstrate that PowerMLP generally achieves higher accuracy and a training speed about 40 times faster than KAN in various tasks.
CVFeb 17, 2025
SayAnything: Audio-Driven Lip Synchronization with Conditional Video DiffusionJunxian Ma, Shiwen Wang, Jian Yang et al.
Recent advances in diffusion models have led to significant progress in audio-driven lip synchronization. However, existing methods typically rely on constrained audio-visual alignment priors or multi-stage learning of intermediate representations to force lip motion synthesis. This leads to complex training pipelines and limited motion naturalness. In this paper, we present SayAnything, a conditional video diffusion framework that directly synthesizes lip movements from audio input while preserving speaker identity. Specifically, we propose three specialized modules including identity preservation module, audio guidance module, and editing control module. Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation without requiring additional supervision signals or intermediate representations. Extensive experiments demonstrate that SayAnything generates highly realistic videos with improved lip-teeth coherence, enabling unseen characters to say anything, while effectively generalizing to animated characters.
CVOct 30, 2024
LumiSculpt: Enabling Consistent Portrait Lighting in Video GenerationYuxin Zhang, Dandan Zheng, Biao Gong et al.
Lighting plays a pivotal role in ensuring the naturalness and aesthetic quality of video generation. However, the impact of lighting is deeply coupled with other factors of videos, e.g., objects and scenes. Thus, it remains challenging to disentangle and model coherent lighting conditions independently, limiting the flexibility to control lighting in video generation. In this paper, inspired by the established controllable T2I models, we propose LumiSculpt, which enables precise and consistent lighting control in T2V generation models. LumiSculpt equips the video generation with new interactive capabilities, allowing the input of reference image sequences with customized lighting conditions. Furthermore, the core learnable plug-and-play module of LumiSculpt facilitates direct control over the intensity, position and trajectory of an assumed light source in video diffusion models. To effectively train LumiSculpt and address the issue of insufficient lighting data, we construct LumiHuman, a new lightweight and flexible dataset for portrait lighting of images and videos. Experimental results demonstrate that LumiSculpt achieves precise and high-quality lighting control in video generation. The analysis demonstrates the flexibility of LumiHuman.