CVSep 10, 2024
RealisDance: Equip controllable character animation with realistic handsJingkai Zhou, Benzhi Wang, Weihua Chen et al.
Controllable character animation is an emerging task that generates character videos controlled by pose sequences from given character images. Although character consistency has made significant progress via reference UNet, another crucial factor, pose control, has not been well studied by existing methods yet, resulting in several issues: 1) The generation may fail when the input pose sequence is corrupted. 2) The hands generated using the DWPose sequence are blurry and unrealistic. 3) The generated video will be shaky if the pose sequence is not smooth enough. In this paper, we present RealisDance to handle all the above issues. RealisDance adaptively leverages three types of poses, avoiding failed generation caused by corrupted pose sequences. Among these pose types, HaMeR provides accurate 3D and depth information of hands, enabling RealisDance to generate realistic hands even for complex gestures. Besides using temporal attention in the main UNet, RealisDance also inserts temporal attention into the pose guidance network, smoothing the video from the pose condition aspect. Moreover, we introduce pose shuffle augmentation during training to further improve generation robustness and video smoothness. Qualitative experiments demonstrate the superiority of RealisDance over other existing methods, especially in hand quality.
82.3DCMar 26
The Complexity of Distributed Minimum Weight Cycle ApproximationYi-Jun Chang, Yanyu Chen, Dipan Dey et al.
We investigate the \emph{minimum weight cycle (MWC)} problem in the $\mathsf{CONGEST}$ model of distributed computing. For undirected weighted graphs, we design a randomized algorithm that achieves a $(k+1)$-approximation, for any \emph{real} number $k \ge 1$. The round complexity of algorithm is \[ \tilde{O}\!\Big( n^{\frac{k+1}{2k+1}} + n^{\frac{1}{k}} + D\, n^{\frac{1}{2(2k+1)}} + D^{\frac{2}{5}} n^{\frac{2}{5}+\frac{1}{2(2k+1)}} \Big). \] where $n$ denotes the number of nodes and $D$ is the unweighted diameter of the graph. This result yields a smooth trade-off between approximation ratio and round complexity. In particular, when $k \geq 2$ and $D = \tilde{O}(n^{1/4})$, the bound simplifies to \[ \tilde{O}\!\left( n^{\frac{k+1}{2k+1}} \right) \] On the lower bound side, assuming the ErdÅs girth conjecture, we prove that for every \emph{integer} $k \ge 1$, any randomized $(k+1-ε)$-approximation algorithm for MWC requires \[ \tildeΩ\!\left( n^{\frac{k+1}{2k+1}} \right) \] rounds. This lower bound holds for both directed unweighted and undirected weighted graphs, and applies even to graphs with small diameter $D = Î(\log n)$. Taken together, our upper and lower bounds \emph{match up to polylogarithmic factors} for graphs of sufficiently small diameter $D = \tilde{O}(n^{1/4})$ (when $k \geq 2$), yielding a nearly tight bound on the distributed complexity of the problem. Our results improve upon the previous state of the art: Manoharan and Ramachandran (PODC~2024) demonstrated a $(2+ε)$-approximation algorithm for undirected weighted graphs with round complexity $\tilde{O}(n^{2/3}+D)$, and proved that for any arbitrarily large number $α$, any $α$-approximation algorithm for directed unweighted or undirected weighted graphs requires $Ω(\sqrt{n}/\log n)$ rounds.
CLDec 18, 2024
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World TasksFrank F. Xu, Yufan Song, Boxuan Li et al. · cmu
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents--in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on https://the-agent-company.com.
69.7ROMar 19
Inductance-Based Force Self-Sensing in Fiber-Reinforced Pneumatic Twisted-and-Coiled ActuatorsYunsong Zhang, Tianlin Li, Mingyang Yang et al.
Fiber-reinforced pneumatic twisted-and-coiled actuators (FR-PTCAs) offer high power density and compliance but their strong hysteresis and lack of intrinsic proprioception limit effective closed-loop control. This paper presents a self-sensing FR-PTCA integrated with a conductive nickel wire that enables intrinsic force estimation and indirect displacement inference via inductance feedback. Experimental characterization reveals that the inductance of the actuator exhibits a deterministic, low-hysteresis inductance-force relationship at constant pressures, in contrast to the strongly hysteretic inductance-length behavior. Leveraging this property, this paper develops a parametric self-sensing model and a nonlinear hybrid observer that integrates an Extended Kalman Filter (EKF) with constrained optimization to resolve the ambiguity in the inductance-force mapping and estimate actuator states. Experimental results demonstrate that the proposed approach achieves force estimation accuracy comparable to that of external load cells and maintains robust performance under varying load conditions.
CVJun 5, 2025
FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video DiffusionAkide Liu, Zeyu Zhang, Zhexin Li et al.
Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint optimization. We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.
CVNov 25, 2025
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World SimulationInferix Team, Tianyu Feng, Yizeng Han et al.
World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.