Understanding Multimodal Failure in Action-Chunking Behavioral Cloning
For researchers in behavioral cloning and imitation learning, this work provides a mechanistic understanding of failure modes in multimodal action prediction, though the insights are incremental.
The paper analyzes how different multimodal parameterizations in action-chunking behavioral cloning fail under multimodality, showing that latent-variable policies suffer from a trade-off between posterior-prior regularization and mode preservation, while action-space generative policies are limited by the Lipschitz constant of the transport map. Experiments on synthetic and robotic tasks validate these findings.
Behavioral cloning becomes difficult when the same observation admits several valid actions. We study this problem for action-chunking policies and show that different multimodal parameterizations fail in different ways. For latent-variable policies, posterior-prior regularization makes deployment-time sampling more reliable, but excessive regularization removes the action-conditioned information needed to distinguish demonstrated modes. Reducing this regularization can preserve mode information, but then success depends on whether the prior covers the relevant latent regions. For action-space generative policies, multimodality is constrained by the smoothness of the base-to-action transport: a map with small Lipschitz constant cannot assign substantial probability to many well-separated modes. Covering many modes therefore requires either sharp transitions in base space or off-support bridge regions in action space. Experiments on synthetic multimodal tasks and robotic simulation benchmarks support these mechanisms.