Statistical transformer networks: learning shape and appearance models via self supervision
This work addresses the challenge of unsupervised shape and appearance modeling in computer vision, offering a novel approach that eliminates the need for direct supervision like landmarks, though it is incremental as it builds on existing Spatial Transformer Networks.
The authors tackled the problem of learning shape and appearance models without supervision by generalizing Spatial Transformer Networks to use a deformable, statistical shape model, called Statistical Transformer Networks (StaTN), which learns optimal nonrigid alignment for tasks and can be trained end-to-end or with generic loss functions like minimum description length, achieving unsupervised learning of active appearance models.
We generalise Spatial Transformer Networks (STN) by replacing the parametric transformation of a fixed, regular sampling grid with a deformable, statistical shape model which is itself learnt. We call this a Statistical Transformer Network (StaTN). By training a network containing a StaTN end-to-end for a particular task, the network learns the optimal nonrigid alignment of the input data for the task. Moreover, the statistical shape model is learnt with no direct supervision (such as landmarks) and can be reused for other tasks. Besides training for a specific task, we also show that a StaTN can learn a shape model using generic loss functions. This includes a loss inspired by the minimum description length principle in which an appearance model is also learnt from scratch. In this configuration, our model learns an active appearance model and a means to fit the model from scratch with no supervision at all, even identity labels.