Learning to Edit Visual Programs with Self-Supervision
This work addresses the challenge of improving visual program accuracy in domains lacking program annotations, offering a novel editing-based paradigm that is incremental but provides specific gains.
The paper tackles the problem of editing visual programs to better match visual targets, introducing a self-supervised learning approach that combines an edit network with a one-shot prediction model, resulting in more accurate visual programs across multiple domains with significant advantages over using only the one-shot model under equal search-time budgets.
We design a system that learns how to edit visual programs. Our edit network consumes a complete input program and a visual target. From this input, we task our network with predicting a local edit operation that could be applied to the input program to improve its similarity to the target. In order to apply this scheme for domains that lack program annotations, we develop a self-supervised learning approach that integrates this edit network into a bootstrapped finetuning loop along with a network that predicts entire programs in one-shot. Our joint finetuning scheme, when coupled with an inference procedure that initializes a population from the one-shot model and evolves members of this population with the edit network, helps to infer more accurate visual programs. Over multiple domains, we experimentally compare our method against the alternative of using only the one-shot model, and find that even under equal search-time budgets, our editing-based paradigm provides significant advantages.