CV AI LG NCMar 7, 2024

A spatiotemporal style transfer algorithm for dynamic visual stimulus generation

arXiv:2403.04940v15.210 citationsh-index: 10Nat Comput Sci

Originality Incremental advance

AI Analysis

This provides a versatile tool for vision scientists to generate and manipulate video stimuli for testing hypotheses in biological and artificial vision systems, though it is incremental as it builds on existing style transfer methods.

The authors tackled the scarcity of video generation methods for vision research by introducing the Spatiotemporal Style Transfer (STST) algorithm, which generates dynamic visual stimuli that match low-level spatiotemporal features of natural videos but lack high-level semantic features, enabling studies on object recognition and probing deep network representations.

Understanding how visual information is encoded in biological and artificial systems often requires vision scientists to generate appropriate stimuli to test specific hypotheses. Although deep neural network models have revolutionized the field of image generation with methods such as image style transfer, available methods for video generation are scarce. Here, we introduce the Spatiotemporal Style Transfer (STST) algorithm, a dynamic visual stimulus generation framework that allows powerful manipulation and synthesis of video stimuli for vision research. It is based on a two-stream deep neural network model that factorizes spatial and temporal features to generate dynamic visual stimuli whose model layer activations are matched to those of input videos. As an example, we show that our algorithm enables the generation of model metamers, dynamic stimuli whose layer activations within our two-stream model are matched to those of natural videos. We show that these generated stimuli match the low-level spatiotemporal features of their natural counterparts but lack their high-level semantic features, making it a powerful paradigm to study object recognition. Late layer activations in deep vision models exhibited a lower similarity between natural and metameric stimuli compared to early layers, confirming the lack of high-level information in the generated stimuli. Finally, we use our generated stimuli to probe the representational capabilities of predictive coding deep networks. These results showcase potential applications of our algorithm as a versatile tool for dynamic stimulus generation in vision science.

View on arXiv PDF

Similar