Antagonising explanation and revealing bias directly through sequencing and multimodal inference
This work addresses the problem of bias and lack of explanation in generative models for researchers and practitioners in film and audiovisual arts, though it appears incremental as it builds on existing critiques of bias in AI.
The paper argues that deep generative models, particularly diffusion models, inherently reflect cultural biases from their training data, which can be revealed by analyzing the generative process as a form of 'going back in time' through sequencing and multimodal inference. It suggests this approach can expose predictive failures in contemporary synthesis methods for video production and audiovisual arts.
Deep generative models produce data according to a learned representation, e.g. diffusion models, through a process of approximation computing possible samples. Approximation can be understood as reconstruction and the large datasets used to train models as sets of records in which we represent the physical world with some data structure (photographs, audio recordings, manuscripts). During the process of reconstruction, e.g., image frames develop each timestep towards a textual input description. While moving forward in time, frame sets are shaped according to learned bias and their production, we argue here, can be considered as going back in time; not by inspiration on the backward diffusion process but acknowledging culture is specifically marked in the records. Futures of generative modelling, namely in film and audiovisual arts, can benefit by dealing with diffusion systems as a process to compute the future by inevitably being tied to the past, if acknowledging the records as to capture fields of view at a specific time, and to correlate with our own finite memory ideals. Models generating new data distributions can target video production as signal processors and by developing sequences through timelines we ourselves also go back to decade-old algorithmic and multi-track methodologies revealing the actual predictive failure of contemporary approaches to synthesis in moving image, both as relevant to composition and not explanatory.