What Do AI-Generated Images Want?
This is an incremental theoretical exploration for art history and AI ethics, focusing on the agency of AI-generated images without practical applications.
The paper reframes W.J.T. Mitchell's question about picture agency to ask what AI-generated images want, arguing that they desire specificity and concreteness due to their fundamentally abstract nature, based on the premise that text and image are interchangeable tokens in multimodal models.
W.J.T. Mitchell's influential essay 'What do pictures want?' shifts the theoretical focus away from the interpretative act of understanding pictures and from the motivations of the humans who create them to the possibility that the picture itself is an entity with agency and wants. In this article, I reframe Mitchell's question in light of contemporary AI image generation tools to ask: what do AI-generated images want? Drawing from art historical discourse on the nature of abstraction, I argue that AI-generated images want specificity and concreteness because they are fundamentally abstract. Multimodal text-to-image models, which are the primary subject of this article, are based on the premise that text and image are interchangeable or exchangeable tokens and that there is a commensurability between them, at least as represented mathematically in data. The user pipeline that sees textual input become visual output, however, obscures this representational regress and makes it seem like one form transforms into the other -- as if by magic.