CVAug 7, 2024
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language ModelingWilliam Yicheng Zhu, Keren Ye, Junjie Ke et al.
Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).
CVNov 28, 2022
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual RepresentationJiangyong Huang, William Yicheng Zhu, Baoxiong Jia et al.
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations.
SOC-PHApr 3
The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year CountdownWilliam Yicheng Zhu, Lei Zhu
The recent, super-exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from machines replacing the human hands (manual labor and mechanical processing) to machines delegating for the human minds (thinking, reasoning, and intention). This uncontrolled offloading and scaling of "thinking" itself has profound consequences for humanity's heat balance sheet, since thinking, or intelligence, carries thermodynamic weight. The Earth has already surpassed the heat dissipation threshold required for long-term ecological stability, and projecting based on empirical data reveal a concerning trajectory: without radical structural intervention, anthropogenic heat accumulation will breach critical planetary ecological thresholds in less than 6.5 years, even under the most ideal scenario where Earth Energy Imbalance (EEI) holds constant. In this work, we identify six interacting factors that govern the global heat dissipation rate and delineate how their interplay drives society toward one of four macroscopic trajectories: legacy, accelerationist, centrist, or restorative. We propose that the integration of artificial intelligence and its heat dissipation into the planetary system constitutes the 10th planetary boundary (9+1). The core measurement of this new boundary is the net-new waste heat generated by exponential AI growth balanced against its impact on reducing economic and societal inefficiencies and through which the baseline anthropogenic waste heat emissions. We demonstrate that managing AI scaling lacks a moderate middle ground: it will either accelerate the imminent breach of critical thermodynamic thresholds, or it will serve as the single most effective lever capable of stabilizing the other planetary boundaries and the survival of human civilization.