SIMar 12
A large-scale analysis of public-facing, community-built chatbots on Character.AIOwen Lee, Kenneth Joseph
This paper presents the first large-scale analysis of public-facing chatbots on Character.AI, a rapidly growing social media platform where users create and interact with chatbots. Character.AI is distinctive in that it merges generative AI with user-generated content, enabling users to build bots for others to engage with. It is also popular, with over 20 million monthly active users, and impactful, with headlines detailing significant issues with youth engagement on the site. Character.AI is thus of interest to study both substantively and conceptually. To this end, we present a descriptive overview using a dataset of 2.1 million English-language prompts (or "greetings") from chatbots on the site, created by around 1 million users. Our work explores the prevalence of different fandoms on the site, broader tropes that persist across fandoms, and how dynamics of power intersect with gender within greetings. Overall, our findings illuminate an emerging form of online (para)social interaction at a unique and important intersection between generative AI and user-generated content.
SDJun 1, 2025Code
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual FusionShunian Chen, Xinyuan Xie, Zheshu Chen et al.
High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in https://github.com/satsuki2486441738/FusionAudio.
CVJan 28Code
Shape of Thought: Progressive Object Assembly via Visual Chain-of-ThoughtYu Huo, Siyu Zhang, Kun Zeng et al.
Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at https://anonymous.4open.science/r/16FE/. The SoT-26K dataset will be released upon acceptance.