CYMar 6

THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science

arXiv:2603.05972v111.0h-index: 5Has Code

Predicted impact top 27% in CY · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the scalability trap in computational social science by enabling more efficient and theoretically rigorous analysis of big social data, though it appears incremental as it builds on existing embedding and agent-based approaches.

The paper tackles the scalability and semantic limitations of traditional topic modeling for social science research by introducing THETA, a framework combining domain-adapted embeddings with an AI agent system; experiments across six domains show it significantly outperforms models like LDA, ETM, and CTM in capturing domain-specific constructs while maintaining superior coherence.

The explosion of big social data has created a scalability trap for traditional qualitative research, as manual coding remains labor-intensive and conventional topic models often suffer from semantic thinning and a lack of domain awareness. This paper introduces Textual Hybrid Embedding based Topic Analysis (THETA), a novel computational paradigm and open-source tool designed to bridge the gap between massive data scale and rich theoretical depth. THETA moves beyond frequency-based statistics by implementing Domain-Adaptive Fine-tuning (DAFT) via LoRA on foundation embedding models, which effectively optimizes semantic vector structures within specific social contexts to capture latent meanings. To ensure epistemological rigor, we encapsulate this process into an AI Scientist Agent framework, comprising Data Steward, Modeling Analyst, and Domain Expert agents, to simulate the human-in-the-loop expert judgment and constant comparison processes central to grounded theory. Departing from purely computational models, this framework enables agents to iteratively evaluate algorithmic clusters, perform cross-topic semantic alignment, and refine raw outputs into logically consistent theoretical categories. To validate the effectiveness of THETA, we conducted experiments across six domains, including financial regulation and public health. Our results demonstrate that THETA significantly outperforms traditional models, such as LDA, ETM, and CTM, in capturing domain-specific interpretive constructs while maintaining superior coherence. By providing an interactive analysis platform, THETA democratizes advanced natural language processing for social scientists and ensures the trustworthiness and reproducibility of research findings. Code is available at https://github.com/CodeSoul-co/THETA.

View on arXiv PDF Code

Similar