CLAIHCSep 22, 2025

HICode: Hierarchical Inductive Coding with LLMs

arXiv:2509.17946v19 citationsh-index: 3EMNLP
Originality Incremental advance
AI Analysis

This addresses the scalability issue for researchers conducting nuanced text analysis, offering a tool to automate manual labeling processes.

The paper tackles the problem of scaling fine-grained corpus analysis by proposing HICode, a hierarchical inductive coding pipeline using LLMs, which shows alignment with human themes and reveals aggressive marketing strategies in opioid litigation documents.

Despite numerous applications for fine-grained corpus analysis, researchers continue to rely on manual labeling, which does not scale, or statistical tools like topic modeling, which are difficult to control. We propose that LLMs have the potential to scale the nuanced analyses that researchers typically conduct manually to large text corpora. To this effect, inspired by qualitative research methods, we develop HICode, a two-part pipeline that first inductively generates labels directly from analysis data and then hierarchically clusters them to surface emergent themes. We validate this approach across three diverse datasets by measuring alignment with human-constructed themes and demonstrating its robustness through automated and human evaluations. Finally, we conduct a case study of litigation documents related to the ongoing opioid crisis in the U.S., revealing aggressive marketing strategies employed by pharmaceutical companies and demonstrating HICode's potential for facilitating nuanced analyses in large-scale data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes