CVApr 30

Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization

arXiv:2604.2733554.5Has Code
Predicted impact top 63% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners of zero-shot classification in dynamic web filtering, this work provides a method to improve classification accuracy without retraining, though the gains are incremental and domain-specific.

The paper proposes a training-free iterative definition refinement framework for zero-shot web content classification that uses LLMs to optimize category definitions, achieving consistent performance improvements across 13 embedding models on a new 10-category benchmark with 1,000 samples per class.

Web filtering systems rely on accurate web content classification to block cyber threats, prevent data exfiltration, and ensure compliance. However, classification is increasingly difficult due to the dynamic and rapidly evolving nature of the modern web. Embedding-based zero-shot approaches map content and category descriptions into a shared semantic space, enabling label assignment without labeled training data, but remain highly sensitive to definition quality. Poorly specified or ambiguous definitions create semantic overlap in the embedding space, leading to systematic misclassification. In this paper, we propose a training-free, adaptive iterative definition refinement framework that improves zero-shot web content classification by progressively optimizing category definitions rather than updating model parameters. Using LLMs as feedback-driven definition optimizers, we investigate three refinement strategies namely example-guided, confusion-aware, and history-aware, each refining class descriptions using structured signals from misclassified instances. Furthermore, we introduce a human-labeled benchmark of 10 URL categories with 1,000 samples per class and evaluate across 13 state-of-the-art embedding foundation models. Results demonstrate that iterative definition refinement consistently improves classification performance across diverse architectures, establishing definition quality as a critical and underexplored factor in embedding-based systems. The dataset is available at https://github.com/naeemrehmat/B2MWT-10C.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes