IRAIJan 16

Learn Before Represent: Bridging Generative and Contrastive Learning for Domain-Specific LLM Embeddings

arXiv:2601.11124v11 citationsh-index: 7
Originality Highly original
AI Analysis

This work addresses the challenge of building accurate and robust LLM embeddings for specialized domains, representing a novel paradigm rather than an incremental improvement.

The paper tackles the problem of domain-specific LLM embeddings struggling in vertical domains like chemistry and law by proposing a two-stage framework called Learn Before Represent (LBR), which first injects domain knowledge via generative learning and then refines it with contrastive learning, resulting in significant performance improvements on medical, chemistry, and code retrieval tasks.

Large Language Models (LLMs) adapted via contrastive learning excel in general representation learning but struggle in vertical domains like chemistry and law, primarily due to a lack of domain-specific knowledge. This work identifies a core bottleneck: the prevailing ``LLM+CL'' paradigm focuses on semantic alignment but cannot perform knowledge acquisition, leading to failures on specialized terminology. To bridge this gap, we propose Learn Before Represent (LBR), a novel two-stage framework. LBR first injects domain knowledge via an Information Bottleneck-Constrained Generative Learning stage, preserving the LLM's causal attention to maximize knowledge acquisition while compressing semantics. It then performs Generative-Refined Contrastive Learning on the compressed representations for alignment. This approach maintains architectural consistency and resolves the objective conflict between generative and contrastive learning. Extensive experiments on medical, chemistry, and code retrieval tasks show that LBR significantly outperforms strong baselines. Our work establishes a new paradigm for building accurate and robust representations in vertical domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes