CLJun 22, 2025

LLMs for Customized Marketing Content Generation and Evaluation at Scale

arXiv:2506.17863v19 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the need for scalable, effective marketing content generation and evaluation in e-commerce, though it is incremental in combining existing techniques like retrieval-augmentation and LLM-as-a-Judge.

The paper tackled the problem of generic offsite marketing content by proposing MarketingFM, a retrieval-augmented system for generating keyword-specific ad copy, which achieved up to 9% higher CTR and 12% more impressions in A/B tests. It also introduced AutoEval-Main for automated ad evaluation, achieving 89.57% agreement with human reviewers, and AutoEval-Update to refine evaluation prompts with minimal human input.

Offsite marketing is essential in e-commerce, enabling businesses to reach customers through external platforms and drive traffic to retail websites. However, most current offsite marketing content is overly generic, template-based, and poorly aligned with landing pages, limiting its effectiveness. To address these limitations, we propose MarketingFM, a retrieval-augmented system that integrates multiple data sources to generate keyword-specific ad copy with minimal human intervention. We validate MarketingFM via offline human and automated evaluations and large-scale online A/B tests. In one experiment, keyword-focused ad copy outperformed templates, achieving up to 9% higher CTR, 12% more impressions, and 0.38% lower CPC, demonstrating gains in ad ranking and cost efficiency. Despite these gains, human review of generated ads remains costly. To address this, we propose AutoEval-Main, an automated evaluation system that combines rule-based metrics with LLM-as-a-Judge techniques to ensure alignment with marketing principles. In experiments with large-scale human annotations, AutoEval-Main achieved 89.57% agreement with human reviewers. Building on this, we propose AutoEval-Update, a cost-efficient LLM-human collaborative framework to dynamically refine evaluation prompts and adapt to shifting criteria with minimal human input. By selectively sampling representative ads for human review and using a critic LLM to generate alignment reports, AutoEval-Update improves evaluation consistency while reducing manual effort. Experiments show the critic LLM suggests meaningful refinements, improving LLM-human agreement. Nonetheless, human oversight remains essential for setting thresholds and validating refinements before deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes