LG CV PFSep 6, 2025

ProfilingAgent: Profiling-Guided Agentic Reasoning for Adaptive Model Optimization

Sadegh Jafari, Aishwarya Sarkar, Mohiuddin Bilwal, Ali Jannesari

arXiv:2509.05584v14.1h-index: 6

Originality Incremental advance

AI Analysis

This addresses deployment challenges for foundation models on resource-limited platforms, offering an incremental improvement over existing compression techniques by integrating profiling into automated pipelines.

The paper tackles the problem of compute and memory bottlenecks in foundation models on resource-limited platforms by proposing ProfilingAgent, a profiling-guided agentic approach using LLMs to automate compression via pruning and quantization, resulting in maintained or improved accuracy (e.g., about 1% drop on ImageNet-1K, +2% gains for ViT-B/16) and up to 74% memory savings with <0.5% accuracy loss.

Foundation models face growing compute and memory bottlenecks, hindering deployment on resource-limited platforms. While compression techniques such as pruning and quantization are widely used, most rely on uniform heuristics that ignore architectural and runtime heterogeneity. Profiling tools expose per-layer latency, memory, and compute cost, yet are rarely integrated into automated pipelines. We propose ProfilingAgent, a profiling-guided, agentic approach that uses large language models (LLMs) to automate compression via structured pruning and post-training dynamic quantization. Our modular multi-agent system reasons over static metrics (MACs, parameter counts) and dynamic signals (latency, memory) to design architecture-specific strategies. Unlike heuristic baselines, ProfilingAgent tailors layer-wise decisions to bottlenecks. Experiments on ImageNet-1K, CIFAR-10, and CIFAR-100 with ResNet-101, ViT-B/16, Swin-B, and DeiT-B/16 show pruning maintains competitive or improved accuracy (about 1% drop on ImageNet-1K, +2% gains for ViT-B/16 on smaller datasets), while quantization achieves up to 74% memory savings with <0.5% accuracy loss. Our quantization also yields consistent inference speedups of up to 1.74 times faster. Comparative studies with GPT-4o and GPT-4-Turbo highlight the importance of LLM reasoning quality for iterative pruning. These results establish agentic systems as scalable solutions for profiling-guided model optimization.

View on arXiv PDF

Similar