Woosang Lim

LG
h-index4
9papers
196citations
Novelty52%
AI Score58

9 Papers

LGSep 2, 2024Code
Efficient and Scalable Estimation of Tool Representations in Vector Space

Suhong Moon, Siddharth Jha, Lutfi Eren Erdogan et al.

Recent advancements in function calling and tool use have significantly enhanced the capabilities of large language models (LLMs) by enabling them to interact with external information sources and execute complex tasks. However, the limited context window of LLMs presents challenges when a large number of tools are available, necessitating efficient methods to manage prompt length and maintain accuracy. Existing approaches, such as fine-tuning LLMs or leveraging their reasoning capabilities, either require frequent retraining or incur significant latency overhead. A more efficient solution involves training smaller models to retrieve the most relevant tools for a given query, although this requires high quality, domain-specific data. To address those challenges, we present a novel framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models. Empowered by LLMs, we create ToolBank, a new tool retrieval dataset that reflects real human user usages. For tool retrieval methodologies, we propose novel approaches: (1) Tool2Vec: usage-driven tool embedding generation for tool retrieval, (2) ToolRefiner: a staged retrieval method that iteratively improves the quality of retrieved tools, and (3) MLC: framing tool retrieval as a multi-label classification problem. With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank. Additionally, we present further experimental results to rigorously validate our methods. Our code is available at \url{https://github.com/SqueezeAILab/Tool2Vec}

CLJul 6, 2024Code
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding

Zekun Li, Xianjun Yang, Kyuri Choi et al.

Scientific figure interpretation is a crucial capability for AI-driven scientific assistants built on advanced Large Vision Language Models. However, current datasets and benchmarks primarily focus on simple charts or other relatively straightforward figures from limited science domains. To address this gap, we present a comprehensive dataset compiled from peer-reviewed Nature Communications articles covering 72 scientific fields, encompassing complex visualizations such as schematic diagrams, microscopic images, and experimental data which require graduate-level expertise to interpret. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Our analysis revealed significant task challenges and performance gaps among models. Beyond serving as a benchmark, this dataset serves as a valuable resource for large-scale training. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations. Furthermore, continuous pre-training on our interleaved article and figure data substantially enhanced the model's downstream task performance in materials science. We have released our dataset to support further research.

LGMay 27
Locality-Aware Redundancy Pruning for LLM Depth Compression

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim et al.

Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across architectures. We propose Locality-Aware Redundancy Pruning (LoRP), a training-free one-shot depth pruning framework guided by representation locality. We show that inter-layer redundancy can be either localized or globally distributed depending on the LLM architecture. To characterize this phenomenon, we introduce Representation Locality Score (RLS), derived from global inter-layer hidden-state similarity. Using a small calibration set, LoRP computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy.

CLMar 19Code
EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Minsoo Cheong, Donghyun Son, Woosang Lim et al.

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

LGApr 27
Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning

Minkyu Kim, Vincent-Daniel Yun, Youngrae Kim et al.

Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work has focused on importance criteria and search algorithms, often treating layer redundancy as an inherent structural property of pretrained networks. In contrast, we adopt a \emph{functional perspective}, where redundancy is jointly influenced by the model and the evaluation objective, suggesting that a universal ranking may not be sufficient. Through an empirical study across three LLM families, two calibration objectives, and seven search algorithms, we observe that different objectives yield qualitatively different redundant layers, and that perplexity and downstream accuracy rankings do not consistently align. Under a fixed objective, however, search algorithms tend to produce similar solutions. Overall, our results suggest that the calibration objective may play a more influential role than the choice of search algorithm, indicating that further attention to objective design could be beneficial.

CLMay 10, 2025Code
MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

Woosang Lim, Zekun Li, Gyuwan Kim et al.

Long-context large language models (LC LLMs) combined with retrieval-augmented generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained windows, and fragmented information from suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical RAG framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through real-time chunk- and document-level expansions. By initiating with finest-level retrieval and progressively incorporating broader, higher-level context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm MacRAG consistently surpasses baseline RAG pipelines in single- and multi-step generation using Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at https://github.com/Leezekun/MacRAG.

LGFeb 17, 2025Code
Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer

Euntae Choi, Sumin Song, Woosang Lim et al.

We propose Rotate, Clip, and Partition (RCP), a quantization-aware training (QAT) approach that first realizes extreme compression of LLMs with W2A4KV4(2-bit weight, 4-bit activation, and 4-bit KV cache) configuration. RCP integrates recent rotation techniques with a novel non-uniform weight quantizer design, by quantitatively analyzing the impact of random rotation on 2-bit weight quantization. Our weight quantizer features Learnable Direct Partitioning (LDP), which introduces learnable parameters to directly learn non-uniform intervals jointly with LLM weights. We also present a specialized GPU kernel that supports GEMV on non-uniform W2A4. Experiments show that RCP can compress LLaMA-2-7B to W2A4KV4 with a loss of only 2.84 WikiText2 ppl and 5.29 times reduced memory footprint. Furthermore, RCP can quantize challenging mobile-targeted LLaMA-3.2 models and domain-specific WizardCoder-7B and MetaMath-7B with no critical problems such as convergence failure and repetition. Code is available at https://github.com/ songsm921/RCP.

LGMay 2, 2025
Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free

Euntae Choi, Sumin Song, Woosang Lim et al.

Large Language Models (LLMs) face deployment challenges due to high computational costs, and while Post-Training Quantization (PTQ) offers a solution, existing rotation-based methods struggle at very low bit-widths like 2-bit. We introduce a novel, training-free approach to construct an improved rotation matrix, addressing the limitations of current methods. The key contributions include leveraging the Walsh-Hadamard transform with sequency ordering, which clusters similar frequency components to reduce quantization error compared to standard Hadamard matrices, significantly improving performance. Furthermore, we propose a Grouped Sequency-arranged Rotation (GSR) using block-diagonal matrices with smaller Walsh blocks, effectively isolating outlier impacts and achieving performance comparable to optimization-based methods without requiring any training. Our method demonstrates robust performance on reasoning tasks and Perplexity (PPL) score on WikiText-2. Our method also enhances results even when applied over existing learned rotation techniques.

CVOct 14, 2016
Hadamard Product for Low-rank Bilinear Pooling

Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim et al.

Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmentation, and visual question-answering, to get state-of-the-art performances taking advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally complex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning. We show that our model outperforms compact bilinear pooling in visual question-answering tasks with the state-of-the-art results on the VQA dataset, having a better parsimonious property.