CLOct 21, 2022
Syntax-guided Localized Self-attention by Constituency Syntactic DistanceShengyuan Hou, Jushi Kai, Haotian Xue et al. · gatech, meta-ai
Recent works have revealed that Transformers are implicitly learning the syntactic information in its lower layers from data, albeit is highly dependent on the quality and scale of the training data. However, learning syntactic information from data is not necessary if we can leverage an external syntactic parser, which provides better parsing quality with well-defined syntactic structures. This could potentially improve Transformer's performance and sample efficiency. In this work, we propose a syntax-guided localized self-attention for Transformer that allows directly incorporating grammar structures from an external constituency parser. It prohibits the attention mechanism to overweight the grammatically distant tokens over close ones. Experimental results show that our model could consistently improve translation performance on a variety of machine translation datasets, ranging from small to large dataset sizes, and with different source languages.
AIFeb 11, 2024Code
ITINERA: Integrating Spatial Optimization with Large Language Models for Open-domain Urban Itinerary PlanningYihong Tang, Zhaokai Wang, Ao Qu et al. · mit
Citywalk, a recently popular form of urban travel, requires genuine personalization and understanding of fine-grained requests compared to traditional itinerary planning. In this paper, we introduce the novel task of Open-domain Urban Itinerary Planning (OUIP), which generates personalized urban itineraries from user requests in natural language. We then present ITINERA, an OUIP system that integrates spatial optimization with large language models to provide customized urban itineraries based on user needs. This involves decomposing user requests, selecting candidate points of interest (POIs), ordering the POIs based on cluster-aware spatial optimization, and generating the itinerary. Experiments on real-world datasets and the performance of the deployed system demonstrate our system's capacity to deliver personalized and spatially coherent itineraries compared to current solutions. Source codes of ITINERA are available at https://github.com/YihongT/ITINERA.
CLJan 11, 2024
SH2: Self-Highlighted Hesitation Helps You Decode More TruthfullyJushi Kai, Tianhang Zhang, Hai Hu et al.
Large language models (LLMs) demonstrate great performance in text generation. However, LLMs are still suffering from hallucinations. In this work, we propose an inference-time method, Self-Highlighted Hesitation (SH2), to help LLMs decode more truthfully. SH2 is based on a simple fact rooted in information theory that for an LLM, the tokens predicted with lower probabilities are prone to be more informative than others. Our analysis shows that the tokens assigned with lower probabilities by an LLM are more likely to be closely related to factual information, such as nouns, proper nouns, and adjectives. Therefore, we propose to ''highlight'' the factual information by selecting the tokens with the lowest probabilities and concatenating them to the original context, thus forcing the model to repeatedly read and hesitate on these tokens before generation. During decoding, we also adopt contrastive decoding to emphasize the difference in the output probabilities brought by the hesitation. Experimental results demonstrate that our SH2, requiring no additional data or models, can effectively help LLMs elicit factual knowledge and distinguish hallucinated contexts. Significant and consistent improvements are achieved by SH2 for LLaMA-7b, LLaMA2-7b and Mistral-7b on multiple hallucination tasks.
CLMay 1, 2025
FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window ExtensionJushi Kai, Boyi Zeng, Yixuan Wang et al.
Frequency-domain compression has proven effective in reducing redundancies for spatial signals. In this work, we propose FreqKV, a novel frequency domain key-value (KV) compression technique that enables efficient context window extension for decoder-only large language models (LLMs). Our approach is motivated by a key observation that, in the frequency domain, the energy distribution of the KV cache is predominantly concentrated in low-frequency components. By discarding high-frequency components, we achieve efficient compression of the KV cache with minimal information loss. FreqKV iteratively compresses the increasing KV cache to a fixed size in the frequency domain, allowing models to process lengthy contexts efficiently. Introducing no additional parameters or architectural modifications, FreqKV is applicable to both fine-tuning and inference. With minimal fine-tuning, LLMs can learn to leverage the limited cache that is compressed in the frequency domain and extend the context window. Experiments on a range of long context language modeling and understanding tasks demonstrate the efficiency and effectiveness of the proposed method.
LGFeb 11
Towards Compressive and Scalable Recurrent MemoryYunchong Song, Jushi Kai, Liming Lu et al.
Transformers face a quadratic bottleneck in attention when scaling to long contexts. Recent approaches introduce recurrent memory to extend context beyond the current window, yet these often face a fundamental trade-off between theoretical principles and practical scalability. To address this, we introduce Elastic Memory, a novel memory architecture grounded in the HiPPO framework for online function approximation. Elastic Memory treats historical sequence as samples from continuous signals, applying optimal online compression to encode them into a fixed-size memory state. For retrieval, we propose a flexible \textit{polynomial sampling} mechanism that reconstructs a history summary from this compressed state. Elastic Memory consistently outperformed baselines on long-context (32k+) datasets across three domains. With equal parameters, it beat Memorizing Transformer by 16x memory and outperformed Melodi at all memory sizes, even when Melodi had 30% more parameters. When scaling model size, Elastic Memory stayed ahead of all baselines and was significantly faster than Melodi at 4x size. Furthermore, its decoupled design allows for injecting inductive biases at test-time to boost performance.
CVAug 8, 2025
Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language ModelsHuanyu Wang, Jushi Kai, Haoli Bai et al.
Vision-Language Models (VLMs) typically replace the predefined image placeholder token (<image>) in textual instructions with visual features from an image encoder, forming the input to a backbone Large Language Model (LLM). However, the large number of vision tokens significantly increases the context length, leading to high computational overhead and inference latency. While previous efforts mitigate this by selecting only important visual features or leveraging learnable queries to reduce token count, they often compromise performance or introduce substantial extra costs. In response, we propose Fourier-VLM, a simple yet efficient method that compresses visual representations in the frequency domain. Our approach is motivated by the observation that vision features output from the vision encoder exhibit concentrated energy in low-frequency components. Leveraging this, we apply a low-pass filter to the vision features using a two-dimensional Discrete Cosine Transform (DCT). Notably, the DCT is efficiently computed via the Fast Fourier Transform (FFT) operator with a time complexity of $\mathcal{O}(n\log n)$, minimizing the extra computational cost while introducing no additional parameters. Extensive experiments across various image-based benchmarks demonstrate that Fourier-VLM achieves competitive performance with strong generalizability across both LLaVA and Qwen-VL architectures. Crucially, it reduce inference FLOPs by up to 83.8% and boots generation speed by 31.2% compared to LLaVA-v1.5, highlighting the superior efficiency and practicality.
CLAug 3, 2025
MLP Memory: A Retriever-Pretrained Memory for Large Language ModelsRubin Wei, Jiaqi Cao, Jiarui Wang et al.
Modern approaches to enhancing Large Language Models' factual accuracy and knowledge utilization face a fundamental trade-off: non-parametric retrieval-augmented generation (RAG) provides flexible access to external knowledge but suffers from high inference latency and shallow integration, while parametric fine-tuning methods like LoRA risk catastrophic forgetting and degraded general capabilities. In this work, we propose MLP Memory, a lightweight parametric module that learns to internalize retrieval patterns without explicit document access. By pretraining an MLP to imitate a $k$NN retriever's behavior on the entire pretraining dataset, we create a differentiable memory component that captures the benefits of retrieval-based knowledge access in a fully parametric form. Our architecture integrates this pretrained MLP Memory with Transformer decoders through simple probability interpolation, yielding 17.5\% and 24.1\% scaling gains on WikiText-103 and Web datasets, respectively. It further achieves 12.3\% relative improvement on five question-answering benchmarks and 5.2 points absolute gain across nine general NLP tasks, while reducing hallucinations by up to 10 points on HaluEval. Moreover, MLP Memory delivers 2.5$\times$ faster inference than RAG with superior accuracy. Our findings show that learning retrieval patterns parametrically bridges the gap between efficient inference and effective knowledge access, offering a practical alternative to both RAG and fine-tuning approaches.