Jay Shim

CL
h-index2
4papers
7citations
Novelty41%
AI Score42

4 Papers

CLJun 3
Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

SooHwan Eom, Jay Shim, Gwanhyeong Koo et al.

The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this computational challenge effectively. This paper explores a query-based cross-modal projector designed to bolster Mamba's efficiency for vision-language modeling by compressing visual tokens based on input through the cross-attention mechanism. This innovative projector also removes the need for manually designing the 2D scan order of original image features when converting them into an input sequence for Mamba LLM. Experimental results across various vision-language understanding benchmarks show that the proposed cross-modal projector enhances Mamba-based multimodal LLMs, boosting both performance and throughput.

LGMar 12Code
Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

Jiaheng Hu, Jay Shim, Chen Tang et al.

Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from continual learning suggests that naive Sequential Fine-Tuning (Seq. FT) leads to catastrophic forgetting, necessitating complex CRL strategies. In this work, we take a step back and conduct a systematic study of CRL for large pretrained VLAs across three models and five challenging lifelong RL benchmarks. We find that, contrary to established belief, simple Seq. FT with low-rank adaptation (LoRA) is remarkably strong: it achieves high plasticity, exhibits little to no forgetting, and retains strong zero-shot generalization, frequently outperforming more sophisticated CRL methods. Through detailed analysis, we show that this robustness arises from a synergy between the large pretrained model, parameter-efficient adaptation, and on-policy RL. Together, these components reshape the stability-plasticity trade-off, making continual adaptation both stable and scalable. Our results position Sequential Fine-Tuning as a powerful method for continual RL with VLAs and provide new insights into lifelong learning in the large model era. Code is available at github.com/UT-Austin-RobIn/continual-vla-rl.

CLJul 4, 2024
Chain-of-Thought Augmentation with Logit Contrast for Enhanced Reasoning in Language Models

Jay Shim, Grant Kruttschnitt, Alyssa Ma et al.

Rapidly increasing model scales coupled with steering methods such as chain-of-thought prompting have led to drastic improvements in language model reasoning. At the same time, models struggle with compositional generalization and are far from human performance on many reasoning-based benchmarks. Leveraging the success of chain-of-thought prompting, and also taking inspiration from context-aware decoding (CAD), we explore input-based contrasting methods to further encourage the type of reasoning induced by chain-of-thought prompting. While work remains to stabilize these results across datasets and models, the improvements we find warrant further investigation into input-based steering methods for context-aware reasoning.

LGOct 29, 2024
Evaluating K-Fold Cross Validation for Transformer Based Symbolic Regression Models

Kaustubh Kislay, Shlok Singh, Soham Joshi et al.

Symbolic Regression remains an NP-Hard problem, with extensive research focusing on AI models for this task. Transformer models have shown promise in Symbolic Regression, but performance suffers with smaller datasets. We propose applying k-fold cross-validation to a transformer-based symbolic regression model trained on a significantly reduced dataset (15,000 data points, down from 500,000). This technique partitions the training data into multiple subsets (folds), iteratively training on some while validating on others. Our aim is to provide an estimate of model generalization and mitigate overfitting issues associated with smaller datasets. Results show that this process improves the model's output consistency and generalization by a relative improvement in validation loss of 53.31%. Potentially enabling more efficient and accessible symbolic regression in resource-constrained environments.