CLJun 5, 2023
Skill over Scale: The Case for Medium, Domain-Specific Models for SEManisha Mukherjee, Vincent J. Hellendoorn
Recent advancements in AI have sparked a trend in constructing large, generalist language models that handle a multitude of tasks, including many code-related ones. While these models are expensive to train and are often closed-source, they have enjoyed broad adoption because they tend to outperform smaller, domain-specific models of code. In this work, we argue that this is not a foregone conclusion. We show that modestly sized domain-specific models can outperform much larger ones on code labeling tasks, provided they are trained to the same standards. Concretely, we focus on StackOverflow (SO), which offers large volumes of aligned code and text data. We align established best-practices for pre-training large language models with properties of SO as a data source, especially using a large context window (2,048 tokens), coupled with a powerful toolkit (Megatron-LM) to train two models: SOBertBase (125M parameters) and SOBertLarge (762M parameters), at a budget of just $374 and $1600 each. We compare the performance of our models with a prior domain-specific model which did not adopt many of these practices (BERTOverflow), as well two general-purpose BERT models and two models in OpenAI's GPT series (GPT-3.5 and GPT-4). We study four labeling tasks: question quality prediction, closed question prediction, NER and obsoletion prediction. The final task is a new benchmark we introduce, on which we additionally compare SOBert with a fine-tuned CodeLlama and StackLlama (models with 10x more parameters than SOBertLarge). Our models consistently outperform all baselines. In contrast, BertOverflow is outperformed by generalist models in most tasks. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models. Both models are released to the public on Hugging Face.
49.2AIApr 15
Reward Design for Physical Reasoning in Vision-Language ModelsDerek Lilienthal, Manisha Mukherjee, Sameera Horawalavithana
Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions. We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.
SEMar 2
Inference-Time Safety For Code LLMs Via Retrieval-Augmented RevisionManisha Mukherjee, Vincent J. Hellendoorn
Large Language Models (LLMs) are increasingly deployed for code generation in high-stakes software development, yet their limited transparency in security reasoning and brittleness to evolving vulnerability patterns raise critical trustworthiness concerns. Models trained on static datasets cannot readily adapt to newly discovered vulnerabilities or changing security standards without retraining, leading to the repeated generation of unsafe code. We present a principled approach to trustworthy code generation by design that operates as an inference-time safety mechanism. Our approach employs retrieval-augmented generation to surface relevant security risks in generated code and retrieve related security discussions from a curated Stack Overflow knowledge base, which are then used to guide an LLM during code revision. This design emphasizes three aspects relevant to trustworthiness: (1) interpretability, through transparent safety interventions grounded in expert community explanations; (2) robustness, by allowing adaptation to evolving security practices without model retraining; and (3) safety alignment, through real-time intervention before unsafe code reaches deployment. Across real-world and benchmark datasets, our approach improves the security of LLM-generated code compared to prompting alone, while introducing no new vulnerabilities as measured by static analysis. These results suggest that principled, retrieval-augmented inference-time interventions can serve as a complementary mechanism for improving the safety of LLM-based code generation, and highlight the ongoing value of community knowledge in supporting trustworthy AI deployment.