Francesca Lucchetti

AI
h-index9
5papers
135citations
Novelty52%
AI Score31

5 Papers

LGJul 18, 2024Code
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

Jaden Fiotto-Kaufman, Alexander R. Loftus, Eric Todd et al.

We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. The National Deep Inference Fabric (NDIF) is a scalable inference service that executes NNsight requests, allowing users to share GPU resources and pretrained models. These technologies are enabled by the Intervention Graph, an architecture developed to decouple experimental design from model runtime. Together, this framework provides transparent and efficient access to the internals of deep neural networks such as very large language models (LLMs) without imposing the cost or complexity of hosting customized models individually. We conduct a quantitative survey of the machine learning literature that reveals a growing gap in the study of the internals of large-scale AI. We demonstrate the design and use of our framework to address this gap by enabling a range of research methods on huge models. Finally, we conduct benchmarks to compare performance with previous approaches. Code, documentation, and tutorials are available at https://nnsight.net/.

PLAug 19, 2023
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Federico Cassano, John Gouwar, Francesca Lucchetti et al.

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available. Low resource languages include OCaml, Racket, and several others. This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate Python code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of validated training items for Julia, Lua, OCaml, R, and Racket. Furthermore, we use an open model (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket. On established benchmarks (MultiPL-E), these models outperform other open Code LLMs. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.

AIFeb 3, 2025
PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

Zixuan Wu, Francesca Lucchetti, Aleksander Boruch-Gruszecki et al.

Existing benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark with 594 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models' mistakes are easy to spot. As LLMs are more widely deployed in society, we believe it is useful to develop benchmarks for frontier models that humans can understand without the need for deep domain expertise. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models on our benchmark, despite being on par with other models when tested on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with "I give up" before providing an answer that it knows is wrong. R1 can also be remarkably "uncertain" in its output and in rare cases, it does not "finish thinking," which suggests the need for techniques to "wrap up" before the context window limit is reached. We also quantify the effectiveness of reasoning longer to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.

CYOct 15, 2024
Substance Beats Style: Why Beginning Students Fail to Code with LLMs

Francesca Lucchetti, Zixuan Wu, Arjun Guha et al.

Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.

CLApr 2, 2024
Understanding How CodeLLMs (Mis)Predict Types with Activation Steering

Francesca Lucchetti, Arjun Guha

Large Language Models (LLMs) are widely used by software engineers for programming tasks. However, research shows that LLMs often lack a deep understanding of program semantics. Even minor changes to syntax, such as renaming variables, can significantly degrade performance across various tasks. In this work, we examine the task of type prediction: given a partially typed program, can a model predict a missing type annotations such that the resulting program is more typed? We construct a dataset of adversarial examples where models initially predict the correct types, but begin to fail after semantically irrelevant edits. This is problematic, as models should ideally generalize across different syntactic forms of semantically equivalent code. This lack of robustness suggests that models may have a shallow understanding of code semantics. Despite this, we provide evidence that LLMs do, in fact, learn robust mechanisms for type prediction-though these mechanisms often fail to activate in adversarial scenarios. By using activation steering, a method that manipulates a model's internal activations to guide it toward using latent knowledge, we restore accurate predictions on adversarial inputs. We show that steering successfully activates a type prediction mechanism that is shared by both Python and TypeScript, and is more effective than prompting with in-context examples. Across five different models, our comprehensive evaluation demonstrates that LLMs can learn generalizable representations of code semantics that transfer across programming languages.