CLJun 15, 2023
Inverse Scaling: When Bigger Isn't BetterIan R. McKenzie, Alexander Lyzhov, Michael Pieler et al. · stanford, utoronto
Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at https://inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.
AISep 23, 2022
Predicting the Future of AI with AI: High-quality link prediction in an exponentially growing knowledge networkMario Krenn, Lorenzo Buffoni, Bruno Coutinho et al.
A tool that could suggest new personalized research directions and ideas by taking insights from the scientific literature could significantly accelerate the progress of science. A field that might benefit from such an approach is artificial intelligence (AI) research, where the number of scientific publications has been growing exponentially over the last years, making it challenging for human researchers to keep track of the progress. Here, we use AI techniques to predict the future research directions of AI itself. We develop a new graph-based benchmark based on real-world data -- the Science4Cast benchmark, which aims to predict the future state of an evolving semantic network of AI. For that, we use more than 100,000 research papers and build up a knowledge network with more than 64,000 concept nodes. We then present ten diverse methods to tackle this task, ranging from pure statistical to pure learning methods. Surprisingly, the most powerful methods use a carefully curated set of network features, rather than an end-to-end AI approach. It indicates a great potential that can be unleashed for purely ML approaches without human knowledge. Ultimately, better predictions of new future research directions will be a crucial component of more advanced research suggestion tools.
CHEM-PHSep 3, 2024
SmileyLlama: Modifying Large Language Models for Directed Chemical Space ExplorationJoseph M. Cavanagh, Kunyang Sun, Andrew Gritsevskiy et al.
Here we show that a general-purpose large language model (LLM) chatbot, Llama-3.1-8B-Instruct, can be transformed via supervised fine-tuning of engineered prompts into a chemical language model (CLM), SmileyLlama, for molecule generation. We benchmark SmileyLlama by comparing it to CLMs trained from scratch on large amounts of ChEMBL data for their ability to generate valid and novel drug-like molecules. We also use direct preference optimization to both improve SmileyLlama's adherence to a prompt and to generate molecules within the iMiner reinforcement learning framework to predict new drug molecules with optimized 3D conformations and high binding affinity to drug targets, illustrated with the SARS-Cov-2 Main Protease. This overall framework allows a LLM to speak directly as a CLM which can generate molecules with user-specified properties, rather than acting only as a chatbot with knowledge of chemistry or as a helpful virtual assistant. While our dataset and analyses are geared toward drug discovery, this general procedure can be extended to other chemical applications such as chemical synthesis.
LGJan 24, 2025
Humanity's Last ExamLong Phan, Alice Gatti, Ziwen Han et al. · amazon-science, apple-ml
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
CRJun 3, 2024Code
Unelicitable Backdoors in Language Models via Cryptographic Transformer CircuitsAndis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani et al.
The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to properly evaluate ahead of deployment even if given full white-box access and using automated techniques, such as red-teaming or certain formal verification methods. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties. We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. Additionally, we expand on previous work by showing that our universal backdoors, while not completely undetectable in white-box settings, can be harder to detect than some existing designs. By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies. This offers new insights into the offence-defence balance in AI safety and security.
LGMar 16, 2025
SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language ModelsKunyang Sun, Dorian Bagni, Joseph M. Cavanagh et al.
Generative machine learning models for exploring chemical space have shown immense promise, but many molecules they generate are too difficult to synthesize, making them impractical for further investigation or development. In this work, we present a novel approach by fine-tuning Meta's Llama3 Large Language Models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data, and offers strong performance in both forward and bottom-up synthesis planning compared to other state-of-the-art methods. We find that SynLlama, even without training on external building blocks, can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than the training data. We also demonstrate the use of SynLlama in a pharmaceutical context for synthesis planning of analog molecules and hit expansion leads for proposed inhibitors of target proteins, offering medicinal chemists a valuable tool for discovery.
CLJan 11, 2024
REBUS: A Robust Evaluation Benchmark of Understanding SymbolsAndrew Gritsevskiy, Arjun Panickssery, Aaron Kirtland et al.
We propose a new benchmark evaluating the performance of multimodal large language models on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with hypothesis testing, multi-step reasoning, and an understanding of human cognition, making for a complex, multimodal evaluation of capabilities. We find that GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models. However, even the best model has a final accuracy of only 42\%, which goes down to just 7\% on hard puzzles, highlighting the need for substantial improvements in reasoning. Further, models rarely understand all parts of a puzzle, and are almost always incapable of retroactively explaining the correct answer. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.
CVApr 26, 2018
Capsule networks for low-data transfer learningAndrew Gritsevskiy, Maksym Korablyov
We propose a capsule network-based architecture for generalizing learning to new data with few examples. Using both generative and non-generative capsule networks with intermediate routing, we are able to generalize to new information over 25 times faster than a similar convolutional neural network. We train the networks on the multiMNIST dataset lacking one digit. After the networks reach their maximum accuracy, we inject 1-100 examples of the missing digit into the training set, and measure the number of batches needed to return to a comparable level of accuracy. We then discuss the improvement in low-data transfer learning that capsule networks bring, and propose future directions for capsule research.