Casey Kennington

CL
h-index43
22papers
125citations
Novelty21%
AI Score44

22 Papers

CVJul 6, 2023
Vision Language Transformers: A Survey

Clayton Fields, Casey Kennington

Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.

67.1SEApr 3Code
Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models

Eric L. Melin, Adam J. Torek, Nasir U. Eisty et al.

Context: Large Language Models (LLMs) like GPT-5 and LLaMA-405b exhibit advanced code generation abilities, but their deployment demands substantial computation resources and energy. Quantization can reduce memory footprint and hardware requirements, yet may degrade code quality. Objective: This study investigates code generation performance of smaller LLMs, examines the effect of quantization, and identifies common code quality issues as a proof of concepts (PoC). Method: Four open-source LLMs are evaluated on Python benchmarks using code similarity metrics, with an analysis on 8-bit and 4-bit quantization, alongside static code quality assessment. Results: While smaller LLMs can generate functional code, benchmark performance is limited. Quantization impacts are variable, and generated code exhibits quality and maintainability concerns. Conclusions: LLM-generated code should be carefully validated before integration into software projects.

CLFeb 23, 2023
Conversational Agents and Children: Let Children Learn

Casey Kennington, Jerry Alan Fails, Katherine Landau Wright et al.

Using online information discovery as a case study, in this position paper we discuss the need to design, develop, and deploy (conversational) agents that can -- non-intrusively -- guide children in their quest for online resources rather than simply finding resources for them. We argue that agents should "let children learn" and should be built to take on a teacher-facilitator function, allowing children to develop their technical and critical thinking abilities as they interact with varied technology in a broad range of use cases.

AIMar 15, 2023
Who's in Charge? Roles and Responsibilities of Decision-Making Components in Conversational Robots

Pierre Lison, Casey Kennington

Software architectures for conversational robots typically consist of multiple modules, each designed for a particular processing task or functionality. Some of these modules are developed for the purpose of making decisions about the next action that the robot ought to perform in the current context. Those actions may relate to physical movements, such as driving forward or grasping an object, but may also correspond to communicative acts, such as asking a question to the human user. In this position paper, we reflect on the organization of those decision modules in human-robot interaction platforms. We discuss the relative benefits and limitations of modular vs. end-to-end architectures, and argue that, despite the increasing popularity of end-to-end approaches, modular architectures remain preferable when developing conversational robots designed to execute complex tasks in collaboration with human users. We also show that most practical HRI architectures tend to be either robot-centric or dialogue-centric, depending on where developers wish to place the ``command center'' of their system. While those design choices may be justified in some application domains, they also limit the robot's ability to flexibly interleave physical movements and conversational behaviours. We contend that architectures placing ``action managers'' and ``interaction managers'' on an equal footing may provide the best path forward for future human-robot interaction systems.

CLFeb 23, 2023
Evaluating Automatic Speech Recognition in an Incremental Setting

Ryan Whetten, Mir Tahsin Imtiaz, Casey Kennington

The increasing reliability of automatic speech recognition has proliferated its everyday use. However, for research purposes, it is often unclear which model one should choose for a task, particularly if there is a requirement for speed as well as accuracy. In this paper, we systematically evaluate six speech recognizers using metrics including word error rate, latency, and the number of updates to already recognized words on English test data, as well as propose and compare two methods for streaming audio into recognizers for incremental recognition. We further propose Revokes per Second as a new metric for evaluating incremental recognition and demonstrate that it provides insights into overall model performance. We find that, generally, local recognizers are faster and require fewer updates than cloud-based recognizers. Finally, we find Meta's Wav2Vec model to be the fastest, and find Mozilla's DeepSpeech model to be the most stable in its predictions.

CLJul 10, 2023
On the Computational Modeling of Meaning: Embodied Cognition Intertwined with Emotion

Casey Kennington

This document chronicles this author's attempt to explore how words come to mean what they do, with a particular focus on child language acquisition and what that means for models of language understanding.\footnote{I say \emph{historical} because I synthesize the ideas based on when I discovered them and how those ideas influenced my later thinking.} I explain the setting for child language learning, how embodiment -- being able to perceive and enact in the world, including knowledge of concrete and abstract concepts -- is crucial, and how emotion and cognition relate to each other and the language learning process. I end with what I think are some of the requirements for a language-learning agent that learns language in a setting similar to that of children. This paper can act as a potential guide for ongoing and future work in modeling language.

86.0AIApr 21Code
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Josue Torres-Fonseca, Naihao Deng, Yinpei Dai et al.

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git

CVNov 11, 2024Code
Renaissance: Investigating the Pretraining of Vision-Language Encoders

Clayton Fields, Casey Kennington

In the past several years there has been an explosion of available models for vision-language tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. In this paper we seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model. Additionally, we introduce a VL modeling platform called Renaissance that we use to conduct all of the experiments. This program offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.

2.8CVApr 20
ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting

Clayton Fields, Casey Kennington

Vision-language modeling is rapidly increasing in popularity with an ever expanding list of available models. In most cases, these vision-language models have parameters in the tens of billions, which is necessary for some needs, but in many cases smaller models are necessary (e.g., on edge devices or independent robotic platforms). Unfortunately, there is little research in producing light-weight models or in training them with small datasets. Inspired by the language learning progression and data sparsity in child development, in this paper, we address both of these goals in a systematic fashion. We show that two-tower encoder models are superior to one-tower encoders in low-resource settings for discriminative English tasks. We show also that incorporating traditional convolutional networks into the two-tower transformer architecture can help produce parameter efficient vision-language models. Finally, we show that the cross-modal fusion module of two-tower encoders can vary significantly in shape and size while producing the same results. In addition, we present ESsEN, a compact vision-language model that can be trained end-to-end with relatively few resources that performs as well on several tasks with only a fraction of the parameters compared to other models. The experimental results and the tools we present here make vision-language modeling more accessible to a wider variety of researchers.

CLJul 11, 2019Code
Incrementalizing RASA's Open-Source Natural Language Understanding Pipeline

Andrew Rafla, Casey Kennington

As spoken dialogue systems and chatbots are gaining more widespread adoption, commercial and open-sourced services for natural language understanding are emerging. In this paper, we explain how we altered the open-source RASA natural language understanding pipeline to process incrementally (i.e., word-by-word), following the incremental unit framework proposed by Schlangen and Skantze. To do so, we altered existing RASA components to process incrementally, and added an update-incremental intent recognition model as a component to RASA. Our evaluations on the Snips dataset show that our changes allow RASA to function as an effective incremental natural language understanding service.

CLFeb 16, 2024
Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning

Jun Zhuang, Casey Kennington

As new research on Large Language Models (LLMs) continues, it is difficult to keep up with new research and models. To help researchers synthesize the new research many have written survey papers, but even those have become numerous. In this paper, we develop a method to automatically assign survey papers to a taxonomy. We collect the metadata of 144 LLM survey papers and explore three paradigms to classify papers within the taxonomy. Our work indicates that leveraging graph structure information on co-category graphs can significantly outperform the language models in two paradigms; pre-trained language models' fine-tuning and zero-shot/few-shot classifications using LLMs. We find that our model surpasses an average human recognition level and that fine-tuning LLMs using weak labels generated by a smaller model, such as the GCN in this study, can be more effective than using ground-truth labels, revealing the potential of weak-to-strong generalization in the taxonomy classification task.

CLApr 1, 2024
Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

Casey Kennington, Malihe Alikhani, Heather Pon-Barry et al. · cmu

The ability to interact with machines using natural human language is becoming not just commonplace, but expected. The next step is not just text interfaces, but speech interfaces and not just with computers, but with all machines including robots. In this paper, we chronicle the recent history of this growing field of spoken dialogue with robots and offer the community three proposals, the first focused on education, the second on benchmarks, and the third on the modeling of language when it comes to spoken interaction with robots. The three proposals should act as white papers for any researcher to take and build upon.

CLJul 8, 2025
Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers?

Casey Kennington, David Schlangen

Formal, Distributional, and Grounded theories of computational semantics each have their uses and their drawbacks. There has been a shift to ground models of language by adding visual knowledge, and there has been a call to enrich models of language with symbolic methods to gain the benefits from formal, distributional, and grounded theories. In this paper, we attempt to make the case that one potential path forward in unifying all three semantic fields is paved with the words-as-classifier model, a model of word-level grounded semantics that has been incorporated into formalisms and distributional language models in the literature, and it has been well-tested within interactive dialogue settings. We review that literature, motivate the words-as-classifiers model with an appeal to recent work in cognitive science, and describe a small experiment. Finally, we sketch a model of semantics unified through words-as-classifiers.

CLJan 1, 2025
Prior Lessons of Incremental Dialogue and Robot Action Management for the Age of Language Models

Casey Kennington, Pierre Lison, David Schlangen

Efforts towards endowing robots with the ability to speak have benefited from recent advancements in natural language processing, in particular large language models. However, current language models are not fully incremental, as their processing is inherently monotonic and thus lack the ability to revise their interpretations or output in light of newer observations. This monotonicity has important implications for the development of dialogue systems for human--robot interaction. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and the implications of incremental dialogue for embodied, robotic platforms in the age of large language models.

CLApr 3, 2024
Unsupervised, Bottom-up Category Discovery for Symbol Grounding with a Curious Robot

Catherine Henry, Casey Kennington

Towards addressing the Symbol Grounding Problem and motivated by early childhood language development, we leverage a robot which has been equipped with an approximate model of curiosity with particular focus on bottom-up building of unsupervised categories grounded in the physical world. That is, rather than starting with a top-down symbol (e.g., a word referring to an object) and providing meaning through the application of predetermined samples, the robot autonomously and gradually breaks up its exploration space into a series of increasingly specific unlabeled categories at which point an external expert may optionally provide a symbol association. We extend prior work by using a robot that can observe the visual world, introducing a higher dimensional sensory space, and using a more generalizable method of category building. Our experiments show that the robot learns categories based on actions and what it visually observes, and that those categories can be symbolically grounded into.https://info.arxiv.org/help/prep#comments

CLAug 24, 2021
The State of SLIVAR: What's next for robots, human-robot interaction, and (spoken) dialogue systems?

Casey Kennington

We synthesize the reported results and recommendations of recent workshops and seminars that convened to discuss open questions within the important intersection of robotics, human-robot interaction, and spoken dialogue systems research. The goal of this growing area of research interest is to enable people to more effectively and naturally communicate with robots. To carry forward opportunities networking and discussion towards concrete, potentially fundable projects, we encourage interested parties to consider participating in future virtual and in-person discussions and workshops.

CLJun 30, 2021
An Analysis of the Recent Visibility of the SigDial Conference

Casey Kennington, McKenzie Steenson

Automated speech and text interfaces are continuing to improve, resulting in increased research in the area of dialogue systems. Moreover, conferences and workshops from various fields are focusing more on language through speech and text mediums as candidates for interaction with applications such as search interfaces and robots. In this paper, we explore how visible the SigDial conference is to outside conferences by analysing papers from top Natural Langauge Processing conferences since 2015 to determine the popularity of certain SigDial-related topics, as well as analysing what SigDial papers are being cited by others outside of SigDial. We find that despite a dramatic increase in dialogue-related research, SigDial visibility has not increased. We conclude by offering some suggestions.

CLMay 10, 2021
Language Acquisition is Embodied, Interactive, Emotive: a Research Proposal

Casey Kennington

Humans' experience of the world is profoundly multimodal from the beginning, so why do existing state-of-the-art language models only use text as a modality to learn and represent semantic meaning? In this paper we review the literature on the role of embodiment and emotion in the interactive setting of spoken dialogue as necessary prerequisites for language learning for human children, including how words in child vocabularies are largely concrete, then shift to become more abstract as the children get older. We sketch a model of semantics that leverages current transformer-based models and a word-level grounded model, then explain the robot-dialogue system that will make use of our semantic model, the setting for the system to learn language, and existing benchmarks for evaluation.

CYMay 7, 2021
CASTing a Net: Supporting Teachers with Search Technology

Garrett Allen, Katherine Landau Wright, Jerry Alan Fails et al.

Past and current research has typically focused on ensuring that search technology for the classroom serves children. In this paper, we argue for the need to broaden the research focus to include teachers and how search technology can aid them. In particular, we share how furnishing a behind-the-scenes portal for teachers can empower them by providing a window into the spelling, writing, and concept connection skills of their students.

CLNov 8, 2019
Composing and Embedding the Words-as-Classifiers Model of Grounded Semantics

Daniele Moro, Stacy Black, Casey Kennington

The words-as-classifiers model of grounded lexical semantics learns a semantic fitness score between physical entities and the words that are used to denote those entities. In this paper, we explore how such a model can incrementally perform composition and how the model can be unified with a distributional representation. For the latter, we leverage the classifier coefficients as an embedding. For composition, we leverage the underlying mechanics of three different classifier types (i.e., logistic regression, decision trees, and multi-layer perceptrons) to arrive at a several systematic approaches to composition unique to each classifier including both denotational and connotational methods of composition. We compare these approaches to each other and to prior work in a visual reference resolution task using the refCOCO dataset. Our results demonstrate the need to expand upon existing composition strategies and bring together grounded and distributional representations.

CLSep 29, 2017
Symbol, Conversational, and Societal Grounding with a Toy Robot

Casey Kennington, Sarah Plane

Essential to meaningful interaction is grounding at the symbolic, conversational, and societal levels. We present ongoing work with Anki's Cozmo toy robot as a research platform where we leverage the recent words-as-classifiers model of lexical semantics in interactive reference resolution tasks for language grounding.

CLOct 7, 2015
Resolving References to Objects in Photographs using the Words-As-Classifiers Model

David Schlangen, Sina Zarriess, Casey Kennington

A common use of language is to refer to visually present objects. Modelling it in computers requires modelling the link between language and perception. The "words as classifiers" model of grounded semantics views words as classifiers of perceptual contexts, and composes the meaning of a phrase through composition of the denotations of its component words. It was recently shown to perform well in a game-playing scenario with a small number of object types. We apply it to two large sets of real-world photographs that contain a much larger variety of types and for which referring expressions are available. Using a pre-trained convolutional neural network to extract image features, and augmenting these with in-picture positional information, we show that the model achieves performance competitive with the state of the art in a reference resolution task (given expression, find bounding box of its referent), while, as we argue, being conceptually simpler and more flexible.