CLOct 15, 2022
Temporal Word Meaning Disambiguation using TimeLMsMihir Godbole, Parth Dandavate, Aditya Kane
Meaning of words constantly changes given the events in modern civilization. Large Language Models use word embeddings, which are often static and thus cannot cope with this semantic change. Thus,it is important to resolve ambiguity in word meanings. This paper is an effort in this direction, where we explore methods for word sense disambiguation for the EvoNLP shared task. We conduct rigorous ablations for two solutions to this problem. We see that an approach using time-aware language models helps this task. Furthermore, we explore possible future directions to this problem.
CVJul 11, 2024
Manipulating a Tetris-Inspired 3D Video RepresentationMihir Godbole
Video Synopsis is a technique that performs video compression in a way that preserves the activity in the video. This technique is particularly useful in surveillance and monitoring applications. Although it is still a nascent field of research, there have been several approaches proposed over the last two decades varying with the application, optimization type, nature of data feed, etc. The primary data required for these algorithms arises from some sort of object tracking method. In this paper, we discuss different spatio-temporal data representations suitable for different applications. We also present a formal definition for the video synopsis algorithm. We further discuss the assumptions and modifications to this definition required for a simpler version of the problem. We explore the application of a packing algorithm to solve the problem of video synopsis. Since the nature of the data is three dimensional, we consider 3D packing problems in the discussion. This paper also provides an extensive literature review of different video synopsis methods and packing problems. Lastly, we examine the different applications of this algorithm and how the different data representations discussed earlier can make the problem simpler. We also discuss the future directions of research that can be explored following this discussion.
CVMay 10, 2024
Non-Uniform Spatial Alignment Errors in sUAS Imagery From Wide-Area DisastersThomas Manzini, Priyankari Perali, Raisa Karnik et al. · microsoft-research
This work presents the first quantitative study of alignment errors between small uncrewed aerial systems (sUAS) georectified imagery and a priori building polygons and finds that alignment errors are non-uniform and irregular, which negatively impacts field robotics systems and human-robot interfaces that rely on geospatial information. There are no efforts that have considered the alignment of a priori spatial data with georectified sUAS imagery, possibly because straight-forward linear transformations often remedy any misalignment in satellite imagery. However, an attempt to develop machine learning models for an sUAS field robotics system for disaster response from nine wide-area disasters using the CRASAR-U-DROIDs dataset uncovered serious translational alignment errors. The analysis considered 21,608 building polygons in 51 orthomosaic images, covering 16787.2 Acres (26.23 square miles), and 7,880 adjustment annotations, averaging 75.36 pixels and an average intersection over union of 0.65. Further analysis found no uniformity among the angle and distance metrics of the building polygon alignments, presenting an average circular variance of 0.28 and an average distance variance of 0.45 pixels2, making it impossible to use the linear transform used to align satellite imagery. The study's primary contribution is alerting field robotics and human-robot interaction (HRI) communities to the problem of spatial alignment and that a new method will be needed to automate and communicate the alignment of spatial data in sUAS georectified imagery. This paper also contributes a description of the updated CRASAR-U-DROIDs dataset of sUAS imagery, which contains building polygons and human-curated corrections to spatial misalignment for further research in field robotics and HRI.
CVJun 21, 2025
DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For DrivingMihir Godbole, Xiangbo Gao, Zhengzhong Tu
Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates multi-class intent prediction in safety-critical situations, To address this gap, we introduce DRAMA-X, a fine-grained benchmark constructed from the DRAMA dataset via an automated annotation pipeline. DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions for the ego vehicle, and descriptive motion summaries. These annotations enable a structured evaluation of four interrelated tasks central to autonomous decision-making: object detection, intent prediction, risk assessment, and action suggestion. As a reference baseline, we propose SGG-Intent, a lightweight, training-free framework that mirrors the ego vehicle's reasoning pipeline. It sequentially generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends an action using a compositional reasoning stage powered by a large language model. We evaluate a range of recent VLMs, comparing performance across all four DRAMA-X tasks. Our experiments demonstrate that scene-graph-based reasoning enhances intent prediction and risk assessment, especially when contextual cues are explicitly modeled.
CLMar 31, 2024
Humane Speech Synthesis through Zero-Shot Emotion and Disfluency GenerationRohan Chaudhury, Mihir Godbole, Aakash Garg et al.
Contemporary conversational systems often present a significant limitation: their responses lack the emotional depth and disfluent characteristic of human interactions. This absence becomes particularly noticeable when users seek more personalized and empathetic interactions. Consequently, this makes them seem mechanical and less relatable to human users. Recognizing this gap, we embarked on a journey to humanize machine communication, to ensure AI systems not only comprehend but also resonate. To address this shortcoming, we have designed an innovative speech synthesis pipeline. Within this framework, a cutting-edge language model introduces both human-like emotion and disfluencies in a zero-shot setting. These intricacies are seamlessly integrated into the generated text by the language model during text generation, allowing the system to mirror human speech patterns better, promoting more intuitive and natural user interactions. These generated elements are then adeptly transformed into corresponding speech patterns and emotive sounds using a rule-based approach during the text-to-speech phase. Based on our experiments, our novel system produces synthesized speech that's almost indistinguishable from genuine human communication, making each interaction feel more personal and authentic.