Marcelo Sandoval-Castaneda

CV
h-index5
3papers
22citations
Novelty55%
AI Score41

3 Papers

CLMar 8
Cross-Modal Taxonomic Generalization in (Vision-) Language Models

Tianyang Xu, Marcelo Sandoval-Castaneda, Karen Livescu et al.

What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.

CVSep 13, 2025
EditDuet: A Multi-Agent System for Video Non-Linear Editing

Marcelo Sandoval-Castaneda, Bryan Russell, Josef Sivic et al.

Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving actual editing to the user. In contrast, we propose to automate the core task of video editing, formulating it as sequential decision making process. Ours is a multi-agent approach. We design an Editor agent and a Critic agent. The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence. On the other hand, the Critic gives natural language feedback to the editor based on the produced sequence or renders it if it is satisfactory. We introduce a learning-based approach for enabling effective communication across specialized agents to address the language-driven video editing task. Finally, we explore an LLM-as-a-judge metric for evaluating the quality of video editing system and compare it with general human preference. We evaluate our system's output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference.

CVSep 2, 2023
Self-Supervised Video Transformers for Isolated Sign Language Recognition

Marcelo Sandoval-Castaneda, Yanhong Li, Diane Brentari et al.

This paper presents an in-depth analysis of various self-supervision methods for isolated sign language recognition (ISLR). We consider four recently introduced transformer-based approaches to self-supervised learning from videos, and four pre-training data regimes, and study all the combinations on the WLASL2000 dataset. Our findings reveal that MaskFeat achieves performance superior to pose-based and supervised video models, with a top-1 accuracy of 79.02% on gloss-based WLASL2000. Furthermore, we analyze these models' ability to produce representations of ASL signs using linear probing on diverse phonological features. This study underscores the value of architecture and pre-training task choices in ISLR. Specifically, our results on WLASL2000 highlight the power of masked reconstruction pre-training, and our linear probing results demonstrate the importance of hierarchical vision transformers for sign language representation.