Jeffrey Mark Siskind

h-index28

24papers

580citations

Novelty40%

AI Score38

Ranked #85,490 of 194,257 authors (top 44%)#28,817 in CV (top 49%)

24 Papers

1.2PLMar 8, 2012

AD in Fortran, Part 1: Design

Alexey Radul, Barak A. Pearlmutter, Jeffrey Mark Siskind

We propose extensions to Fortran which integrate forward and reverse Automatic Differentiation (AD) directly into the programming model. Irrespective of implementation technology, embedding AD constructs directly into the language extends the reach and convenience of AD while allowing abstraction of concepts of interest to scientific-computing practice, such as root finding, optimization, and finding equilibria of continuous games. Multiple different subprograms for these tasks can share common interfaces, regardless of whether and how they use AD internally. A programmer can maximize a function F by calling a library maximizer, XSTAR=ARGMAX(F,X0), which internally constructs derivatives of F by AD, without having to learn how to use any particular AD tool. We illustrate the utility of these extensions by example: programs become much more concise and closer to traditional mathematical notation. A companion paper describes how these extensions can be implemented by a program that generates input to existing Fortran-based AD tools.

1.2NANov 27, 2024

Automatic Differentiation: Inverse Accumulation Mode

Barak A. Pearlmutter, Jeffrey Mark Siskind

We show that, under certain circumstances, it is possible to automatically compute Jacobian-inverse-vector and Jacobian-inverse-transpose-vector products about as efficiently as Jacobian-vector and Jacobian-transpose-vector products. The key insight is to notice that the Jacobian corresponding to the use of one basis function is of a form whose sparsity is invariant to inversion. The main restriction of the method is a constraint on the number of active variables, which suggests a variety of techniques or generalization to allow the constraint to be enforced or relaxed. This technique has the potential to allow the efficient direct calculation of Newton steps as well as other numeric calculations of interest.

15.5ROFeb 7, 2018Code

A Critical Investigation of Deep Reinforcement Learning for Navigation

Vikas Dhiman, Shurjo Banerjee, Brent Griffin et al.

The navigation problem is classically approached in two steps: an exploration step, where map-information about the environment is gathered; and an exploitation step, where this information is used to navigate efficiently. Deep reinforcement learning (DRL) algorithms, alternatively, approach the problem of navigation in an end-to-end fashion. Inspired by the classical approach, we ask whether DRL algorithms are able to inherently explore, gather and exploit map-information over the course of navigation. We build upon Mirowski et al. [2017] work and introduce a systematic suite of experiments that vary three parameters: the agent's starting location, the agent's target location, and the maze structure. We choose evaluation metrics that explicitly measure the algorithm's ability to gather and exploit map-information. Our experiments show that when trained and tested on the same maps, the algorithm successfully gathers and exploits map-information. However, when trained and tested on different sets of maps, the algorithm fails to transfer the ability to gather and exploit map-information to unseen maps. Furthermore, we find that when the goal location is randomized and the map is kept static, the algorithm is able to gather and exploit map-information but the exploitation is far from optimal. We open-source our experimental suite in the hopes that it serves as a framework for the comparison of future algorithms and leads to the discovery of robust alternatives to classical navigation methods.

10.2CVJul 24, 2025

Towards Effective Human-in-the-Loop Assistive AI Agents

Filippos Bellos, Yayuan Li, Cary Shu et al.

Effective human-AI collaboration for physical task completion has significant potential in both everyday activities and professional domains. AI agents equipped with informative guidance can enhance human performance, but evaluating such collaboration remains challenging due to the complexity of human-in-the-loop interactions. In this work, we introduce an evaluation framework and a multimodal dataset of human-AI interactions designed to assess how AI guidance affects procedural task performance, error reduction and learning outcomes. Besides, we develop an augmented reality (AR)-equipped AI agent that provides interactive guidance in real-world tasks, from cooking to battlefield medicine. Through human studies, we share empirical insights into AI-assisted human performance and demonstrate that AI-assisted collaboration improves task completion.

1.2NCAug 1, 2025

The Repeated-Stimulus Confound in Electroencephalography

Jack A. Kilgallen, Barak A. Pearlmutter, Jeffrey Mark Siskind

In neural-decoding studies, recordings of participants' responses to stimuli are used to train models. In recent years, there has been an explosion of publications detailing applications of innovations from deep-learning research to neural-decoding studies. The data-hungry models used in these experiments have resulted in a demand for increasingly large datasets. Consequently, in some studies, the same stimuli are presented multiple times to each participant to increase the number of trials available for use in model training. However, when a decoding model is trained and subsequently evaluated on responses to the same stimuli, stimulus identity becomes a confounder for accuracy. We term this the repeated-stimulus confound. We identify a susceptible dataset, and 16 publications which report model performance based on evaluation procedures affected by the confound. We conducted experiments using models from the affected studies to investigate the likely extent to which results in the literature have been misreported. Our findings suggest that the decoding accuracies of these models were overestimated by between 4.46-7.42%. Our analysis also indicates that per 1% increase in accuracy under the confound, the magnitude of the overestimation increases by 0.26%. The confound not only results in optimistic estimates of decoding performance, but undermines the validity of several claims made within the affected publications. We conducted further experiments to investigate the implications of the confound in alternative contexts. We found that the same methodology used within the affected studies could also be used to justify an array of pseudoscientific claims, such as the existence of extrasensory perception.

4.1ROOct 28, 2020Code

The Amazing Race TM: Robot Edition

Jared Sigurd Johansen, Thomas Victor Ilyevsky, Jeffrey Mark Siskind

State-of-the-art natural-language-driven autonomous-navigation systems generally lack the ability to operate in real unknown environments without crutches, such as having a map of the environment in advance or requiring a strict syntactic structure for natural-language commands. Practical artificial-intelligent systems should not have to depend on such prior knowledge. To encourage effort towards this goal, we propose The Amazing Race TM: Robot Edition, a new task of finding a room in an unknown and unmodified office environment by following instructions obtained in spoken dialog from an untrained person. We present a solution that treats this challenge as a series of sub-tasks: natural-language interpretation, autonomous navigation, and semantic mapping. The solution consists of a finite-state-machine system design whose states solve these sub-tasks to complete The Amazing Race TM. Our design is deployed on a real robot and its performance is demonstrated in 52 trials on 4 floors of each of 3 different previously unseen buildings with 13 untrained volunteers.

11.7SPApr 9, 2020

Object classification from randomized EEG trials

Hamad Ahmed, Ronnie B Wilbur, Hari M Bharadwaj et al.

New results suggest strong limits to the feasibility of classifying human brain activity evoked from image stimuli, as measured through EEG. Considerable prior work suffers from a confound between the stimulus class and the time since the start of the experiment. A prior attempt to avoid this confound using randomized trials was unable to achieve results above chance in a statistically significant fashion when the data sets were of the same size as the original experiments. Here, we again attempt to replicate these experiments with randomized trials on a far larger (20x) dataset of 1,000 stimulus presentations of each of forty classes, all from a single subject. To our knowledge, this is the largest such EEG data collection effort from a single subject and is at the bounds of feasibility. We obtain classification accuracy that is marginally above chance and above chance in a statistically significant fashion, and further assess how accuracy depends on the classifier used, the amount of training data used, and the number of classes. Reaching the limits of data collection without substantial improvement in classification accuracy suggests limits to the feasibility of this enterprise.

3.9CVDec 18, 2018

Training on the test set? An analysis of Spampinato et al. [31]

Ren Li, Jared S. Johansen, Hamad Ahmed et al.

A recent paper [31] claims to classify brain processing evoked in subjects watching ImageNet stimuli as measured with EEG and to use a representation derived from this processing to create a novel object classifier. That paper, together with a series of subsequent papers [8, 15, 17, 20, 21, 30, 35], claims to revolutionize the field by achieving extremely successful results on several computer-vision tasks, including object classification, transfer learning, and generation of images depicting human perception and thought using brain-derived representations measured through EEG. Our novel experiments and analyses demonstrate that their results crucially depend on the block design that they use, where all stimuli of a given class are presented together, and fail with a rapid-event design, where stimuli of different classes are randomly intermixed. The block design leads to classification of arbitrary brain states based on block-level temporal correlations that tend to exist in all EEG data, rather than stimulus-related activity. Because every trial in their test sets comes from the same block as many trials in the corresponding training sets, their block design thus leads to surreptitiously training on the test set. This invalidates all subsequent analyses performed on this data in multiple published papers and calls into question all of the purported results. We further show that a novel object classifier constructed with a random codebook performs as well as or better than a novel object classifier constructed with the representation extracted from EEG data, suggesting that the performance of their classifier constructed with a representation extracted from EEG data does not benefit at all from the brain-derived representation. Our results calibrate the underlying difficulty of the tasks involved and caution against sensational and overly optimistic, but false, claims to the contrary.

5.7LGSep 25, 2018

Floyd-Warshall Reinforcement Learning: Learning from Past Experiences to Reach New Goals

Vikas Dhiman, Shurjo Banerjee, Jeffrey M. Siskind et al.

Consider mutli-goal tasks that involve static environments and dynamic goals. Examples of such tasks, such as goal-directed navigation and pick-and-place in robotics, abound. Two types of Reinforcement Learning (RL) algorithms are used for such tasks: model-free or model-based. Each of these approaches has limitations. Model-free RL struggles to transfer learned information when the goal location changes, but achieves high asymptotic accuracy in single goal tasks. Model-based RL can transfer learned information to new goal locations by retaining the explicitly learned state-dynamics, but is limited by the fact that small errors in modelling these dynamics accumulate over long-term planning. In this work, we improve upon the limitations of model-free RL in multi-goal domains. We do this by adapting the Floyd-Warshall algorithm for RL and call the adaptation Floyd-Warshall RL (FWRL). The proposed algorithm learns a goal-conditioned action-value function by constraining the value of the optimal path between any two states to be greater than or equal to the value of paths via intermediary states. Experimentally, we show that FWRL is more sample-efficient and learns higher reward strategies in multi-goal tasks as compared to Q-learning, model-based RL and other relevant baselines in a tabular domain.

4.3LGNov 10, 2016

Tricks from Deep Learning

Atılım Güneş Baydin, Barak A. Pearlmutter, Jeffrey Mark Siskind

The deep learning community has devised a diverse set of methods to make gradient optimization, using large datasets, of large and highly complex models with deeply cascaded nonlinearities, practical. Taken as a whole, these methods constitute a breakthrough, allowing computational structures which are quite wide, very deep, and with an enormous number and variety of free parameters to be effectively optimized. The result now dominates much of practical machine learning, with applications in machine translation, computer vision, and speech recognition. Many of these methods, viewed through the lens of algorithmic differentiation (AD), can be seen as either addressing issues with the gradient itself, or finding ways of achieving increased efficiency using tricks that are AD-related, but not provided by current AD systems. The goal of this paper is to explain not just those methods of most relevance to AD, but also the technical constraints and mindset which led to their discovery. After explaining this context, we present a "laundry list" of methods developed by the deep learning community. Two of these are discussed in further mathematical detail: a way to dramatically reduce the size of the tape when performing reverse-mode AD on a (theoretically) time-reversible process like an ODE integrator; and a new mathematical insight that allows for the implementation of a stochastic Newton's method.

1.2PLNov 10, 2016

Binomial Checkpointing for Arbitrary Programs with No User Annotation

Jeffrey Mark Siskind, Barak A. Pearlmutter

Heretofore, automatic checkpointing at procedure-call boundaries, to reduce the space complexity of reverse mode, has been provided by systems like Tapenade. However, binomial checkpointing, or treeverse, has only been provided in Automatic Differentiation (AD) systems in special cases, e.g., through user-provided pragmas on DO loops in Tapenade, or as the nested taping mechanism in adol-c for time integration processes, which requires that user code be refactored. We present a framework for applying binomial checkpointing to arbitrary code with no special annotation or refactoring required. This is accomplished by applying binomial checkpointing directly to a program trace. This trace is produced by a general-purpose checkpointing mechanism that is orthogonal to AD.

1.3CVNov 18, 2015

Collecting and Annotating the Large Continuous Action Dataset

Daniel Paul Barrett, Ran Xu, Haonan Yu et al.

We make available to the community a new dataset to support action-recognition research. This dataset is different from prior datasets in several key ways. It is significantly larger. It contains streaming video with long segments containing multiple action occurrences that often overlap in space and/or time. All actions were filmed in the same collection of backgrounds so that background gives little clue as to action class. We had five humans replicate the annotation of temporal extent of action occurrences labeled with their class and measured a surprisingly low level of intercoder agreement. A baseline experiment shows that recent state-of-the-art methods perform poorly on this dataset. This suggests that this will be a challenging dataset to foster advances in action-recognition research. This manuscript serves to describe the novel content and characteristics of the LCA dataset, present the design decisions made when filming the dataset, and document the novel methods employed to annotate the dataset.

10.1ROAug 25, 2015

Robot Language Learning, Generation, and Comprehension

Daniel Paul Barrett, Scott Alan Bronikowski, Haonan Yu et al.

We present a unified framework which supports grounding natural-language semantics in robotic driving. This framework supports acquisition (learning grounded meanings of nouns and prepositions from human annotation of robotic driving paths), generation (using such acquired meanings to generate sentential description of new robotic driving paths), and comprehension (using such acquired meanings to support automated driving to accomplish navigational goals specified in natural language). We evaluate the performance of these three tasks by having independent human judges rate the semantic fidelity of the sentences associated with paths, achieving overall average correctness of 94.6% and overall average completeness of 85.6%.

2.5CVJun 5, 2015

Sentence Directed Video Object Codetection

Haonan Yu, Jeffrey Mark Siskind

We tackle the problem of video object codetection by leveraging the weak semantic constraint implied by sentences that describe the video content. Unlike most existing work that focuses on codetecting large objects which are usually salient both in size and appearance, we can codetect objects that are small or medium sized. Our method assumes no human pose or depth information such as is required by the most recent state-of-the-art method. We employ weak semantic constraint on the codetection process by pairing the video with sentences. Although the semantic information is usually simple and weak, it can greatly boost the performance of our codetection framework by reducing the search space of the hypothesized object detections. Our experiment demonstrates an average IoU score of 0.423 on a new challenging dataset which contains 15 object classes and 150 videos with 12,509 frames in total, and an average IoU score of 0.373 on a subset of an existing dataset, originally intended for activity recognition, which contains 5 object classes and 75 videos with 8,854 frames in total.

1.9CVNov 14, 2014

A Faster Method for Tracking and Scoring Videos Corresponding to Sentences

Haonan Yu, Daniel P. Barrett, Jeffrey Mark Siskind

Prior work presented the sentence tracker, a method for scoring how well a sentence describes a video clip or alternatively how well a video clip depicts a sentence. We present an improved method for optimizing the same cost function employed by this prior work, reducing the space complexity from exponential in the sentence length to polynomial, as well as producing a qualitatively identical result in time polynomial in the sentence length instead of exponential. Since this new method is plug-compatible with the prior method, it can be used for the same applications: video retrieval with sentential queries, generating sentential descriptions of video clips, and focusing the attention of a tracker with a sentence, while allowing these applications to scale with significantly larger numbers of object detections, word meanings modeled with HMMs with significantly larger numbers of states, and significantly longer sentences, with no appreciable degradation in quality of results.

17.5CVAug 9, 2014

Video In Sentences Out

Andrei Barbu, Alexander Bridge, Zachary Burchill et al.

We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases, spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the trackto-role assignments, and changing body posture.

12.0CVSep 20, 2013

Saying What You're Looking For: Linguistics Meets Video Search

Andrei Barbu, N. Siddharth, Jeffrey Mark Siskind

We present an approach to searching large video corpora for video clips which depict a natural-language query in the form of a sentence. This approach uses compositional semantics to encode subtle meaning that is lost in other systems, such as the difference between two sentences which have identical words but entirely different meaning: "The person rode the horse} vs. \emph{The horse rode the person". Given a video-sentence pair and a natural-language parser, along with a grammar that describes the space of sentential queries, we produce a score which indicates how well the video depicts the sentence. We produce such a score for each video clip in a corpus and return a ranked list of clips. Furthermore, this approach addresses two fundamental problems simultaneously: detecting and tracking objects, and recognizing whether those tracks depict the query. Because both tracking and object detection are unreliable, this uses knowledge about the intended sentential query to focus the tracker on the relevant participants and ensures that the resulting tracks are described by the sentential query. While earlier work was limited to single-word queries which correspond to either verbs or nouns, we show how one can search for complex queries which contain multiple phrases, such as prepositional phrases, and modifiers, such as adverbs. We demonstrate this approach by searching for 141 queries involving people and horses interacting with each other in 10 full-length Hollywood movies.

14.2CVAug 19, 2013

Seeing What You're Told: Sentence-Guided Activity Recognition In Video

N. Siddharth, Andrei Barbu, Jeffrey Mark Siskind

We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity videos: sentence-guided focus of attention, generation of sentential descriptions of video, and query-based video search, simply by leveraging the framework in different manners.

3.1CVJun 21, 2013

Discriminative Training: Learning to Describe Video with Sentences, from Video Described with Sentences

Haonan Yu, Jeffrey Mark Siskind

We present a method for learning word meanings from complex and realistic video clips by discriminatively training (DT) positive sentential labels against negative ones, and then use the trained word models to generate sentential descriptions for new video. This new work is inspired by recent work which adopts a maximum likelihood (ML) framework to address the same problem using only positive sentential labels. The new method, like the ML-based one, is able to automatically determine which words in the sentence correspond to which concepts in the video (i.e., ground words to meanings) in a weakly supervised fashion. While both DT and ML yield comparable results with sufficient training data, DT outperforms ML significantly with smaller training sets because it can exploit negative training labels to better constrain the learning problem.

5.5CVJun 20, 2013

Felzenszwalb-Baum-Welch: Event Detection by Changing Appearance

Daniel Paul Barrett, Jeffrey Mark Siskind

We propose a method which can detect events in videos by modeling the change in appearance of the event participants over time. This method makes it possible to detect events which are characterized not by motion, but by the changing state of the people or objects involved. This is accomplished by using object detectors as output models for the states of a hidden Markov model (HMM). The method allows an HMM to model the sequence of poses of the event participants over time, and is effective for poses of humans and inanimate objects. The ability to use existing object-detection methods as part of an event model makes it possible to leverage ongoing work in the object-detection community. A novel training method uses an EM loop to simultaneously learn the temporal structure and object models automatically, without the need to specify either the individual poses to be modeled or the frames in which they occur. The E-step estimates the latent assignment of video frames to HMM states, while the M-step estimates both the HMM transition probabilities and state output models, including the object detectors, which are trained on the weighted subset of frames assigned to their state. A new dataset was gathered because little work has been done on events characterized by changing object pose, and suitable datasets are not available. Our method produced results superior to that of comparison systems on this dataset.

1.5CVApr 16, 2012

Large-Scale Automatic Labeling of Video Events with Verbs Based on Event-Participant Interaction

Andrei Barbu, Alexander Bridge, Dan Coroian et al.

We present an approach to labeling short video clips with English verbs as event descriptions. A key distinguishing aspect of this work is that it labels videos with verbs that describe the spatiotemporal interaction between event participants, humans and objects interacting with each other, abstracting away all object-class information and fine-grained image characteristics, and relying solely on the coarse-grained motion of the event participants. We apply our approach to a large set of 22 distinct verb classes and a corpus of 2,584 videos, yielding two surprising outcomes. First, a classification accuracy of greater than 70% on a 1-out-of-22 labeling task and greater than 85% on a variety of 1-out-of-10 subsets of this labeling task is independent of the choice of which of two different time-series classifiers we employ. Second, we achieve this level of accuracy using a highly impoverished intermediate representation consisting solely of the bounding boxes of one or two event participants as a function of time. This indicates that successful event recognition depends more on the choice of appropriate features that characterize the linguistic invariants of the event classes than on the particular classifier algorithms.

7.7CVApr 12, 2012

Seeing Unseeability to See the Unseeable

Siddharth Narayanaswamy, Andrei Barbu, Jeffrey Mark Siskind

We present a framework that allows an observer to determine occluded portions of a structure by finding the maximum-likelihood estimate of those occluded portions consistent with visible image evidence and a consistency model. Doing this requires determining which portions of the structure are occluded in the first place. Since each process relies on the other, we determine a solution to both problems in tandem. We extend our framework to determine confidence of one's assessment of which portions of an observed structure are occluded, and the estimate of that occluded structure, by determining the sensitivity of one's assessment to potential new observations. We further extend our framework to determine a robotic action whose execution would allow a new observation that would maximally increase one's confidence.

14.6CVApr 12, 2012

Video In Sentences Out

Andrei Barbu, Alexander Bridge, Zachary Burchill et al.

We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases,spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.

12.7CVApr 12, 2012

Simultaneous Object Detection, Tracking, and Event Recognition

Andrei Barbu, Aaron Michaux, Siddharth Narayanaswamy et al.

The common internal structure and algorithmic organization of object detection, detection-based tracking, and event recognition facilitates a general approach to integrating these three components. This supports multidirectional information flow between these components allowing object detection to influence tracking and event recognition and event recognition to influence tracking and object detection. The performance of the combination can exceed the performance of the components in isolation. This can be done with linear asymptotic complexity.