Bertram E. Shi

CV
h-index6
26papers
2,650citations
Novelty47%
AI Score32

26 Papers

CVSep 2, 2024Code
Merging Multiple Datasets for Improved Appearance-Based Gaze Estimation

Liang Wu, Bertram E. Shi

Multiple datasets have been created for training and testing appearance-based gaze estimators. Intuitively, more data should lead to better performance. However, combining datasets to train a single esti-mator rarely improves gaze estimation performance. One reason may be differences in the experimental protocols used to obtain the gaze sam-ples, resulting in differences in the distributions of head poses, gaze an-gles, illumination, etc. Another reason may be the inconsistency between methods used to define gaze angles (label mismatch). We propose two innovations to improve the performance of gaze estimation by leveraging multiple datasets, a change in the estimator architecture and the intro-duction of a gaze adaptation module. Most state-of-the-art estimators merge information extracted from images of the two eyes and the entire face either in parallel or combine information from the eyes first then with the face. Our proposed Two-stage Transformer-based Gaze-feature Fusion (TTGF) method uses transformers to merge information from each eye and the face separately and then merge across the two eyes. We argue that this improves head pose invariance since changes in head pose affect left and right eye images in different ways. Our proposed Gaze Adaptation Module (GAM) method handles annotation inconsis-tency by applying a Gaze Adaption Module for each dataset to correct gaze estimates from a single shared estimator. This enables us to combine information across datasets despite differences in labeling. Our experi-ments show that these innovations improve gaze estimation performance over the SOTA both individually and collectively (by 10% - 20%). Our code is available at https://github.com/HKUST-NISL/GazeSetMerge.

ROJul 30, 2024
Survey of Design Paradigms for Social Robots

Rita Frieske, Xiaoyu Mo, Yini Fang et al.

The demand for social robots in fields like healthcare, education, and entertainment increases due to their emotional adaptation features. These robots leverage multimodal communication, incorporating speech, facial expressions, and gestures to enhance user engagement and emotional support. The understanding of design paradigms of social robots is obstructed by the complexity of the system and the necessity to tune it to a specific task. This article provides a structured review of social robot design paradigms, categorizing them into cognitive architectures, role design models, linguistic models, communication flow, activity system models, and integrated design models. By breaking down the articles on social robot design and application based on these paradigms, we highlight the strengths and areas for improvement in current approaches. We further propose our original integrated design model that combines the most important aspects of the design of social robots. Our approach shows the importance of integrating operational, communicational, and emotional dimensions to create more adaptive and empathetic interactions between robots and humans.

CVJul 25, 2024
ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation

Rita Frieske, Bertram E. Shi

ERIT is a novel multimodal dataset designed to facilitate research in a lightweight multimodal fusion. It contains text and image data collected from videos of elderly individuals reacting to various situations, as well as seven emotion labels for each data sample. Because of the use of labeled images of elderly users reacting emotionally, it is also facilitating research on emotion recognition in an underrepresented age group in machine learning visual emotion recognition. The dataset is validated through comprehensive experiments indicating its importance in neural multimodal fusion research.

CLJan 11, 2022Code
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

Wenliang Dai, Samuel Cahyawijaya, Tiezheng Yu et al.

With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource languages, hindering the development of research and applications. In this paper, we introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, we augment our dataset using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one. We provide detailed statistics of both the clean and the augmented versions of our dataset. Moreover, we implement two multimodal baselines to demonstrate the validity of CI-AVSR. Experiment results show that leveraging the visual signal improves the overall performance of the model. Although our best model can achieve a considerable quality on the clean test set, the speech recognition quality on the noisy data is still inferior and remains as an extremely challenging task for real in-car speech recognition systems. The dataset and code will be released at https://github.com/HLTCHKUST/CI-AVSR.

CLJan 3, 2024
Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models

Rita Frieske, Bertram E. Shi

Hallucinations are a type of output error produced by deep neural networks. While this has been studied in natural language processing, they have not been researched previously in automatic speech recognition. Here, we define hallucinations in ASR as transcriptions generated by a model that are semantically unrelated to the source utterance, yet still fluent and coherent. The similarity of hallucinations to probable natural language outputs of the model creates a danger of deception and impacts the credibility of the system. We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models. To address this, we propose a perturbation-based method for assessing the susceptibility of an automatic speech recognition (ASR) model to hallucination at test time, which does not require access to the training dataset. We demonstrate that this method helps to distinguish between hallucinatory and non-hallucinatory models that have similar baseline word error rates. We further explore the relationship between the types of ASR errors and the types of dataset noise to determine what types of noise are most likely to create hallucinatory outputs. We devise a framework for identifying hallucinations by analysing their semantic connection with the ground truth and their fluency. Finally, we discover how to induce hallucinations with a random noise injection to the utterance.

CLJan 7, 2022
Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset

Tiezheng Yu, Rita Frieske, Peng Xu et al.

Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.

CLDec 12, 2021
ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation

Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata et al.

Code-switching is a speech phenomenon occurring when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data from read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English code-switching corpus built on spontaneous multi-turn conversational dialogue sources collected in Hong Kong. We report ASCEND's design and procedure for collecting the speech data, including annotations. ASCEND consists of 10.62 hours of clean speech, collected from 23 bilingual speakers of Chinese and English. Furthermore, we conduct baseline experiments using pre-trained wav2vec 2.0 models, achieving a best performance of 22.69\% character error rate and 27.05% mixed error rate.

CVJul 21, 2021
Iterative Distillation for Better Uncertainty Estimates in Multitask Emotion Recognition

Didan Deng, Liang Wu, Bertram E. Shi

When recognizing emotions, subtle nuances in displays of emotion generate ambiguity or uncertainty in emotion perception. Emotion uncertainty has been previously interpreted as inter-rater disagreement among multiple annotators. In this paper, we consider a more common and challenging scenario: modeling emotion uncertainty when only single emotion labels are available. From a Bayesian perspective, we propose to use deep ensembles to capture uncertainty for multiple emotion descriptors, i.e., action units, discrete expression labels and continuous descriptors. We further apply iterative self-distillation. Iterative distillation over multiple generations significantly improves performance in both emotion recognition and uncertainty estimation. Our method generates single student models that provide accurate estimates of uncertainty for in-domain samples and a student ensemble that can detect out-of-domain samples. Our experiments on emotion recognition and uncertainty estimation using the Aff-wild2 dataset demonstrate that our algorithm gives more reliable uncertainty estimates than both Temperature Scaling and Monte Carol Dropout.

CVJan 29, 2021
Learning Hierarchical Integration of Foveal and Peripheral Vision for Vergence Control by Active Efficient Coding

Zhetuo Zhao, Jochen Triesch, Bertram E. Shi

The active efficient coding (AEC) framework parsimoniously explains the joint development of visual processing and eye movements, e.g., the emergence of binocular disparity selective neurons and fusional vergence, the disjunctive eye movements that align left and right eye images. Vergence can be driven by information in both the fovea and periphery, which play complementary roles. The high resolution fovea can drive precise short range movements. The lower resolution periphery supports coarser long range movements. The fovea and periphery may also contain conflicting information, e.g. due to objects at different depths. While past AEC models did integrate peripheral and foveal information, they did not explicitly take into account these characteristics. We propose here a two-level hierarchical approach that does. The bottom level generates different vergence actions from foveal and peripheral regions. The top level selects one. We demonstrate that the hierarchical approach performs better than prior approaches in realistic environments, exhibiting better alignment and less oscillation.

CVJan 27, 2021
Self-Calibrating Active Binocular Vision via Active Efficient Coding with Deep Autoencoders

Charles Wilmot, Bertram E. Shi, Jochen Triesch

We present a model of the self-calibration of active binocular vision comprising the simultaneous learning of visual representations, vergence, and pursuit eye movements. The model follows the principle of Active Efficient Coding (AEC), a recent extension of the classic Efficient Coding Hypothesis to active perception. In contrast to previous AEC models, the present model uses deep autoencoders to learn sensory representations. We also propose a new formulation of the intrinsic motivation signal that guides the learning of behavior. We demonstrate the performance of the model in simulations.

CVJan 14, 2021
AVGCN: Trajectory Prediction using Graph Convolutional Networks Guided by Human Attention

Congcong Liu, Yuying Chen, Ming Liu et al.

Pedestrian trajectory prediction is a critical yet challenging task, especially for crowded scenes. We suggest that introducing an attention mechanism to infer the importance of different neighbors is critical for accurate trajectory prediction in scenes with varying crowd size. In this work, we propose a novel method, AVGCN, for trajectory prediction utilizing graph convolutional networks (GCN) based on human attention (A denotes attention, V denotes visual field constraints). First, we train an attention network that estimates the importance of neighboring pedestrians, using gaze data collected as subjects perform a bird's eye view crowd navigation task. Then, we incorporate the learned attention weights modulated by constraints on the pedestrian's visual field into a trajectory prediction network that uses a GCN to aggregate information from neighbors efficiently. AVGCN also considers the stochastic nature of pedestrian trajectories by taking advantage of variational trajectory prediction. Our approach achieves state-of-the-art performance on several trajectory prediction benchmarks, and the lowest average prediction error over all considered benchmarks.

CVSep 15, 2020
HGCN-GJS: Hierarchical Graph Convolutional Network with Groupwise Joint Sampling for Trajectory Prediction

Yuying Chen, Congcong Liu, Xiaodong Mei et al.

Accurate pedestrian trajectory prediction is of great importance for downstream tasks such as autonomous driving and mobile robot navigation. Fully investigating the social interactions within the crowd is crucial for accurate pedestrian trajectory prediction. However, most existing methods do not capture group level interactions well, focusing only on pairwise interactions and neglecting group-wise interactions. In this work, we propose a hierarchical graph convolutional network, HGCN-GJS, for trajectory prediction which well leverages group level interactions within the crowd. Furthermore, we introduce a novel joint sampling scheme for modeling the joint distribution of multiple pedestrians in the future trajectories. Based on the group information, this scheme associates the trajectory of one person with the trajectory of other people in the group, but maintains the independence of the trajectories of outsiders. We demonstrate the performance of our network on several trajectory prediction datasets, achieving state-of-the-art results on all datasets considered.

CVFeb 10, 2020
Multitask Emotion Recognition with Incomplete Labels

Didan Deng, Zhaokang Chen, Bertram E. Shi

We train a unified model to perform three tasks: facial action unit detection, expression classification, and valence-arousal estimation. We address two main challenges of learning the three tasks. First, most existing datasets are highly imbalanced. Second, most existing datasets do not contain labels for all three tasks. To tackle the first challenge, we apply data balancing techniques to experimental datasets. To tackle the second challenge, we propose an algorithm for the multitask model to learn from missing (incomplete) labels. This algorithm has two steps. We first train a teacher model to perform all three tasks, where each instance is trained by the ground truth label of its corresponding task. Secondly, we refer to the outputs of the teacher model as the soft labels. We use the soft labels and the ground truth to train the student model. We find that most of the student models outperform their teacher model on all the three tasks. Finally, we use model ensembling to boost performance further on the three tasks.

CVJan 25, 2020
Towards High Performance Low Complexity Calibration in Appearance Based Gaze Estimation

Zhaokang Chen, Bertram E. Shi

Appearance-based gaze estimation from RGB images provides relatively unconstrained gaze tracking. We have previously proposed a gaze decomposition method that decomposes the gaze angle into the sum of a subject-independent gaze estimate from the image and a subject-dependent bias. This paper extends that work with a more complete characterization of the interplay between the complexity of the calibration dataset and estimation accuracy. We analyze the effect of the number of gaze targets, the number of images used per gaze target and the number of head positions in calibration data using a new NISLGaze dataset, which is well suited for analyzing these effects as it includes more diversity in head positions and orientations for each subject than other datasets. A better understanding of these factors enables low complexity high performance calibration. Our results indicate that using only a single gaze target and single head position is sufficient to achieve high quality calibration, outperforming state-of-the-art methods by more than 6.3%. One of the surprising findings is that the same estimator yields the best performance both with and without calibration. To better understand the reasons, we provide a new theoretical analysis that specifies the conditions under which this can be expected.

ROSep 23, 2019
Robot Navigation in Crowds by Graph Convolutional Networks with Attention Learned from Human Gaze

Yuying Chen, Congcong Liu, Ming Liu et al.

Safe and efficient crowd navigation for mobile robot is a crucial yet challenging task. Previous work has shown the power of deep reinforcement learning frameworks to train efficient policies. However, their performance deteriorates when the crowd size grows. We suggest that this can be addressed by enabling the network to identify and pay attention to the humans in the crowd that are most critical to navigation. We propose a novel network utilizing a graph representation to learn the policy. We first train a graph convolutional network based on human gaze data that accurately predicts human attention to different agents in the crowd. Then we incorporate the learned attention into a graph-based reinforcement learning architecture. The proposed attention mechanism enables the assignment of meaningful weightings to the neighbors of the robot, and has the additional benefit of interpretability. Experiments on real-world dense pedestrian datasets with various crowd sizes demonstrate that our model outperforms state-of-art methods by 18.4% in task accomplishment and by 16.4% in time efficiency.

CVMay 11, 2019
Offset Calibration for Appearance-Based Gaze Estimation via Gaze Decomposition

Zhaokang Chen, Bertram E. Shi

Appearance-based gaze estimation provides relatively unconstrained gaze tracking. However, subject-independent models achieve limited accuracy partly due to individual variations. To improve estimation, we propose a novel gaze decomposition method and a single gaze point calibration method, motivated by our finding that the inter-subject squared bias exceeds the intra-subject variance for a subject-independent estimator. We decompose the gaze angle into a subject-dependent bias term and a subject-independent term between the gaze angle and the bias. The subject-independent term is estimated by a deep convolutional network. For calibration-free tracking, we set the subject-dependent bias term to zero. For single gaze point calibration, we estimate the bias from a few images taken as the subject gazes at a point. Experiments on three datasets indicate that as a calibration-free estimator, the proposed method outperforms the state-of-the-art methods by up to $10.0\%$. The proposed calibration method is robust and reduces estimation error significantly (up to $35.6\%$), achieving state-of-the-art performance for appearance-based eye trackers with calibration.

CVApr 17, 2019
Gaze Training by Modulated Dropout Improves Imitation Learning

Yuying Chen, Congcong Liu, Lei Tai et al.

Imitation learning by behavioral cloning is a prevalent method that has achieved some success in vision-based autonomous driving. The basic idea behind behavioral cloning is to have the neural network learn from observing a human expert's behavior. Typically, a convolutional neural network learns to predict the steering commands from raw driver-view images by mimicking the behaviors of human drivers. However, there are other cues, such as gaze behavior, available from human drivers that have yet to be exploited. Previous researches have shown that novice human learners can benefit from observing experts' gaze patterns. We present here that deep neural networks can also profit from this. We propose a method, gaze-modulated dropout, for integrating this gaze information into a deep driving network implicitly rather than as an additional input. Our experimental results demonstrate that gaze-modulated dropout enhances the generalization capability of the network to unseen scenes. Prediction error in steering commands is reduced by 23.5% compared to uniform dropout. Running closed loop in the simulator, the gaze-modulated dropout net increased the average distance travelled between infractions by 58.5%. Consistent with these results, the gaze-modulated dropout net shows lower model uncertainty.

CVMar 18, 2019
Appearance-Based Gaze Estimation Using Dilated-Convolutions

Zhaokang Chen, Bertram E. Shi

Appearance-based gaze estimation has attracted more and more attention because of its wide range of applications. The use of deep convolutional neural networks has improved the accuracy significantly. In order to improve the estimation accuracy further, we focus on extracting better features from eye images. Relatively large changes in gaze angles may result in relatively small changes in eye appearance. We argue that current architectures for gaze estimation may not be able to capture such small changes, as they apply multiple pooling layers or other downsampling layers so that the spatial resolution of the high-level layers is reduced significantly. To evaluate whether the use of features extracted at high resolution can benefit gaze estimation, we adopt dilated-convolutions to extract high-level features without reducing spatial resolution. In cross-subject experiments on the Columbia Gaze dataset for eye contact detection and the MPIIGaze dataset for 3D gaze vector regression, the resulting Dilated-Nets achieve significant (up to 20.8%) gains when compared to similar networks without dilated-convolutions. Our proposed Dilated-Net achieves state-of-the-art results on both the Columbia Gaze and the MPIIGaze datasets.

CVDec 25, 2018
Coupled Recurrent Network (CRN)

Lin Sun, Kui Jia, Yuejia Shen et al.

Many semantic video analysis tasks can benefit from multiple, heterogenous signals. For example, in addition to the original RGB input sequences, sequences of optical flow are usually used to boost the performance of human action recognition in videos. To learn from these heterogenous input sources, existing methods reply on two-stream architectural designs that contain independent, parallel streams of Recurrent Neural Networks (RNNs). However, two-stream RNNs do not fully exploit the reciprocal information contained in the multiple signals, let alone exploit it in a recurrent manner. To this end, we propose in this paper a novel recurrent architecture, termed Coupled Recurrent Network (CRN), to deal with multiple input sources. In CRN, the parallel streams of RNNs are coupled together. Key design of CRN is a Recurrent Interpretation Block (RIB) that supports learning of reciprocal feature representations from multiple signals in a recurrent manner. Different from RNNs which stack the training loss at each time step or the last time step, we propose an effective and efficient training strategy for CRN. Experiments show the efficacy of the proposed CRN. In particular, we achieve the new state of the art on the benchmark datasets of human action recognition and multi-person pose estimation.

IVMay 2, 2018
Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Didan Deng, Yuqian Zhou, Jimin Pi et al.

The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. Much previous work has focused on instantaneous emotion recognition. The 2018 One-Minute Gradual-Emotion Recognition (OMG-Emotion) challenge, which was held in conjunction with the IEEE World Congress on Computational Intelligence, encouraged participants to address long-term emotion recognition by integrating cues from multiple modalities, including facial expression, audio and language. Intuitively, a multi-modal inference network should be able to leverage information from each modality and their correlations to improve recognition over that achievable by a single modality network. We describe here a multi-modal neural architecture that integrates visual information over time using an LSTM, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips. Our model outperforms the unimodal baseline, achieving the concordance correlation coefficients (CCC) of 0.400 on the arousal task, and 0.353 on the valence task.

CVAug 13, 2017
Lattice Long Short-Term Memory for Human Action Recognition

Lin Sun, Kui Jia, Kevin Chen et al.

Human actions captured in video sequences are three-dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities.

HCApr 21, 2017
Using Variable Dwell Time to Accelerate Gaze-Based Web Browsing with Two-Step Selection

Zhaokang Chen, Bertram E. Shi

In order to avoid the "Midas Touch" problem, gaze-based interfaces for selection often introduce a dwell time: a fixed amount of time the user must fixate upon an object before it is selected. Past interfaces have used a uniform dwell time across all objects. Here, we propose a gaze-based browser using a two-step selection policy with variable dwell time. In the first step, a command, e.g. "back" or "select", is chosen from a menu using a dwell time that is constant across the different commands. In the second step, if the "select" command is chosen, the user selects a hyperlink using a dwell time that varies between different hyperlinks. We assign shorter dwell times to more likely hyperlinks and longer dwell times to less likely hyperlinks. In order to infer the likelihood each hyperlink will be selected, we have developed a probabilistic model of natural gaze behavior while surfing the web. We have evaluated a number of heuristic and probabilistic methods for varying the dwell times using both simulation and experiment. Our results demonstrate that varying dwell time improves the user experience in comparison with fixed dwell time, resulting in fewer errors and increased speed. While all of the methods for varying dwell time resulted in improved performance, the probabilistic models yielded much greater gains than the simple heuristics. The best performing model reduces error rate by 50% compared to 100ms uniform dwell time while maintaining a similar response time. It reduces response time by 60% compared to 300ms uniform dwell time while maintaining a similar error rate.

NCJun 21, 2016
An active efficient coding model of the optokinetic nystagmus

Chong Zhang, Jochen Triesch, Bertram E. Shi

Optokinetic nystagmus (OKN) is an involuntary eye movement responsible for stabilizing retinal images in the presence of relative motion between an observer and the environment. Fully understanding the development of optokinetic nystagmus requires a neurally plausible computational model that accounts for the neural development and the behavior. To date, work in this area has been limited. We propose a neurally plausible framework for the joint development of disparity and motion tuning in the visual cortex, the optokinetic and vergence eye movements. This framework models the joint emergence of both perception and behavior, and accounts for the importance of the development of normal vergence control and binocular vision in achieving normal monocular OKN (mOKN) behaviors. Because the model includes behavior, we can simulate the same perturbations as performed in past experiments, such as artificially induced strabismus. The proposed model agrees both qualitatively and quantitatively with a number of findings from the literature on both binocular vision as well as the optokinetic reflex. Finally, our model also makes quantitative predictions about the OKN behavior using the same methods used to characterize the OKN in the experimental literature.

CVApr 15, 2016
Invariant feature extraction from event based stimuli

Thusitha N. Chandrapala, Bertram E. Shi

We propose a novel architecture, the event-based GASSOM for learning and extracting invariant representations from event streams originating from neuromorphic vision sensors. The framework is inspired by feed-forward cortical models for visual processing. The model, which is based on the concepts of sparsity and temporal slowness, is able to learn feature extractors that resemble neurons in the primary visual cortex. Layers of units in the proposed model can be cascaded to learn feature extractors with different levels of complexity and selectivity. We explore the applicability of the framework on real world tasks by using the learned network for object recognition. The proposed model achieve higher classification accuracy compared to other state-of-the-art event based processing methods. Our results also demonstrate the generality and robustness of the method, as the recognizers for different data sets and different tasks all used the same set of learned feature detectors, which were trained on data collected independently of the testing data.

CVOct 2, 2015
Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

Lin Sun, Kui Jia, Dit-Yan Yeung et al.

Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification, recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. This has triggered us to investigate in this paper a new deep architecture which can handle 3D signals more effectively. Specifically, we propose factorized spatio-temporal convolutional networks (FstCN) that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). We introduce a novel transformation and permutation operator to make factorization in FstCN possible. Moreover, to address the issue of sequence alignment, we propose an effective training and inference strategy based on sampling multiple video clips from a given action video sequence. We have tested FstCN on two commonly used benchmark datasets (UCF-101 and HMDB-51). Without using auxiliary training videos to boost the performance, FstCN outperforms existing CNN based methods and achieves comparable performance with a recent method that benefits from using auxiliary training videos.

CVFeb 14, 2014
Intrinsically Motivated Learning of Visual Motion Perception and Smooth Pursuit

Chong Zhang, Yu Zhao, Jochen Triesch et al.

We extend the framework of efficient coding, which has been used to model the development of sensory processing in isolation, to model the development of the perception/action cycle. Our extension combines sparse coding and reinforcement learning so that sensory processing and behavior co-develop to optimize a shared intrinsic motivational signal: the fidelity of the neural encoding of the sensory input under resource constraints. Applying this framework to a model system consisting of an active eye behaving in a time varying environment, we find that this generic principle leads to the simultaneous development of both smooth pursuit behavior and model neurons whose properties are similar to those of primary visual cortical neurons selective for different directions of visual motion. We suggest that this general principle may form the basis for a unified and integrated explanation of many perception/action loops.