CVJun 2, 2022
Unified Recurrence Modeling for Video Action AnticipationTsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee et al.
Forecasting future events based on evidence of current conditions is an innate skill of human beings, and key for predicting the outcome of any decision making. In artificial vision for example, we would like to predict the next human action before it happens, without observing the future video frames associated to it. Computer vision models for action anticipation are expected to collect the subtle evidence in the preamble of the target actions. In prior studies recurrence modeling often leads to better performance, the strong temporal inference is assumed to be a key element for reasonable prediction. To this end, we propose a unified recurrence modeling for video action anticipation via message passing framework. The information flow in space-time can be described by the interaction between vertices and edges, and the changes of vertices for each incoming frame reflects the underlying dynamics. Our model leverages self-attention as the building blocks for each of the message passing functions. In addition, we introduce different edge learning strategies that can be end-to-end optimized to gain better flexibility for the connectivity between vertices. Our experimental results demonstrate that our proposed method outperforms previous works on the large-scale EPIC-Kitchen dataset.
CVDec 17, 2022
Inductive Attention for Video Action AnticipationTsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee et al.
Anticipating future actions based on spatiotemporal observations is essential in video understanding and predictive computer vision. Moreover, a model capable of anticipating the future has important applications, it can benefit precautionary systems to react before an event occurs. However, unlike in the action recognition task, future information is inaccessible at observation time -- a model cannot directly map the video frames to the target action to solve the anticipation task. Instead, the temporal inference is required to associate the relevant evidence with possible future actions. Consequently, existing solutions based on the action recognition models are only suboptimal. Recently, researchers proposed extending the observation window to capture longer pre-action profiles from past moments and leveraging attention to retrieve the subtle evidence to improve the anticipation predictions. However, existing attention designs typically use frame inputs as the query which is suboptimal, as a video frame only weakly connects to the future action. To this end, we propose an inductive attention model, dubbed IAM, which leverages the current prediction priors as the query to infer future action and can efficiently process the long video content. Furthermore, our method considers the uncertainty of the future via the many-to-many association in the attention design. As a result, IAM consistently outperforms the state-of-the-art anticipation models on multiple large-scale egocentric video datasets while using significantly fewer model parameters.
CVJun 22, 2022
NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022Tsung-Ming Tai, Oswald Lanz, Giuseppe Fiameni et al.
In this report, we describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge. Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction. By averaging the prediction scores from a set of models compiled with our proposed training pipeline, we achieved strong performance on the test set, which is 19.61% overall mean top-5 recall, recorded as second place on the public leaderboard.
IVMay 18, 2022
Speckle Image Restoration without Clean DataTsung-Ming Tai, Yun-Jie Jhang, Wen-Jyi Hwang et al.
Speckle noise is an inherent disturbance in coherent imaging systems such as digital holography, synthetic aperture radar, optical coherence tomography, or ultrasound systems. These systems usually produce only single observation per view angle of the same interest object, imposing the difficulty to leverage the statistic among observations. We propose a novel image restoration algorithm that can perform speckle noise removal without clean data and does not require multiple noisy observations in the same view angle. Our proposed method can also be applied to the situation without knowing the noise distribution as prior. We demonstrate our method is especially well-suited for spectral images by first validating on the synthetic dataset, and also applied on real-world digital holography samples. The results are superior in both quantitative measurement and visual inspection compared to several widely applied baselines. Our method even shows promising results across different speckle noise strengths, without the clean data needed.
CVMar 2
Action-Guided Attention for Video Action AnticipationTsung-Ming Tai, Sofia Casarin, Andrea Pilzer et al.
Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.
CVJul 10, 2023
Robust Feature Learning Against Noisy LabelsTsung-Ming Tai, Yun-Jie Jhang, Wen-Jyi Hwang
Supervised learning of deep neural networks heavily relies on large-scale datasets annotated by high-quality labels. In contrast, mislabeled samples can significantly degrade the generalization of models and result in memorizing samples, further learning erroneous associations of data contents to incorrect annotations. To this end, this paper proposes an efficient approach to tackle noisy labels by learning robust feature representation based on unsupervised augmentation restoration and cluster regularization. In addition, progressive self-bootstrapping is introduced to minimize the negative impact of supervision from noisy labels. Our proposed design is generic and flexible in applying to existing classification architectures with minimal overheads. Experimental results show that our proposed method can efficiently and effectively enhance model robustness under severely noisy labels.
CVApr 17, 2021
Higher Order Recurrent Space-Time Transformer for Video Action PredictionTsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee et al.
Endowing visual agents with predictive capability is a key step towards video intelligence at scale. The predominant modeling paradigm for this is sequence learning, mostly implemented through LSTMs. Feed-forward Transformer architectures have replaced recurrent model designs in ML applications of language processing and also partly in computer vision. In this paper we investigate on the competitiveness of Transformer-style architectures for video predictive tasks. To do so we propose HORST, a novel higher order recurrent layer design whose core element is a spatial-temporal decomposition of self-attention for video. HORST achieves state of the art competitive performance on Something-Something early action recognition and EPIC-Kitchens action anticipation, showing evidence of predictive capability that we attribute to our recurrent higher order design of self-attention.
NIJul 22, 2020
An Intelligent QoS Algorithm for Home NetworksWen-Jyi Hwang, Tsung-Ming Tai, Bo-Ting Pan et al.
A novel quality of service (QoS) management algorithm for home networks is presented in this letter. The algorithm is based on service prediction for intelligent QoS management. The service prediction is carried out by a general regression neural network with a profile containing the past records of the service. A novel profile updating technique is proposed so that the profile size can remain small for fast bandwidth allocation. The analytical study and experiments over real LAN reveal that the proposed algorithm provides reliable QoS management for home networks with low computational overhead.
SPJul 22, 2020
Sensor-Based Continuous Hand Gesture Recognition by Long Short-Term MemoryTsung-Ming Tai, Yun-Jie Jhang, Zhen-Wei Liao et al.
This article aims to present a novel sensor-based continuous hand gesture recognition algorithm by long short-term memory (LSTM). Only the basic accelerators and/or gyroscopes are required by the algorithm. Given a sequence of input sensory data, a many-to-many LSTM scheme is adopted to produce an output path. A maximum a posteriori estimation is then carried out based on the observed path to obtain the final classification results. A prototype system based on smartphones has been implemented for the performance evaluation. Experimental results show that the proposed algorithm is an effective alternative for robust and accurate hand-gesture recognition.