Clayton Frederick Souza Leite

h-index6

4papers

24citations

Novelty41%

AI Score32

Ranked #126,772 of 194,257 authors (top 65%)#27,915 in LG (top 69%)

4 Papers

9.2LGOct 17, 2024

Transformer-Based Approaches for Sensor-Based Human Activity Recognition: Opportunities and Challenges

Clayton Souza Leite, Henry Mauranen, Aziza Zhanabatyrova et al.

Transformers have excelled in natural language processing and computer vision, paving their way to sensor-based Human Activity Recognition (HAR). Previous studies show that transformers outperform their counterparts exclusively when they harness abundant data or employ compute-intensive optimization algorithms. However, neither of these scenarios is viable in sensor-based HAR due to the scarcity of data in this field and the frequent need to perform training and inference on resource-constrained devices. Our extensive investigation into various implementations of transformer-based versus non-transformer-based HAR using wearable sensors, encompassing more than 500 experiments, corroborates these concerns. We observe that transformer-based solutions pose higher computational demands, consistently yield inferior performance, and experience significant performance degradation when quantized to accommodate resource-constrained devices. Additionally, transformers demonstrate lower robustness to adversarial attacks, posing a potential threat to user trust in HAR.

4.1LGSep 1, 2025

Hierarchical Motion Captioning Utilizing External Text Data Source

Clayton Leite, Yu Xiao

This paper introduces a novel approach to enhance existing motion captioning methods, which directly map representations of movement to high-level descriptive captions (e.g., ``a person doing jumping jacks"). The existing methods require motion data annotated with high-level descriptions (e.g., ``jumping jacks"). However, such data is rarely available in existing motion-text datasets, which additionally do not include low-level motion descriptions. To address this, we propose a two-step hierarchical approach. First, we employ large language models to create detailed descriptions corresponding to each high-level caption that appears in the motion-text datasets (e.g., ``jumping while synchronizing arm extensions with the opening and closing of legs" for ``jumping jacks"). These refined annotations are used to retrain motion-to-text models to produce captions with low-level details. Second, we introduce a pioneering retrieval-based mechanism. It aligns the detailed low-level captions with candidate high-level captions from additional text data sources, and combine them with motion features to fabricate precise high-level captions. Our methodology is distinctive in its ability to harness knowledge from external text sources to greatly increase motion captioning accuracy, especially for movements not covered in existing motion-text datasets. Experiments on three distinct motion-text datasets (HumanML3D, KIT, and BOTH57M) demonstrate that our method achieves an improvement in average performance (across BLEU-1, BLEU-4, CIDEr, and ROUGE-L) ranging from 6% to 50% compared to the state-of-the-art M2T-Interpretable.

2.6LGOct 11, 2024

Enhancing Motion Variation in Text-to-Motion Models via Pose and Video Conditioned Editing

Clayton Leite, Yu Xiao

Text-to-motion models that generate sequences of human poses from textual descriptions are garnering significant attention. However, due to data scarcity, the range of motions these models can produce is still limited. For instance, current text-to-motion models cannot generate a motion of kicking a football with the instep of the foot, since the training data only includes martial arts kicks. We propose a novel method that uses short video clips or images as conditions to modify existing basic motions. In this approach, the model's understanding of a kick serves as the prior, while the video or image of a football kick acts as the posterior, enabling the generation of the desired motion. By incorporating these additional modalities as conditions, our method can create motions not present in the training set, overcoming the limitations of text-motion datasets. A user study with 26 participants demonstrated that our approach produces unseen motions with realism comparable to commonly represented motions in text-motion datasets (e.g., HumanML3D), such as walking, running, squatting, and kicking.

2.6CVSep 24, 2021

Automatic Map Update Using Dashcam Videos

Aziza Zhanabatyrova, Clayton Souza Leite, Yu Xiao

Autonomous driving requires 3D maps that provide accurate and up-to-date information about semantic landmarks. Due to the wider availability and lower cost of cameras compared with laser scanners, vision-based mapping solutions, especially the ones using crowdsourced visual data, have attracted much attention from academia and industry. However, previous works have mainly focused on creating 3D point clouds, leaving automatic change detection as open issues. We propose in this paper a pipeline for initiating and updating 3D maps with dashcam videos, with a focus on automatic change detection based on comparison of metadata (e.g., the types and locations of traffic signs). To improve the performance of metadata generation, which depends on the accuracy of 3D object detection and localization, we introduce a novel deep learning-based pixel-wise 3D localization algorithm. The algorithm, trained directly with SfM point cloud data, can locate objects detected from 2D images in a 3D space with high accuracy by estimating not only depth from monocular images but also lateral and height distances. In addition, we also propose a point clustering and thresholding algorithm to improve the robustness of the system to errors. We have performed experiments on two distinct areas - a campus and a residential area - with different types of cameras, lighting, and weather conditions. The changes were detected with 85% and 100% accuracy in the campus and residential areas, respectively. The errors in the campus area were mainly due to traffic signs seen from a far distance to the vehicle and intended for pedestrians and cyclists only. We also conducted cause analysis of the detection and localization errors to measure the impact from the performance of the background technology in use.