Daniel Ajisafe

CV
h-index19
5papers
12citations
Novelty37%
AI Score35

5 Papers

CVSep 9, 2023
Mirror-Aware Neural Humans

Daniel Ajisafe, James Tang, Shih-Yang Su et al.

Human motion capture either requires multi-camera systems or is unreliable when using single-view input due to depth ambiguities. Meanwhile, mirrors are readily available in urban environments and form an affordable alternative by recording two views with only a single camera. However, the mirror setting poses the additional challenge of handling occlusions of real and mirror image. Going beyond existing mirror approaches for 3D human pose estimation, we utilize mirrors for learning a complete body model, including shape and dense appearance. Our main contributions are extending articulated neural radiance fields to include a notion of a mirror, making it sample-efficient over potential occlusion regions. Together, our contributions realize a consumer-level 3D motion capture system that starts from off-the-shelf 2D poses by automatically calibrating the camera, estimating mirror orientation, and subsequently lifting 2D keypoint detections to 3D skeleton pose that is used to condition the mirror-aware NeRF. We empirically demonstrate the benefit of learning a body model and accounting for occlusion in challenging mirror scenes.

CVMay 10, 2024Code
CasCalib: Cascaded Calibration for Motion Capture from Sparse Unsynchronized Cameras

James Tang, Shashwat Suri, Daniel Ajisafe et al.

It is now possible to estimate 3D human pose from monocular images with off-the-shelf 3D pose estimators. However, many practical applications require fine-grained absolute pose information for which multi-view cues and camera calibration are necessary. Such multi-view recordings are laborious because they require manual calibration, and are expensive when using dedicated hardware. Our goal is full automation, which includes temporal synchronization, as well as intrinsic and extrinsic camera calibration. This is done by using persons in the scene as the calibration objects. Existing methods either address only synchronization or calibration, assume one of the former as input, or have significant limitations. A common limitation is that they only consider single persons, which eases correspondence finding. We attain this generality by partitioning the high-dimensional time and calibration space into a cascade of subspaces and introduce tailored algorithms to optimize each efficiently and robustly. The outcome is an easy-to-use, flexible, and robust motion capture toolbox that we release to enable scientific applications, which we demonstrate on diverse multi-view benchmarks. Project website: https://github.com/jamestang1998/CasCalib.

51.3CVMar 20
Making Video Models Adhere to User Intent with Minor Adjustments

Daniel Ajisafe, Eric Hedlin, Helge Rhodin et al.

With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.

CVJan 21, 2024
The State of Computer Vision Research in Africa

Abdul-Hakeem Omotayo, Ashery Mbilinyi, Lukman Ismaila et al.

Despite significant efforts to democratize artificial intelligence (AI), computer vision which is a sub-field of AI, still lags in Africa. A significant factor to this, is the limited access to computing resources, datasets, and collaborations. As a result, Africa's contribution to top-tier publications in this field has only been 0.06% over the past decade. Towards improving the computer vision field and making it more accessible and inclusive, this study analyzes 63,000 Scopus-indexed computer vision publications from Africa. We utilize large language models to automatically parse their abstracts, to identify and categorize topics and datasets. This resulted in listing more than 100 African datasets. Our objective is to provide a comprehensive taxonomy of dataset categories to facilitate better understanding and utilization of these resources. We also analyze collaboration trends of researchers within and outside the continent. Additionally, we conduct a large-scale questionnaire among African computer vision researchers to identify the structural barriers they believe require urgent attention. In conclusion, our study offers a comprehensive overview of the current state of computer vision research in Africa, to empower marginalized communities to participate in the design and development of computer vision systems.

ASOct 21, 2020
Towards End-to-End Training of Automatic Speech Recognition for Nigerian Pidgin

Amina Mardiyyah Rufai, Afolabi Abeeb, Esther Oduntan et al.

The prevalence of automatic speech recognition (ASR) systems in spoken language applications has increased significantly in recent years. Notably, many African languages lack sufficient linguistic resources to support the robustness of these systems. This paper focuses on the development of an end-to-end speech recognition system customized for Nigerian Pidgin English. We investigated and evaluated different pretrained state-of-the-art architectures on a new dataset. Our empirical results demonstrate a notable performance of the variant Wav2Vec2 XLSR-53 on our dataset, achieving a word error rate (WER) of 29.6% on the test set, surpassing other architectures such as NEMO QUARTZNET and Wav2Vec2.0 BASE-100H in quantitative assessments. Additionally, we demonstrate that pretrained state-of-the-art architectures do not work well out-of-the-box. We performed zero-shot evaluation using XLSR-English as the baseline, chosen for its similarity to Nigerian Pidgin. This yielded a higher WER of 73.7%. By adapting this architecture to nuances represented in our dataset, we reduce error by 59.84%. Our dataset comprises 4,288 recorded utterances from 10 native speakers, partitioned into training, validation, and test sets. This study underscores the potential for improving ASR systems for under-resourced languages like Nigerian Pidgin English, contributing to greater inclusion in speech technology applications. We publicly release our unique parallel dataset (speech-to-text) on Nigerian Pidgin, as well as the model weights on Hugging Face. Our code would be made available to foster future research from the community.