HCNov 26, 2025
STAR: Smartphone-analogous Typing in Augmented RealityTaejun Kim, Amy Karlson, Aakar Gupta et al.
While text entry is an essential and frequent task in Augmented Reality (AR) applications, devising an efficient and easy-to-use text entry method for AR remains an open challenge. This research presents STAR, a smartphone-analogous AR text entry technique that leverages a user's familiarity with smartphone two-thumb typing. With STAR, a user performs thumb typing on a virtual QWERTY keyboard that is overlain on the skin of their hands. During an evaluation study of STAR, participants achieved a mean typing speed of 21.9 WPM (i.e., 56% of their smartphone typing speed), and a mean error rate of 0.3% after 30 minutes of practice. We further analyze the major factors implicated in the performance gap between STAR and smartphone typing, and discuss ways this gap could be narrowed.
HCMar 20
HiFiGaze: Improving Eye Tracking Accuracy Using Screen Content KnowledgeTaejun Kim, Vimal Mollyn, Riku Arakawa et al.
We present a new and accurate approach for gaze estimation on consumer computing devices. We take advantage of continued strides in the quality of user-facing cameras found in e.g., smartphones, laptops, and desktops - 4K or greater in high-end devices - such that it is now possible to capture the 2D reflection of a device's screen in the user's eyes. This alone is insufficient for accurate gaze tracking due to the near-infinite variety of screen content. Crucially, however, the device knows what is being displayed on its own screen - in this work, we show this information allows for robust segmentation of the reflection, the location and size of which encodes the user's screen-relative gaze target. We explore several strategies to leverage this useful signal, quantifying performance in a user study. Our best performing model reduces mean tracking error by ~8% compared to a baseline appearance-based model. A supplemental study reveals an additional 10-20% improvement if the gaze-tracking camera is located at the bottom of the device.
ASAug 24, 2020
A Computational Analysis of Real-World DJ Mixes using Mix-To-Track Subsequence AlignmentTaejun Kim, Minsuk Choi, Evan Sacks et al.
A DJ mix is a sequence of music tracks concatenated seamlessly, typically rendered for audiences in a live setting by a DJ on stage. As a DJ mix is produced in a studio or the live version is recorded for music streaming services, computational methods to analyze DJ mixes, for example, extracting track information or understanding DJ techniques, have drawn research interests. Many of previous works are, however, limited to identifying individual tracks in a mix or segmenting it, and the sizes of the datasets are usually small. In this paper, we provide an in-depth analysis of DJ music by aligning a mix to its original music tracks. We set up the subsequence alignment such that the audio features are less sensitive to the tempo or key change of the original track in a mix. This approach provides temporally tight mix-to-track matching from which we can obtain cue-points, transition length, mix segmentation, and musical changes in DJ performance. Using 1,557 mixes from 1001Tracklists including 13,728 tracks and 20,765 transitions, we conduct the proposed analysis and show a wide range of statistics, which may elucidate the creative process of DJ music making.
ASOct 30, 2019
Temporal Feedback Convolutional Recurrent Neural Networks for Speech Command RecognitionTaejun Kim, Juhan Nam
End-to-end learning models using raw waveforms as input have shown superior performances in many audio recognition tasks. However, most model architectures are based on convolutional neural networks (CNN) which were mainly developed for visual recognition tasks. In this paper, we propose an extension of squeeze-and-excitation networks (SENets) which adds temporal feedback control from the top-layer features to channel-wise feature activations in lower layers using a recurrent module. This is analogous to the adaptive gain control mechanism of outer hair-cell in the human auditory system. We apply the proposed model to speech command recognition and show that it slightly outperforms the SENets and other CNN-based models. We also investigate the details of the performance improvement by conducting failure analysis and visualizing the channel-wise feature scaling induced by the temporal feedback.
SDDec 4, 2017
Raw Waveform-based Audio Classification Using Sample-level CNN ArchitecturesJongpil Lee, Taejun Kim, Jiyoung Park et al.
Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. However, as the image domain grows rapidly by versatile image classification models, it is necessary to study extensible classification models in the audio domain as well. In this study, we approach this problem using two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity. One is a basic model that consists of convolution and pooling layers. The other is an improved model that additionally has residual connections, squeeze-and-excitation modules and multi-level concatenation. We show that the sample-level models reach state-of-the-art performance levels for the three different categories of sound. Also, we visualize the filters along layers and compare the characteristics of learned filters.
SDOct 28, 2017
Sample-level CNN Architectures for Music Auto-tagging Using Raw WaveformsTaejun Kim, Jongpil Lee, Juhan Nam
Recent work has shown that the end-to-end approach using convolutional neural network (CNN) is effective in various types of machine learning tasks. For audio signals, the approach takes raw waveforms as input using an 1-D convolution layer. In this paper, we improve the 1-D CNN architecture for music auto-tagging by adopting building blocks from state-of-the-art image classification models, ResNets and SENets, and adding multi-level feature aggregation to it. We compare different combinations of the modules in building CNN architectures. The results show that they achieve significant improvements over previous state-of-the-art models on the MagnaTagATune dataset and comparable results on Million Song Dataset. Furthermore, we analyze and visualize our model to show how the 1-D CNN operates.