CVMay 13, 2022
FontNet: Closing the gap to font designer performance in font synthesisAmmar Ul Hassan Muhammad, Jaeyoung Choi
Font synthesis has been a very active topic in recent years because manual font design requires domain expertise and is a labor-intensive and time-consuming job. While remarkably successful, existing methods for font synthesis have major shortcomings; they require finetuning for unobserved font style with large reference images, the recent few-shot font synthesis methods are either designed for specific language systems or they operate on low-resolution images which limits their use. In this paper, we tackle this font synthesis problem by learning the font style in the embedding space. To this end, we propose a model, called FontNet, that simultaneously learns to separate font styles in the embedding space where distances directly correspond to a measure of font similarity, and translates input images into the given observed or unobserved font style. Additionally, we design the network architecture and training procedure that can be adopted for any language system and can produce high-resolution font images. Thanks to this approach, our proposed method outperforms the existing state-of-the-art font generation methods on both qualitative and quantitative experiments.
MMDec 28, 2017Code
Field Studies with Multimedia Big Data: Opportunities and Challenges (Extended Version)Mario Michael Krell, Julia Bernd, Yifan Li et al.
Social multimedia users are increasingly sharing all kinds of data about the world. They do this for their own reasons, not to provide data for field studies-but the trend presents a great opportunity for scientists. The Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset comprises 99 million images and nearly 800 thousand videos from Flickr, all shared under Creative Commons licenses. To enable scientists to leverage these media records for field studies, we propose a new framework that extracts targeted subcorpora from the YFCC100M, in a format usable by researchers who are not experts in big data retrieval and processing. This paper discusses a number of examples from the literature-as well as some entirely new ideas-of natural and social science field studies that could be piloted, supplemented, replicated, or conducted using YFCC100M data. These examples illustrate the need for a general new open-source framework for Multimedia Big Data Field Studies. There is currently a gap between the separate aspects of what multimedia researchers have shown to be possible with consumer-produced big data and the follow-through of creating a comprehensive field study framework that supports scientists across other disciplines. To bridge this gap, we must meet several challenges. For example, the framework must handle unlabeled and noisily labeled data to produce a filtered dataset for a scientist-who naturally wants it to be both as large and as clean as possible. This requires an iterative approach that provides access to statistical summaries and refines the search by constructing new classifiers. The first phase of our framework is available as Multimedia Commons Search, an intuitive interface that enables complex search queries at a large scale...
CVMay 8
EggHand: A Multimodal Foundation Model for Egocentric Hand Pose ForecastingJaeyoung Choi, Hyeondong Kim, Yujin Kim et al.
Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand
LGSep 9, 2025
Hybrid GCN-GRU Model for Anomaly Detection in Cryptocurrency TransactionsGyuyeon Na, Minjung Park, Hyeonjeong Cha et al.
Blockchain transaction networks are complex, with evolving temporal patterns and inter-node relationships. To detect illicit activities, we propose a hybrid GCN-GRU model that captures both structural and sequential features. Using real Bitcoin transaction data (2020-2024), our model achieved 0.9470 Accuracy and 0.9807 AUC-ROC, outperforming all baselines.
CVMay 19, 2025
A Study on the Refining Handwritten Font by Mixing Font StylesAvinash Kumar, Kyeolhee Kang, Ammar ul Hassan et al.
Handwritten fonts have a distinct expressive character, but they are often difficult to read due to unclear or inconsistent handwriting. FontFusionGAN (FFGAN) is a novel method for improving handwritten fonts by combining them with printed fonts. Our method implements generative adversarial network (GAN) to generate font that mix the desirable features of handwritten and printed fonts. By training the GAN on a dataset of handwritten and printed fonts, it can generate legible and visually appealing font images. We apply our method to a dataset of handwritten fonts and demonstrate that it significantly enhances the readability of the original fonts while preserving their unique aesthetic. Our method has the potential to improve the readability of handwritten fonts, which would be helpful for a variety of applications including document creation, letter writing, and assisting individuals with reading and writing difficulties. In addition to addressing the difficulties of font creation for languages with complex character sets, our method is applicable to other text-image-related tasks, such as font attribute control and multilingual font style transfer.
CVApr 30, 2025
Text-Conditioned Diffusion Model for High-Fidelity Korean Font GenerationAbdul Sami, Avinash Kumar, Irfanullah Memon et al.
Automatic font generation (AFG) is the process of creating a new font using only a few examples of the style images. Generating fonts for complex languages like Korean and Chinese, particularly in handwritten styles, presents significant challenges. Traditional AFGs, like Generative adversarial networks (GANs) and Variational Auto-Encoders (VAEs), are usually unstable during training and often face mode collapse problems. They also struggle to capture fine details within font images. To address these problems, we present a diffusion-based AFG method which generates high-quality, diverse Korean font images using only a single reference image, focusing on handwritten and printed styles. Our approach refines noisy images incrementally, ensuring stable training and visually appealing results. A key innovation is our text encoder, which processes phonetic representations to generate accurate and contextually correct characters, even for unseen characters. We used a pre-trained style encoder from DG FONT to effectively and accurately encode the style images. To further enhance the generation quality, we used perceptual loss that guides the model to focus on the global style of generated images. Experimental results on over 2000 Korean characters demonstrate that our model consistently generates accurate and detailed font images and outperforms benchmark methods, making it a reliable tool for generating authentic Korean fonts across different styles.
IRNov 4, 2021
Sequential Movie Genre Prediction using Average Transition Probability with ClusteringJihyeon Kim, Jinkyung Kim, Jaeyoung Choi
In recent movie recommendations, predicting the user's sequential behavior and suggesting the next movie to watch is one of the most important issues. However, capturing such sequential behavior is not easy because each user's short-term or long-term behavior must be taken into account. For this reason, many research results show that the performance of recommending a specific movie is not very high in a sequential recommendation. In this paper, we propose a cluster-based method for classifying users with similar movie purchase patterns and a movie genre prediction algorithm rather than the movie itself considering their short-term and long-term behaviors. The movie genre prediction does not recommend a specific movie, but it predicts the genre for the next movie to watch in consideration of each user's preference for the movie genre based on the genre included in the movie. Through this, it is possible to provide appropriate guidelines for recommending movies including the genre to users who tend to prefer a specific genre. In particular, in this paper, users with similar genre preferences are organized into clusters to recommend genres, and in clusters that do not have relatively specific tendencies, genre prediction is performed by appropriately trimming genres that are not necessary for recommendation in order to improve performance. We evaluate our method on well-known movie datasets, and qualitatively that it captures personalized dynamics and is able to make meaningful recommendations.
MMOct 19, 2020
DIME: An Online Tool for the Visual Comparison of Cross-Modal Retrieval ModelsTony Zhao, Jaeyoung Choi, Gerald Friedland
Cross-modal retrieval relies on accurate models to retrieve relevant results for queries across modalities such as image, text, and video. In this paper, we build upon previous work by tackling the difficulty of evaluating models both quantitatively and qualitatively quickly. We present DIME (Dataset, Index, Model, Embedding), a modality-agnostic tool that handles multimodal datasets, trained models, and data preprocessors to support straightforward model comparison with a web browser graphical user interface. DIME inherently supports building modality-agnostic queryable indexes and extraction of relevant feature embeddings, and thus effectively doubles as an efficient cross-modal tool to explore and search through datasets.
CRNov 15, 2018
Cybercasing 2.0: You Get What You Pay ForJaeyoung Choi, Istemi Ekin Akkus, Serge Egelman et al.
Under U.S. law, marketing databases exist under almost no legal restrictions concerning accuracy, access, or confidentiality. We explore the possible (mis)use of these databases in a criminal context by conducting two experiments. First, we show how this data can be used for "cybercasing" by using this data to resolve the physical addresses of individuals who are likely to be on vacation. Second, we evaluate the utility of a "bride to be" mailing list augmented with data obtained by searching both Facebook and a bridal registry aggregator. We conclude that marketing data is not necessarily harmless and can represent a fruitful target for criminal misuse.
CRAug 22, 2018
The Accuracy of the Demographic Inferences Shown on Google's Ad SettingsMichael Carl Tschantz, Serge Egelman, Jaeyoung Choi et al.
Google's Ad Settings shows the gender and age that Google has inferred about a web user. We compare the inferred values to the self-reported values of 501 survey participants. We find that Google often does not show an inference, but when it does, it is typically correct. We explore which usage characteristics, such as using privacy enhancing technologies, are associated with Google's accuracy, but found no significant results.
SDJul 15, 2016
DCAR: A Discriminative and Compact Audio Representation to Improve Event DetectionLiping Jing, Bo Liu, Jaeyoung Choi et al.
This paper presents a novel two-phase method for audio representation, Discriminative and Compact Audio Representation (DCAR), and evaluates its performance at detecting events in consumer-produced videos. In the first phase of DCAR, each audio track is modeled using a Gaussian mixture model (GMM) that includes several components to capture the variability within that track. The second phase takes into account both global structure and local structure. In this phase, the components are rendered more discriminative and compact by formulating an optimization problem on Grassmannian manifolds, which we found represents the structure of audio effectively. Our experiments used the YLI-MED dataset (an open TRECVID-style video corpus based on YFCC100M), which includes ten events. The results show that the proposed DCAR representation consistently outperforms state-of-the-art audio representations. DCAR's advantage over i-vector, mv-vector, and GMM representations is significant for both easier and harder discrimination tasks. We discuss how these performance differences across easy and hard cases follow from how each type of model leverages (or doesn't leverage) the intrinsic structure of the data. Furthermore, DCAR shows a particularly notable accuracy advantage on events where humans have more difficulty classifying the videos, i.e., events with lower mean annotator confidence.