Juhyeong Kim

AI
h-index4
5papers
6citations
Novelty47%
AI Score50

5 Papers

58.0CVMar 18Code
UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

Segyu Lee, Boryeong Cho, Hojung Jung et al.

Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE

PMApr 19, 2025Code
LLM-Enhanced Black-Litterman Portfolio Optimization

Youngbin Lee, Yejin Kim, Juhyeong Kim et al.

The Black-Litterman model addresses the sensitivity issues of tra- ditional mean-variance optimization by incorporating investor views, but systematically generating these views remains a key challenge. This study proposes and validates a systematic frame- work that translates return forecasts and predictive uncertainty from Large Language Models (LLMs) into the core inputs for the Black-Litterman model: investor views and their confidence lev- els. Through a backtest on S&P 500 constituents, we demonstrate that portfolios driven by top-performing LLMs significantly out- perform traditional baselines in both absolute and risk-adjusted terms. Crucially, our analysis reveals that each LLM exhibits a dis- tinct and consistent investment style which is the primary driver of performance. We found that the selection of an LLM is therefore not a search for a single best forecaster, but a strategic choice of an investment style whose success is contingent on its alignment with the prevailing market regime. The source code and data are available at https://github.com/youngandbin/LLM-BLM.

AIOct 2, 2025Code
GuruAgents: Emulating Wise Investors with Prompt-Guided LLM Agents

Yejin Kim, Youngbin Lee, Juhyeong Kim et al.

This study demonstrates that GuruAgents, prompt-guided AI agents, can systematically operationalize the strategies of legendary investment gurus. We develop five distinct GuruAgents, each designed to emulate an iconic investor, by encoding their distinct philosophies into LLM prompts that integrate financial tools and a deterministic reasoning pipeline. In a backtest on NASDAQ-100 constituents from Q4 2023 to Q2 2025, the GuruAgents exhibit unique behaviors driven by their prompted personas. The Buffett GuruAgent achieves the highest performance, delivering a 42.2\% CAGR that significantly outperforms benchmarks, while other agents show varied results. These findings confirm that prompt engineering can successfully translate the qualitative philosophies of investment gurus into reproducible, quantitative strategies, highlighting a novel direction for automated systematic investing. The source code and data are available at https://github.com/yejining99/GuruAgents.

LGMar 16, 2025Code
Decision by Supervised Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization

Juhyeong Kim, Sungyoon Choi, Youngbin Lee et al.

We propose Decision by Supervised Learning (DSL), a practical framework for robust portfolio optimization. DSL reframes portfolio construction as a supervised learning problem: models are trained to predict optimal portfolio weights, using cross-entropy loss and portfolios constructed by maximizing the Sharpe or Sortino ratio. To further enhance stability and reliability, DSL employs Deep Ensemble methods, substantially reducing variance in portfolio allocations. Through comprehensive backtesting across diverse market universes and neural architectures, shows superior performance compared to both traditional strategies and leading machine learning-based methods, including Prediction-Focused Learning and End-to-End Learning. We show that increasing the ensemble size leads to higher median returns and more stable risk-adjusted performance. The code is available at https://github.com/DSLwDE/DSLwDE.

IROct 15, 2025
FinAI Data Assistant: LLM-based Financial Database Query Processing with the OpenAI Function Calling API

Juhyeong Kim, Yejin Kim, Youngbin Lee et al.

We present FinAI Data Assistant, a practical approach for natural-language querying over financial databases that combines large language models (LLMs) with the OpenAI Function Calling API. Rather than synthesizing complete SQL via text-to-SQL, our system routes user requests to a small library of vetted, parameterized queries, trading generative flexibility for reliability, low latency, and cost efficiency. We empirically study three questions: (RQ1) whether LLMs alone can reliably recall or extrapolate time-dependent financial data without external retrieval; (RQ2) how well LLMs map company names to stock ticker symbols; and (RQ3) whether function calling outperforms text-to-SQL for end-to-end database query processing. Across controlled experiments on prices and fundamentals, LLM-only predictions exhibit non-negligible error and show look-ahead bias primarily for stock prices relative to model knowledge cutoffs. Ticker-mapping accuracy is near-perfect for NASDAQ-100 constituents and high for S\&P~500 firms. Finally, FinAI Data Assistant achieves lower latency and cost and higher reliability than a text-to-SQL baseline on our task suite. We discuss design trade-offs, limitations, and avenues for deployment.