CLMay 28
FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-TuningJuneyoung Park, Seongbae Lee, Han-Sang Lee et al.
Parameter-efficient fine-tuning(PEFT) has largely focused on LoRA and its accuracy-oriented variants, leaving the original goal of reducing trainable parameters has receivedcomparatively little attention. We introduce FoRA, which revisits this goal by reducing the number of adapted layers rather than adapter rank. FoRA selects task-informative layers via a single-pass diagonal Fisher score (under 1% of training cost) and trains the LoRA down-projection at selected layers on the Stiefel manifold, preserving column orthonormality and effective rank. FoRA consistently outperforms LoRA and DoRA at half their parameter budget, and falls within 0.7-0.8 accuracy points of AdaLoRA at one-quarter its parameter count, across five LLaMA-family backbones. Cross-architecture experiments on twelve backbones from the LLaMA, Qwen3, and Gemma families confirm consistent gains from 270M to 32B parameters. The two components combine super-additively: Fisher selection alone matches rank reduction at the same budget, while the Stiefel constraint provides the decisive additional gain.
LGMar 16Code
Cross-RAG: Zero-Shot Retrieval-Augmented Time Series Forecasting via Cross-AttentionSeunghan Lee, Jaehoon Lee, Jun Seo et al.
Recent advances in time series foundation models (TSFMs) demonstrate strong expressive capacity through large-scale pretraining across diverse time series domains. Zero-shot time series forecasting with TSFMs, however, exhibits limited generalization to unseen datasets, which retrieval-augmented forecasting addresses by leveraging an external knowledge base. Existing approaches rely on a fixed number of retrieved samples that may introduce irrelevant information. To this end, we propose Cross-RAG, a zero-shot retrieval-augmented forecasting framework that selectively attends to query-relevant retrieved samples. Cross-RAG models input-level relevance between the query and retrieved samples via query-retrieval cross-attention, while jointly incorporating information from the query and retrieved samples. Extensive experiments demonstrate that Cross-RAG consistently improves zero-shot forecasting performance across various TSFMs and RAG methods, and additional analyses confirm its effectiveness across diverse retrieval scenarios. Code is available at https://github.com/seunghan96/cross-rag/.
LGMay 6Code
Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior AdjustmentSeunghan Lee, Jaehoon Lee, Jun Seo et al.
TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test-time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model's predicted posterior, without architectural modification or additional training. We further introduce DistPFN-T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN-based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: https://github.com/seunghan96/DistPFN.
AIMay 8Code
AdaTKG: Adaptive Memory for Temporal Knowledge Graph ReasoningSeunghan Lee, Jun Seo, Jaehoon Lee et al.
Temporal knowledge graphs (TKGs) represent time-stamped relational facts and support a wide range of reasoning tasks over evolving events. However, existing methods produce entity representations that are static at the entity level, in that each representation is a function of learned parameters only and retains no trace of the interactions in which the entity has participated. In this paper, we depart from this static view and propose that each entity be modeled as an adaptive process whose representation is refined every time the entity participates in a fact. To this end, we propose AdaTKG, which maintains a per-entity memory that is updated with every observed interaction, with the memory accumulating online and predictions improving as more interactions arrive. Specifically, we instantiate the memory update as a learnable exponential moving average governed by a single shared scalar instead of using learnable parameters for each entity, enabling AdaTKG to handle entities unseen during training. Extensive experiments confirm consistent gains over TKG baselines, demonstrating the effectiveness of adaptive memory. Code is publicly available at: https://github.com/seunghan96/AdaTKG.
CVApr 14
Motif-Video 2B: Technical ReportJunghwan Lim, Wai Ting Cheung, Minsu Ha et al.
Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.
CVMay 21
WorldKV: Efficient World Memory with World Retrieval and CompressionJung Yi, Minjae Kim, Paul Hyunbin Cho et al.
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/
LGMar 23Code
Rethinking Multimodal Fusion for Time Series: Auxiliary Modalities Need Constrained FusionSeunghan Lee, Jun Seo, Jaehoon Lee et al.
Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture-specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug-in method that enables controlled cross-modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low-rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods including CFA. Code is publicly available at: https://github.com/seunghan96/cfa/.
AIMay 5Code
FinSTaR: Towards Financial Reasoning with Time Series Reasoning ModelsSeunghan Lee, Jun Seo, Jaehoon Lee et al.
Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail on financial domain, which exhibit unique characteristics. We propose a general 2x2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain -- where the distinction between deterministic assessment and stochastic prediction is particularly critical -- as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is publicly available at: https://github.com/seunghan96/FinSTaR.
CEMar 15
From Text to Alpha: Can LLMs Track Evolving Signals in Corporate Disclosures?Chanyeol Choi, Yoon Kim, Yu Yu et al.
Natural language processing (NLP) has been widely used in quantitative finance, but traditional methods often struggle to capture rich narratives in corporate disclosures, leaving potentially informative signals under-explored. Large language models (LLMs) offer a promising alternative due to their ability to extract nuanced semantics. In this paper, we ask whether semantic signals extracted by LLMs from corporate disclosures predict alpha, defined as abnormal returns beyond broad market movements and common risk factors. We introduce a simple framework, LLM as extractor, embedding as ruler, which extracts context-aware, metric-focused textual spans and quantifies semantic changes across consecutive disclosure periods using embedding-based similarity. This allows us to measure the degree of metric shifting -- how much firms move away from previously emphasized metrics, referred as moving targets. In experiments with portfolio and cross-sectional regression tests against a recent NER-based baseline, our method achieves more than twice the risk-adjusted alpha and shows significantly stronger predictive power. Qualitative analysis suggests that these gains stem from preserving contextual qualifiers and filtering out non-metric terms that keyword-based approaches often miss.
CVMay 13, 2022
StyLandGAN: A StyleGAN based Landscape Image Synthesis using Depth-mapGunhee Lee, Jonghwa Yim, Chanran Kim et al.
Despite recent success in conditional image synthesis, prevalent input conditions such as semantics and edges are not clear enough to express `Linear (Ridges)' and `Planar (Scale)' representations. To address this problem, we propose a novel framework StyLandGAN, which synthesizes desired landscape images using a depth map which has higher expressive power. Our StyleLandGAN is extended from the unconditional generation model to accept input conditions. We also propose a '2-phase inference' pipeline which generates diverse depth maps and shifts local parts so that it can easily reflect user's intend. As a comparison, we modified the existing semantic image synthesis models to accept a depth map as well. Experimental results show that our method is superior to existing methods in quality, diversity, and depth-accuracy.
CVJul 25, 2019Code
U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image TranslationJunho Kim, Minjae Kim, Hyeonwoo Kang et al.
We propose a novel method for unsupervised image-to-image translation, which incorporates a new attention module and a new learnable normalization function in an end-to-end manner. The attention module guides our model to focus on more important regions distinguishing between source and target domains based on the attention map obtained by the auxiliary classifier. Unlike previous attention-based method which cannot handle the geometric changes between domains, our model can translate both images requiring holistic changes and images requiring large shape changes. Moreover, our new AdaLIN (Adaptive Layer-Instance Normalization) function helps our attention-guided model to flexibly control the amount of change in shape and texture by learned parameters depending on datasets. Experimental results show the superiority of the proposed method compared to the existing state-of-the-art models with a fixed network architecture and hyper-parameters. Our code and datasets are available at https://github.com/taki0112/UGATIT or https://github.com/znxlwm/UGATIT-pytorch.
AIMar 3
FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level PairingJaehoon Lee, Suhwan Park, Tae Yoon Lim et al.
The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company's stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target-company level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct \textbf{FinTexTS}, a new large-scale text-paired stock price dataset. Experimental results on \textbf{FinTexTS} demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying \textbf{FinTexTS}, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.
CVOct 11, 2022
Style-Guided Inference of Transformer for High-resolution Image SynthesisJonghwa Yim, Minjae Kim
Transformer is eminently suitable for auto-regressive image synthesis which predicts discrete value from the past values recursively to make up full image. Especially, combined with vector quantised latent representation, the state-of-the-art auto-regressive transformer displays realistic high-resolution images. However, sampling the latent code from discrete probability distribution makes the output unpredictable. Therefore, it requires to generate lots of diverse samples to acquire desired outputs. To alleviate the process of generating lots of samples repetitively, in this article, we propose to take a desired output, a style image, as an additional condition without re-training the transformer. To this end, our method transfers the style to a probability constraint to re-balance the prior, thereby specifying the target distribution instead of the original prior. Thus, generated samples from the re-balanced prior have similar styles to reference style. In practice, we can choose either an image or a category of images as an additional condition. In our qualitative assessment, we show that styles of majority of outputs are similar to the input style.
CVDec 5, 2024
A Noise is Worth Diffusion GuidanceDonghoon Ahn, Jiwon Kang, Sanghyun Lee et al.
Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: https://cvlab-kaist.github.io/NoiseRefine/.
CLApr 2, 2024
HyperCLOVA X Technical ReportKang Min Yoo, Jaegeun Han, Sookyo In et al.
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
CLNov 7, 2025
Motif 2 12.7B technical reportJunghwan Lim, Sungmin Lee, Dongseok Kim et al.
We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.
AIMay 25, 2025
Structuring the Unstructured: A Multi-Agent System for Extracting and Querying Financial KPIs and GuidanceChanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee et al.
Extracting structured and quantitative insights from unstructured financial filings is essential in investment research, yet remains time-consuming and resource-intensive. Conventional approaches in practice rely heavily on labor-intensive manual processes, limiting scalability and delaying the research workflow. In this paper, we propose an efficient and scalable method for accurately extracting quantitative insights from unstructured financial documents, leveraging a multi-agent system composed of large language models. Our proposed multi-agent system consists of two specialized agents: the \emph{Extraction Agent} and the \emph{Text-to-SQL Agent}. The \textit{Extraction Agent} automatically identifies key performance indicators from unstructured financial text, standardizes their formats, and verifies their accuracy. On the other hand, the \textit{Text-to-SQL Agent} generates executable SQL statements from natural language queries, allowing users to access structured data accurately without requiring familiarity with the database schema. Through experiments, we demonstrate that our proposed system effectively transforms unstructured text into structured data accurately and enables precise retrieval of key information. First, we demonstrate that our system achieves approximately 95\% accuracy in transforming financial filings into structured data, matching the performance level typically attained by human annotators. Second, in a human evaluation of the retrieval task -- where natural language queries are used to search information from structured data -- 91\% of the responses were rated as correct by human evaluators. In both evaluations, our system generalizes well across financial document types, consistently delivering reliable performance.
PMAug 23, 2025
THEME: Enhancing Thematic Investing with Semantic Stock Representations and Temporal DynamicsHoyoung Lee, Wonbin Ahn, Suhwan Park et al.
Thematic investing, which aims to construct portfolios aligned with structural trends, remains a challenging endeavor due to overlapping sector boundaries and evolving market dynamics. A promising direction is to build semantic representations of investment themes from textual data. However, despite their power, general-purpose LLM embedding models are not well-suited to capture the nuanced characteristics of financial assets, since the semantic representation of investment assets may differ fundamentally from that of general financial text. To address this, we introduce THEME, a framework that fine-tunes embeddings using hierarchical contrastive learning. THEME aligns themes and their constituent stocks using their hierarchical relationship, and subsequently refines these embeddings by incorporating stock returns. This process yields representations effective for retrieving thematically aligned assets with strong return potential. Empirical results demonstrate that THEME excels in two key areas. For thematic asset retrieval, it significantly outperforms leading large language models. Furthermore, its constructed portfolios demonstrate compelling performance. By jointly modeling thematic relationships from text and market dynamics from returns, THEME generates stock embeddings specifically tailored for a wide range of practical investment applications.
IRAug 7, 2025
FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question AnsweringChanyeol Choi, Jihoon Kwon, Alejandro Lopez-Lira et al.
Accurate information retrieval (IR) is critical in the financial domain, where investors must identify relevant information from large collections of documents. Traditional IR methods -- whether sparse or dense -- often fall short in retrieval accuracy, as it requires not only capturing semantic similarity but also performing fine-grained reasoning over document structure and domain-specific knowledge. Recent advances in large language models (LLMs) have opened up new opportunities for retrieval with multi-step reasoning, where the model ranks passages through iterative reasoning about which information is most relevant to a given query. However, there exists no benchmark to evaluate such capabilities in the financial domain. To address this gap, we introduce FinAgentBench, the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance -- a setting we term agentic retrieval. The benchmark consists of 26K expert-annotated examples on S&P-500 listed firms and assesses whether LLM agents can (1) identify the most relevant document type among candidates, and (2) pinpoint the key passage within the selected document. Our evaluation framework explicitly separates these two reasoning steps to address context limitations. This design enables to provide a quantitative basis for understanding retrieval-centric LLM behavior in finance. We evaluate a suite of state-of-the-art models and further demonstrated how targeted fine-tuning can significantly improve agentic retrieval performance. Our benchmark provides a foundation for studying retrieval-centric LLM behavior in complex, domain-specific tasks for finance.
LGMay 17, 2025
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate ClassJames V. Roggeveen, Erik Y. Wang, Will Flintoft et al.
Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present HARDMath2, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.
LGDec 11, 2025
Adaptive Information Routing for Multimodal Time Series ForecastingJun Seo, Hyeokjun Choe, Seohui Bae et al.
Time series forecasting is a critical task for artificial intelligence with numerous real-world applications. Traditional approaches primarily rely on historical time series data to predict the future values. However, in practical scenarios, this is often insufficient for accurate predictions due to the limited information available. To address this challenge, multimodal time series forecasting methods which incorporate additional data modalities, mainly text data, alongside time series data have been explored. In this work, we introduce the Adaptive Information Routing (AIR) framework, a novel approach for multimodal time series forecasting. Unlike existing methods that treat text data on par with time series data as interchangeable auxiliary features for forecasting, AIR leverages text information to dynamically guide the time series model by controlling how and to what extent multivariate time series information should be combined. We also present a text-refinement pipeline that employs a large language model to convert raw text data into a form suitable for multimodal forecasting, and we introduce a benchmark that facilitates multimodal forecasting experiments based on this pipeline. Experiment results with the real world market data such as crude oil price and exchange rates demonstrate that AIR effectively modulates the behavior of the time series model using textual inputs, significantly enhancing forecasting accuracy in various time series forecasting tasks.
AIDec 11, 2025
Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training RecipesJunghwan Lim, Sungmin Lee, Dongseok Kim et al.
We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.
LGNov 24, 2025
VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron ChunkingKichang Yang, Seonjun Kim, Minjae Kim et al.
Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson AGX Orin, respectively.
LGOct 11, 2025
FOSSIL: Regret-Minimizing Curriculum Learning for Metadata-Free and Low-Data Mpox DiagnosisSahng-Min Han, Minjae Kim, Jinho Cha et al.
Deep learning in small and imbalanced biomedical datasets remains fundamentally constrained by unstable optimization and poor generalization. We present the first biomedical implementation of FOSSIL (Flexible Optimization via Sample-Sensitive Importance Learning), a regret-minimizing weighting framework that adaptively balances training emphasis according to sample difficulty. Using softmax-based uncertainty as a continuous measure of difficulty, we construct a four-stage curriculum (Easy-Very Hard) and integrate FOSSIL into both convolutional and transformer-based architectures for Mpox skin lesion diagnosis. Across all settings, FOSSIL substantially improves discrimination (AUC = 0.9573), calibration (ECE = 0.053), and robustness under real-world perturbations, outperforming conventional baselines without metadata, manual curation, or synthetic augmentation. The results position FOSSIL as a generalizable, data-efficient, and interpretable framework for difficulty-aware learning in medical imaging under data scarcity.
LGAug 2, 2025
Motif 2.6B Technical ReportJunghwan Lim, Sungmin Lee, Dongseok Kim et al.
Recent advancements in Large Language Models (LLMs) have revolutionized artificial intelligence, yet developing an effective foundational LLM that balances high performance with computational efficiency remains challenging, especially for emerging research groups. To address this gap, we introduce Motif-2.6B, a 2.6-billion-parameter foundation model designed to democratize advanced LLM capabilities. Motif-2.6B incorporates several innovative architectural enhancements, including Differential Attention and PolyNorm activation functions, which improve long-context comprehension, reduce hallucination, and enhance in-context learning capabilities. We rigorously tested multiple novel architectural components through extensive experimentation to determine the optimal architecture for Motif-2.6B. Comprehensive evaluations demonstrate that Motif-2.6B consistently meets or exceeds the performance of similarly sized state-of-the-art models across diverse benchmarks, showcasing its effectiveness, scalability, and real-world applicability. Through detailed experiments and tailored techniques, Motif-2.6B significantly advances the landscape of efficient, scalable, and powerful foundational LLMs, offering valuable insights and a robust foundation for future research and deployment.
CVJun 12, 2025
Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow ModelsDonghoon Ahn, Jiwon Kang, Sanghyun Lee et al.
Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
COMar 29, 2016
Spectral M-estimation with Applications to Hidden Markov ModelsDustin Tran, Minjae Kim, Finale Doshi-Velez
Method of moment estimators exhibit appealing statistical properties, such as asymptotic unbiasedness, for nonconvex problems. However, they typically require a large number of samples and are extremely sensitive to model misspecification. In this paper, we apply the framework of M-estimation to develop both a generalized method of moments procedure and a principled method for regularization. Our proposed M-estimator obtains optimal sample efficiency rates (in the class of moment-based estimators) and the same well-known rates on prediction accuracy as other spectral estimators. It also makes it straightforward to incorporate regularization into the sample moment conditions. We demonstrate empirically the gains in sample efficiency from our approach on hidden Markov models.