Haojun Zhao

CL
h-index18
3papers
258citations
Novelty50%
AI Score42

3 Papers

CLFeb 4, 2025
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch et al. · huggingface

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

CVNov 24, 2025Code
Vidi2.5: Large Multimodal Models for Video Understanding and Creation

Vidi Team, Chia-Wen Kuo, Chuang Huang et al.

Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. To enable comprehensive evaluation of STG, we introduce a new benchmark, VUE-STG, which offers critical improvements over existing STG datasets. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced duration and query distribution. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro Preview and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks. The latest Vidi2.5 offers significantly stronger STG capability and slightly better TR and Video QA performance over Vidi2. This update also introduces a Vidi2.5-Think model to handle plot understanding with complex plot reasoning. To comprehensively evaluate the performance of plot understanding, we propose VUE-PLOT benchmark with two tracks, Character and Reasoning. Notably, Vidi2.5-Think outperforms Gemini 3 Pro Preview on fine-grained character understanding with comparable performance on complex plot reasoning. Furthermore, we demonstrate the effectiveness of Vidi2.5 on a challenging real-world application, video editing planning.

DCMay 28, 2015
Optimized Password Recovery for Encrypted RAR on GPUs

Xiaojing An, Haojun Zhao, Lulu Ding et al.

RAR uses classic symmetric encryption algorithm SHA-1 hashing and AES algorithm for encryption, and the only method of password recovery is brute force, which is very time-consuming. In this paper, we present an approach using GPUs to speed up the password recovery process. However, because the major calculation and time-consuming part, SHA-1 hashing, is hard to be parallelized, so this paper adopts coarse granularity parallel. That is, one GPU thread is responsible for the validation of one password. We mainly use three optimization methods to optimize this parallel version: asynchronous parallel between CPU and GPU, redundant calculations and conditional statements reduction, and the usage of registers optimization. Experiment result shows that the final version reaches 43~57 times speedup on an AMD FirePro W8000 GPU, compared to a well-optimized serial version on Intel Core i5 CPU.