SE AIDec 11, 2025

Cross-modal Retrieval Models for Stripped Binary Analysis

Guoqiang Chen, Lingyun Ying, Ziyang Song, Daguang Liu, Qiang Wang, Zhiqi Wang, Li Hu, Shaoyin Cheng, Weiming Zhang, Nenghai Yu

arXiv:2512.10393v25.91 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of identifying semantically relevant binary functions from stripped binaries for software security professionals, representing a domain-specific advancement.

The paper tackles the problem of retrieving binary code via natural language queries for software security tasks, introducing BinSeek, a two-stage cross-modal retrieval framework that achieved state-of-the-art performance, surpassing same-scale models by 31.42% in Rec@3 and 27.17% in MRR@3.

Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware analysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distinguishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.

View on arXiv PDF

Similar