SE LGDec 15, 2024

A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer

Hanxiao Lu, Hongyu Cai, Yiming Liang, Antonio Bianchi, Z. Berkay Celik

arXiv:2412.11177v21.82 citationsh-index: 3SANER

Originality Incremental advance

AI Analysis

This work addresses binary analysis for security and software engineering by offering a more efficient and robust approach, though it is incremental as it builds on existing transformer-based methods.

The paper tackles the problem of binary code analysis by introducing ProTST, a transformer-based method that uses a progressive teacher-student training paradigm to improve embedding quality, resulting in an average validation score improvement of 14.8% over traditional two-stage training and 10.7% over multimodal frameworks across seven tasks.

Language model approaches have recently been integrated into binary analysis tasks, such as function similarity detection and function signature recovery. These models typically employ a two-stage training process: pre-training via Masked Language Modeling (MLM) on machine code and fine-tuning for specific tasks. While MLM helps to understand binary code structures, it ignores essential code characteristics, including control and data flow, which negatively affect model generalization. Recent work leverages domain-specific features (e.g., control flow graphs and dynamic execution traces) in transformer-based approaches to improve binary code semantic understanding. However, this approach involves complex feature engineering, a cumbersome and time-consuming process that can introduce predictive uncertainty when dealing with stripped or obfuscated code, leading to a performance drop. In this paper, we introduce ProTST, a novel transformer-based methodology for binary code embedding. ProTST employs a hierarchical training process based on a unique tree-like structure, where knowledge progressively flows from fundamental tasks at the root to more specialized tasks at the leaves. This progressive teacher-student paradigm allows the model to build upon previously learned knowledge, resulting in high-quality embeddings that can be effectively leveraged for diverse downstream binary analysis tasks. The effectiveness of ProTST is evaluated in seven binary analysis tasks, and the results show that ProTST yields an average validation score (F1, MRR, and Recall@1) improvement of 14.8% compared to traditional two-stage training and an average validation score of 10.7% compared to multimodal two-stage frameworks.

View on arXiv PDF

Similar