Ahmed Mostafa

h-index1
2papers

2 Papers

AINov 5, 2025
How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder

Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction -- a critical problem in binary code analysis. To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows.

HCNov 4, 2019
Modeling an Augmented Reality Game Environment to Enhance Behavior of ADHD Patients

Saad Alqithami, Musaad Alzahrani, Abdulkareem Alzahrani et al.

The paper generically models an augmented reality game-based environment to project the gamification of an online cognitive behavioral therapist that performs instant measurements for patients with a predefined Attention Deficit Hyperactivity Disorder (ADHD). ADHD is one of the most common neurodevelopmental disorders in which patients have difficulties related to inattention, hyperactivity, and impulsivity. Those patients are in need for a psychological therapy; the use of cognitive behavioral therapy as a firmly-established treatment is to help in enhancing the way they think and behave. A major limitation in traditional cognitive behavioral therapies is that therapists may face difficulty to optimize patients' neuropsychological stimulus following a specified treatment plan, i.e., therapists struggle to draw clear images when stimulating patients' mindset to a point where they should be. Other limitations recognized here include availability, accessibility and level-of-experience of the therapists. Therefore, the paper present a gamification model, we term as "AR-Therapist," in order to take advantages of augmented reality developments to engage patients in both real and virtual game-based environments. The model provides an on-time measurements of patients' progress throughout the treatment sessions which, in result, overcomes limitations observed in traditional cognitive behavioral therapies.