LGPLJul 6, 2021

Shell Language Processing: Unix command parsing for Machine Learning

arXiv:2107.02438v314 citations
AI Analysis

This work addresses the need for specialized tokenization and encoding in shell command processing for machine learning applications, particularly in security domains, and is incremental as it builds upon existing ICT techniques.

The authors tackled the problem of parsing Unix and Linux shell commands by developing a Shell Language Preprocessing (SLP) library, achieving a significant improvement in F1 score from 0.392 to 0.874 on a security classification task.

In this article, we present a Shell Language Preprocessing (SLP) library, which implements tokenization and encoding directed at parsing Unix and Linux shell commands. We describe the rationale behind the need for a new approach with specific examples of when conventional Natural Language Processing (NLP) pipelines fail. Furthermore, we evaluate our methodology on a security classification task against widely accepted information and communications technology (ICT) tokenization techniques and achieve significant improvement of an F1 score from 0.392 to 0.874.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes