SE LGApr 8, 2020

Dependency-Based Neural Representations for Classifying Lines of Programs

Shashank Srikant, Nicolas Lesimple, Una-May O'Reilly

arXiv:2004.10166v17.32 citations

Originality Incremental advance

AI Analysis

This addresses the problem of automated vulnerability detection in software for developers and security analysts, representing an incremental improvement by applying deep learning to model program dependencies.

The paper tackles the problem of classifying lines of program code as vulnerable or not by developing a neural architecture called Vulcan that captures control and data dependencies through AST paths and recursive embeddings, achieving favorable comparison with a state-of-the-art classifier.

We investigate the problem of classifying a line of program as containing a vulnerability or not using machine learning. Such a line-level classification task calls for a program representation which goes beyond reasoning from the tokens present in the line. We seek a distributed representation in a latent feature space which can capture the control and data dependencies of tokens appearing on a line of program, while also ensuring lines of similar meaning have similar features. We present a neural architecture, Vulcan, that successfully demonstrates both these requirements. It extracts contextual information about tokens in a line and inputs them as Abstract Syntax Tree (AST) paths to a bi-directional LSTM with an attention mechanism. It concurrently represents the meanings of tokens in a line by recursively embedding the lines where they are most recently defined. In our experiments, Vulcan compares favorably with a state-of-the-art classifier, which requires significant preprocessing of programs, suggesting the utility of using deep learning to model program dependence information.

View on arXiv PDF

Similar