TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology.

With the widespread usage of Web applications, the security issues of source code are increasing. The exposed vulnerabilities seriously endanger the interests of service providers and customers. There are some models for solving this problem. However, most of them rely on complex graphs generated fr...

Full description

Bibliographic Details
Main Authors: Yong Fang, Shengjun Han, Cheng Huang, Runpu Wu
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2019-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0225196
id doaj-91138365ed36418ba93676b1770f1341
record_format Article
spelling doaj-91138365ed36418ba93676b1770f13412021-03-03T21:17:06ZengPublic Library of Science (PLoS)PLoS ONE1932-62032019-01-011411e022519610.1371/journal.pone.0225196TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology.Yong FangShengjun HanCheng HuangRunpu WuWith the widespread usage of Web applications, the security issues of source code are increasing. The exposed vulnerabilities seriously endanger the interests of service providers and customers. There are some models for solving this problem. However, most of them rely on complex graphs generated from source code or regex patterns based on expert experience. In this paper, TAP, which is based on token mechanism and deep learning technology, was proposed as an analysis model to discover the vulnerabilities of PHP: Hypertext Preprocessor (PHP) Web programs conveniently and easily. Based on the token mechanism of PHP language, a custom tokenizer was designed, and it unifies tokens, supports some features of PHP and optimizes the parsing. Besides, the tokenizer also implements parameter iteration to achieve data flow analysis. On the Software Assurance Reference Dataset(SARD) and SQLI-LABS dataset, we trained the deep learning model of TAP by combining the word2vec model with Long Short-Term Memory (LSTM) network algorithm. According to the experiment on the dataset of CWE-89, TAP not only achieves the 0.9941 Area Under the Curve(AUC), which is better than other models, but also achieves the highest accuracy: 0.9787. Further, compared with RIPS, TAP shows much better in multiclass classification with 0.8319 Kappa and 0.0840 hamming distance.https://doi.org/10.1371/journal.pone.0225196
collection DOAJ
language English
format Article
sources DOAJ
author Yong Fang
Shengjun Han
Cheng Huang
Runpu Wu
spellingShingle Yong Fang
Shengjun Han
Cheng Huang
Runpu Wu
TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology.
PLoS ONE
author_facet Yong Fang
Shengjun Han
Cheng Huang
Runpu Wu
author_sort Yong Fang
title TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology.
title_short TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology.
title_full TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology.
title_fullStr TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology.
title_full_unstemmed TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology.
title_sort tap: a static analysis model for php vulnerabilities based on token and deep learning technology.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2019-01-01
description With the widespread usage of Web applications, the security issues of source code are increasing. The exposed vulnerabilities seriously endanger the interests of service providers and customers. There are some models for solving this problem. However, most of them rely on complex graphs generated from source code or regex patterns based on expert experience. In this paper, TAP, which is based on token mechanism and deep learning technology, was proposed as an analysis model to discover the vulnerabilities of PHP: Hypertext Preprocessor (PHP) Web programs conveniently and easily. Based on the token mechanism of PHP language, a custom tokenizer was designed, and it unifies tokens, supports some features of PHP and optimizes the parsing. Besides, the tokenizer also implements parameter iteration to achieve data flow analysis. On the Software Assurance Reference Dataset(SARD) and SQLI-LABS dataset, we trained the deep learning model of TAP by combining the word2vec model with Long Short-Term Memory (LSTM) network algorithm. According to the experiment on the dataset of CWE-89, TAP not only achieves the 0.9941 Area Under the Curve(AUC), which is better than other models, but also achieves the highest accuracy: 0.9787. Further, compared with RIPS, TAP shows much better in multiclass classification with 0.8319 Kappa and 0.0840 hamming distance.
url https://doi.org/10.1371/journal.pone.0225196
work_keys_str_mv AT yongfang tapastaticanalysismodelforphpvulnerabilitiesbasedontokenanddeeplearningtechnology
AT shengjunhan tapastaticanalysismodelforphpvulnerabilitiesbasedontokenanddeeplearningtechnology
AT chenghuang tapastaticanalysismodelforphpvulnerabilitiesbasedontokenanddeeplearningtechnology
AT runpuwu tapastaticanalysismodelforphpvulnerabilitiesbasedontokenanddeeplearningtechnology
_version_ 1714817727639584768