Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement

Obfuscation is rampant in both benign and malicious JavaScript (JS) codes. It generates an obscure and undetectable code that hinders comprehension and analysis. Therefore, accurate detection of JS codes that masquerade as innocuous scripts is vital. The existing deobfuscation methods assume that a...

Full description

Bibliographic Details
Main Authors:	Samuel Ndichu, Sangwook Kim, Seiichi Ozawa
Format:	Article
Language:	English
Published:	Wiley 2020-06-01
Series:	CAAI Transactions on Intelligence Technology
Subjects:	invasive software java internet feature extraction text analysis vectors learning (artificial intelligence) formatted js code deobfuscation methods unpacking dud-preprocessed obfuscated malicious js codes term frequency–inverse document frequency model long short-term memory model paragraph vector models obfuscated benign js codes learned feature vectors fasttext model js code editor multilayer obfuscation original js code undetectable code obscure code machine learning models detection
Online Access:	https://digital-library.theiet.org/content/journals/10.1049/trit.2020.0026

id	doaj-1ec26bbd25b54437bb5faa62e4cf5359
record_format	Article
spelling	doaj-1ec26bbd25b54437bb5faa62e4cf53592021-04-02T14:11:14ZengWileyCAAI Transactions on Intelligence Technology2468-23222020-06-0110.1049/trit.2020.0026TRIT.2020.0026Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvementSamuel Ndichu0Sangwook Kim1Sangwook Kim2Seiichi Ozawa3Graduate School of Engineering, Kobe UniversityGraduate School of Engineering, Kobe UniversityGraduate School of Engineering, Kobe UniversityGraduate School of Engineering, Kobe UniversityObfuscation is rampant in both benign and malicious JavaScript (JS) codes. It generates an obscure and undetectable code that hinders comprehension and analysis. Therefore, accurate detection of JS codes that masquerade as innocuous scripts is vital. The existing deobfuscation methods assume that a specific tool can recover an original JS code entirely. For a multi-layer obfuscation, general tools realize a formatted JS code, but some sections remain encoded. For the detection of such codes, this study performs Deobfuscation, Unpacking, and Decoding (DUD-preprocessing) by function redefinition using a Virtual Machine (VM), a JS code editor, and a python int_to_str() function to facilitate feature learning by the FastText model. The learned feature vectors are passed to a classifier model that judges the maliciousness of a JS code. In performance evaluation, the authors use the Hynek Petrak's dataset for obfuscated malicious JS codes and the SRILAB dataset and the Majestic Million service top 10,000 websites for obfuscated benign JS codes. They then compare the performance to other models on the detection of DUD-preprocessed obfuscated malicious JS codes. Their experimental results show that the proposed approach enhances feature learning and provides improved accuracy in the detection of obfuscated malicious JS codes.https://digital-library.theiet.org/content/journals/10.1049/trit.2020.0026invasive softwarejavainternetfeature extractiontext analysisvectorslearning (artificial intelligence)formatted js codedeobfuscation methodsunpackingdud-preprocessed obfuscated malicious js codesterm frequency–inverse document frequency modellong short-term memory modelparagraph vector modelsobfuscated benign js codeslearned feature vectorsfasttext modeljs code editormultilayer obfuscationoriginal js codeundetectable codeobscure codemachine learning models detection
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Samuel Ndichu Sangwook Kim Sangwook Kim Seiichi Ozawa
spellingShingle	Samuel Ndichu Sangwook Kim Sangwook Kim Seiichi Ozawa Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement CAAI Transactions on Intelligence Technology invasive software java internet feature extraction text analysis vectors learning (artificial intelligence) formatted js code deobfuscation methods unpacking dud-preprocessed obfuscated malicious js codes term frequency–inverse document frequency model long short-term memory model paragraph vector models obfuscated benign js codes learned feature vectors fasttext model js code editor multilayer obfuscation original js code undetectable code obscure code machine learning models detection
author_facet	Samuel Ndichu Sangwook Kim Sangwook Kim Seiichi Ozawa
author_sort	Samuel Ndichu
title	Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement
title_short	Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement
title_full	Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement
title_fullStr	Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement
title_full_unstemmed	Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement
title_sort	deobfuscation, unpacking, and decoding of obfuscated malicious javascript for machine learning models detection performance improvement
publisher	Wiley
series	CAAI Transactions on Intelligence Technology
issn	2468-2322
publishDate	2020-06-01
description	Obfuscation is rampant in both benign and malicious JavaScript (JS) codes. It generates an obscure and undetectable code that hinders comprehension and analysis. Therefore, accurate detection of JS codes that masquerade as innocuous scripts is vital. The existing deobfuscation methods assume that a specific tool can recover an original JS code entirely. For a multi-layer obfuscation, general tools realize a formatted JS code, but some sections remain encoded. For the detection of such codes, this study performs Deobfuscation, Unpacking, and Decoding (DUD-preprocessing) by function redefinition using a Virtual Machine (VM), a JS code editor, and a python int_to_str() function to facilitate feature learning by the FastText model. The learned feature vectors are passed to a classifier model that judges the maliciousness of a JS code. In performance evaluation, the authors use the Hynek Petrak's dataset for obfuscated malicious JS codes and the SRILAB dataset and the Majestic Million service top 10,000 websites for obfuscated benign JS codes. They then compare the performance to other models on the detection of DUD-preprocessed obfuscated malicious JS codes. Their experimental results show that the proposed approach enhances feature learning and provides improved accuracy in the detection of obfuscated malicious JS codes.
topic	invasive software java internet feature extraction text analysis vectors learning (artificial intelligence) formatted js code deobfuscation methods unpacking dud-preprocessed obfuscated malicious js codes term frequency–inverse document frequency model long short-term memory model paragraph vector models obfuscated benign js codes learned feature vectors fasttext model js code editor multilayer obfuscation original js code undetectable code obscure code machine learning models detection
url	https://digital-library.theiet.org/content/journals/10.1049/trit.2020.0026
work_keys_str_mv	AT samuelndichu deobfuscationunpackinganddecodingofobfuscatedmaliciousjavascriptformachinelearningmodelsdetectionperformanceimprovement AT sangwookkim deobfuscationunpackinganddecodingofobfuscatedmaliciousjavascriptformachinelearningmodelsdetectionperformanceimprovement AT sangwookkim deobfuscationunpackinganddecodingofobfuscatedmaliciousjavascriptformachinelearningmodelsdetectionperformanceimprovement AT seiichiozawa deobfuscationunpackinganddecodingofobfuscatedmaliciousjavascriptformachinelearningmodelsdetectionperformanceimprovement
_version_	1721562769706188800

Deobfuscation, unpacking, and decoding of obfuscated malicious JavaScript for machine learning models detection performance improvement

Similar Items