Summary: | This thesis examines machine learning techniques for detecting malware using dynamic runtime opcodes. Previous work in the field has faltered on inadequately sized and poorly sampled datasets. A novel run-trace dataset is presented, the largest in the literature to date. Using this dataset, malware detection using opcode analysis is shown to be not only feasible, but highly accurate at short run-lengths and without computationally-expensive sequencing analysis. Second, unsupervised learning is used to investigate the effects of anti-virus (AV) labels on detection rates. AV labels offer an English-language description of the malware type, whereas it is found that using an assembly language description is more beneficial in malware triaging. Third, the machine learning techniques are applied to ransomware run-traces, which has not been explored in the literature to date. This offers four further novel contributions: examination of dynamic API calls vs opcode traces in ransomware; run-lengths necessary to detect ransomware accurately; creation of a logical feature reduction algorithm to minimise computational expense in machine learning; the first model in the literature which can differentiate between benign encryption (zipping) and malicious encryption. Lastly, the computational costs of 23 machine learning algorithms are investigated with respect to the run trace dataset. In the literature, researchers discuss the explosion of malware, yet opcode analyses have used fixed-size datasets, with no deference to how this model will cope with retraining on escalating datasets. The cost of retraining and testing updatable and non-updatable classifiers, both parallelised and non-parallelised, is examined with simulated escalating datasets. Lastly, a model is proposed and examined to mitigate the disadvantages of the most successful classifiers for future work.
|