Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code

Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we w...

Full description

Bibliographic Details
Main Author:	Jang, Jiyong
Format:	Others
Published:	Research Showcase @ CMU 2013
Subjects:	Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data Electrical and Computer Engineering
Online Access:	http://repository.cmu.edu/dissertations/306 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1308&context=dissertations

id	ndltd-cmu.edu-oai-repository.cmu.edu-dissertations-1308
record_format	oai_dc
spelling	ndltd-cmu.edu-oai-repository.cmu.edu-dissertations-13082014-07-24T15:36:16Z Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code Jang, Jiyong Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we will always be one step behind attackers. Thus developing scalable analysis to bridge the gap is essential. In this dissertation, we argue that automatic code reuse detection enables an efficient data reduction of a high volume of incoming malware for downstream analysis and enhances software security by efficiently finding known vulnerabilities across large code bases. In order to demonstrate the benefits of automatic software similarity detection, we discuss two representative problems that are remedied by scalable analysis: malware triage and unpatched code clone detection. First, we tackle the onslaught of malware. Although over one million new malware are reported each day, existing research shows that most malware are not written from scratch; instead, they are automatically generated variants of existing malware. When groups of highly similar variants are clustered together, new malware more easily stands out. Unfortunately, current systems struggle with handling this high volume of malware. We scale clustering using feature hashing and perform semantic analysis using co-clustering. Our evaluation demonstrates that these techniques are an order of magnitude faster than previous systems and automatically discover highly correlated features and malware groups. Furthermore, we design algorithms to infer evolutionary relationships among malware, which helps analysts understand trends over time and make informed decisions about which malware to analyze first. Second, we address the problem of detecting unpatched code clones at scale. When buggy code gets copied from project to project, eventually all projects will need to be patched. We call clones of buggy code that have been fixed in only a subset of projects unpatched code clones. Unfortunately, code copying is usually ad-hoc and is often not tracked, which makes it challenging to identify all unpatched vulnerabilities in code basesat the scale of entire OS distributions. We scale unpatched code clone detection to spot over15,000 latent security vulnerabilities in 2.1 billion lines of code from the Linux kernel, allDebian and Ubuntu packages, and all C/C++ projects in SourceForge in three hours on asingle machine. To the best of our knowledge, this is the largest set of bugs ever reported in a single paper. 2013-08-01T07:00:00Z text application/pdf http://repository.cmu.edu/dissertations/306 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1308&context=dissertations Dissertations Research Showcase @ CMU Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data Electrical and Computer Engineering
collection	NDLTD
format	Others
sources	NDLTD
topic	Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data Electrical and Computer Engineering
spellingShingle	Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data Electrical and Computer Engineering Jang, Jiyong Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
description	Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we will always be one step behind attackers. Thus developing scalable analysis to bridge the gap is essential. In this dissertation, we argue that automatic code reuse detection enables an efficient data reduction of a high volume of incoming malware for downstream analysis and enhances software security by efficiently finding known vulnerabilities across large code bases. In order to demonstrate the benefits of automatic software similarity detection, we discuss two representative problems that are remedied by scalable analysis: malware triage and unpatched code clone detection. First, we tackle the onslaught of malware. Although over one million new malware are reported each day, existing research shows that most malware are not written from scratch; instead, they are automatically generated variants of existing malware. When groups of highly similar variants are clustered together, new malware more easily stands out. Unfortunately, current systems struggle with handling this high volume of malware. We scale clustering using feature hashing and perform semantic analysis using co-clustering. Our evaluation demonstrates that these techniques are an order of magnitude faster than previous systems and automatically discover highly correlated features and malware groups. Furthermore, we design algorithms to infer evolutionary relationships among malware, which helps analysts understand trends over time and make informed decisions about which malware to analyze first. Second, we address the problem of detecting unpatched code clones at scale. When buggy code gets copied from project to project, eventually all projects will need to be patched. We call clones of buggy code that have been fixed in only a subset of projects unpatched code clones. Unfortunately, code copying is usually ad-hoc and is often not tracked, which makes it challenging to identify all unpatched vulnerabilities in code basesat the scale of entire OS distributions. We scale unpatched code clone detection to spot over15,000 latent security vulnerabilities in 2.1 billion lines of code from the Linux kernel, allDebian and Ubuntu packages, and all C/C++ projects in SourceForge in three hours on asingle machine. To the best of our knowledge, this is the largest set of bugs ever reported in a single paper.
author	Jang, Jiyong
author_facet	Jang, Jiyong
author_sort	Jang, Jiyong
title	Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_short	Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_full	Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_fullStr	Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_full_unstemmed	Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_sort	scaling software security analysis to millions of malicious programs and billions of lines of code
publisher	Research Showcase @ CMU
publishDate	2013
url	http://repository.cmu.edu/dissertations/306 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1308&context=dissertations
work_keys_str_mv	AT jangjiyong scalingsoftwaresecurityanalysistomillionsofmaliciousprogramsandbillionsoflinesofcode
_version_	1716709425200234496

Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code

Similar Items