Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we w...
Main Author: | |
---|---|
Format: | Others |
Published: |
Research Showcase @ CMU
2013
|
Subjects: | |
Online Access: | http://repository.cmu.edu/dissertations/306 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1308&context=dissertations |
id |
ndltd-cmu.edu-oai-repository.cmu.edu-dissertations-1308 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-cmu.edu-oai-repository.cmu.edu-dissertations-13082014-07-24T15:36:16Z Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code Jang, Jiyong Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we will always be one step behind attackers. Thus developing scalable analysis to bridge the gap is essential. In this dissertation, we argue that automatic code reuse detection enables an efficient data reduction of a high volume of incoming malware for downstream analysis and enhances software security by efficiently finding known vulnerabilities across large code bases. In order to demonstrate the benefits of automatic software similarity detection, we discuss two representative problems that are remedied by scalable analysis: malware triage and unpatched code clone detection. First, we tackle the onslaught of malware. Although over one million new malware are reported each day, existing research shows that most malware are not written from scratch; instead, they are automatically generated variants of existing malware. When groups of highly similar variants are clustered together, new malware more easily stands out. Unfortunately, current systems struggle with handling this high volume of malware. We scale clustering using feature hashing and perform semantic analysis using co-clustering. Our evaluation demonstrates that these techniques are an order of magnitude faster than previous systems and automatically discover highly correlated features and malware groups. Furthermore, we design algorithms to infer evolutionary relationships among malware, which helps analysts understand trends over time and make informed decisions about which malware to analyze first. Second, we address the problem of detecting unpatched code clones at scale. When buggy code gets copied from project to project, eventually all projects will need to be patched. We call clones of buggy code that have been fixed in only a subset of projects unpatched code clones. Unfortunately, code copying is usually ad-hoc and is often not tracked, which makes it challenging to identify all unpatched vulnerabilities in code basesat the scale of entire OS distributions. We scale unpatched code clone detection to spot over15,000 latent security vulnerabilities in 2.1 billion lines of code from the Linux kernel, allDebian and Ubuntu packages, and all C/C++ projects in SourceForge in three hours on asingle machine. To the best of our knowledge, this is the largest set of bugs ever reported in a single paper. 2013-08-01T07:00:00Z text application/pdf http://repository.cmu.edu/dissertations/306 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1308&context=dissertations Dissertations Research Showcase @ CMU Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data Electrical and Computer Engineering |
collection |
NDLTD |
format |
Others
|
sources |
NDLTD |
topic |
Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data Electrical and Computer Engineering |
spellingShingle |
Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data Electrical and Computer Engineering Jang, Jiyong Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code |
description |
Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we will always be one step behind attackers. Thus developing scalable analysis to bridge the gap is essential.
In this dissertation, we argue that automatic code reuse detection enables an efficient data reduction of a high volume of incoming malware for downstream analysis and enhances software security by efficiently finding known vulnerabilities across large code bases. In order to demonstrate the benefits of automatic software similarity detection, we discuss two representative problems that are remedied by scalable analysis: malware triage and unpatched code clone detection.
First, we tackle the onslaught of malware. Although over one million new malware are reported each day, existing research shows that most malware are not written from scratch; instead, they are automatically generated variants of existing malware. When groups of highly similar variants are clustered together, new malware more easily stands out. Unfortunately, current systems struggle with handling this high volume of malware. We scale clustering using feature hashing and perform semantic analysis using co-clustering. Our evaluation demonstrates that these techniques are an order of magnitude faster than previous systems and automatically discover highly correlated features and malware groups. Furthermore, we design algorithms to infer evolutionary relationships among malware, which helps analysts understand trends over time and make informed decisions about which malware to analyze first.
Second, we address the problem of detecting unpatched code clones at scale. When buggy code gets copied from project to project, eventually all projects will need to be patched. We call clones of buggy code that have been fixed in only a subset of projects unpatched code clones. Unfortunately, code copying is usually ad-hoc and is often not tracked, which makes it challenging to identify all unpatched vulnerabilities in code basesat the scale of entire OS distributions. We scale unpatched code clone detection to spot over15,000 latent security vulnerabilities in 2.1 billion lines of code from the Linux kernel, allDebian and Ubuntu packages, and all C/C++ projects in SourceForge in three hours on asingle machine. To the best of our knowledge, this is the largest set of bugs ever reported in a single paper. |
author |
Jang, Jiyong |
author_facet |
Jang, Jiyong |
author_sort |
Jang, Jiyong |
title |
Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code |
title_short |
Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code |
title_full |
Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code |
title_fullStr |
Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code |
title_full_unstemmed |
Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code |
title_sort |
scaling software security analysis to millions of malicious programs and billions of lines of code |
publisher |
Research Showcase @ CMU |
publishDate |
2013 |
url |
http://repository.cmu.edu/dissertations/306 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1308&context=dissertations |
work_keys_str_mv |
AT jangjiyong scalingsoftwaresecurityanalysistomillionsofmaliciousprogramsandbillionsoflinesofcode |
_version_ |
1716709425200234496 |