Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code

Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we w...

Full description

Bibliographic Details
Main Author: Jang, Jiyong
Format: Others
Published: Research Showcase @ CMU 2013
Subjects:
Online Access:http://repository.cmu.edu/dissertations/306
http://repository.cmu.edu/cgi/viewcontent.cgi?article=1308&context=dissertations
id ndltd-cmu.edu-oai-repository.cmu.edu-dissertations-1308
record_format oai_dc
spelling ndltd-cmu.edu-oai-repository.cmu.edu-dissertations-13082014-07-24T15:36:16Z Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code Jang, Jiyong Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we will always be one step behind attackers. Thus developing scalable analysis to bridge the gap is essential. In this dissertation, we argue that automatic code reuse detection enables an efficient data reduction of a high volume of incoming malware for downstream analysis and enhances software security by efficiently finding known vulnerabilities across large code bases. In order to demonstrate the benefits of automatic software similarity detection, we discuss two representative problems that are remedied by scalable analysis: malware triage and unpatched code clone detection. First, we tackle the onslaught of malware. Although over one million new malware are reported each day, existing research shows that most malware are not written from scratch; instead, they are automatically generated variants of existing malware. When groups of highly similar variants are clustered together, new malware more easily stands out. Unfortunately, current systems struggle with handling this high volume of malware. We scale clustering using feature hashing and perform semantic analysis using co-clustering. Our evaluation demonstrates that these techniques are an order of magnitude faster than previous systems and automatically discover highly correlated features and malware groups. Furthermore, we design algorithms to infer evolutionary relationships among malware, which helps analysts understand trends over time and make informed decisions about which malware to analyze first. Second, we address the problem of detecting unpatched code clones at scale. When buggy code gets copied from project to project, eventually all projects will need to be patched. We call clones of buggy code that have been fixed in only a subset of projects unpatched code clones. Unfortunately, code copying is usually ad-hoc and is often not tracked, which makes it challenging to identify all unpatched vulnerabilities in code basesat the scale of entire OS distributions. We scale unpatched code clone detection to spot over15,000 latent security vulnerabilities in 2.1 billion lines of code from the Linux kernel, allDebian and Ubuntu packages, and all C/C++ projects in SourceForge in three hours on asingle machine. To the best of our knowledge, this is the largest set of bugs ever reported in a single paper. 2013-08-01T07:00:00Z text application/pdf http://repository.cmu.edu/dissertations/306 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1308&context=dissertations Dissertations Research Showcase @ CMU Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data Electrical and Computer Engineering
collection NDLTD
format Others
sources NDLTD
topic Malware
Triage
Feature Hashing
Co-clustering
Hadoop
Unpatched Code Clone
Bloom Filter
Lineage
Binary Analysis
Code Reuse
Big Data
Electrical and Computer Engineering
spellingShingle Malware
Triage
Feature Hashing
Co-clustering
Hadoop
Unpatched Code Clone
Bloom Filter
Lineage
Binary Analysis
Code Reuse
Big Data
Electrical and Computer Engineering
Jang, Jiyong
Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
description Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we will always be one step behind attackers. Thus developing scalable analysis to bridge the gap is essential. In this dissertation, we argue that automatic code reuse detection enables an efficient data reduction of a high volume of incoming malware for downstream analysis and enhances software security by efficiently finding known vulnerabilities across large code bases. In order to demonstrate the benefits of automatic software similarity detection, we discuss two representative problems that are remedied by scalable analysis: malware triage and unpatched code clone detection. First, we tackle the onslaught of malware. Although over one million new malware are reported each day, existing research shows that most malware are not written from scratch; instead, they are automatically generated variants of existing malware. When groups of highly similar variants are clustered together, new malware more easily stands out. Unfortunately, current systems struggle with handling this high volume of malware. We scale clustering using feature hashing and perform semantic analysis using co-clustering. Our evaluation demonstrates that these techniques are an order of magnitude faster than previous systems and automatically discover highly correlated features and malware groups. Furthermore, we design algorithms to infer evolutionary relationships among malware, which helps analysts understand trends over time and make informed decisions about which malware to analyze first. Second, we address the problem of detecting unpatched code clones at scale. When buggy code gets copied from project to project, eventually all projects will need to be patched. We call clones of buggy code that have been fixed in only a subset of projects unpatched code clones. Unfortunately, code copying is usually ad-hoc and is often not tracked, which makes it challenging to identify all unpatched vulnerabilities in code basesat the scale of entire OS distributions. We scale unpatched code clone detection to spot over15,000 latent security vulnerabilities in 2.1 billion lines of code from the Linux kernel, allDebian and Ubuntu packages, and all C/C++ projects in SourceForge in three hours on asingle machine. To the best of our knowledge, this is the largest set of bugs ever reported in a single paper.
author Jang, Jiyong
author_facet Jang, Jiyong
author_sort Jang, Jiyong
title Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_short Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_full Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_fullStr Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_full_unstemmed Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code
title_sort scaling software security analysis to millions of malicious programs and billions of lines of code
publisher Research Showcase @ CMU
publishDate 2013
url http://repository.cmu.edu/dissertations/306
http://repository.cmu.edu/cgi/viewcontent.cgi?article=1308&context=dissertations
work_keys_str_mv AT jangjiyong scalingsoftwaresecurityanalysistomillionsofmaliciousprogramsandbillionsoflinesofcode
_version_ 1716709425200234496