Parallel Junction Tree Algorithm on GPU
<p>Bayesian networks (BNs) are an effective tool in a diverse range of applications that require representation and reasoning with uncertain knowledge and data. Mathematically, a Bayesian network (BN) is a type of statistical model that can compactly represent complex probability distributions...
Main Author: | |
---|---|
Language: | EN |
Published: |
Carnegie Mellon University
2013
|
Subjects: | |
Online Access: | http://pqdtopen.proquest.com/#viewpdf?dispub=3575068 |
id |
ndltd-PROQUEST-oai-pqdtoai.proquest.com-3575068 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-PROQUEST-oai-pqdtoai.proquest.com-35750682013-12-05T04:00:26Z Parallel Junction Tree Algorithm on GPU Zheng, Lu Computer Science <p>Bayesian networks (BNs) are an effective tool in a diverse range of applications that require representation and reasoning with uncertain knowledge and data. Mathematically, a Bayesian network (BN) is a type of statistical model that can compactly represent complex probability distributions of random variables. Inference over BNs can be either exact or approximate. The junction tree algorithm is perhaps the most popular exact inference algorithm. The junction tree algorithm propagates beliefs (or posteriors) over a junction tree when a new piece of evidence comes in so that the factors over the junction tree stay consistent with each other. </p><p> However, belief propagation over junction trees is known to be computationally hard. The cornputational complexity hinders the application of BNs in cases where real-time inference is required. This thesis accelerates the junction tree belief propagation algorithm through parallelizing and using a state of the art manycore computing platform. Recent manycore computing platforms, like the recent Graphical Processing Units (GPUs) from NVIDIA and Intel's Knights Ferry, employ a Single Instruction Multiple Data (SIMD) architecture. They can provide massive computational power to address various computational problems. However, it is not trivial to map a problem instance to the manycore platform. The problem has to be carefully parallelized and the implementation has to be carefully optimized in order to make full use of the computation resources on a GPU. </p><p> In this thesis, we will thoroughly investigate the junction tree algorithm and identify the possible parallel opportunities. Our work mainly focuses on node-level parallelizati on, i.e., we parallelize each message passing of junction tree belief propagation. We first identify two kinds of parallelism in the message passing, namely, element-wise parallelism and arithmetic parallelism. Then, based on these two kinds parallelism, we propose a two-dimensional parallel message passing algorithm. In case of message passings that do not contain enough parallel opportunity, we also propose a junction tree pruning techniques called clique merging. Clique merging eliminates extremely small nodes in a junction tree so that the junction tree is better adjusted to the GPU platform. </p><p> Performance tuning for the parallel junction tree algorithm is necessary. Yet the diversity in size typically seen in the junction tree cliques makes the GPU input workload vary; the two dimensions of parallelism further add to the complexity of this tuning issue. Therefore it is hard to set the GPU configuration parameters for optimal performance. In this thesis, we parameterize the GPU tuning process and propose a machine learning framework to automatically optimize the performance. This framework essentially trains a machine learning model to capture causal relationships between a GPUs workload, its configuration, and resulting performance characteristics. The machine learning model is then used to optimize the GPU configuration. Experiments show that this auto-tuning framework can improve the performance significantly more than extensive manual tuning. </p><p> We implemented the parallel junction tree algorithm on a NVIDIA GTX 460 GPU card, and employed the clique merging and auto-tuning techniques. Compared with a benchmark sequential code that runs on an Intel Core 2 Quad CPU, the fully tuned code show up to 19.90x speedup. On average over all data-sets, we get a speedup of 10.70x (arithmetic average) or 8.68x (geometric average). </p> Carnegie Mellon University 2013-12-04 00:00:00.0 thesis http://pqdtopen.proquest.com/#viewpdf?dispub=3575068 EN |
collection |
NDLTD |
language |
EN |
sources |
NDLTD |
topic |
Computer Science |
spellingShingle |
Computer Science Zheng, Lu Parallel Junction Tree Algorithm on GPU |
description |
<p>Bayesian networks (BNs) are an effective tool in a diverse range of applications that require representation and reasoning with uncertain knowledge and data. Mathematically, a Bayesian network (BN) is a type of statistical model that can compactly represent complex probability distributions of random variables. Inference over BNs can be either exact or approximate. The junction tree algorithm is perhaps the most popular exact inference algorithm. The junction tree algorithm propagates beliefs (or posteriors) over a junction tree when a new piece of evidence comes in so that the factors over the junction tree stay consistent with each other. </p><p> However, belief propagation over junction trees is known to be computationally hard. The cornputational complexity hinders the application of BNs in cases where real-time inference is required. This thesis accelerates the junction tree belief propagation algorithm through parallelizing and using a state of the art manycore computing platform. Recent manycore computing platforms, like the recent Graphical Processing Units (GPUs) from NVIDIA and Intel's Knights Ferry, employ a Single Instruction Multiple Data (SIMD) architecture. They can provide massive computational power to address various computational problems. However, it is not trivial to map a problem instance to the manycore platform. The problem has to be carefully parallelized and the implementation has to be carefully optimized in order to make full use of the computation resources on a GPU. </p><p> In this thesis, we will thoroughly investigate the junction tree algorithm and identify the possible parallel opportunities. Our work mainly focuses on node-level parallelizati on, i.e., we parallelize each message passing of junction tree belief propagation. We first identify two kinds of parallelism in the message passing, namely, element-wise parallelism and arithmetic parallelism. Then, based on these two kinds parallelism, we propose a two-dimensional parallel message passing algorithm. In case of message passings that do not contain enough parallel opportunity, we also propose a junction tree pruning techniques called clique merging. Clique merging eliminates extremely small nodes in a junction tree so that the junction tree is better adjusted to the GPU platform. </p><p> Performance tuning for the parallel junction tree algorithm is necessary. Yet the diversity in size typically seen in the junction tree cliques makes the GPU input workload vary; the two dimensions of parallelism further add to the complexity of this tuning issue. Therefore it is hard to set the GPU configuration parameters for optimal performance. In this thesis, we parameterize the GPU tuning process and propose a machine learning framework to automatically optimize the performance. This framework essentially trains a machine learning model to capture causal relationships between a GPUs workload, its configuration, and resulting performance characteristics. The machine learning model is then used to optimize the GPU configuration. Experiments show that this auto-tuning framework can improve the performance significantly more than extensive manual tuning. </p><p> We implemented the parallel junction tree algorithm on a NVIDIA GTX 460 GPU card, and employed the clique merging and auto-tuning techniques. Compared with a benchmark sequential code that runs on an Intel Core 2 Quad CPU, the fully tuned code show up to 19.90x speedup. On average over all data-sets, we get a speedup of 10.70x (arithmetic average) or 8.68x (geometric average). </p> |
author |
Zheng, Lu |
author_facet |
Zheng, Lu |
author_sort |
Zheng, Lu |
title |
Parallel Junction Tree Algorithm on GPU |
title_short |
Parallel Junction Tree Algorithm on GPU |
title_full |
Parallel Junction Tree Algorithm on GPU |
title_fullStr |
Parallel Junction Tree Algorithm on GPU |
title_full_unstemmed |
Parallel Junction Tree Algorithm on GPU |
title_sort |
parallel junction tree algorithm on gpu |
publisher |
Carnegie Mellon University |
publishDate |
2013 |
url |
http://pqdtopen.proquest.com/#viewpdf?dispub=3575068 |
work_keys_str_mv |
AT zhenglu paralleljunctiontreealgorithmongpu |
_version_ |
1716616501824323584 |