An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads

Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools t...

Full description

Bibliographic Details
Main Authors: Nachiket Kapre, André Dehon
Format: Article
Language:English
Published: Hindawi Limited 2011-01-01
Series:International Journal of Reconfigurable Computing
Online Access:http://dx.doi.org/10.1155/2011/745147
id doaj-48d938f3a33c4a45939ba24827857aeb
record_format Article
spelling doaj-48d938f3a33c4a45939ba24827857aeb2020-11-24T23:43:24ZengHindawi LimitedInternational Journal of Reconfigurable Computing1687-71951687-72092011-01-01201110.1155/2011/745147745147An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented WorkloadsNachiket Kapre0André Dehon1Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, UKDepartment of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104, USAParallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.http://dx.doi.org/10.1155/2011/745147
collection DOAJ
language English
format Article
sources DOAJ
author Nachiket Kapre
André Dehon
spellingShingle Nachiket Kapre
André Dehon
An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
International Journal of Reconfigurable Computing
author_facet Nachiket Kapre
André Dehon
author_sort Nachiket Kapre
title An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_short An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_full An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_fullStr An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_full_unstemmed An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_sort noc traffic compiler for efficient fpga implementation of sparse graph-oriented workloads
publisher Hindawi Limited
series International Journal of Reconfigurable Computing
issn 1687-7195
1687-7209
publishDate 2011-01-01
description Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.
url http://dx.doi.org/10.1155/2011/745147
work_keys_str_mv AT nachiketkapre annoctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads
AT andredehon annoctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads
AT nachiketkapre noctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads
AT andredehon noctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads
_version_ 1725501698681602048