An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools t...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Hindawi Limited
2011-01-01
|
Series: | International Journal of Reconfigurable Computing |
Online Access: | http://dx.doi.org/10.1155/2011/745147 |
id |
doaj-48d938f3a33c4a45939ba24827857aeb |
---|---|
record_format |
Article |
spelling |
doaj-48d938f3a33c4a45939ba24827857aeb2020-11-24T23:43:24ZengHindawi LimitedInternational Journal of Reconfigurable Computing1687-71951687-72092011-01-01201110.1155/2011/745147745147An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented WorkloadsNachiket Kapre0André Dehon1Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, UKDepartment of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104, USAParallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.http://dx.doi.org/10.1155/2011/745147 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Nachiket Kapre André Dehon |
spellingShingle |
Nachiket Kapre André Dehon An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads International Journal of Reconfigurable Computing |
author_facet |
Nachiket Kapre André Dehon |
author_sort |
Nachiket Kapre |
title |
An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads |
title_short |
An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads |
title_full |
An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads |
title_fullStr |
An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads |
title_full_unstemmed |
An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads |
title_sort |
noc traffic compiler for efficient fpga implementation of sparse graph-oriented workloads |
publisher |
Hindawi Limited |
series |
International Journal of Reconfigurable Computing |
issn |
1687-7195 1687-7209 |
publishDate |
2011-01-01 |
description |
Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured
communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to
reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for
complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained
synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow.
This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow. |
url |
http://dx.doi.org/10.1155/2011/745147 |
work_keys_str_mv |
AT nachiketkapre annoctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads AT andredehon annoctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads AT nachiketkapre noctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads AT andredehon noctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads |
_version_ |
1725501698681602048 |