An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads

Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools t...

Full description

Bibliographic Details
Main Authors:	Nachiket Kapre, André Dehon
Format:	Article
Language:	English
Published:	Hindawi Limited 2011-01-01
Series:	International Journal of Reconfigurable Computing
Online Access:	http://dx.doi.org/10.1155/2011/745147

id	doaj-48d938f3a33c4a45939ba24827857aeb
record_format	Article
spelling	doaj-48d938f3a33c4a45939ba24827857aeb2020-11-24T23:43:24ZengHindawi LimitedInternational Journal of Reconfigurable Computing1687-71951687-72092011-01-01201110.1155/2011/745147745147An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented WorkloadsNachiket Kapre0André Dehon1Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, UKDepartment of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104, USAParallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.http://dx.doi.org/10.1155/2011/745147
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Nachiket Kapre André Dehon
spellingShingle	Nachiket Kapre André Dehon An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads International Journal of Reconfigurable Computing
author_facet	Nachiket Kapre André Dehon
author_sort	Nachiket Kapre
title	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_short	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_full	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_fullStr	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_full_unstemmed	An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
title_sort	noc traffic compiler for efficient fpga implementation of sparse graph-oriented workloads
publisher	Hindawi Limited
series	International Journal of Reconfigurable Computing
issn	1687-7195 1687-7209
publishDate	2011-01-01
description	Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.
url	http://dx.doi.org/10.1155/2011/745147
work_keys_str_mv	AT nachiketkapre annoctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads AT andredehon annoctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads AT nachiketkapre noctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads AT andredehon noctrafficcompilerforefficientfpgaimplementationofsparsegraphorientedworkloads
_version_	1725501698681602048

An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads

Similar Items