Fast query processing by distributing an index over CPU caches

Data intensive applications on clusters often require requests quickly be sent to the node managing the desired data. In many applications, one must look through a sorted tree structure to determine the responsible node for accessing or storing the data. Examples include object tracking in sensor ne...

Full description

Bibliographic Details
Published:
Online Access:	http://hdl.handle.net/2047/d20000661

id	ndltd-NEU--neu-330109
record_format	oai_dc
spelling	ndltd-NEU--neu-3301092016-04-25T16:13:55ZFast query processing by distributing an index over CPU cachesData intensive applications on clusters often require requests quickly be sent to the node managing the desired data. In many applications, one must look through a sorted tree structure to determine the responsible node for accessing or storing the data. Examples include object tracking in sensor networks, packet routing over the internet, request processing in publish-subscribe middleware, and query processing in database systems. When the tree structure is larger than the CPU cache, the standard implementation potentially incurs many cache misses for each lookup; one cache miss at each successive level of the tree. As the CPURAM gap grows, this performance degradation will only become worse in the future. We propose a solution that takes advantage of the growing speed of local area networks for clusters. We split the sorted tree structure among the nodes of the cluster. We assume that the structure will fit inside the aggregation of the CPU caches of the entire cluster. We then send a word over the network (as part of a larger packet containing other words) in order to examine the tree structure in another node’s CPU cache. We show that this is often faster than the standard solution, which locally incurs multiple cache misses while accessing each successive level of the tree. The principle is demonstrated with a cluster configured with Pentium III nodes connected with a Myrinet network. The new approach is shown to be 50% faster on this current cluster. In the future, the new approach is expected to have a still greater advantage as networks grow in speed, and as cache lines grow in length (greater cache miss penalty). This can be used to successfully overcome the inherent memory latency associated with cache misses.http://hdl.handle.net/2047/d20000661
collection	NDLTD
sources	NDLTD
description	Data intensive applications on clusters often require requests quickly be sent to the node managing the desired data. In many applications, one must look through a sorted tree structure to determine the responsible node for accessing or storing the data. Examples include object tracking in sensor networks, packet routing over the internet, request processing in publish-subscribe middleware, and query processing in database systems. When the tree structure is larger than the CPU cache, the standard implementation potentially incurs many cache misses for each lookup; one cache miss at each successive level of the tree. As the CPURAM gap grows, this performance degradation will only become worse in the future. We propose a solution that takes advantage of the growing speed of local area networks for clusters. We split the sorted tree structure among the nodes of the cluster. We assume that the structure will fit inside the aggregation of the CPU caches of the entire cluster. We then send a word over the network (as part of a larger packet containing other words) in order to examine the tree structure in another node’s CPU cache. We show that this is often faster than the standard solution, which locally incurs multiple cache misses while accessing each successive level of the tree. The principle is demonstrated with a cluster configured with Pentium III nodes connected with a Myrinet network. The new approach is shown to be 50% faster on this current cluster. In the future, the new approach is expected to have a still greater advantage as networks grow in speed, and as cache lines grow in length (greater cache miss penalty). This can be used to successfully overcome the inherent memory latency associated with cache misses.
title	Fast query processing by distributing an index over CPU caches
spellingShingle	Fast query processing by distributing an index over CPU caches
title_short	Fast query processing by distributing an index over CPU caches
title_full	Fast query processing by distributing an index over CPU caches
title_fullStr	Fast query processing by distributing an index over CPU caches
title_full_unstemmed	Fast query processing by distributing an index over CPU caches
title_sort	fast query processing by distributing an index over cpu caches
publishDate
url	http://hdl.handle.net/2047/d20000661
_version_	1718235607284580352

Fast query processing by distributing an index over CPU caches

Similar Items