SAHA: A String Adaptive Hash Table for Analytical Databases

Hash tables are the fundamental data structure for analytical database workloads, such as aggregation, joining, set filtering and records deduplication. The performance aspects of hash tables differ drastically with respect to what kind of data are being processed or how many inserts, lookups and de...

Full description

Bibliographic Details
Main Authors: Tianqi Zheng, Zhibin Zhang, Xueqi Cheng
Format: Article
Language:English
Published: MDPI AG 2020-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/10/6/1915
id doaj-70e3a74e41dc4244b9b86474889a06c7
record_format Article
spelling doaj-70e3a74e41dc4244b9b86474889a06c72020-11-25T02:01:59ZengMDPI AGApplied Sciences2076-34172020-03-01106191510.3390/app10061915app10061915SAHA: A String Adaptive Hash Table for Analytical DatabasesTianqi Zheng0Zhibin Zhang1Xueqi Cheng2CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, ChinaCAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, ChinaCAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, ChinaHash tables are the fundamental data structure for analytical database workloads, such as aggregation, joining, set filtering and records deduplication. The performance aspects of hash tables differ drastically with respect to what kind of data are being processed or how many inserts, lookups and deletes are constructed. In this paper, we address some common use cases of hash tables: aggregating and joining over arbitrary string data. We designed a new hash table, SAHA, which is tightly integrated with modern analytical databases and optimized for string data with the following advantages: (1) it inlines short strings and saves hash values for long strings only; (2) it uses special memory loading techniques to do quick dispatching and hashing computations; and (3) it utilizes vectorized processing to batch hashing operations. Our evaluation results reveal that SAHA outperforms state-of-the-art hash tables by one to five times in analytical workloads, including Google’s SwissTable and Facebook’s F14Table. It has been merged into the ClickHouse database and shows promising results in production.https://www.mdpi.com/2076-3417/10/6/1915hash tableanalytical databasestring data
collection DOAJ
language English
format Article
sources DOAJ
author Tianqi Zheng
Zhibin Zhang
Xueqi Cheng
spellingShingle Tianqi Zheng
Zhibin Zhang
Xueqi Cheng
SAHA: A String Adaptive Hash Table for Analytical Databases
Applied Sciences
hash table
analytical database
string data
author_facet Tianqi Zheng
Zhibin Zhang
Xueqi Cheng
author_sort Tianqi Zheng
title SAHA: A String Adaptive Hash Table for Analytical Databases
title_short SAHA: A String Adaptive Hash Table for Analytical Databases
title_full SAHA: A String Adaptive Hash Table for Analytical Databases
title_fullStr SAHA: A String Adaptive Hash Table for Analytical Databases
title_full_unstemmed SAHA: A String Adaptive Hash Table for Analytical Databases
title_sort saha: a string adaptive hash table for analytical databases
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2020-03-01
description Hash tables are the fundamental data structure for analytical database workloads, such as aggregation, joining, set filtering and records deduplication. The performance aspects of hash tables differ drastically with respect to what kind of data are being processed or how many inserts, lookups and deletes are constructed. In this paper, we address some common use cases of hash tables: aggregating and joining over arbitrary string data. We designed a new hash table, SAHA, which is tightly integrated with modern analytical databases and optimized for string data with the following advantages: (1) it inlines short strings and saves hash values for long strings only; (2) it uses special memory loading techniques to do quick dispatching and hashing computations; and (3) it utilizes vectorized processing to batch hashing operations. Our evaluation results reveal that SAHA outperforms state-of-the-art hash tables by one to five times in analytical workloads, including Google’s SwissTable and Facebook’s F14Table. It has been merged into the ClickHouse database and shows promising results in production.
topic hash table
analytical database
string data
url https://www.mdpi.com/2076-3417/10/6/1915
work_keys_str_mv AT tianqizheng sahaastringadaptivehashtableforanalyticaldatabases
AT zhibinzhang sahaastringadaptivehashtableforanalyticaldatabases
AT xueqicheng sahaastringadaptivehashtableforanalyticaldatabases
_version_ 1724954582279258112