High Performance Packet Processing on Multi-queue and Multi-core Platforms

博士 === 國立清華大學 === 通訊工程研究所 === 103 === Advances in semiconductor technology are making way for multi-core and many-core processors that incorporate tens to hundreds cores in a single package. Meanwhile, network interface cards (NICs) featuring multiple hardware reception (Rx) and transmission (Tx) qu...

Full description

Bibliographic Details
Main Authors: Tsai, Wenyen, 蔡文嚴
Other Authors: Huang, Nen Fu
Format: Others
Language:en_US
Published: 2015
Online Access:http://ndltd.ncl.edu.tw/handle/18774014002518328624
id ndltd-TW-103NTHU5650125
record_format oai_dc
spelling ndltd-TW-103NTHU56501252016-11-20T04:18:16Z http://ndltd.ncl.edu.tw/handle/18774014002518328624 High Performance Packet Processing on Multi-queue and Multi-core Platforms 於具備多佇列網路卡的多核心平台上對高效能封包處理之研究 Tsai, Wenyen 蔡文嚴 博士 國立清華大學 通訊工程研究所 103 Advances in semiconductor technology are making way for multi-core and many-core processors that incorporate tens to hundreds cores in a single package. Meanwhile, network interface cards (NICs) featuring multiple hardware reception (Rx) and transmission (Tx) queues are the responses from the networking community to the prevalence of multi-core computing. Multi-queue networking circumvents the performance degradation due to contention of multi-core on a single Rx/Tx queue by distributing the packets across multiple queues. In the meantime, benefiting from evolving interrupt handling techniques, a NIC can now be allocated enough interrupt resource for each of its queues to associate to a dedicated core. Although the number of cores in CPUs continues to climb, many difficulties remain in building systems that are capable of keeping up with the packet volume in a modern middle to large scale network deployment. This is due to several factors, including the ever-increasing rate of network traffic, e.g., the now prevalent 10Gbps, the cutting-edge 40Gbps, and the upcoming 100Gbps NICs, and some fundamental limitations in both software and hardware architectures. Software imposed synchronization overheads for multi-core programming such as atomic operations and locking play a critical role affecting the packet processing performance. On the other spectrum, hardware architectural complication like cache coherency and NUMA effects brings new challenges that demand developers to equip with new skill set to unleash the real computing power. Correspondingly, researches attack these challenges by a hardware and software co-design approach that starts from investigating the underlying hardware, which collects necessary knowledge to facilitate software development and allow optimization. In this dissertation, we focus on two problems: 1) reducing the lock contentions when performing session tracking and 2) affinitizing interrupts from multi-queue NICs to CPU cores with the objective of maximizing packet processing performance. For the first problem, we propose a simple partitioning scheme aiming at striking a balance between excessive locking and lockless manipulations. Meanwhile, a resource balancing mechanism is also given to prevent the problem of underutilization of session tracking resources under circumstances of unbalanced traffic loads. The effectiveness is justified by improved performance as the number of cores that contend for a single lock decreases. On the other end of the spectrum, to address the problem of interrupt affinitization, an algorithmic approach based on numerical cost model is proposed to find the best affinitization. Comprehensive experiences covering 1G and 10G NICs with four networking applications ranging from L2 to L7 are conducted to justify the effectiveness. Huang, Nen Fu 黃能富 2015 學位論文 ; thesis 104 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立清華大學 === 通訊工程研究所 === 103 === Advances in semiconductor technology are making way for multi-core and many-core processors that incorporate tens to hundreds cores in a single package. Meanwhile, network interface cards (NICs) featuring multiple hardware reception (Rx) and transmission (Tx) queues are the responses from the networking community to the prevalence of multi-core computing. Multi-queue networking circumvents the performance degradation due to contention of multi-core on a single Rx/Tx queue by distributing the packets across multiple queues. In the meantime, benefiting from evolving interrupt handling techniques, a NIC can now be allocated enough interrupt resource for each of its queues to associate to a dedicated core. Although the number of cores in CPUs continues to climb, many difficulties remain in building systems that are capable of keeping up with the packet volume in a modern middle to large scale network deployment. This is due to several factors, including the ever-increasing rate of network traffic, e.g., the now prevalent 10Gbps, the cutting-edge 40Gbps, and the upcoming 100Gbps NICs, and some fundamental limitations in both software and hardware architectures. Software imposed synchronization overheads for multi-core programming such as atomic operations and locking play a critical role affecting the packet processing performance. On the other spectrum, hardware architectural complication like cache coherency and NUMA effects brings new challenges that demand developers to equip with new skill set to unleash the real computing power. Correspondingly, researches attack these challenges by a hardware and software co-design approach that starts from investigating the underlying hardware, which collects necessary knowledge to facilitate software development and allow optimization. In this dissertation, we focus on two problems: 1) reducing the lock contentions when performing session tracking and 2) affinitizing interrupts from multi-queue NICs to CPU cores with the objective of maximizing packet processing performance. For the first problem, we propose a simple partitioning scheme aiming at striking a balance between excessive locking and lockless manipulations. Meanwhile, a resource balancing mechanism is also given to prevent the problem of underutilization of session tracking resources under circumstances of unbalanced traffic loads. The effectiveness is justified by improved performance as the number of cores that contend for a single lock decreases. On the other end of the spectrum, to address the problem of interrupt affinitization, an algorithmic approach based on numerical cost model is proposed to find the best affinitization. Comprehensive experiences covering 1G and 10G NICs with four networking applications ranging from L2 to L7 are conducted to justify the effectiveness.
author2 Huang, Nen Fu
author_facet Huang, Nen Fu
Tsai, Wenyen
蔡文嚴
author Tsai, Wenyen
蔡文嚴
spellingShingle Tsai, Wenyen
蔡文嚴
High Performance Packet Processing on Multi-queue and Multi-core Platforms
author_sort Tsai, Wenyen
title High Performance Packet Processing on Multi-queue and Multi-core Platforms
title_short High Performance Packet Processing on Multi-queue and Multi-core Platforms
title_full High Performance Packet Processing on Multi-queue and Multi-core Platforms
title_fullStr High Performance Packet Processing on Multi-queue and Multi-core Platforms
title_full_unstemmed High Performance Packet Processing on Multi-queue and Multi-core Platforms
title_sort high performance packet processing on multi-queue and multi-core platforms
publishDate 2015
url http://ndltd.ncl.edu.tw/handle/18774014002518328624
work_keys_str_mv AT tsaiwenyen highperformancepacketprocessingonmultiqueueandmulticoreplatforms
AT càiwényán highperformancepacketprocessingonmultiqueueandmulticoreplatforms
AT tsaiwenyen yújùbèiduōzhùlièwǎnglùkǎdeduōhéxīnpíngtáishàngduìgāoxiàonéngfēngbāochùlǐzhīyánjiū
AT càiwényán yújùbèiduōzhùlièwǎnglùkǎdeduōhéxīnpíngtáishàngduìgāoxiàonéngfēngbāochùlǐzhīyánjiū
_version_ 1718395899206434816