A Selective Mitigation Technique of Soft Errors for DNN Models Used in Healthcare Applications: DenseNet201 Case Study

Deep neural networks (DNNs) have been successfully deployed in widespread domains, including healthcare applications. DenseNet201 is a new DNN architecture used in healthcare systems (i.e., presence detection of the surgical tool). Specialized accelerators such as GPUs have been used to speed up the...

Full description

Bibliographic Details
Main Authors: Khalid Adam, Izzeldin Ibrahim Mohamed, Younis Ibrahim
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9419032/
Description
Summary:Deep neural networks (DNNs) have been successfully deployed in widespread domains, including healthcare applications. DenseNet201 is a new DNN architecture used in healthcare systems (i.e., presence detection of the surgical tool). Specialized accelerators such as GPUs have been used to speed up the execution of DNNs. Nevertheless, GPUs are prone to transient effects and other reliability threats, which can impact DNN models&#x2019; reliability. Safety-critical systems, such as healthcare applications, must be highly reliable because minor errors might lead to severe injury or death. In this paper, we propose a selective mitigation technique that relies on in-depth analysis. First, we inject the DenseNet201 model implemented on a GPU via NVIDIA&#x2019;s SASSIFI fault injector. Second, we perform a comprehensive analysis from the perspective of kernel and layer to identify the most vulnerable portions of the injected model. Finally, we validate our technique by applying it to the top-vulnerable kernels to selectively protect the only sensitive portions of the model to avoid unnecessary overheads. Our experiments demonstrate that our mitigation technique achieves a significant reduction in the percentage of errors that cause malfunction (errors that lead to misclassification) from <bold>6.463&#x0025;</bold> to <bold>0.21&#x0025;</bold>. Moreover, the performance overhead (the execution time) of our technique is compared with the well-known protection techniques: Algorithm-Based Fault Tolerance (ABFT), Double Modular Redundancy (DMR), and Triple Modular Redundancy (TMR). The proposed solution shows only <bold>0.3035&#x0025;</bold> overhead compared to these techniques while correcting up <bold>84.8&#x0025;</bold> of the SDC errors in DenseNet201, remarkably improving the healthcare domain&#x2019;s model reliability.
ISSN:2169-3536