Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest

Nonalcoholic fatty liver disease (NAFLD) is the hepatic manifestation of metabolic syndrome and is the most common cause of chronic liver disease in developed countries. Certain conditions, including mild inflammation biomarkers, dyslipidemia, and insulin resistance, can trigger a progression to non...

Full description

Bibliographic Details
Main Authors:	Rafael García-Carretero, Roberto Holgado-Cuadrado, Óscar Barquero-Pérez
Format:	Article
Language:	English
Published:	MDPI AG 2021-06-01
Series:	Entropy
Subjects:	non-alcoholic fatty liver disease random forest interpretability
Online Access:	https://www.mdpi.com/1099-4300/23/6/763

id	doaj-49f254f8108440ebb6d65ca139bda668
record_format	Article
spelling	doaj-49f254f8108440ebb6d65ca139bda6682021-07-01T00:23:26ZengMDPI AGEntropy1099-43002021-06-012376376310.3390/e23060763Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random ForestRafael García-Carretero0Roberto Holgado-Cuadrado1Óscar Barquero-Pérez2Department of Signal Theory and Communications and Telematics Systems and Computing, Rey Juan Carlos University, 28935 Mostoles, SpainDepartment of Signal Theory and Communications and Telematics Systems and Computing, Rey Juan Carlos University, 28935 Mostoles, SpainDepartment of Signal Theory and Communications and Telematics Systems and Computing, Rey Juan Carlos University, 28935 Mostoles, SpainNonalcoholic fatty liver disease (NAFLD) is the hepatic manifestation of metabolic syndrome and is the most common cause of chronic liver disease in developed countries. Certain conditions, including mild inflammation biomarkers, dyslipidemia, and insulin resistance, can trigger a progression to nonalcoholic steatohepatitis (NASH), a condition characterized by inflammation and liver cell damage. We demonstrate the usefulness of machine learning with a case study to analyze the most important features in random forest (RF) models for predicting patients at risk of developing NASH. We collected data from patients who attended the Cardiovascular Risk Unit of Mostoles University Hospital (Madrid, Spain) from 2005 to 2021. We reviewed electronic health records to assess the presence of NASH, which was used as the outcome. We chose RF as the algorithm to develop six models using different pre-processing strategies. The performance metrics was evaluated to choose an optimized model. Finally, several interpretability techniques, such as feature importance, contribution of each feature to predictions, and partial dependence plots, were used to understand and explain the model to help obtain a better understanding of machine learning-based predictions. In total, 1525 patients met the inclusion criteria. The mean age was 57.3 years, and 507 patients had NASH (prevalence of 33.2%). Filter methods (the chi-square and Mann–Whitney–Wilcoxon tests) did not produce additional insight in terms of interactions, contributions, or relationships among variables and their outcomes. The random forest model correctly classified patients with NASH to an accuracy of 0.87 in the best model and to 0.79 in the worst one. Four features were the most relevant: insulin resistance, ferritin, serum levels of insulin, and triglycerides. The contribution of each feature was assessed via partial dependence plots. Random forest-based modeling demonstrated that machine learning can be used to improve interpretability, produce understanding of the modeled behavior, and demonstrate how far certain features can contribute to predictions.https://www.mdpi.com/1099-4300/23/6/763non-alcoholic fatty liver diseaserandom forestinterpretability
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Rafael García-Carretero Roberto Holgado-Cuadrado Óscar Barquero-Pérez
spellingShingle	Rafael García-Carretero Roberto Holgado-Cuadrado Óscar Barquero-Pérez Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest Entropy non-alcoholic fatty liver disease random forest interpretability
author_facet	Rafael García-Carretero Roberto Holgado-Cuadrado Óscar Barquero-Pérez
author_sort	Rafael García-Carretero
title	Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_short	Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_full	Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_fullStr	Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_full_unstemmed	Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_sort	assessment of classification models and relevant features on nonalcoholic steatohepatitis using random forest
publisher	MDPI AG
series	Entropy
issn	1099-4300
publishDate	2021-06-01
description	Nonalcoholic fatty liver disease (NAFLD) is the hepatic manifestation of metabolic syndrome and is the most common cause of chronic liver disease in developed countries. Certain conditions, including mild inflammation biomarkers, dyslipidemia, and insulin resistance, can trigger a progression to nonalcoholic steatohepatitis (NASH), a condition characterized by inflammation and liver cell damage. We demonstrate the usefulness of machine learning with a case study to analyze the most important features in random forest (RF) models for predicting patients at risk of developing NASH. We collected data from patients who attended the Cardiovascular Risk Unit of Mostoles University Hospital (Madrid, Spain) from 2005 to 2021. We reviewed electronic health records to assess the presence of NASH, which was used as the outcome. We chose RF as the algorithm to develop six models using different pre-processing strategies. The performance metrics was evaluated to choose an optimized model. Finally, several interpretability techniques, such as feature importance, contribution of each feature to predictions, and partial dependence plots, were used to understand and explain the model to help obtain a better understanding of machine learning-based predictions. In total, 1525 patients met the inclusion criteria. The mean age was 57.3 years, and 507 patients had NASH (prevalence of 33.2%). Filter methods (the chi-square and Mann–Whitney–Wilcoxon tests) did not produce additional insight in terms of interactions, contributions, or relationships among variables and their outcomes. The random forest model correctly classified patients with NASH to an accuracy of 0.87 in the best model and to 0.79 in the worst one. Four features were the most relevant: insulin resistance, ferritin, serum levels of insulin, and triglycerides. The contribution of each feature was assessed via partial dependence plots. Random forest-based modeling demonstrated that machine learning can be used to improve interpretability, produce understanding of the modeled behavior, and demonstrate how far certain features can contribute to predictions.
topic	non-alcoholic fatty liver disease random forest interpretability
url	https://www.mdpi.com/1099-4300/23/6/763
work_keys_str_mv	AT rafaelgarciacarretero assessmentofclassificationmodelsandrelevantfeaturesonnonalcoholicsteatohepatitisusingrandomforest AT robertoholgadocuadrado assessmentofclassificationmodelsandrelevantfeaturesonnonalcoholicsteatohepatitisusingrandomforest AT oscarbarqueroperez assessmentofclassificationmodelsandrelevantfeaturesonnonalcoholicsteatohepatitisusingrandomforest
_version_	1721348721642307584

Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest

Similar Items