A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems

Feature selection is the process of automatically selecting important features from data. It is an essential part of machine learning, artificial intelligence, data mining, and modelling in general. There are many feature selection algorithms available and the appropriate choice can be difficult. Th...

Full description

Bibliographic Details
Main Authors:	Anette, Kniberg, Nokto, David
Format:	Others
Language:	English
Published:	KTH, Medicinteknik och hälsosystem 2018
Subjects:	feature selection variable selection attribute selection machine learning data mining benchmark classification variabelselektion maskininlärning datautvinning klassificering Medical Engineering Medicinteknik
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-228614

id	ndltd-UPSALLA1-oai-DiVA.org-kth-228614
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-kth-2286142018-06-26T06:09:51ZA Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification ProblemsengAnette, KnibergNokto, DavidKTH, Medicinteknik och hälsosystemKTH, Medicinteknik och hälsosystem2018feature selectionvariable selectionattribute selectionmachine learningdata miningbenchmarkclassificationvariabelselektionmaskininlärningdatautvinningklassificeringMedical EngineeringMedicinteknikFeature selection is the process of automatically selecting important features from data. It is an essential part of machine learning, artificial intelligence, data mining, and modelling in general. There are many feature selection algorithms available and the appropriate choice can be difficult. The aim of this thesis was to compare feature selection algorithms in order to provide an experimental basis for which algorithm to choose. The first phase involved assessing which algorithms are most common in the scientific community, through a systematic literature study in the two largest reference databases: Scopus and Web of Science. The second phase involved constructing and implementing a benchmark pipeline to compare 31 algorithms’ performance on 50 data sets.The selected features were used to construct classification models and their predictive performances were compared, as well as the runtime of the selection process. The results show a small overall superiority of embedded type algorithms, especially types that involve Decision Trees. However, there is no algorithm that is significantly superior in every case. The pipeline and data from the experiments can be used by practitioners in determining which algorithms to apply to their respective problems. Variabelselektion är en process där relevanta variabler automatiskt selekteras i data. Det är en essentiell del av maskininlärning, artificiell intelligens, datautvinning och modellering i allmänhet. Den stora mängden variabelselektionsalgoritmer kan göra det svårt att avgöra vilken algoritm som ska användas. Målet med detta examensarbete är att jämföra variabelselektionsalgoritmer för att ge en experimentell bas för valet av algoritm. I första fasen avgjordes vilka algoritmer som är mest förekommande i vetenskapen, via en systematisk litteraturstudie i de två största referensdatabaserna: Scopus och Web of Science. Den andra fasen bestod av att konstruera och implementera en experimentell mjukvara för att jämföra algoritmernas prestanda på 50 data set. De valda variablerna användes för att konstruera klassificeringsmodeller vars prediktiva prestanda, samt selektionsprocessens körningstid, jämfördes. Resultatet visar att inbäddade algoritmer i viss grad är överlägsna, framför allt typer som bygger på beslutsträd. Det finns dock ingen algoritm som är signifikant överlägsen i varje sammanhang. Programmet och datan från experimenten kan användas av utövare för att avgöra vilken algoritm som bör appliceras på deras respektive problem. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-228614TRITA-CBH-GRU ; 2018:32application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	feature selection variable selection attribute selection machine learning data mining benchmark classification variabelselektion maskininlärning datautvinning klassificering Medical Engineering Medicinteknik
spellingShingle	feature selection variable selection attribute selection machine learning data mining benchmark classification variabelselektion maskininlärning datautvinning klassificering Medical Engineering Medicinteknik Anette, Kniberg Nokto, David A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems
description	Feature selection is the process of automatically selecting important features from data. It is an essential part of machine learning, artificial intelligence, data mining, and modelling in general. There are many feature selection algorithms available and the appropriate choice can be difficult. The aim of this thesis was to compare feature selection algorithms in order to provide an experimental basis for which algorithm to choose. The first phase involved assessing which algorithms are most common in the scientific community, through a systematic literature study in the two largest reference databases: Scopus and Web of Science. The second phase involved constructing and implementing a benchmark pipeline to compare 31 algorithms’ performance on 50 data sets.The selected features were used to construct classification models and their predictive performances were compared, as well as the runtime of the selection process. The results show a small overall superiority of embedded type algorithms, especially types that involve Decision Trees. However, there is no algorithm that is significantly superior in every case. The pipeline and data from the experiments can be used by practitioners in determining which algorithms to apply to their respective problems. === Variabelselektion är en process där relevanta variabler automatiskt selekteras i data. Det är en essentiell del av maskininlärning, artificiell intelligens, datautvinning och modellering i allmänhet. Den stora mängden variabelselektionsalgoritmer kan göra det svårt att avgöra vilken algoritm som ska användas. Målet med detta examensarbete är att jämföra variabelselektionsalgoritmer för att ge en experimentell bas för valet av algoritm. I första fasen avgjordes vilka algoritmer som är mest förekommande i vetenskapen, via en systematisk litteraturstudie i de två största referensdatabaserna: Scopus och Web of Science. Den andra fasen bestod av att konstruera och implementera en experimentell mjukvara för att jämföra algoritmernas prestanda på 50 data set. De valda variablerna användes för att konstruera klassificeringsmodeller vars prediktiva prestanda, samt selektionsprocessens körningstid, jämfördes. Resultatet visar att inbäddade algoritmer i viss grad är överlägsna, framför allt typer som bygger på beslutsträd. Det finns dock ingen algoritm som är signifikant överlägsen i varje sammanhang. Programmet och datan från experimenten kan användas av utövare för att avgöra vilken algoritm som bör appliceras på deras respektive problem.
author	Anette, Kniberg Nokto, David
author_facet	Anette, Kniberg Nokto, David
author_sort	Anette, Kniberg
title	A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems
title_short	A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems
title_full	A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems
title_fullStr	A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems
title_full_unstemmed	A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems
title_sort	benchmark of prevalent feature selection algorithms on a diverse set of classification problems
publisher	KTH, Medicinteknik och hälsosystem
publishDate	2018
url	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-228614
work_keys_str_mv	AT anettekniberg abenchmarkofprevalentfeatureselectionalgorithmsonadiversesetofclassificationproblems AT noktodavid abenchmarkofprevalentfeatureselectionalgorithmsonadiversesetofclassificationproblems AT anettekniberg benchmarkofprevalentfeatureselectionalgorithmsonadiversesetofclassificationproblems AT noktodavid benchmarkofprevalentfeatureselectionalgorithmsonadiversesetofclassificationproblems
_version_	1718707735970709504

A Benchmark of Prevalent Feature Selection Algorithms on a Diverse Set of Classification Problems

Similar Items