The extension of the largest generalized-eigenvalue based distance metric ) in arbitrary feature spaces to classify composite data points

Analyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogen...

Full description

Bibliographic Details
Main Author: Mosaab Daoud
Format: Article
Language:English
Published: Korea Genome Organization 2019-11-01
Series:Genomics & Informatics
Subjects:
Online Access:http://genominfo.org/upload/pdf/gi-2019-17-4-e39.pdf
id doaj-cceab37ad68b4781a5834e861fd17797
record_format Article
spelling doaj-cceab37ad68b4781a5834e861fd177972020-11-25T02:13:31ZengKorea Genome OrganizationGenomics & Informatics2234-07422019-11-0117410.5808/GI.2019.17.4.e39582The extension of the largest generalized-eigenvalue based distance metric ) in arbitrary feature spaces to classify composite data pointsMosaab DaoudAnalyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogeneous sets of biosequences (composite data points). A composite data point is a set of ordinary data points (e.g., set of feature vectors). We theoretically extend the derivation of the largest generalized eigenvalue-based distance metric Dij (γ1) in any linear and non-linear feature spaces. We prove that Dij (γ1) is a metric under any linear and non-linear feature transformation function. We show the sufficiency and efficiency of using the decision rule δΞi (i.e., mean of Dij (γ1)) in classification of heterogeneous sets of biosequences compared with the decision rules minΞi and medianΞi. We analyze the impact of linear and non-linear transformation functions on classifying/clustering collections of heterogeneous sets of biosequences. The impact of the length of a sequence in a heterogeneous sequence-set generated by simulation on the classification and clustering results in linear and non-linear feature spaces is empirically shown in this paper. We propose a new concept: the limiting dispersion map of the existing clusters in heterogeneous sets of biosequences embedded in linear and nonlinear feature spaces, which is based on the limiting distribution of nucleotide compositions estimated from real data sets. Finally, the empirical conclusions and the scientific evidences are deduced from the experiments to support the theoretical side stated in this paper.http://genominfo.org/upload/pdf/gi-2019-17-4-e39.pdfclassificationclusteringcomposite data pointslimiting dispersion maplinear (non-linear) transformation functionsets of sequencesstatistical learning
collection DOAJ
language English
format Article
sources DOAJ
author Mosaab Daoud
spellingShingle Mosaab Daoud
The extension of the largest generalized-eigenvalue based distance metric ) in arbitrary feature spaces to classify composite data points
Genomics & Informatics
classification
clustering
composite data points
limiting dispersion map
linear (non-linear) transformation function
sets of sequences
statistical learning
author_facet Mosaab Daoud
author_sort Mosaab Daoud
title The extension of the largest generalized-eigenvalue based distance metric ) in arbitrary feature spaces to classify composite data points
title_short The extension of the largest generalized-eigenvalue based distance metric ) in arbitrary feature spaces to classify composite data points
title_full The extension of the largest generalized-eigenvalue based distance metric ) in arbitrary feature spaces to classify composite data points
title_fullStr The extension of the largest generalized-eigenvalue based distance metric ) in arbitrary feature spaces to classify composite data points
title_full_unstemmed The extension of the largest generalized-eigenvalue based distance metric ) in arbitrary feature spaces to classify composite data points
title_sort extension of the largest generalized-eigenvalue based distance metric ) in arbitrary feature spaces to classify composite data points
publisher Korea Genome Organization
series Genomics & Informatics
issn 2234-0742
publishDate 2019-11-01
description Analyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogeneous sets of biosequences (composite data points). A composite data point is a set of ordinary data points (e.g., set of feature vectors). We theoretically extend the derivation of the largest generalized eigenvalue-based distance metric Dij (γ1) in any linear and non-linear feature spaces. We prove that Dij (γ1) is a metric under any linear and non-linear feature transformation function. We show the sufficiency and efficiency of using the decision rule δΞi (i.e., mean of Dij (γ1)) in classification of heterogeneous sets of biosequences compared with the decision rules minΞi and medianΞi. We analyze the impact of linear and non-linear transformation functions on classifying/clustering collections of heterogeneous sets of biosequences. The impact of the length of a sequence in a heterogeneous sequence-set generated by simulation on the classification and clustering results in linear and non-linear feature spaces is empirically shown in this paper. We propose a new concept: the limiting dispersion map of the existing clusters in heterogeneous sets of biosequences embedded in linear and nonlinear feature spaces, which is based on the limiting distribution of nucleotide compositions estimated from real data sets. Finally, the empirical conclusions and the scientific evidences are deduced from the experiments to support the theoretical side stated in this paper.
topic classification
clustering
composite data points
limiting dispersion map
linear (non-linear) transformation function
sets of sequences
statistical learning
url http://genominfo.org/upload/pdf/gi-2019-17-4-e39.pdf
work_keys_str_mv AT mosaabdaoud theextensionofthelargestgeneralizedeigenvaluebaseddistancemetricinarbitraryfeaturespacestoclassifycompositedatapoints
AT mosaabdaoud extensionofthelargestgeneralizedeigenvaluebaseddistancemetricinarbitraryfeaturespacestoclassifycompositedatapoints
_version_ 1724904701875453952