Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides

In biological systems, a few sequence differences diversify the hybridization profile of nucleotides and enable the quantitative control of cellular metabolism in a cooperative manner. In this respect, the information required for a better understanding may not be in each nucleotide sequence, but re...

Full description

Bibliographic Details
Main Authors: Byoungsang Lee, So Yeon Ahn, Charles Park, James J. Moon, Jung Heon Lee, Dan Luo, Soong Ho Um, Seung Won Shin
Format: Article
Language:English
Published: MDPI AG 2019-01-01
Series:Molecules
Subjects:
Online Access:http://www.mdpi.com/1420-3049/24/2/348
id doaj-5a857145e51340d0aac152c694ec01b3
record_format Article
spelling doaj-5a857145e51340d0aac152c694ec01b32020-11-24T21:15:55ZengMDPI AGMolecules1420-30492019-01-0124234810.3390/molecules24020348molecules24020348Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of OligonucleotidesByoungsang Lee0So Yeon Ahn1Charles Park2James J. Moon3Jung Heon Lee4Dan Luo5Soong Ho Um6Seung Won Shin7School of Advanced Materials Science and Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South KoreaSchool of Chemical Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South KoreaBiointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USABiointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USASchool of Advanced Materials Science and Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South KoreaDepartment of Biological and Environmental Engineering, Cornell University, Ithaca, NY 14850, USASchool of Chemical Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South KoreaSchool of Chemical Engineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, South KoreaIn biological systems, a few sequence differences diversify the hybridization profile of nucleotides and enable the quantitative control of cellular metabolism in a cooperative manner. In this respect, the information required for a better understanding may not be in each nucleotide sequence, but representative information contained among them. Existing methodologies for nucleotide sequence design have been optimized to track the function of the genetic molecule and predict interaction with others. However, there has been no attempt to extract new sequence information to represent their inheritance function. Here, we tried to conceptually reveal the presence of a representative sequence from groups of nucleotides. The combined application of the K-means clustering algorithm and the social network analysis theorem enabled the effective calculation of the representative sequence. First, a “common sequence” is made that has the highest hybridization property to analog sequences. Next, the sequence complementary to the common sequence is designated as a ‘representative sequence’. Based on this, we obtained a representative sequence from multiple analog sequences that are 8–10-bases long. Their hybridization was empirically tested, which confirmed that the common sequence had the highest hybridization tendency, and the representative sequence better alignment with the analogs compared to a mere complementary.http://www.mdpi.com/1420-3049/24/2/348representative nucleotidehybridization profileK-means clusteringmultiple equilibriasociogram
collection DOAJ
language English
format Article
sources DOAJ
author Byoungsang Lee
So Yeon Ahn
Charles Park
James J. Moon
Jung Heon Lee
Dan Luo
Soong Ho Um
Seung Won Shin
spellingShingle Byoungsang Lee
So Yeon Ahn
Charles Park
James J. Moon
Jung Heon Lee
Dan Luo
Soong Ho Um
Seung Won Shin
Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides
Molecules
representative nucleotide
hybridization profile
K-means clustering
multiple equilibria
sociogram
author_facet Byoungsang Lee
So Yeon Ahn
Charles Park
James J. Moon
Jung Heon Lee
Dan Luo
Soong Ho Um
Seung Won Shin
author_sort Byoungsang Lee
title Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides
title_short Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides
title_full Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides
title_fullStr Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides
title_full_unstemmed Revealing the Presence of a Symbolic Sequence Representing Multiple Nucleotides Based on K-Means Clustering of Oligonucleotides
title_sort revealing the presence of a symbolic sequence representing multiple nucleotides based on k-means clustering of oligonucleotides
publisher MDPI AG
series Molecules
issn 1420-3049
publishDate 2019-01-01
description In biological systems, a few sequence differences diversify the hybridization profile of nucleotides and enable the quantitative control of cellular metabolism in a cooperative manner. In this respect, the information required for a better understanding may not be in each nucleotide sequence, but representative information contained among them. Existing methodologies for nucleotide sequence design have been optimized to track the function of the genetic molecule and predict interaction with others. However, there has been no attempt to extract new sequence information to represent their inheritance function. Here, we tried to conceptually reveal the presence of a representative sequence from groups of nucleotides. The combined application of the K-means clustering algorithm and the social network analysis theorem enabled the effective calculation of the representative sequence. First, a “common sequence” is made that has the highest hybridization property to analog sequences. Next, the sequence complementary to the common sequence is designated as a ‘representative sequence’. Based on this, we obtained a representative sequence from multiple analog sequences that are 8–10-bases long. Their hybridization was empirically tested, which confirmed that the common sequence had the highest hybridization tendency, and the representative sequence better alignment with the analogs compared to a mere complementary.
topic representative nucleotide
hybridization profile
K-means clustering
multiple equilibria
sociogram
url http://www.mdpi.com/1420-3049/24/2/348
work_keys_str_mv AT byoungsanglee revealingthepresenceofasymbolicsequencerepresentingmultiplenucleotidesbasedonkmeansclusteringofoligonucleotides
AT soyeonahn revealingthepresenceofasymbolicsequencerepresentingmultiplenucleotidesbasedonkmeansclusteringofoligonucleotides
AT charlespark revealingthepresenceofasymbolicsequencerepresentingmultiplenucleotidesbasedonkmeansclusteringofoligonucleotides
AT jamesjmoon revealingthepresenceofasymbolicsequencerepresentingmultiplenucleotidesbasedonkmeansclusteringofoligonucleotides
AT jungheonlee revealingthepresenceofasymbolicsequencerepresentingmultiplenucleotidesbasedonkmeansclusteringofoligonucleotides
AT danluo revealingthepresenceofasymbolicsequencerepresentingmultiplenucleotidesbasedonkmeansclusteringofoligonucleotides
AT soonghoum revealingthepresenceofasymbolicsequencerepresentingmultiplenucleotidesbasedonkmeansclusteringofoligonucleotides
AT seungwonshin revealingthepresenceofasymbolicsequencerepresentingmultiplenucleotidesbasedonkmeansclusteringofoligonucleotides
_version_ 1716744108328878080