Summary: | Single nucleotide polymorphisms (SNPs) have been increasingly popular for
a wide range of genetic studies. A high-throughput genotyping technologies
usually involves a statistical genotype calling algorithm. Most calling
algorithms in the literature, using methods such as k-means and mixturemodels,
rely on elliptical structures of the genotyping data; they may fail
when the minor allele homozygous cluster is small or absent, or when the
data have extreme tails or linear patterns.
We propose an automatic genotype calling algorithm by further developing
a linear grouping algorithm (Van Aelst et al., 2006). The proposed
algorithm clusters unnormalized data points around lines as against around
centroids. In addition, we associate a quality value, silhouette width, with
each DNA sample and a whole plate as well. This algorithm shows promise
for genotyping data generated from TaqMan technology (Applied Biosystems).
A key feature of the proposed algorithm is that it applies to unnormalized
fluorescent signals when the TaqMan SNP assay is used. The
algorithm could also be potentially adapted to other fluorescence-based SNP
genotyping technologies such as Invader Assay.
Motivated by the SNP genotyping problem, we propose a partial likelihood
approach to linear clustering which explores potential linear clusters
in a data set. Instead of fully modelling the data, we assume only the signed
orthogonal distance from each data point to a hyperplane is normally distributed.
Its relationships with several existing clustering methods are discussed.
Some existing methods to determine the number of components in a
data set are adapted to this linear clustering setting. Several simulated and
real data sets are analyzed for comparison and illustration purpose. We also
investigate some asymptotic properties of the partial likelihood approach.
A Bayesian version of this methodology is helpful if some clusters are
sparse but there is strong prior information about their approximate locations
or properties. We propose a Bayesian hierarchical approach which is
particularly appropriate for identifying sparse linear clusters. We show that
the sparse cluster in SNP genotyping datasets can be successfully identified
after a careful specification of the prior distributions.
|