Summary: | Recent improvements in sequencing technologies have caused various interesting problems to arouse. Having millions of read sequences as the final product of sequencing genome at a lower cost compared to micro array era, has encouraged scientists to enhance previous methods in various areas of bioinformatics. Genotyping and generating genetic maps to study inherited genotypes in order to analyze specific traits in a population is one of the fields of bioinformatics that involves generating different genetic markers and identify polymorphisms in different individuals of a population.
Presence/absence markers are the main focus of this thesis. This is one type of Restriction site Associate DNA (RAD) markers which is present in some samples and absent in others and is the sign of variation in the cut site of a restriction enzyme. However, the counts of markers in an experiment are highly correlated and calling true absence and presence is not a straightforward task which means any marker with zero count is not necessarily absent in the sample under study. This is also the case for non-zero count markers which are not necessarily present. A good model that can fit the data is able to make true calls. We propose two different contexts for designing such models as a solution to this problem and investigate their performance.
On the other hand, utilizing features of next generation sequencing technology in an even more efficient way, requires the ability to multiplex high number of samples in a single experiment run. In that case, appropriate barcoding, that is robust to various sources of noise in the machine, becomes paramount. Designing such barcodes in an efficient way is a challenging task which is addressed in detail as another problem of this thesis.
We make two contributions. One, we propose an algorithm for barcoding multiplexed RADSeq samples. Two, we propose an algorithm for the statistical selection of presence/absence markers on the basis of RADSeq data on two related individuals. Operating characteristics of our methods are explored using both simulated and real data.
|