10:10 AM
11:00 AM
Bioinformatics methods often decompose variable length sequence data into short fixed length strings called k-mers. K-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. In this talk, we will derive some of their basic properties as well as the bias and prediction intervals associated with k-mer-based estimators. In the first part of the talk, we consider the simple model where a sequence S (e.g. a genome or a read) undergoes a simple mutation process whereby each nucleotide is mutated independently with some probability r. How does this process affect the k-mers of S? We use Stein's method and derive the expectation and variance of several random variables in this model. We then derive prediction and confidence intervals for r given observed values of these random variables. In the second part of the talk (time permitting), we look at the accuracy of a the k-mer based minimizer sketch and its use to estimate the Jaccard similarity between two sequences. We show that this estimator is inconsistent by deriving an analytical formula for the bias. We show both theoretically and experimentally that there are families of sequences where the bias can be substantial. Theoretical results will be complemented by experimental evaluations. This is joint work with Mahdi Belbasi, Antonio Blanca, Bob Harris, and David Koslicki based on two recent papers.