Skip to main content
stat
The statistics of k-mers from mutated sequences and the bias of k-mer-based Jaccard estimators
Add to Calendar 2021-11-12T15:10:00 2021-11-12T16:00:00 UTC The statistics of k-mers from mutated sequences and the bias of k-mer-based Jaccard estimators 327 Thomas Building, University Park, PA
Start DateFri, Nov 12, 2021
10:10 AM
to
End DateFri, Nov 12, 2021
11:00 AM
Presented By
Paul Medvedev
Event Series: SMAC Talks

Bioinformatics methods often decompose variable length sequence data into short fixed length strings called k-mers. K-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. In this talk, we will derive some of their basic properties as well as the bias and prediction intervals associated with k-mer-based estimators. In the first part of the talk, we consider the simple model where a sequence S (e.g. a genome or a read) undergoes a simple mutation process whereby each nucleotide is mutated independently with some probability r. How does this process affect the k-mers of S? We use Stein's method and derive the expectation and variance of several random variables in this model. We then derive prediction and confidence intervals for r given observed values of these random variables. In the second part of the talk (time permitting), we look at the accuracy of a the k-mer based minimizer sketch and its use to estimate the Jaccard similarity between two sequences. We show that this estimator is inconsistent by deriving an analytical formula for the bias. We show both theoretically and experimentally that there are families of sequences where the bias can be substantial. Theoretical results will be complemented by experimental evaluations. This is joint work with Mahdi Belbasi, Antonio Blanca, Bob Harris, and David Koslicki based on two recent papers.