3:30 PM
4:30 PM
Abstract
The identification of patterns and structure within a dataset is a challenging problem in unsupervised learning. This is due in part to the fact that there is no gold standard by which performance can be assessed. The concept of “stability” has been used as a surrogate for performance primarily in the area of data clustering and defined in a number of ways. Measures of stability capture the quality of the clustering and reproducibility. In this talk, I will introduce an approach to cluster stability that relies on bootstrapped clustering of the data and use of the Jaccard distance. A distinguishing feature of this approach is that stability can be measured and summarized at the level of the individual items being clustered, the clusters themselves and used for model selection (number of clusters). Recent extensions to this framework to the problem of community detection in undirected graphical models will be described. Applications include metabolomics dataset from the Beijing Olympics Air Pollution (BoaP) study. These approaches are implemented in the “bootcluster” package that is available in the R programming language.