Benjamin Shaby, assistant professor of statistics, and Daisy Philtron, assistant research professor of statistics.
A pair of Penn State professors will be conducting research in a study funded by the National Science Foundation (NSF) to combine different sources of information to more efficiently identify genes involved in disease progression. Ultimately, discovering genetic modifiers of disease in the human genome may help to further precision medicine.
Benjamin Shaby, assistant professor of statistics and Penn State Institute for CyberScience faculty co-hire, and Daisy Philtron, assistant research professor of statistics, have received funding for a collaborative grant with a lab at University of California, San Francisco. The total amount for the award is $1 million, and Penn State is receiving $250,000.
“Genetics can play a key role in understanding the causes of and developing treatments for some diseases,” Shaby said. “Studying a single type of genetic information usually results in very poor ability to detect weak signals. We will develop new models that can study several types of genetic information simultaneously. “
The duo hope that combining different sources of information will enhance their ability to identify genetic modifiers. For example, genome wide association studies (GWAS) take the entire coding region of the DNA of study participants who have a particular phenotype or disease and compare that with the DNA of control participants.
This approach has had some success; however, genetic variants identified through GWAS approaches usually explain only a small fraction of their known heritability and have yielded a poor record of finding disease-causing variants.
“Our approach will allow for integration of disparate data types such as microarray data, genome-wide association data, and pedigree data,” Philtron said.
Shaby and Philtron’s project aims to develop tools to combine GWAS with other sources of data, such as family-based genetic studies that identify important heritable variants and transcriptomic studies that measure differences in gene expression, to find genetic modifiers that could be missed using GWAS alone.
They will use Parkinson’s disease as their model disease, but the tools they develop will be applicable to any heritable complex phenotype.
The project will identify these genes by combining information across different experimental types using statistical tools called hierarchical models. Their models hypothesize that each gene in a person’s genome belongs to one of three groups: a “null group” that is not associated with disease progression; a “deleterious group” that is associated with negative disease outcomes; or a “beneficial group” that is associated with positive disease outcomes. The members of each group will tend to influence the various types of experimental measurements in similar ways.
The three-group structure has two key features. First, unlike traditional methods, it automatically accounts for multiplicity, which means that the results remain valid even though thousands of genes are being analyzed simultaneously. Second, it allows the information to be shared across various kinds of experiments, meaning that the results of the different experiments are mutually reinforcing.
Taken together, these two features have the potential to result in enhanced power to detect weak signals and, at the same time, produce few false positive results. Furthermore, the modular structure of the model design means that it would easily accommodate future types of experimental outcomes, should they become available.
“The work is exciting because of its inherent flexibility to incorporate new data types as they become available,” Philtron said. “We hope that our integrated analysis will detect important signals that may be missed in analyses of individual data types.”
This project, titled “Combining Heterogeneous Data Sources to Identify Genetic Modifiers of Diseases,” will span five years. Shaby and Philtron will fit their statistical models using computationally-intensive algorithms called Markov chain Monte Carlo, using data that comes with stringent privacy and security requirements. To make this possible, they will utilize the ICS Advanced CyberInfrastructure, Penn State’s supercomputer.
“Our goal is to develop powerful, generalizable tools to help identify which genes play a role in disease development,” Shaby said. “This project will hopefully lead to further understanding of the biological basis of Parkinson’s disease and potentially to therapeutic targets for drug development. These tools will also be useful for other researchers who are studying other heritable diseases.”