New method reveals high similarity between gorilla and human Y chromosome
A new, less expensive, and faster method now has been developed and used to determine the DNA sequence of the male-specific Y chromosome in the gorilla. The technique will allow better access to genetic information of the Y chromosome of any species and thus can be used to study male infertility disorders and male-specific mutations. It also can aid in conservation genetics efforts by helping to trace paternity and to track how males move within and between populations in endangered species, like gorillas.
A paper describing the method and the discovery resulting from its use in comparing the sequence of the gorilla Y chromosome to the sequences of the human and chimpanzee Y chromosomes will be published on March 2, 2016 in the Advance Online edition of the journal Genome Research. The article also will be published in the April 2016 print issue of the journal.
"Surprisingly, we found that in many ways the gorilla Y chromosome is more similar to the human Y chromosome than either is to the chimpanzee Y chromosome," said Kateryna Makova, the Francis R. and Helen M. Pentz Professor of Science at Penn State and one of two corresponding authors of the paper. "In regions of the chromosome where we can align all three species, the sequence similarity fits with what we know about the evolutionary relationships among the species -- humans are more closely related to chimpanzees. However, the chimpanzee Y chromosome appears to have undergone more changes in the number of genes and contains a different amount of repetitive elements compared to the human or gorilla. Moreover, a greater proportion of the gorilla Y sequences can be aligned to the human than to the chimpanzee Y chromosome."
The Y chromosome of mammals is incredibly difficult to sequence for a number of reasons. One reason is that the Y chromosome is present in only one copy and makes up only about one to two percent of the total genetic material found in a cell of a male. To reduce this difficulty, the researchers used an experimental technique called flow-sorting to preferentially select the Y chromosome for sequencing based on the chromosome's size and genetic content.
"Flow-sorting increased the amount of the Y chromosome in our dataset to about thirty percent," said Paul Medvedev, assistant professor of computer science and engineering and of biochemistry and molecular biology at Penn State, the other corresponding author of the paper. "To further enrich our data for the Y chromosome, we developed a computational technique -- called RecoverY -- to sort the data into Y and non-Y sequences based on how frequently similar sequences appeared in our data."
The Y chromosome, like all DNA, is composed of a series of molecules called "bases" that are represented by the letters A, T, C, and G. Current genetic sequencing technologies produce "reads" of sequence that are much shorter than the entire length of the chromosome. These reads need to be placed in order and pieced together by finding places where they overlap into longer and longer chunks. The research team used two different sequencing technologies to help with this assembly of the DNA sequence of the Y chromosome.
One sequencing technology used by the researchers produces massive amounts of very short reads -- about 150 to 250 bases in length. Using this method, the researchers sequenced enough reads to cover the entire length of the Y chromosome about 450 times. The researchers assembled these short reads into longer chunks that they then further connected using the second sequencing technology that produces longer reads -- about seven thousand bases in length on average.
"By reducing non-Y chromosome reads from our data with flow sorting and the RecoverY technique that we developed, and by using this combination of sequencing technologies, we were able to assemble the gorilla Y chromosome so that more than half of the sequence data was in chunks longer than about 100,000 bases in length," said Medvedev.
Another reason that determining the genetic sequence of the Y chromosome is so difficult is that it is composed of an unusually high number of repeated sequences -- regions where the sequence of As, Ts, Cs, and Gs are identical, or nearly identical, for thousands or millions of bases in a row. Many of these repeats, including some genes, appear as back-to-back series of the same repeated sequence or as long palindromes which, like the word "racecar," read the same forward and backward. The researchers used an experimental technique -- "droplet digital polymerase chain reaction" -- to determine the number of copies of the genes that appear in these series.
"Sequencing the Y chromosome is like trying to put together a jigsaw puzzle, without knowing the final picture, from a pile of pieces where only about one out of every hundred is useful, and most of the pieces you do need look identical," said Makova. "We've developed a pipeline for sequencing the Y chromosome that is more efficient than previous methods and reduces a number of the difficulties associated with determining the genetic sequence of the Y chromosome. Our method will open the door for studying the Y chromosome for more labs, more species, and more individuals within those species."
To demonstrate the utility of the gorilla Y chromosome sequence they generated, the researchers designed genetic markers that can be used to differentiate the genetic relatedness among male gorillas and thus to aid in conservation genetics efforts targeted at preserving this endangered species.
In addition to Makova and Medvedev, the research team includes Marta Tomaszkiewicz, Samarth Rangavittal, Monika Cechova, Rebeca Campos-Sanchez, Howard W. Fescemyer, Robert Harris, Danling Ye, and Rayan Chikhi at Penn State; Malcom A. Ferguson-Smith and Patricia C. M. O'Brien at the University of Cambridge in the United Kingdom; and Oliver Ryder at the San Diego Zoo.
The research was funded by the National Science Foundation (award numbers DBI-ABI 0965596, DBI-1356529, IIS-1453527, IIS-1421908, and CCF-1439057); the Penn State Clinical and Translational Sciences Institute; the Pennsylvania Department of Health; Computation, Bioinformatics, and Statistics Predoctoral Training Program funded by the National Institutes of Health and Penn State; the John and Beverly Stauffer Foundation; the Alice B. Tyler Charitable Trust; and the Leverhulme Trust.