Data Deluge | Eberly College of Science

Whittney Gould

1 June 2016

Data Deluge cover spread - data types falling into mixer

Big Data is a ubiquitous term in today’s society. With modern technology seamlessly incorporated into so many aspects of human life, the possibilities for extracting information about our habits, our health, and the many other events of our daily life is huge. But how can academia, research, and industry learn how to harness the power of the large sets of data we’re now able to collect.

Whether it’s helping to analyze large sets of data for research or educating the next generation of students, scientists in the Eberly College of Science are unlocking the potential of new opportunities in the era of big data.

New Data Sciences Major

Traditionally, many researchers analyzed small data sets using a single computer to help industry and academia make decisions. The questions researchers could address scientifically were constrained by limited streams of information, such as asking how many units were sold or during what time of year? Often, an analyst could tell something was happening, but could not address why something was happening, or predict how to make it happen again.

Recent advances in technology have led to much more voluminous and rapid data collection, whether part of carefully designed scientific experiments or unstructured data mined from social media. Thanks to concurrent improvements in computing power and algorithms, we’re able to use these large data sets in new ways.

For example, instead of just being able to calculate what is selling and when, we are now able to extract information about customers and identify correlations in customer choices to better meet the needs of each customer. This use of big data can be found pretty much anywhere on the Internet, like in the way that Amazon and Netflix recommend other products or movies that you might be interested in based on both information you provide and patterns in the behavior of other users.

This new influx of information and endless potential for analysis has opened the door to a whole new career field in data science.

A new intercollege undergraduate major at Penn State has been created to address the need for education in this area. The new Data Sciences major combines expertise from three colleges to address the many facets of the new field of data science that has quickly emerged over the last few years.

The new major that launches this summer is a partnership between the College of Information Sciences and Technology, the College of Engineering, and the Eberly College of Science. While partnering across colleges and units is common for graduate programs at the University, it’s much less common on the undergraduate side, particularly between three colleges as equal partners. But this new discipline truly requires substantial expertise and input from all three colleges, said David Hunter, head of the Department of Statistics at Penn State.

“There’s a huge need for people who can do this,” he said. “Everybody knows that there’s a deluge of data out there, but to make sense of it requires skills that don’t fit well into the University’s existing majors.”

Dean Douglas Cavener of the Eberly College of Science agrees with Hunter about the major’s importance. “Our world is overwhelmed with massive amounts of data that can potentially address global challenges,” he said. “To employ those data in solving problems and advancing society requires people with high-level skills in data analysis, and this is why our new major in Data Sciences is so important and timely.”

Students will learn the technical fundamentals associated with data sciences and the skills needed to manage and analyze large-scale data to address an expanding range of problems in industry, government, and academia. In addition to learning computer programming, data wrangling, statistical analysis techniques, and algorithms, students in the major will address ethics in the era of big data.

“Privacy is a major concern, and not just in cases where there is a federal law to back it up,” said Hunter. “At what point do people start to get concerned that you know too much about their online habits? We want our students to be aware of these issues if they are going to be the experts in analyzing data.”

woman with sphere of data hovering over her hand

The program will hire a full-time faculty member in data ethics so that students will have a resource on campus to teach them about this important aspect of the job of a data scientist.

The need for increased computer science expertise is growing rapidly due to the massive size of today’s available data sets: “On the one hand, we’ve got a computer science program that trains students to do the coding side of dealing with data, which is substantial. Sometimes data sets are so large in their raw form that just dealing with files that size requires computer science expertise,” said Hunter.

Working with big data requires a different perspective than many statisticians are used to. “As statisticians we’re sort of coming at it from the other side. We have always talked about how to analyze data, but we have traditionally dealt with small data sets by today’s standards and we haven’t been preparing our students well enough to handle the computing side of things,” Hunter said.

The College of Information Sciences and Technology approached the other two colleges to propose a well-rounded approach to teaching data sciences for this interdisciplinary major.
Recognizing the importance of collaboration with other disciplines is a key aspect of the program. “This is a real strength of Penn State’s,” said Hunter. “We in Statistics have been involved in this conversation with IST and Computer Science from day one. In a lot of other places, the feeling is that one area owns data sciences, which hasn’t been the case here. This broader conversation makes our program well positioned to tackle future challenges.”

The major will feature three options for studets to choose from: Applied Data Sciences for those who want to focus on applications to science and industry, Computational Data Sciences for those who want to explore more of the computational challenges, and Statistical Modeling Data Sciences for those who are more interested in statistical modeling.

Where could students find jobs with a Data Sciences degree? “I don’t think I could tell you realistically where you couldn’t work as a data scientist,” said Hunter. “In most fields, there will be a need to analyze data.”

Other colleges at Penn State are also starting to recognize the importance of big data and incorporating it into their program curriculums, including the Department of Political Science’s Social Data Analytics major, the Smeal College of Business, and a certificate program at Penn State Great Valley in Data Analytics.

“It’s a huge movement, and the new major is just one outgrowth of it,” Hunter said.

Early Adopters: Astronomy and Physics

While the University is just now officially creating a full major in data sciences, some scientific fields have been training students in this type of research for years.

“In fields like astronomy and astrophysics, we have been training students to be data scientists since before that term had been invented,” said Eric Ford, professor of astronomy and astrophysics.

Astronomy and physics research has been garnering large data sets for years now, especially from large-scale projects like the Sloan Digital Sky Survey, which has had an enormous impact on the field, leading to more publications and citations than the Hubble Space Telescope. The Sloan Digital Sky Survey has ushered in a new wave of survey science, as astronomers learn to search through large databases to find patterns in the data, new classes of objects, and rare objects that don’t fit in the existing categories. Penn State scientists have participated in the planning and execution of the Sloan Digital Sky Survey since 1994.

Today, Penn State is also a partner in the next generation of surveys, including the Hobby-Eberly Dark Energy Experiment and the Habitable Zone Planet Finder surveys to start in the coming year. Experience with these will prepare astronomers to make use of data from the Large Synoptic Survey Telescope, scheduled to start surveying the sky in 2023.

“Traditionally, astronomers spend lots of effort to develop instruments and collect data, yet often apply relatively simple statistical analyses or computational methods,” he explained. “As scientists collect scientific data more rapidly, it becomes essential that we be able to work with the resulting large and complex data sets.”

Ford uses his own research as an example. He focuses on interpreting observations of extrasolar planets to draw accurate inferences about the planets, the planetary systems they reside in, and the abundance of planets in our galaxy. He uses advanced statistical methods and dozens to thousands of computer cores to analyze observations from a variety of astronomical observatories, including ground-based telescopes like the Hobby-Eberly Telescope and space telescopes such as NASA’s Kepler mission. By applying science techniques, his group has advanced the state-of-the-art in searching for small planets and measuring their masses and densities to make inferences about their composition and formation.

“My research aims to enhance the science return of astronomical observatories and surveys by using the most powerful statistical and computational techniques available to analyze the data,” he said.

This translates to the education side of things, too. As astronomy students conduct their research, they have to learn many data science skills that are typically not in the core curriculum of any major. But students have been developing the data science skills necessary for astronomy and physics research since before the term “data science” was even coined.

“In the past, it could be a real challenge to explain to a company why a student trained in analyzing astronomical observations would be a great asset to their organization. As industry and government agencies have begun to embrace data-driven decision making, ‘data science’ became a compact way to describe the complex set of skills that our students develop and that employers are looking for,” said Ford.

Astronomy students feel the benefits of this training when it’s time to apply for a job. “We train students to develop a combination of statistics and computational tools that makes their domain expertise even more valuable,” Ford said.

The Galaxy Project

While data science has been part of astronomy and physics for a long time, biology has been much different.

“Data sciences have existed for a long time, but they didn’t extend to biology much,” said Anton Nekrutenko, professor of biochemistry and molecular biology.

Then, around 2005, it became possible to sequence DNA on a very large scale, which meant that biologists were suddenly faced with the types of large data sets that scientists in astronomy and physics were working with. Since these large data sets were a new challenge for biologists, no one really knew how to handle them.

“For the first time, biology became a data-driven science. The idea of Galaxy was to simplify large-data logistics for biologists through the web,” Nekrutenko said. “You upload your data somewhere and you analyze it through a web interface.”

The project he refers to, The Galaxy Project, is data analysis software created by Nekrutenko and collaborator James Taylor of Johns Hopkins University. Nekrutenko is a biologist and Taylor a computer scientist by trade, so their skill sets combined brought a new perspective to the idea of data analysis in life sciences. Now this combination of life sciences knowledge and computational skills is referred to as bioinformatics, a field which both Nekrutenko and Taylor now work in.

Galaxy utilizes a cloud-computing approach to data analysis, so that software, data, and computer hardware infrastructure can be accessed from any location in the world rather than tied to any single physical location. This means that a project can easily allow for international collaborations.

Galaxy’s tagline, “Data intensive biology for everyone,” clearly states its mission. “The software is freely available to everyone, so they can take it and run their own Galaxies,” he said.

And by everyone, he means fields other than his own field of biology and bioinformatics. Galaxy may have started with a focus on biology and life sciences, but it now also serves fields like sociology, economics, astronomy, natural language processing, and climate research in addition to life sciences.

Whether Galaxy users install Galaxy locally or use the web-based software, Galaxy is open source, and will stay that way, according to Nekrutenko.

“There is no business model here. People should be able to analyze data using best practices, for free,” he said. “But our goal is not the freeness of the analysis, it’s that the analysis that you conduct today can be repeated in five years, so that when you publish your paper, the results are reproducible. One of the biggest challenges in biology is that the analyses in most of the papers you read cannot be reproduced. Data analysis should be reproducible.”

Galaxy records all analysis steps, settings, inputs, and outputs. Data can be organized visually in a variety of charts or graph forms. The information stays on the Galaxy server and the analysis can be run again if the scientist has a need for it. Galaxy tries to serve as many types of analyses as possible, and even offers an app store of sorts with Galaxy utilities and add-ons available called the Galaxy Tool Shed. Galaxy’s community of users creates many of these Galaxy utilities and add-ons.

It seems that scientists find the site useful, as currently Galaxy’s main server boasts 80,000 registered users worldwide and performs an average of 250,000 analyses per month. Galaxy has been directly cited in scientific papers more than 3,000 times since Galaxy Project staff began tracking the citations in 2011.

Galaxy has even spawned its own online and in-person user communities, with workshops and meetups taking place all over the world. The 2016 Galaxy Community Conference, an annual gathering of Galaxy enthusiasts organized by Galaxy Project staff, will take place later this month. 2016 marks the conference’s seventh year of bringing Galaxy staff and users together.

The Future of Data in Science

Regardless of their discipline, scientists in the Eberly College of Science agree that big data is here to stay. Learning to analyze the large amounts of data collected in a specific field is the future.

“It’s the new norm. If you want to publish the best papers, you have to take advantage of the available data, no matter what your field, whether you’re in biology or social sciences,” said Nekrutenko.

“Increasingly, we’re able to automate the data collection process, making it practical to collect much more data than before. If we want to realize the full potential of such large data sets, then we must develop statistical frameworks and computational methods that can work on ‘big data,’” said Ford. “Data science is about how to draw appropriate conclusions from data, drawing upon a combination of domain expertise, mathematical modeling, and computational tools. What could be more important for science?”