Science Journal cover summer 2018 gadgets hero
science-journal

Finding meaning in data

Two Penn State statisticians search for the story behind the numbers.
17 October 2018
Image
Matthew Reimherr and Ben Shaby. Credit: Nate Follmer, Penn State.

It is used in nearly every field of research, is vital to business and industry, guides critical decisions in government, enables breakthroughs in medicine, and informs practically everything from agriculture to weather forecasting. In this modern era of omnipresent and progressively more-sophisticated technology, it has been invaluable in parsing the resultant data deluge. But despite its tremendous impact, statistics remains one of science’s unsung heroes.

“I think there’s a very common misunderstanding as to what statistics actually is,” says Matthew Reimherr. “Basically, we’re trying to extract information from data, and that’s really what statistics is at the heart.”

Part of the issue is that many statisticians work with other scientists in fields other than statistics, which consequently is where the impact of statistics is seen, but not necessarily as such. Reimherr collaborates with biologists, anthropologists, and other non-statisticians to study complex human phenotypes such as facial shapes and growth trajectories—ultimately statistical problems that can be, as he puts it, “a bit challenging to model.”

“Then when we have, say, millions of genetic markers that we want to build into the models—all of a sudden we run into a pretty serious computational problem,” he says. “We need lots of computational power to model these high-dimensional phenotypes alongside high-dimensional predictors like genetic mutations.”

Reimherr found a solution in the Advanced CyberInfrastructure of Penn State’s Institute for CyberScience (ICS–ACI). With more than 23,000 computing cores and 20 petabytes—20 million gigabytes—of encrypted storage on a state-of-the-art high-speed network, the ICS–ACI has the computational power to analyze massive datasets and run sophisticated simulations, and its cybersecurity meets both Penn State’s and the National Institutes of Health’s stringent requirements for the human-subject research data Reimherr works with. Many other researchers at Penn State—in fields ranging from evolutionary biology to astronomy and astrophysics—use the ICS–ACI to perform computationally intensive tasks up to several thousand times faster than on the most powerful desktop system, where they could take weeks or even months to run. One of Reimherr’s colleagues, Ben Shaby, also uses the ICS–ACI for complex statistical modeling, but with applications that are practically a world apart from those of Reimherr’s work.

“There’s a tremendous diversity of interests and approaches among statisticians,” Shaby says. “The way I would approach a problem is probably completely different from how Matthew would, because we look at what it means to do statistics in very different ways. Even the sorts of computing we do and the computational strategies we use are very different.”

Shaby works with civil engineers, meteorologists, and atmospheric scientists to model the spacial structure of rare, extreme weather events. "For example," he says, "if you're a city planner building coastal defenses or bridges, you might care about extreme precipitation because of flood risk. And one thing you might want to know is what's the highest-intensity storm you're likely to see. But knowing about such an event in a single location doesn't really tell you all you need to know. What you really need to know is how much rain is going to fall in a drainage basin or a catchment."

To accurately describe such events, Shaby combines a number of probability models that collectively approximate the larger scenario and then, through a statistical process known as decomposition, renders that composite as a hierarchical series of smaller models that represent the underlying patterns—altogether a highly complex set of operations that he, like Reimherr, relies on the ICS–ACI to accomplish.

Despite the striking contrast in the scientific questions they seek to answer, at a high level Shaby and Reimherr have very similar goals and perspectives—both apply probability-based models to make population-level inferences from data—but the specific methods and statistical tools they use are on nearly opposite ends of the spectrum. One practically has to wonder how such an apparent paradox, of two scientist-colleagues simultaneously so alike and so different, came to be.

Initially, Shaby was intrigued by the dynamics of the global carbon cycle. “I found myself interested in environmental problems as an undergrad,” he says, “and it quickly became apparent to me that what was happening was a really big statistical problem and that if I wanted to be good at this, I needed to learn a whole lot more about statistics. Once I got into grad school, my focus shifted a little bit, but still on these issues of understanding the importance of spatial structure. The places where events are occurring can tell you a lot, and so that’s been a recurring theme, the spatial structure of these extremely rare events.”

Reimherr, though, was captivated by mathematics from the start and, over time, gradually migrated into statistics. “I was very interested in the math and in the methods,” he says, “and I wanted to work on problems that I thought were important, so this is what pushed me into areas like genetics. They have a combination of things: very interesting problems, very interesting data and methods. And I can work on something that I think is meaningful to society as a whole.”

These two scientists, Shaby and Reimherr, exemplify the value of statistics and statisticians to society—and there are many more like them working, often behind the scenes, on our most- pressing scientific issues; but their time in the shadows is waning with each such example that comes to light. Bit by bit, it’s becoming clearer that in the age of “big data” and beyond, the progress we make on the issues facing the human race—from global warming to genetic disorders and disease—will be owed increasingly to statisticians and their contributions to our understanding of what the data are telling us. “These are difficult questions,” Shaby says, “and they’re complicated. But this is what we do: We look into problems and try to find the meaning in the data.”

Matthew Reimherr is an assistant professor of statistics at Penn State.

Ben Shaby is an assistant professor of statistics and a member of the Institute for CyberScience (ICS) at Penn State.