Science Journal Winter 2026 Artificial Intelligence.
science-journal

Foundational Questions in Machine Learning: How to Compare Distributional Data

23 January 2026
Image
Jia Li sits in front of computer while a student points at screen
Jia Li consults with graduate students Sebastian Pena (middle) and Evgenii Kuriabov. Credit: Michelle Bixby

At its core, machine learning, whether supervised or unsupervised, helps identify patterns in data. In order to make mathematical comparisons among data points, an algorithm often relies on the notion of distance between them.

“When we talk about distance between two points, for example in inches or dollars or degrees of temperature, we usually think about multidimensional measurements that are treated as vectors—with both magnitude and direction—in what statisticians call the Euclidean space,” said Jia Li, professor of statistics. “But what happens when, instead of single data points, we are comparing groups of data points, or measuring the difference in how they are distributed? One increasingly popular measure of distance between distributions is called Wasserstein distance.”

The original mathematical problems motivating the metric date back to eighteenth-century France and deal with the optimization of transporting something of interest from multiple source locations to multiple destinations. The exploration of this “optimal transport” problem in different contexts has earned the metric several names, including the Kantorovich-Rubinstein metric, Mallows distance in statistics, and since the late 1990s, the Earth Mover’s Distance in computer vision. In 1969, Leonid Vaserstein, professor emeritus of mathematics at Penn State, explored the idea in the context of mathematical game theory, earning the metric the name Wasserstein distance based on an alternate spelling of his name.

“Say you measured a variety of characteristics from 100 children in different geographic regions, including their height,” Li said. “One region might have a lot of children of similar heights except for a few very tall kids, while another region might have a larger variation in this trait but without any extreme cases. When it comes to height, you could represent each region by one number, such as the average, but important distributional information can be lost. Additionally, a very small number of particularly tall or particularly short children—statistical outliers—can skew that number.”

According to Li, several metrics exist to compare distributions of data. Because the calculation of the Wasserstein distance involves optimal transport and is relatively intensive, an alternative called Kullback-Leibler (KL) divergence has been widely used in many machine learning algorithms.

“As computational power in terms of both hardware and software has improved, the Wasserstein distance has become more popular over the last 15 years,” Li said. “It has several advantages over KL divergence and is easy to interpret. The metric used by a machine learning algorithm also profoundly affects the final model.”

Li originally became interested in comparing distributions 20 years ago as a way to assess the similarity of images.

“To capture the rich information in images, we represented each by a distribution of pixel-level characteristics, and that’s where the Wasserstein distance comes into play,” she said. “I developed algorithms to cluster images using the Wasserstein distance, and a core challenge in this task is the computation of the Wasserstein barycenter, which is a type of center or average of a set of distributions.”

More recently, Li has investigated how to extend the standard framework of optimal transport to help analyze genomic data with heterogeneity, for instance inconsistencies that might arise from different data sources. She is also exploring how to develop dimension reduction techniques under the Wasserstein metric to address statistical challenges in large datasets.

“Dimension reduction is a well-explored area of research, but most research efforts have focused on vector data, and much less attention has been given to distributional data,” Li said. “In general, the optimization problem is more difficult because the calculation of the Wasserstein distance itself requires a level of optimization.”

Looking to the future, Li anticipates that AI will become increasingly embedded in scientific inquiry, particularly in areas where data are complex and heterogeneous, a mode of research referred to by some as "vertical AI." Developing frameworks that can characterize differences between distributions will be an important part of this evolution.

“Universities with strong interdisciplinary environments provide an ideal setting for this type of work,” she said. “Much like electricity and computing, AI is on a path to becoming part of the basic infrastructure of science and society, shaping the ways researchers analyze data, generate insights, and pursue new questions.”

 

Editor's Note: This story is part of a larger feature about artificial intelligence developed for the Winter 2026 issue of the Eberly College of Science Science Journal.