As technology continues to improve—from more sensitive genetic sequencing tools to advanced telescopes that can see further into the cosmos—scientists from a wide variety of fields are increasingly inundated with data. AI is poised to help researchers deal with some of this data.
“The amount of molecular and cellular biology data is doubling every 12 to 20 months,” said Ed O’Brien, professor of chemistry and Institute for Computational and Data Sciences (ICDS) co-hire. “There are currently 350 petabytes of publicly available data—that’s 350,000 terabytes—in this area, and most of that data is used only in the original study and never reused. We believe that there are opportunities to combine and connect that data and interrogate it in new and novel ways to address a variety of important questions.”
With the goal of synthesizing publicly available data to gain deeper and broader insights, the U.S. National Science Foundation National Synthesis Center for Emergence in the Molecular and Cellular Sciences (NCEMS) opened last year at Penn State.
“The research engines of the center are our working groups, which are collaborative teams of scientists from diverse fields from across the country and around the world coming together to address specific questions using this publicly available data,” said O’Brien, who is also the director of the center. “There are biochemists, biologists, statisticians, physicists, computational biologists, and bioinformaticians on these teams collaborating in this way.”
To date, the center has supported 10 working groups, both by providing the support of staff scientists as well as access to cyberinfrastructure. The center will support 34 groups in its first five years. One of these working groups, led by Hyebin Song, assistant professor of statistics, and James Stephenson, lead data scientist at the European Bioinformatics Institute in Cambridge, United Kingdom, is exploring how proteins that do not fold correctly are linked to disease. The team is using interpretable machine learning—models with clear, understandable predictions—to analyze datasets of genes with known disease associations as well as publicly available clinical datasets of human genetic mutations and find patterns and links between them.
“When proteins fail to fold properly, they do not carry out their intended function, which can lead to disease,” said O’Brien. “We hope to uncover potential relationships between proteins that are likely to misfold in particular ways and disease, which may reveal previously unrecognized causes of disease and provide a new perspective on their molecular origins. Ultimately, we hope this work could suggest potential therapeutic targets for interventions that could correct misfolding and thus prevent the disease.”
Dealing with astronomy data
The new US National Science Foundation–Department of Energy Vera C. Rubin Observatory is poised to launch a new era of astronomy. Using the largest camera ever built—a 3,200-megapixel camera the size of a car—the Rubin Observatory will scan the entire visible sky every three to four nights. By stitching the resulting clips together, the international Legacy Survey of Space and Time (LSST) collaboration—which includes Penn State astronomers—will produce the most detailed time-lapse view of the cosmos that has ever existed. The LSST will allow astronomers to investigate important questions about Earth-crossing asteroids in our solar system, the structure of the Milky Way galaxy, the evolution of supermassive black holes, as well as the nature of dark matter and dark energy.
“Each night, this effort will produce about 20 terabytes of data,” said W. Niel Brandt, holder of the Eberly Family Chair in Astronomy and Astrophysics and professor of physics at Penn State and co-chair of the LSST Active Galactic Nuclei Science Collaboration. “At that kind of scale, human inspection of the data is infeasible. We’re going to have to use novel informatics techniques. Machine-learning and AI-based algorithms will support this effort—for example, in classifying the billions of detected galaxies and stars, filtering the millions of nightly sources of transient signals, and discovering unexpected outlier cosmic signals.”
Editor's Note: This story is part of a larger feature about artificial intelligence developed for the Winter 2026 issue of the Eberly College of Science Science Journal.