Qunhua Li | Eberly College of Science

I am a Professor of Statistics at Penn State and the associate chair of the Bioinformatics and Genomics graduate program. My research focuses on developing statistical methods to uncover complex patterns in large-scale biological data, with a long-standing emphasis on assessing and improving reproducibility in high-throughput genomics. I develop latent variable models and machine learning approaches to identify scientifically meaningful structure in high-dimensional data.

More recently, my work has expanded to social science data and the study of large language models, with a focus on human–AI collaboration and the reliability and reproducibility of data-driven scientific workflows.

I received my Ph.D. in Statistics from the University of Washington and completed my postdoctoral training at the University of California, Berkeley.

Selected Publications

Google Scholar for full list

Singh, R., Xi, H., Park, A.K., Hardison, R.C., Zhu, X.+, and Li, Q.+ (2026) Retrofit: Reference-free deconvolution of cell mixtures in spatial transcriptomics. Nature Communications (Accepted)
Ranilli, M., Lyu, Y., Koch, H., and Li, Q.+ (2026). A statistical framework for measuring reproducibility and replicability of high-throughput experiments from multiple sources. Statistics in Medicine. 45(3-5): e70354
Zeng, Q., Jin, C., Wang, X., Zheng, Y., Li, Q.. AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science. Findings of the Association for Computational Linguistics: EMNLP 2025.
Seal, S., Li, Q., Basner, E.B., Saba, L.M. and Kechris, K. (2023) RCFGL: Rapid Condition adaptive Fused Graphical Lasso and application to modeling brain region co-expression networks. PLoS Computational Biology, Jan 6, p1-26.
Koch, H., Keller, C. A., Giardine, B., Xiang, G., Zhang, F., Wang, Y., Hardison, R. C.+, and Li, Q.+ (2022) High-dimensional association detection in large scale genomic data. Nature communications, Nov 12, 13: 6874, p1-15
Zhang, F., and Li, Q.+ (2022). Segmented correspondence curve regression for quantifying covariate effects on the reproducibility of high-throughput experiments. Biometrics. Sept 3, p1-14
Singh, R., Zhang, F., and Li, Q.+ (2022). Assessing reproducibility of high-throughput experiments in the case of missing Data. Statistics in Medicine, 41(10), p1884-1899.
Osotsi, A., and Li, Q.+ Learning Robust Representations using a Change Point Framework. (2021) KDD 2021, the 7th Workshop on Mining and Learning from Time Series (MiLeTS), paper 20, p1-8.
McGuire, D., Jiang, Y., Liu, M., Weissenkampen, J.D., Eckert, S., Yang, L., Chen, F., GWAS and Sequencing Consortium of Alcohol and Nicotine Use (GSCAN), Berg, A., Vrieze, S., Jiang, B.+, Li, Q+, and Liu, D.+, (2021) Model-based Assessment of Replicability for Genome-wide Association Meta-analysis, Nature Communications, 12(1), p1-14.
An, L., Yang, T., Yang, J., Nuebler, J., Xiang, G., Hardison, R.C., Li, Q.+, and Zhang, Y.+. OnTAD: hierarchical domain structure reveals the divergence of activity among TADs and boundaries, Genome Biology 20, 282 (2019) (+: co-corresponding authors).
Koch, H., Starenki, D., Cooper, S.J., Myers, R.M., Li, Q.+. (2018) powerTCR: a model- based approach to comparative analysis of the clone size distribution of the T cell receptor repertoire, PLoS Computational Biology 14(11): e1006571.
Lyu, Y., Xue, L., Zhang, F., Koch, H., Saba, L., Kechris, K., Li, Q.+. (2018) Condition-adaptive fused graphical lasso (CFGL): an adaptive procedure for inferring condition specific gene co-expression network, PLoS Computational Biology.
Philtron, D., Lyu, Y., Li, Q.+, and Ghosh, D.+ (2018) Maximum rank reproducibility: a non-parametric approach to assessing reproducibility in replicate experiments, Journal of American Statistical Association. 113: 1028-1039 (+: co-corresponding authors).
Li, Q.+ and Zhang, F.. (2018) A regression framework for assessing covariate effects on the reproducibility of high-throughput experiments. Biometrics 74(3): 781-1138.
Yang, T., Zhang, F., Yardimci, G.G., Song, F., Hardison, R.C., Noble, W.S., Yue, F., Li, Q.+. (2017) HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Research, 27(11):1939-1949.
Zhang, F. and Li, Q.+. (2017) A continuous threshold expectile model. Computational Statistics & Data Analysis, 116: 49–66.
Zhang, F. and Li, Q.+. (2017) Robust bent line regression. Journal of Statistical Planning and Inference, 185: 41–55.
Lyu, Y. and Li, Q.+ (2016) A semi-parametric statistical model for integrating gene expression profiles across different platforms. BMC Bioinformatics, 17(Suppl 1):S5
Bailey, T.*, Krajewski, P.*, Ladunga, I.*, Lefebvre, C.*, Li, Q.*, Liu, T.*, Madrigal, P.*, Taslim, C.*, and Zhang J.* (2013). Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Computational Biology, 9(11): e1003326.
Li, Q., Eng, J.K. and Stephens, M. (2012), A model-based algorithm for peptide identification using mass spectrometry via database searching. Annals of Applied Statistics, 6(4), 1775-1794
Chen, Y., Negre, N., Li, Q., Mieczkowska, J.O., Slattery, M., Zhang, Y., Kim, Y., He, H., Zieba, J., Ruan, Y., Bickel, P.J., Myers, R.M., Wold, B.J., White, K.P., Lieb, J.D., Liu, X.S. (2012) Systematic evaluation of factors influencing ChIP-seq fidenlity using ultra-deepsequencing. Nature Methods, 9, 609–614 .
The ENCODE project consortium. (2012) An integrated encyclopedia of DNA elements in the human genome, Nature, 489, 57-74 (Li, Q. as one of the lead analysts)
Li, Q., Brown, J.B., Huang, H. and Bickel, P.J. (2011) Measuring reproducibility of high-throughput experiments, Annals of Applied Statistics, 5(3), 1752-1779.
Li, Q., MacCoss, M. and Stephens, M. (2010) A nested mixture model for protein identification using mass spectrometry, Annals of Applied Statistics, 4(2), 962-987.

Teaching

STAT 555 - Statistical genomics

STAT504 - Analysis of discrete data

STAT 557 - Data mining

STAT 544 - Categorical data analysis

STAT 414 - Introduction to probability

STAT 415 -Introduction to mathematical statistics