Biography
I am a Professor of Statistics at Penn State and the associate chair of the Bioinformatics and Genomics graduate program. My research focuses on developing statistical methods to uncover complex patterns in large-scale biological data, with a long-standing emphasis on assessing and improving reproducibility in high-throughput genomics. I develop latent variable models and machine learning approaches to identify scientifically meaningful structure in high-dimensional data.
More recently, my work has expanded to social science data and the study of large language models, with a focus on human–AI collaboration and the reliability and reproducibility of data-driven scientific workflows.
I received my Ph.D. in Statistics from the University of Washington and completed my postdoctoral training at the University of California, Berkeley.
Selected Publications
- Singh, R., Xi, H., Park, A.K., Hardison, R.C., Zhu, X.+, and Li, Q.+ (2026) Retrofit: Reference-free deconvolution of cell mixtures in spatial transcriptomics. Nature Communications (Accepted)
- Ranilli, M., Lyu, Y., Koch, H., and Li, Q.+ (2026). A statistical framework for measuring reproducibility and replicability of high-throughput experiments from multiple sources. Statistics in Medicine. 45(3-5): e70354
- Zeng, Q., Jin, C., Wang, X., Zheng, Y., Li, Q.. AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science. Findings of the Association for Computational Linguistics: EMNLP 2025.
- Seal, S., Li, Q., Basner, E.B., Saba, L.M. and Kechris, K. (2023) RCFGL: Rapid Condition adaptive Fused Graphical Lasso and application to modeling brain region co-expression networks. PLoS Computational Biology, Jan 6, p1-26.
- Koch, H., Keller, C. A., Giardine, B., Xiang, G., Zhang, F., Wang, Y., Hardison, R. C.+, and Li, Q.+ (2022) High-dimensional association detection in large scale genomic data. Nature communications, Nov 12, 13: 6874, p1-15
- Zhang, F., and Li, Q.+ (2022). Segmented correspondence curve regression for quantifying covariate effects on the reproducibility of high-throughput experiments. Biometrics. Sept 3, p1-14
- Singh, R., Zhang, F., and Li, Q.+ (2022). Assessing reproducibility of high-throughput experiments in the case of missing Data. Statistics in Medicine, 41(10), p1884-1899.
- Osotsi, A., and Li, Q.+ Learning Robust Representations using a Change Point Framework. (2021) KDD 2021, the 7th Workshop on Mining and Learning from Time Series (MiLeTS), paper 20, p1-8.
- McGuire, D., Jiang, Y., Liu, M., Weissenkampen, J.D., Eckert, S., Yang, L., Chen, F., GWAS and Sequencing Consortium of Alcohol and Nicotine Use (GSCAN), Berg, A., Vrieze, S., Jiang, B.+, Li, Q+, and Liu, D.+, (2021) Model-based Assessment of Replicability for Genome-wide Association Meta-analysis, Nature Communications, 12(1), p1-14.
- An, L., Yang, T., Yang, J., Nuebler, J., Xiang, G., Hardison, R.C., Li, Q.+, and Zhang, Y.+. OnTAD: hierarchical domain structure reveals the divergence of activity among TADs and boundaries, Genome Biology 20, 282 (2019) (+: co-corresponding authors).
- Koch, H., Starenki, D., Cooper, S.J., Myers, R.M., Li, Q.+. (2018) powerTCR: a model- based approach to comparative analysis of the clone size distribution of the T cell receptor repertoire, PLoS Computational Biology 14(11): e1006571.
- Lyu, Y., Xue, L., Zhang, F., Koch, H., Saba, L., Kechris, K., Li, Q.+. (2018) Condition-adaptive fused graphical lasso (CFGL): an adaptive procedure for inferring condition specific gene co-expression network, PLoS Computational Biology.
- Philtron, D., Lyu, Y., Li, Q.+, and Ghosh, D.+ (2018) Maximum rank reproducibility: a non-parametric approach to assessing reproducibility in replicate experiments, Journal of American Statistical Association. 113: 1028-1039 (+: co-corresponding authors).
- Li, Q.+ and Zhang, F.. (2018) A regression framework for assessing covariate effects on the reproducibility of high-throughput experiments. Biometrics 74(3): 781-1138.
- Yang, T., Zhang, F., Yardimci, G.G., Song, F., Hardison, R.C., Noble, W.S., Yue, F., Li, Q.+. (2017) HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Research, 27(11):1939-1949.
- Zhang, F. and Li, Q.+. (2017) A continuous threshold expectile model. Computational Statistics & Data Analysis, 116: 49–66.
- Zhang, F. and Li, Q.+. (2017) Robust bent line regression. Journal of Statistical Planning and Inference, 185: 41–55.
- Lyu, Y. and Li, Q.+ (2016) A semi-parametric statistical model for integrating gene expression profiles across different platforms. BMC Bioinformatics, 17(Suppl 1):S5
- Bailey, T.*, Krajewski, P.*, Ladunga, I.*, Lefebvre, C.*, Li, Q.*, Liu, T.*, Madrigal, P.*, Taslim, C.*, and Zhang J.* (2013). Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Computational Biology, 9(11): e1003326.
- Li, Q., Eng, J.K. and Stephens, M. (2012), A model-based algorithm for peptide identification using mass spectrometry via database searching. Annals of Applied Statistics, 6(4), 1775-1794
- Chen, Y., Negre, N., Li, Q., Mieczkowska, J.O., Slattery, M., Zhang, Y., Kim, Y., He, H., Zieba, J., Ruan, Y., Bickel, P.J., Myers, R.M., Wold, B.J., White, K.P., Lieb, J.D., Liu, X.S. (2012) Systematic evaluation of factors influencing ChIP-seq fidenlity using ultra-deepsequencing. Nature Methods, 9, 609–614 .
- The ENCODE project consortium. (2012) An integrated encyclopedia of DNA elements in the human genome, Nature, 489, 57-74 (Li, Q. as one of the lead analysts)
- Li, Q., Brown, J.B., Huang, H. and Bickel, P.J. (2011) Measuring reproducibility of high-throughput experiments, Annals of Applied Statistics, 5(3), 1752-1779.
- Li, Q., MacCoss, M. and Stephens, M. (2010) A nested mixture model for protein identification using mass spectrometry, Annals of Applied Statistics, 4(2), 962-987.
Teaching
STAT 555 - Statistical genomics
STAT504 - Analysis of discrete data
STAT 557 - Data mining
STAT 544 - Categorical data analysis
STAT 414 - Introduction to probability
STAT 415 -Introduction to mathematical statistics