About Me

I am a postdoctoral research fellow at Fred Hutchinson Cancer Research Center. I am currently working with Dr. Ollivier Hyrien on a project of single-cell data cytokine clustering in vaccine studies, which utilizes a combination of graph-based approach and machine learning algorithms.

I obtained my Ph.D. from Department of Statistical Science, under the mentorship of Prof. Subhadeep Mukhopadhyay. My thesis was on "Graph-based Modern Nonparametrics For High-dimensional Data". Before that I was studying Financial Mathematics for my M.S. at University of southern California.

My research interests includes: Nonparametric Statistical Learning, Graph Data Science and Computational Statistics. I have developed nonparametric methods that combines modern nonparametric with spectral graph theory to analyze large dimensional data sets, with several implementations available on CRAN.

Here's a copy of my CV.

Selected Publications

  1. Mukhopadhyay, S. and Kaijun Wang(2021) “Nonparametric Clustering for Single-Cell Subsets in Vaccine Studies”. Technical Report, Fred Hutchinson Cancer Research Center.

  2. Mukhopadhyay, S. and Kaijun Wang(2021) “Statistical Machine Learning: An Integrated Approach”. Journal of Business & Economic Statistics (under review), arXiv:2005.13596

  3. Mukhopadhyay, S. and Kaijun Wang(2021) “On The Problem of Relevance in Statistical Inference”. Econometrics and Statistics (invited revision, special issue) arXiv:2004.09588

  4. Mukhopadhyay, S. and Kaijun Wang(2021) “Spectral Graph Analysis: A Unified Explanation and Modern Perspectives”. Nature Scientific Reports (Accepted): arXiv:1901.07090

  5. Mukhopadhyay, S. and Kaijun Wang(2020) “A Nonparametric Approach to High-dimensional K-sample Comparison Problem”. Biometrika 107(3) page 555–572 arXiv:1810.01724

  6. Zhu, Jiaqi, Kaijun Wang, Yunkun Wu, Zhongyi Hu, and Hongan Wang. (2016) ``Mining User-Aware Rare Sequential Topic Patterns in Document Streams." IEEE Transactions on Knowledge and Data Engineering, 28, page 1790--1804.

Education

Postdoctoral Research (Current)

Fred Hutchinson Cancer Research Center, Vaccine and Infectious Disease Division


Ph.D., Statistics (2015 - 2019)

Temple University, Fox School of Business, Department of Statistical Science


M.S., Mathematical Finance (2012 - 2014)

University of Southern California


B.S., Applied Mathematics (2008 - 2012)

Central University of Finance and Economics

Research Projects

Statistical Machine Learning: An Integrated Approach

We introduces a new data analysis framework, called Integrated Statistical Learning (ISL) theory, which for the first time, offers solutions to blend the parametric statistical modeling and algorithmic machine learning into a coherent whole by establishing a link between them. This new integrated statistical method provide novel solutions to conditional density estimation, goodness-of-fit evaluation, quantile regression and much more.


Heterogeneity, Relevance and Customized Inference

In modern Large-Scale Inference problems, it is important to take the extra covariate information into account when performing inferences for the heterogeneous data. I am developing with Prof. Deep Mukhopadhyay a new paradigm of statistical modeling called ``Global-to-Local Inference", which will provide the necessary theory and algorithms for this Individualized Inference. By borrowing strength from the full ensemble, this method generates simulated relevance samples to power subsequent analysis.


Single-cell Data Cytokine Clustering in Vaccine Studies

Flow cytometry single-cell data are usually used in modern study of immune-related diseases. So far, various model have been proposed on how to process the signal intensity to identify cell subsets of different marker combinations, yet little has been said what to do with the resulting count data. This is not a trivial matter as the resulting data set consists of sparse count data of high dimensionality, and analyzing it using traditional means are challenging. We believe that an important step to understanding this data set is finding out which subsets have similar responses to the stimuli. To this end, I'm developing a framework using mixture model and graph theory that can quickly cluster together marker combinations with similar reaction to stimuli for a given time period.


Nonparametric Approach to High-dimensional K-sample Comparison Problems

Multivariate k-sample comparison problem frequently appears in a wide range of data-rich scientific fields. In this project, I developed an approach based on modern LP-nonparametric tools and unexplored connections with spectral graph theory, which demonstrated impressive robustness for noise contaminated data sets. Furthermore, this method comes with an exploratory interface, which not only provides more insight into the problem but also can be utilized for developing a better predictive model at the next phase of data-modeling


High-dimensional Nonparametric Change-point Detection

A large class of applied problems arising in health to military to environmental monitoring can be formulated as: tracking massive amount of data collected from a large number of sensors, which can be summarized as high dimensional change-point detection problems. Approaching this kind of problems in a brute-force classical manner is known to be extremely challenging. I've developed a spectral graph analysis approach for the purpose, which significantly reduces the memory footprint and speeds up the computation.


Presentations and Talks

  1. Graph-Based Compressive High-Dimensional Testing. --JSM Conference Presentation, 2020.

  2. Graph-Based Modern Nonparametrics For High-Dimensional Data. --Temple University, Dissertation Topic, 2018.

  3. Nonparametric High dimensional K-sample Comparison. --Institute of Software, Chinese Academy of Sciences, Invited Talk, 2018.

Teaching Experience

STAT 1001, Quantitative Methods for Business I

Temple University, Fall 2017 and Spring 2018, 2019


STAT 1001, Quantitative Methods for Business I

Temple University, Fall 2016


R Packages:

 LPMachineLearning

Statistical modeling tools for converting a black-box ML algorithm into an interpretable conditional distribution prediction machine, which provides a wide range of facilities, including goodness-of-fit, various types of exploratory graphical diagnostics, generalized feature selection, predictive inference methods, and others.


 LPKsample

A graph-based nonparametric algorithm for High-dimensional k-sample problem that includes (i) confirmatory test; (ii) exploratory results and (iii) options to output a data-driven LP-transformed matrix for classification.


 LPGraph

Fast and compressive nonparametric spectral algorithm for ordered graphs with application to the high-dimensional change point analysis.


 LPRelevance

A framework of methods to perform customized inference at individual level by taking contextual covariates into account. Three main functions are provided in this package: (i) LASER(): it generates specially-designed artificial relevant samples for a given case; (ii) g2l.proc(): computes customized conditional fdr; and (iii) rEB.proc(): performs empirical Bayes inference based on LASERs.