Mind the gap: using variable selection in genomics

7 March 2017

Leaps in information and computing technology now allow scientists to better understand genomes for advances in public health. The new field of biostatistics now enables researchers to analyse a wide range of data from mutation to gene and protein expression.

“Statisticians are always interested in developing new methods to handle new structures of data. At the same time we need the underlying biological knowledge to help us analyse and build models,” said Kin Yau Wong, a PhD student in biostatistics at the University of North Carolina at Chapel Hill. “We collaborate with biologists and experts in genomics to tell us how different types of data are expected to interact. With this information we can build disease model and do the estimation.”

 Wong was a student of actuarial science but during his undergraduate study he realised that he wasn’t interested in financial risk analysis. While he was pursuing his masters in the Department of Statistics and Actuarial Science at HKU, his supervisor was working on statistics with a focus on medical data and that is how Wong started developing his interest for biostatistics.

 “I’m not only driven to biostatistics because of its application. Statistics as a discipline of mathematics is interesting and fascinating to me. What’s more important is its impact on the well-being of human society. I’d like to see my work actually benefiting people,” he added.

Wong’s research is primarily focused on an integrative analysis of genomic data. In his first dissertation project he applied a statistical framework called structural equation modelling for the integrative analysis of survival data along with different types of genomic data in cancer studies.  

Structural equation modelling is a very wide area of research and it has been studied for some time but still not well understood, particularly on model identifiability which Wong has worked on.

“We discussed how we can perform estimation under a very flexible model and using this framework, we were able to draw better conclusions than by simply using standard regression or other existing statistical methods.”

Another research interest of Wong’s integrative analysis is finding out genes associated with certain outcome in cancer. When a certain gene expression is known to be highly related to tumour progression, scientists can develop a drug to target that gene to inhibit the gene expression.

This is an important application, as it assists in the development of the drug to control and maybe eventually to treat cancer.

Biostatisticians can also help discover a list of genes that biologists can follow up on for the better understanding of the mechanism. “This is an important application, as it assists in the development of the drug to control and maybe eventually to treat cancer,” added Wong.

Another application of this research is in the predication of survival time. At present, medical doctors are predicting clinical outcomes with other clinical data or a single type of genomic data collected from patients.

“In the future we’d like to predict patient’s survival time using all available genomic data. We are trying to develop methods of making use new genomic data to help us to do more accurate prediction. This is a very popular application of integrative analysis in cancer,” he added.

Future projects

Genomic data are high dimensional, which means there are more variable than samples in a data set. Statisticians perform variable selection, in which they assume that a small subset of features have effect on the clinical outcome of interest.

 In most variable selection it is assumed that there is no missing data, that every subject has observation on every variable that experts are interested in. If that is not the case, statisticians will typically either discard the subjects with missing data or impute some numbers for the missing data and assume the data is complete and continue with the analysis.

 But that is not an optimal way to do it, adds Wong. He is interested in working with variable selection of missing data and developing a better method to handle missing data instead of simply imputing them.

 “This is motivated by integrative analysis that involves multiple types of genomic data, where it is common that one subject has observation on one type of data but not the other one,” explained Wong. “When the number of variables is very large, we may need to do variable selection. Work in the literature on the interplay of missing data and variable selection is scarce. This is something I’d like to explore further.”


Working in the field of biostatistics, it is essential to be able to communicate with people of different fields and translate what they want to address into well-defined statistical problems.

“We are interested in developing statistical methods of real practical value that can be applied to genomic problems. But we need to understand the areas that we are applying the statistical methods to,” said Wong.  

Despite these challenges, Wong is motivated to continue with his research because of the support he and fellow scholars in the field have got from the Croucher Foundation.

 After graduating this summer, he is interested in continuing in academia. “I hope to contribute to the research community in Hong Kong in particular, where my home is.”


Kin Yau Wong obtained his B.Sc in Actuarial Science and Master of Philosophy working on survival analysis from the University of Hong Kong in 2010 and 2012 respectively. He received Croucher Scholarship at University of North Carolina at Chapel Hill in 2012 and is currently a PhD student of biostatistics at the university in Chapel Hill.

To view Kin Yau Wong's Croucher profile, please click here