Research

This page introduces all individual research in our lab

Algorithm Development or Mathematical Theory

Scaled MinHash

Participant(s): Mahmudur Rahman Hera

MinHash is a popular method to estimate the Jaccard similarity (as well as the containment index) between two sets. The idea behind MinHash is that the probability that the minimum hash value of two sets being equal is the same as their Jaccard index. Therefore, by using multiple hash functions and generating the minimum hash values, Jaccard index between the two sets can be estimated. However, if the two sets are very large, calculating the min-hash value multiple times could turn out to be exhausting; and working with only a portion of the two sets is more feasible. In this project, we are working on developing sound mathematical theory to incorporate a scaling factor that indicates the size of the subsets respective of the original sets, and to make sure that the containment and Jaccard values of the smaller sets estimate the true values well enough. In metagenomic applications, the two sets would be sets of k-mers derived from two genomes. Therefore, we are also working on extending the theory with a simple mutation model (where one genome is mutated into another) and determining the mutation rate given the containment and Jaccard values of two smaller scaled subsets of k-mers.

Reasoning Over Large Knowledge Graphs

Automatic Drug Repurposing Reasoning over knowledge graph

Participant(s): Chunyu Ma

Traditional drug discovery methods are time-consuming and costly. In contrast, identifying new indication uses for approved drugs, also known as drug repositioning or drug repurposing, provides a relatively safe, cheap and fast alternative. As the expoential increasement of available biomedical knowledge and data stored in different databases (e.g. DrugBank, KEGG, Ensembl and so on), we are able to integrate these knowledge sources as a large graph (known as knowledge graph) and predict the probability of the new uses of the approved drug via deep learning tehcniques (especially graph neural networks). In this project, we aim to design a pwerful drug repuposing model which contains not only a good prediction performance but expliability.

Metagenomics Research

Multi-resolution estimation of k-mer-based Jaccard and containment indices

Participant(s): Shaopeng Liu

K-mer based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where data sets are large.
In this project, we've developed a truncation-based method to estimate multi-resolution estimation in a more efficient way. And we're working on showing its efficacy and superiority in metagenomic analysis. The source code CMash can be found here, and the proof-of-concept implementation and benchmark analysis can be found here.

>