Publications

Showing all papers

2022

RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine

E. C. Wood, Amy K. Glen, Lindsey G. Kvarfordt, Finn Womack, Liliana Acevedo, Timothy S. Yoon, Chunyu Ma, Veronica Flores, Meghamala Sinha, Yodsawalai Chodpathumwan, Arash Termehchy, Jared C. Roach, Luis Mendoza, Andrew S. Hoffman, Eric W. Deutsch, David Koslicki, Stephen A. Ramsey

BMC Bioinformatics · 29 Sep 2022 ·

Overview:
We develop RTX-KG2, a biomedical knowledge graph that uses an Extract-Transform-Load (ETL) approach to integrate 70 knowledge sources (including UMLS, SemMedDB, ChEMBL, DrugBank, SMPDB, and 65 additional knowledge sources) into a single knowledge graph and conforms to the Biolink standard metamodel.

Temp_PREPRINT knowledge graphs biomedical knowledge integration semantic normalization

Highlights Github Preprint

ARAX: a graph-based modular reasoning tool for translational biomedicine

Amy K. Glen, Chunyu Ma, Luis Mendoza, Finn Womack, E. C. Wood, Meghamala Sinha, Liliana Acevedo, Lindsey G. Kvarfordt, Ross C. Peene, Shaopeng Liu, Andrew S. Hoffman, Jared C. Roach, Eric W. Deutsch, Stephen A. Ramsey, David Koslicki

Cold Spring Harbor Laboratory · 16 Aug 2022 ·

Overview:
We introduce ARAX, a new reasoning system for translational biomedicine that provides a web browser user interface and an application programming interface. ARAX enables users to encode translational biomedical questions and to integrate knowledge across sources to answer the user9s query and facilitate exploration of results. For ARAX, we developed new approaches to query planning, knowledge-gathering, reasoning, and result ranking and dynamically integrate knowledge providers for answering biomedical questions.

knowledge graphs translational science biomedical reasoning

Highlights Github

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

Shaopeng Liu, David Koslicki

Bioinformatics · 24 Jun 2022 ·

Overview:
Utilizing truncation-based methods, we can obtain multi-resolution estimation of k-mer-based JI/CI in linear time for a range of k values.

k-mer Jaccard index MinHash Containment index Truncation

Highlights Github Preprint

Biolink Model: A universal schema for knowledge graphs in clinical, biomedical, and translational science

Deepak R. Unni, Sierra A. T. Moxon, Michael Bada, Matthew Brush, Richard Bruskiewich, J. Harry Caufield, Paul A. Clemons, Vlado Dancik, Michel Dumontier, Karamarie Fecho, Gustavo Glusman, Jennifer J. Hadlock, Nomi L. Harris, Arpita Joshi, Tim Putman, Guangrong Qin, Stephen A. Ramsey, Kent A. Shefchek, Harold Solbrig, Karthik Soman, Anne E. Thessen, Melissa A. Haendel, Chris Bizon, Christopher J. Mungall, Liliana Acevedo, Stanley C. Ahalt, John Alden, Ahmed Alkanaq, Nada Amin, Ricardo Avila, Jim Balhoff, Sergio E. Baranzini, Andrew Baumgartner, William Baumgartner, Basazin Belhu, MacKenzie Brandes, Namdi Brandon, Noel Burtt, William Byrd, Jackson Callaghan, Marco Alvarado Cano, Steven Carrell, Remzi Celebi, James Champion, Zhehuan Chen, Mei‐Jan Chen, Lawrence Chung, Kevin Cohen, Tom Conlin, Dan Corkill, Maria Costanzo, Steven Cox, Andrew Crouse, Camerron Crowder, Mary E. Crumbley, Cheng Dai, Vlado Dančík, Ricardo De Miranda Azevedo, Eric Deutsch, Jennifer Dougherty, Marc P. Duby, Venkata Duvvuri, Stephen Edwards, Vincent Emonet, Nathaniel Fehrmann, Jason Flannick, Aleksandra M. Foksinska, Vicki Gardner, Edgar Gatica, Amy Glen, Prateek Goel, Joseph Gormley, Alon Greyber, Perry Haaland, Kristina Hanspers, Kaiwen He, Kaiwen He, Jeff Henrickson, Eugene W. Hinderer, Maureen Hoatlin, Andrew Hoffman, Sui Huang, Conrad Huang, Robert Hubal, Kenneth Huellas‐Bruskiewicz, Forest B. Huls, Lawrence Hunter, Greg Hyde, Tursynay Issabekova, Matthew Jarrell, Lindsay Jenkins, Adam Johs, Jimin Kang, Richa Kanwar, Yaphet Kebede, Keum Joo Kim, Alexandria Kluge, Michael Knowles, Ryan Koesterer, Daniel Korn, David Koslicki, Ashok Krishnamurthy, Lindsey Kvarfordt, Jay Lee, Margaret Leigh, Jason Lin, Zheng Liu, Shaopeng Liu, Chunyu Ma, Andrew Magis, Tarun Mamidi, Meisha Mandal, Michelle Mantilla, Jeffrey Massung, Denise Mauldin, Jason McClelland, Julie McMurry, Philip Mease, Luis Mendoza, Marian Mersmann, Abrar Mesbah, Matthew Might, Kenny Morton, Sandrine Muller, Arun Teja Muluka, John Osborne, Phil Owen, Michael Patton, David B. Peden, R. Carter Peene, Bria Persaud, Emily Pfaff, Alexander Pico, Elizabeth Pollard, Guthrie Price, Shruti Raj, Jason Reilly, Anders Riutta, Jared Roach, Ryan T. Roper, Greg Rosenblatt, Irit Rubin, Sienna Rucka, Nathaniel Rudavsky‐Brody, Rayn Sakaguchi, Eugene Santos, Kevin Schaper, Charles P. Schmitt, Shepherd Schurman, Erik Scott, Sarah Seitanakis, Priya Sharma, Ilya Shmulevich, Manil Shrestha, Shalki Shrivastava, Meghamala Sinha, Brett Smith, Noel Southall, Nicholas Southern, Lisa Stillwell, Michael “ Michi” Strasser, Andrew I. Su, Casey Ta, Anne E. Thessen, Jillian Tinglin, Lucas Tonstad, Thi Tran‐Nguyen, Alexander Tropsha, Gaurav Vaidya, Luke Veenhuis, Adam Viola, Marcin von Grotthuss, Max Wang, Patrick Wang, Paul B. Watkins, Rosina Weber, Qi Wei, Chunhua Weng, Jordan Whitlock, Mark D. Williams, Andrew Williams, Finn Womack, Erica Wood, Chunlei Wu, Jiwen Kevin Xin, Hao Xu, Colleen Xu, Chase Yakaboski, Yao Yao, Hong Yi, Arif Yilmaz, Marissa Zheng, Xinghua Zhou, Eric Zhou, Qian Zhu, Tom Zisk,

Clinical and Translational Science · 06 Jun 2022 ·

Overview:
Biolink Model is an open-source data model that can be used to formalize the relationships between data structures in translational science. It incorporates object-oriented classification and graph-oriented features. The core of the model is a set of hierarchical, interconnected classes (or categories) and relationships between them (or predicates) representing biomedical entities such as gene, disease, chemical, anatomic structure, and phenotype. The model provides class and edge attributes and associations that guide how entities should relate to one another. Here, we highlight the need for a standardized data model for KGs, describe Biolink Model, and compare it with other models.

knowledge graphs biomedical translational

Highlights Preprint

Progress toward a universal biomedical data translator

Karamarie Fecho, Anne E. Thessen, Sergio E. Baranzini, Chris Bizon, Jennifer J. Hadlock, Sui Huang, Ryan T. Roper, Noel Southall, Casey Ta, Paul B. Watkins, Mark D. Williams, Hao Xu, William Byrd, Vlado Dančík, Marc P. Duby, Michel Dumontier, Gustavo Glusman, Nomi L. Harris, Eugene W. Hinderer, Greg Hyde, Adam Johs, Andrew I. Su, Guangrong Qin, Qian Zhu, Liliana Acevedo, Stanley C. Ahalt, John Alden, Ahmed Alkanaq, Nada Amin, Ricardo Avila, Michael Bada, Jim Balhoff, Andrew Baumgartner, William Baumgartner, Basazin Belhu, Mac Kenzie Brandes, Namdi Brandon, Matthew Brush, Richard Bruskiewich, Noel Burtt, Jackson Callaghan, Marco Alvarado Cano, Steven Carrell, J. Harry Caufield, Remzi Celebi, James Champion, Zhehuan Chen, Mei‐Jan Chen, Lawrence Chung, Paul A. Clemons, Kevin Cohen, Tom Conlin, Dan Corkill, Maria Costanzo, Steven Cox, Andrew Crouse, Camerron Crowder, Mary E. Crumbley, Cheng Dai, Ricardo De Miranda Azevedo, Eric Deutsch, Jennifer Dougherty, Venkata Duvvuri, Stephen Edwards, Vincent Emonet, Nathaniel Fehrmann, Jason Flannick, Aleksandra M. Foksinska, Vicki Gardner, Edgar Gatica, Amy Glen, Prateek Goel, Joseph Gormley, Alon Greyber, Perry Haaland, Melissa A. Haendel, Kristina Hanspers, Kaiwen He, Jeff Henrickson, Maureen Hoatlin, Andrew Hoffman, Conrad Huang, Robert Hubal, Kenneth Huellas‐Bruskiewicz, Forest B. Huls, Lawrence Hunter, Tursynay Issabekova, Matthew Jarrell, Lindsay Jenkins, Arpita Joshi, Jimin Kang, Richa Kanwar, Yaphet Kebede, Keum Joo Kim, Alexandria Kluge, Michael Knowles, Ryan Koesterer, Daniel Korn, David Koslicki, Ashok Krishnamurthy, Lindsey Kvarfordt, Jay Lee, Margaret Leigh, Jason Lin, Zheng Liu, Shaopeng Liu, Chunyu Ma, Andrew Magis, Tarun Mamidi, Meisha Mandal, Michelle Mantilla, Jeffrey Massung, Denise Mauldin, Jason McClelland, Julie McMurry, Philip Mease, Luis Mendoza, Marian Mersmann, Abrar Mesbah, Matthew Might, Kenny Morton, Sierra A. T. Moxon, Sandrine Muller, Arun Teja Muluka, Christopher J. Mungall, John Osborne, Phil Owen, Michael Patton, David B. Peden, R. Carter Peene, Bria Persaud, Emily Pfaff, Alexander Pico, Elizabeth Pollard, Guthrie Price, Tim Putman, Shruti Raj, Stephen A. Ramsey, Jason Reilly, Anders Riutta, Jared Roach, Greg Rosenblatt, Irit Rubin, Sienna Rucka, Nathaniel Rudavsky‐Brody, Rayn Sakaguchi, Eugene Santos, Kevin Schaper, Charles P. Schmitt, Shepherd Schurman, Erik Scott, Sarah Seitanakis, Priya Sharma, Kent A. Shefchek, Ilya Shmulevich, Manil Shrestha, Shalki Shrivastava, Meghamala Sinha, Brett Smith, Harold Solbrig, Karthik Soman, Nicholas Southern, Lisa Stillwell, Michael “ Michi” Strasser, Anne E. Thessen, Jillian Tinglin, Lucas Tonstad, Thi Tran‐Nguyen, Alexander Tropsha, Deepak R. Unni, Gaurav Vaidya, Luke Veenhuis, Adam Viola, Marcin von Grotthuss, Max Wang, Patrick Wang, Rosina Weber, Qi Wei, Chunhua Weng, Jordan Whitlock, Andrew Williams, Finn Womack, Erica Wood, Chunlei Wu, Jiwen Kevin Xin, Colleen Xu, Chase Yakaboski, Yao Yao, Hong Yi, Arif Yilmaz, Marissa Zheng, Xinghua Zhou, Eric Zhou, Tom Zisk,

Clinical and Translational Science · 06 Jun 2022 ·

Overview:
The Biomedical Data Translator Consortium has developed and tested a pilot knowledge graph-based “Translator” system capable of integrating existing biomedical data sets and “translating” those data into insights intended to augment human reasoning and accelerate translational science.

Highlights

TAMPA: interpretable analysis and visualization of metagenomics-based taxon abundance profiles

Varuni Sarwal, Jaqueline Brito, Serghei Mangul, David Koslicki

Cold Spring Harbor Laboratory · 29 Apr 2022 ·

Overview:
Here we report the development of TAMPA (Taxonomic metagenome profiling evaluation) , a robust and easy-to-use method that allows scientists to easily interpret and interact with taxonomic profiles produced by the many different taxonomic profiler methods beyond the standard metrics used by the scientific community. We demonstrate the unique ability of TAMPA to provide important biological insights into the taxonomic differences between samples otherwise missed by commonly utilized metrics. Additionally, we show that TAMPA can help visualize the output of taxonomic profilers, enabling biologists to effectively choose the most appropriate profiling method to use on their metagenomics data.

metagenomic profiling visualization

Highlights Github

Critical Assessment of Metagenome Interpretation: the second round of challenges

Fernando Meyer, Adrian Fritz, Zhi-Luo Deng, David Koslicki, Till Robin Lesker, Alexey Gurevich, Gary Robertson, Mohammed Alser, Dmitry Antipov, Francesco Beghini, Denis Bertrand, Jaqueline J. Brito, C. Titus Brown, Jan Buchmann, Aydin Buluç, Bo Chen, Rayan Chikhi, Philip T. L. C. Clausen, Alexandru Cristian, Piotr Wojciech Dabrowski, Aaron E. Darling, Rob Egan, Eleazar Eskin, Evangelos Georganas, Eugene Goltsman, Melissa A. Gray, Lars Hestbjerg Hansen, Steven Hofmeyr, Pingqin Huang, Luiz Irber, Huijue Jia, Tue Sparholt Jørgensen, Silas D. Kieser, Terje Klemetsen, Axel Kola, Mikhail Kolmogorov, Anton Korobeynikov, Jason Kwan, Nathan LaPierre, Claire Lemaitre, Chenhao Li, Antoine Limasset, Fabio Malcher-Miranda, Serghei Mangul, Vanessa R. Marcelino, Camille Marchet, Pierre Marijon, Dmitry Meleshko, Daniel R. Mende, Alessio Milanese, Niranjan Nagarajan, Jakob Nissen, Sergey Nurk, Leonid Oliker, Lucas Paoli, Pierre Peterlongo, Vitor C. Piro, Jacob S. Porter, Simon Rasmussen, Evan R. Rees, Knut Reinert, Bernhard Renard, Espen Mikal Robertsen, Gail L. Rosen, Hans-Joachim Ruscheweyh, Varuni Sarwal, Nicola Segata, Enrico Seiler, Lizhen Shi, Fengzhu Sun, Shinichi Sunagawa, Søren Johannes Sørensen, Ashleigh Thomas, Chengxuan Tong, Mirko Trajkovski, Julien Tremblay, Gherman Uritskiy, Riccardo Vicedomini, Zhengyang Wang, Ziye Wang, Zhong Wang, Andrew Warren, Nils Peder Willassen, Katherine Yelick, Ronghui You, Georg Zeller, Zhengqiao Zhao, Shanfeng Zhu, Jie Zhu, Ruben Garrido-Oter, Petra Gastmeier, Stephane Hacquard, Susanne Häußler, Ariane Khaledi, Friederike Maechler, Fantin Mesny, Simona Radutoiu, Paul Schulze-Lefert, Nathiana Smit, Till Strowig, Andreas Bremges, Alexander Sczyrba, Alice Carolyn McHardy

Nature Methods · 01 Apr 2022 ·

Overview:
We analyze and present the results of the second round of the CAMI competition. Over 5,000 results were submitted by 76 different metagenomic tools covering the tasks of assembly, binning, and profiling.

metagenomics CAMI

Highlights Github Preprint

PressPurt: network sensitivity to press perturbations under interaction uncertainty

David Koslicki, Dana Gibbon, Mark Novak

F1000Research · 11 Feb 2022 ·

Overview:
Here, we introduce PressPurt , a computational package for identifying the interactions whose strengths must be estimated most accurately in order to produce robust predictions of a network’s response to press perturbations. The package provides methods for calculating and visualizing these edge-specific sensitivities (tolerances) when uncertainty is associated to one or more edges according to a variety of different error distributions. The software requires the network to be represented as a numerical (quantitative or qualitative) Jacobian matrix evaluated at stable equilibrium.

metagenomic visualization network

Highlights Github

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Antonio Blanca, Robert S. Harris, David Koslicki, Paul Medvedev

Journal of Computational Biology · 01 Feb 2022 ·

Overview:
We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash).

k-mer statistics

Highlights Github Preprint

The minimizer Jaccard estimator is biased and inconsistent*

Mahdi Belbasi, Antonio Blanca, Robert S. Harris, David Koslicki, Paul Medvedev

Cold Spring Harbor Laboratory · 17 Jan 2022 ·

Overview:
We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e., the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool.

MinHash minimizer Jaccard

Highlights Github

Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances

Mahmudur Rahman Hera, N. Tessa Pierce-Ward, David Koslicki

Cold Spring Harbor Laboratory · 12 Jan 2022 ·

Overview:
In this paper, we theoretically analyzed FracMinHash, proved a bias in the estimator, and finally showed how to debias the estimator. We then showed how this FracMinHash estimator can be used to estimate the evolutionary distance of two genomes. Our results show that mutation rates determined by our analyses are more accurate across a wider range of evolutionary distances compared to classical MinHash techniques.

FracMinHash ANI Confidence intervals Mutation rate

Highlights Github Preprint

Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers

Luiz Irber, Phillip T. Brooks, Taylor Reiter, N. Tessa Pierce-Ward, Mahmudur Rahman Hera, David Koslicki, C. Titus Brown

Cold Spring Harbor Laboratory · 12 Jan 2022 ·

Overview:
Implementation of FracMinHash sketching technique into sourmash software package, and analysis of FracMinHash in shotgun metagenome compositional analysis after modeling the problem as minimum metagenome cover

FracMinHash Sourmash Metagenome Compositional analysis Minimum metagenome cover

Highlights Github

WGSUniFrac: Applying UniFrac Metric to Whole Genome Shotgun Data

Wei Wei, David Koslicki

Schloss Dagstuhl - Leibniz-Zentrum für Informatik · 01 Jan 2022 ·

Overview:
In this study, we demonstrate a method to overcome this intrinsic difference and compute the UniFrac metric on WGS data by assigning branch lengths to the taxonomic tree obtained from input taxonomic profiles. We conduct a series of experiments to demonstrate that this WGSUniFrac method is comparably robust to traditional 16S UniFrac and is not highly sensitive to branch lengths assignments, be they data-derived or model-prescribed.

metagenomic WGS Unifrac Similarity metric

Highlights Github Preprint

2021

Technology dictates algorithms: recent developments in read alignment

Mohammed Alser, Jeremy Rotman, Dhrithi Deshpande, Kodi Taraszka, Huwenbo Shi, Pelin Icer Baykal, Harry Taegyun Yang, Victor Xue, Sergey Knyazev, Benjamin D. Singer, Brunilda Balliu, David Koslicki, Pavel Skums, Alex Zelikovsky, Can Alkan, Onur Mutlu, Serghei Mangul

Genome Biology · 26 Aug 2021 ·

Overview:
We provide a systematic survey of 107 long and short read alignment tools and analyze how alignment algorithms have co-evolved with sequencing technology advancements, as well as how these algorithmic changes affect tool performance.

Read alignment Review

Highlights Github Preprint

Application of MCAT questions as a testing tool and evaluation metric for knowledge graph–based reasoning systems

Karamarie Fecho, James Balhoff, Chris Bizon, William E. Byrd, Sui Hang, David Koslicki, Stefano E. Rensi, Patrick L. Schmitt, Mathias J. Wawer, Mark Williams, Stanley C. Ahalt

Clinical and Translational Science · 09 Apr 2021 ·

Overview:
Reasoning system based on knowledge graphs.

knowledge graphs reasoning

Highlights

Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit

Fernando Meyer, Till-Robin Lesker, David Koslicki, Adrian Fritz, Alexey Gurevich, Aaron E. Darling, Alexander Sczyrba, Andreas Bremges, Alice C. McHardy

Nature Protocols · 01 Mar 2021 ·

Overview:
Tutorial for using CAMI toolkits to access metagenomic software.

metagenomics CAMI

Highlights Github Preprint

2020

Metalign: efficient alignment-based metagenomic profiling via containment min hash

Nathan LaPierre, Mohammed Alser, Eleazar Eskin, David Koslicki, Serghei Mangul

Genome Biology · 10 Sep 2020 ·

Overview:
Utilizing a pre-filter strategy, Metalign performs efficient and accurate alignment-based metagenomic profiling through the estimation of contaiment indices of pre-built reference databases.

metagenomic profiling alignment

Highlights Github Preprint

Metagenomics for clinical diagnostics: technologies and informatics

Caitlin Loeffler, Keylie M. Gibson, Lana Martin, Liz Chang, Jeremy Rotman, Ian V. Toma, Christopher E. Mason, Eleazar Eskin, Joseph P. Zackular, Keith A. Crandall, David Koslicki, Serghei Mangul

arXiv · 11 Aug 2020 ·

Overview:
We present an overview of metagenomics methods with a discussion of computational challenges and limitations.

Temp_PREPRINT metagenomic clinical diagnostics

Highlights Preprint

Improving the usability and comprehensiveness of microbial databases

Caitlin Loeffler, Aaron Karlsberg, Lana S. Martin, Eleazar Eskin, David Koslicki, Serghei Mangul

BMC Biology · 07 Apr 2020 ·

Overview:
We analyze and discuss guidelines for the development of a master reference database for disparate reference databases.

metagenomics database

Highlights Github Preprint

Finer Metagenomic Reconstruction via Biodiversity Optimization

Simon Foucart, David Koslicki

arXiv · 24 Jan 2020 ·

Overview:
We leverage a recently developed notion of biological diversity that simultaneously accounts for organism similarities and retains the optimization strategy underlying compressive-sensing-based approaches.

Temp_PREPRINT metagenomic reconstruction biodiversity optimization

Highlights Github Preprint

2019

Leveraging Distributed Biomedical Knowledge Sources to Discover Novel Uses for Known Drugs

Finn Womack, Jason McClelland, David Koslicki

Cold Spring Harbor Laboratory · 11 Sep 2019 ·

Overview:
We leverage a graph-based approach to integrate biological knowledge and then employ a graph node embedding scheme to make novel predictions about current drugs.

Temp_PREPRINT knowledge graphs drug repurposing

Highlights Github Preprint

Improving MinHash via the containment index with applications to metagenomic analysis

David Koslicki, Hooman Zabeti

Applied Mathematics and Computation · 01 Aug 2019 ·

Overview:
This work improves upon the so called “min hash” technique (a “probabilistic data analysis” method) to develop a very fast and efficient way to estimate the similarity of two sets of objects (in terms of how much they overlap). The approach we present is orders of magnitude faster (and uses orders of magnitude less space) when two data sets under consideration are of very different size. The kinds of sets we consider are sets of sub-strings (called k-mers) of DNA sequences from communities of microorganisms.

MinHash Jaccard index containment index metagenomics

Highlights Github Preprint

MiCoP: microbial community profiling method for detecting viral and fungal organisms in metagenomic samples

Nathan LaPierre, Serghei Mangul, Mohammed Alser, Igor Mandric, Nicholas C. Wu, David Koslicki, Eleazar Eskin

BMC Genomics · 01 Jun 2019 ·

Overview:
We show that read mapping, along with a probabilistic assignment of multi-mapped reads, outperforms other computational approaches to identify the presence and relative amount of viral and fungal organisms in a metagenomic sample of microorganismal DNA.

microbial community metagenomics

Highlights Github Preprint

Assessing taxonomic metagenome profilers with OPAL

Fernando Meyer, Andreas Bremges, Peter Belmann, Stefan Janssen, Alice C. McHardy, David Koslicki

Genome Biology · 04 Mar 2019 ·

Overview:
We introduce a framework to compare tools utilized to determine what microbes are present in a sample, and at what relative abundance. This will help computational biologists design better tools to analyze communities of microorganisms (which affect nearly everything in existence!).

metagenome taxonomic profile

Highlights Github Preprint

2018

The Biomedical Data Translator Program: Conception, Culture, and Community

Clinical and Translational Science · 09 Nov 2018 ·

Overview:
A description of the NIH NCATS culture that emerged during the Translator project.

biomedical translational science

Highlights

Toward A Universal Biomedical Data Translator

Clinical and Translational Science · 09 Nov 2018 ·

Overview:
The vision and high-level overview of the NIH National Center for Advancing Translational Science (NCATS) project entitled “Translator” (through which we have been funded). The goal of the project is essentially to build a biomedical “Siri”: an automated platform for answering biomedical research questions that leverages repositories of publicly available information.

biomedical translational science

Highlights Github UI

Transcriptome analysis in whole blood reveals increased microbial diversity in schizophrenia

Loes M. Olde Loohuis, Serghei Mangul, Anil P. S. Ori, Guillaume Jospin, David Koslicki, Harry Taegyun Yang, Timothy Wu, Marco P. Boks, Catherine Lomen-Hoerth, Martina Wiedau-Pazos, Rita M. Cantor, Willem M. de Vos, René S. Kahn, Eleazar Eskin, Roel A. Ophoff

Translational Psychiatry · 10 May 2018 ·

Overview:
Along with collaborators at UCLA, we were able to detect a small, but significant amount of microbes in blood This is surprising since it’s typically assumed that the immune system typically removes any microbial presence from human blood. I used a reference-free microbial community algorithm, called EMDeBruijn, to help corroborate the patterns we saw which included an increase in microbial diversity in schizophrenia patients. EMDeBruijn is a metric based on the Wasserstein metric (aka the Earth Mover’s Distance) and a de Bruijn graph induced by the k-mers in a metagenomic DNA sample.

microbial community de bruijn k-mer

Highlights Preprint

EMDUniFrac: exact linear time computation of the UniFrac metric and identification of differentially abundant organisms

Jason McClelland, David Koslicki

Journal of Mathematical Biology · 25 Apr 2018 ·

Overview:
Rapidly answers “why are these data sets different” by leveraging hierarchical/relatedness information. In short, we develop an algorithm to quickly compute the Unifrac distance by leveraging the earth mover’s distance, prove its correctness, and derive time and space complexity characterizations.

Unifrac distance

Highlights Github Preprint

2017

IndeCut evaluates performance of network motif discovery algorithms

Mitra Ansariola, Molly Megraw, David Koslicki

Bioinformatics · 11 Dec 2017 ·

Overview:
A gene regulatory network is basically a representation of how genes interact with each other. In this work, we develop the only (to date) method to assess the accuracy of so called “motif discovery algorithms” that seek to find important sub-networks of a given gene regulatory network. We develop a provably correct mathematical approach (based on a variety of metrics that say how close two matrices are to each other) and use this to assess the performance of a variety of motif discovery algorithms.

network motif

Highlights Github Preprint

Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software

Alexander Sczyrba, Peter Hofmann, Peter Belmann, David Koslicki, Stefan Janssen, Johannes Dröge, Ivan Gregor, Stephan Majda, Jessika Fiedler, Eik Dahms, Andreas Bremges, Adrian Fritz, Ruben Garrido-Oter, Tue Sparholt Jørgensen, Nicole Shapiro, Philip D Blood, Alexey Gurevich, Yang Bai, Dmitrij Turaev, Matthew Z DeMaere, Rayan Chikhi, Niranjan Nagarajan, Christopher Quince, Fernando Meyer, Monika Balvočiūtė, Lars Hestbjerg Hansen, Søren J Sørensen, Burton K H Chia, Bertrand Denis, Jeff L Froula, Zhong Wang, Robert Egan, Dongwan Don Kang, Jeffrey J Cook, Charles Deltel, Michael Beckstette, Claire Lemaitre, Pierre Peterlongo, Guillaume Rizk, Dominique Lavenier, Yu-Wei Wu, Steven W Singer, Chirag Jain, Marc Strous, Heiner Klingenberg, Peter Meinicke, Michael D Barton, Thomas Lingner, Hsin-Hung Lin, Yu-Chieh Liao, Genivaldo Gueiros Z Silva, Daniel A Cuevas, Robert A Edwards, Surya Saha, Vitor C Piro, Bernhard Y Renard, Mihai Pop, Hans-Peter Klenk, Markus Göker, Nikos C Kyrpides, Tanja Woyke, Julia A Vorholt, Paul Schulze-Lefert, Edward M Rubin, Aaron E Darling, Thomas Rattei, Alice C McHardy

Nature Methods · 02 Oct 2017 ·

Overview:
In a very reproducible fashion, we assess a wide variety of computational techniques in metagenomics, including assembly (putting together pieces of genomes, called contigs, from short reads), binning (figuring out where the contigs came from), and taxonomic profiling (determining which organisms are present in a sample and at what relative amount).

metagenomics computational methods

Highlights Github Preprint

Exact probabilities for the indeterminacy of complex networks as perceived through press perturbations

David Koslicki, Mark Novak

Journal of Mathematical Biology · 22 Jul 2017 ·

Overview:
In a network of interacting quantities (such as a food web), we examine how qualitative and quantitative predictions change when a quantity (such as the abundance of an organism or a set of organisms) is increased. This is quantified in terms of which model parameters cause the largest change in predictions.

network

Highlights Preprint

2016

Substitution Markov chains and Martin boundaries

David Koslicki, Manfred Denker

Rocky Mountain Journal of Mathematics · 01 Dec 2016 ·

Overview:
After introducing the notion of a random substitution Markov chain, we relate it to other notions of a “random substitution” and give a complete description of the Martin boundary for a few interesting examples.

Markov chains Martin boundaries

Highlights Preprint

Reference-free comparison of microbial communities via de Bruijn graphs

Serghei Mangul, David Koslicki

Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics · 02 Oct 2016 ·

Overview:
We present the idea of using the “earth mover’s distance” (aka the first Wasserstein metric) to measure the distance between samples of DNA. This reduces to finding the most efficient way to transform one kind of graph (known as de Bruijn graphs) into another.

de Bruijn graph

Highlights Github Preprint

MetaPalette: a k -mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation

David Koslicki, Daniel Falush

mSystems · 28 Jun 2016 ·

Overview:
We present a computational technique that answers the question “Which organisms are present in a given sample of of DNA from a microbial community, and at what relative amount” while simultaneously predicting the relatedness of novel (never-before seen organisms) in relation to known organisms. This relies on a mathematical technique referred to as sparsity-promoting optimization and relies on a technique similar to the Jaccard index.

metagenomic profiling

Highlights Github Preprint

2015

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

David Koslicki, Saikat Chatterjee, Damon Shahrivar, Alan W. Walker, Suzanna C. Francis, Louise J. Fraser, Mikko Vehkaperä, Yueheng Lan, Jukka Corander

PLOS ONE · 23 Oct 2015 ·

Overview:
We develop a pre-processing step that significantly improves k-mer based metagenomic profiling techniques.

k-mer metagenomic profiling

Highlights Github Preprint

2014

SEK: sparsity exploiting k-mer-based estimation of bacterial community composition

S. Chatterjee, D. Koslicki, S. Dong, N. Innocenti, L. Cheng, Y. Lan, M. Vehkapera, M. Skoglund, L. K. Rasmussen, E. Aurell, J. Corander

Bioinformatics · 07 May 2014 ·

Overview:
In this paper, we improve both the accuracy and speed of the Quikr approach to classifying a given set of metagenomic DNA sequences (16S rRNA). This is accomplished by increasing the number of “feature vectors” we use for each training genome, and by modifying the Lawson-Hanson algorithm for non-negative least squares.

k-mer metagenome

Highlights Github Preprint

Sparse Recovery by Means of Nonnegative Least Squares

Simon Foucart, David Koslicki

IEEE Signal Processing Letters · 01 Apr 2014 ·

Overview:
We prove that nonnegative least squares (typically prone to over-fitting) can be slightly modified to return sparse results.

Highlights Reproducible Preprint

WGSQuikr: Fast Whole-Genome Shotgun Metagenomic Classification

David Koslicki, Simon Foucart, Gail Rosen

PLoS ONE · 13 Mar 2014 ·

Overview:
Extending the results of Quikr to whole genome shotgun metagenomic samples, we develop a method to automatically select a parameter that balances sparsity (how succinct the result is) with accuracy.

metagenomic classfication k-mer

Highlights Sourceforce Preprint

Coding sequence density estimation via topological pressure

David Koslicki, Daniel J. Thompson

Journal of Mathematical Biology · 22 Jan 2014 ·

Overview:
We demonstrate that a concept of “weighted information content” (known as topological pressure, from the ergodic theory literature) can be used to facilitate the analysis of genomic data (in particular, find areas of a genome that have many genes in them). This is a conceptual extension to topological entropy approach presented earlier.

topological entropy

Highlights Sourceforce Preprint

On Entropy-Based Data Mining

Andreas Holzinger, Matthias Hörtenhuber, Christopher Mayer, Martin Bachler, Siegfried Wassertheurer, Armando J. Pinho, David Koslicki

Interactive Knowledge Discovery and Data Mining in Biomedical Informatics · 01 Jan 2014 ·

Overview:
We review a variety of entropy/randomness-based techniques that are useful in a variety of data mining applications.

entropy data mining

Highlights Preprint

2013

Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing

David Koslicki, Simon Foucart, Gail Rosen

Bioinformatics · 20 Jun 2013 ·

Overview:
We introduce an extremely fast, light-weight, “big data” algorithm to quickly answer the question of “which bacteria are present?” in a given sample of DNA. The method is based on the theory of compressed sensing and aims to find the simplest explanation for the data in terms of known information.

k-mer compressed sensing metagenomic classification

Highlights Github Sourceforce Preprint

2011

Topological entropy of DNA sequences

David Koslicki

Bioinformatics · 10 Feb 2011 ·

Overview:
I define a new notion of “randomness” (called topological pressure) suitable for use on sequences of symbols (words) of finite length. I show that this can be used to distinguish between biologically interesting sequences in the human genome.

applied math entropy

Highlights Implementation Preprint

Publications

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2011

More