ESCaPe 2020 Proceedings

Data-driven discovery of nucleotide sequences belonging to species in a metagenome Kasun Vimukthi 1 , Geeth Wimalasiri 1* and Prabhath Bandara 1 , Damayanthi Herath 1 1 Department of Computer Engineering, Faculty of Engineering, University of Peradeniya, Sri Lanka *E-mail: geethp@eng.pdn.ac.lk Abastract: Metagenomics is one of the recent areas in microbiology, which helps to explore novel species and study about existing species and their dynamics in various environments. A key process in a metagenomics study is classifying nucleotide sequences related to species samples which is also known as ”Binning.” The classification can be done according to a reference or based on the mutual characteristics (data-driven) where multiple machine learning techniques are involved such as unsupervised learning. This paper focuses on optimizing data-driven binning by increasing the number of metagenomic sequences binned while maintaining reasonable binning accuracy. A dissimilarity-based approach is proposed to improve the number of contigs binned by an existing binning method. It is shown that the proposed method increases the number of contigs binned by 10% while having a reasonable accuracy compared to the original method. This work suggests that the effective use of observed data which may be discarded as outliers otherwise may result in improved performance binning. Keywords: Binning, Metagenomics, Data-driven methods, Number of contig bins, outlier handling, Improve contig assignments, Mahalanobis distance 16

ESCaPe 2020 Proceedings | Page 16