Data-driven discovery of nucleotide sequences belonging to species in
a metagenome
Kasun Vimukthi 1 , Geeth Wimalasiri 1* and Prabhath Bandara 1 , Damayanthi Herath 1
1
Department of Computer Engineering, Faculty of Engineering, University of Peradeniya, Sri Lanka
*E-mail: [email protected]
Abastract: Metagenomics is one of the recent areas in microbiology, which helps to explore novel species and
study about existing species and their dynamics in various environments. A key process in a metagenomics
study is classifying nucleotide sequences related to species samples which is also known as ”Binning.” The
classification can be done according to a reference or based on the mutual characteristics (data-driven) where
multiple machine learning techniques are involved such as unsupervised learning. This paper focuses on optimizing
data-driven binning by increasing the number of metagenomic sequences binned while maintaining reasonable
binning accuracy. A dissimilarity-based approach is proposed to improve the number of contigs binned
by an existing binning method. It is shown that the proposed method increases the number of contigs binned by
10% while having a reasonable accuracy compared to the original method. This work suggests that the effective
use of observed data which may be discarded as outliers otherwise may result in improved performance binning.
Keywords: Binning, Metagenomics, Data-driven methods, Number of contig bins, outlier handling, Improve
contig assignments, Mahalanobis distance
16