NTU Undergraduates' research April 2014 - Biosciences | Page 16
Examining the distribution of overlapping reading
frames within viruses
Sukhdeep Kaur
15/04/2014
Biomedical Sciences
Nottingham Trent University, Clifton Lane, Nottingham, NG11 8NS
Abstract:
The study was to look at the distribution of overlapping genes within viruses and how the
distribution of an overlap may differ from one family of viruses to another. The study also
looked into detail of the orientation of the overlap identifying whether double stranded DNA
viruses had more convergent, divergent, parallel or nested overlaps.
Viruses with completed genomes within the NCBI http server database were downloaded as
GenBank files and organised into 4 categories whether they were double or single stranded or
DNA or RNA molecules. Using Phython version 2.7.6, a script was ran in order to extract the
relevant information needed of the genes that were overlapping. Random selection of one
viral strain per family of viruses were selected by random and the data was inputted into R.
dsRNA was not investigated further due to the lack of data. The orientation of overlaps
within each family was investigated where frequencies of overlaps within dsDNA;
convergent 0.17, divergent 0.1, parallel 0.363, nested 0.367. Frequencies of overlaps within
ssDNA; convergent 0.06, divergent 0.02, parallel 0.53, nested 0.38. Frequencies of overlaps
within ssRNA; convergent and divergent 0, parallel and nested 0.5. The chi squared test
revealed that the null hypothesis would be rejected as differences are 99% due to other
factors and not due to chance.
The R programme allowed us to produce histograms and collate data together so that we
could analyse the overlaps and nucleotide base pairs where the overlap was occurring within
each family. The dsDNA family in the first section contained a lot of data and therefore was
examined further. Investigations into whether certain orientations of overlap contributed
more to the total number of overlaps found within dsDNA viruses. Results showed that a lot
of the smaller base pair overlaps were occurring in the form of convergent, divergent and
parallel overlaps and the larger overlaps were occurring in the form of nested overlaps.
Due to the degeneracy of the genetic code and the limitations of the codons, expectations of
what the coding sequences is was made. This investigation can be further delved into by
looking at the actual coding sequences within the overlapping genes that we randomly
selected to see whether the expectations we predict are correct and accurate. Certain codons
may be more abundant than others within viruses due to mutational biases and selective
forces within the environment; this may be because some codons translate faster than others
or the fact that some codons are less prone to mistranslation.