Comparative Microbial Genomics and Functional Annotation

Tammi Vesth
Seminar

In November 2013, there was around 21.000 different prokaryotic genomes se- quenced and publicly available, and the number is growing daily with another 20.000 or more genomes expected to be sequenced and deposited by the end of 2014.

An important part of the analysis of this data is the functional annotation of genes – the descriptions assigned to genes that describe the likely function of the encoded proteins. This process is limited by several factors, including the definition of a function which can be more or less specific as well as how many genes can actually be assigned a function based on known functions. This talk describes the development of new tools for comparative functional annotation and a system for comparative genomics in general. As novel sequenced genomes are becoming more readily available, there is a need for standard analysis tools. The system CMG-biotools is presented here as an example of such a system and was used to analyze a set of genomes from the Negativicutes class, a group of bacteria closely related to Gram positives but which has a different cell wall structure and stains Gram negative. The results of this work show that genomes of this class have very little homology to other known genomes making functional annotation based on sequence similarity very difficult.

Inspired in part by this analysis, an approach for comparative functional annotation was created based public sequenced genomes, CMGfunc. Functionally related groups of proteins were clustered based on sequence domains so that each group represented a protein function. Each function was then modeled using Artificial Neural Networks (ANN) and the model was evaluated based on its ability to identify true positives and negatives. The models were used to annotate a number of proteins without functional annotations and predicted functions for 98% of the genes. Evaluation of the precision of the method was performed, using data from the Critical Assessment of Functional Annotation (CAFA) project, and correct predictions were made in about 60% of the cases.

These projects have highlighted the difficulties and challenges in functional annotation and computational analysis of sequence data. They have provided possible solutions for creating reproducible pipelines for comparative genomics as well as constructed a number of functional models not based on sequence similarity. Although much work is still left to be done, resources are flowing into the area of sequence analysis and progress is being made every day. As such, many different approach are being tried out and tested which will, in time, improve the knowledge gained from sequencing genomes.