“AnnoTools”: Extending AnnoTree and AnnoView for Database-Wide Genome Annotation, Visualization, and Comparison

No Thumbnail Available

Date

2025-01-24

Advisor

Doxey, Andrew

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Genomic analysis has revolutionized our understanding of the biology and evolutionary history of bacterial and archaeal microorganisms, leading to numerous applications in biotechnology, medicine, and environmental sciences. One of the fundamental aspects of genomic analysis is protein functional annotation, which involves assigning biological functions to protein-coding sequences identified within genomes. These annotations are widely used to support analyses, such as examining gene or function distributions across the tree of life and comparing gene neighborhoods across taxa. By combining these analyses, researchers can comprehensively explore gene functions and the mechanisms of given genes or gene clusters. In this thesis, I will introduce a pipeline that supports genomic analysis. The project consists of three parts: data annotation, visualization, and the language model. The first part of the pipeline is the generation of protein function annotations. Raw protein sequence data is downloaded from the Genome Taxonomy Database (GTDB) and submitted to two tools: Kofamscan and DIAMOND. Kofamscan assigns KEGG ORTHOLOGY IDs to each input sequence, while DIAMOND assigns Uniref IDs, which are then mapped to InterPro IDs. Combining these IDs provides comprehensive and reliable annotations. The data is filtered for quality and stored on a remote server as an annotation database for further analysis. The second part of the pipeline involves updating two user-friendly, web-based visualization tools, AnnoTree and AnnoView, which utilize the annotation database. AnnoTree displays the distribution and taxonomy of different protein annotations across GTDB using a tree of life representation, offering insights into biological and evolutionary patterns through species phylogenies and supporting genome-wide co-occurrence analysis. AnnoView focuses on comparing and exploring gene neighborhoods, identifying functionally related genes clustered together in genomes as "gene clusters," thus emphasizing window-based co-occurrence analysis. The new annotation database not only provides more comprehensive and accurate annotations, enhancing the databases that both visualization tools rely on, but also extends their functionalities for fast data retrieval and new features. The last part of the pipeline involves the application of the Word2Vec language model, which treats genome contigs as sentences in natural language and trains the model using the annotation database. After training, the updated model can encode each annotation from a specific protein family into high-dimensional vectors with continuous number, allowing researchers to explore annotations that share similar genomic contexts. This allows protein functions prediction based on this comparative gene neighborhood analysis. Finally, I will use one protein domain in the Type VI Secretion System (T6SS) as a case study. T6SS is a cell envelope-spanning machine that translocates toxic effector proteins into eukaryotic and prokaryotic cells. Besides the conserved essential core components, there are various effector and accessory proteins in the system. Some proteins are annotated as Domains of Unknown Function (DUF) and are poorly explored. In this case, I will focus on PF20598 (DUF6795), which shares a similar genomic context with one of the T6SS proteins. Using the visualization tools AnnoTree and AnnoView, I will demonstrate that this DUF is part of the T6SS cluster, supporting the hypothesis that it may function as an adaptor protein in T6SS. In summary, the AnnoTools pipeline integrates all components to enhance comparative genomic analysis with a large-scale annotation database. The user-friendly web-based tools enable researchers to visualize data both genome-wide and at a window-based scale. The ultimate goal of this thesis is to provide researchers with a comprehensive and easy-to-use method for predicting functions of genes or gene clusters of interest.

Description

Keywords

NATURAL SCIENCES::Chemistry::Theoretical chemistry::Bioinformatics, genomics, NATURAL SCIENCES::Biology::Organism biology::Microbiology, visualization, word2vec

LC Subject Headings

Citation

Collections