Thursday, March 1, 2018

Would you like to annotate function with UniProt's annotation systems?

Register your interest here:

One of the core activities at UniProt is to develop computational methods for the functional annotation of protein sequences. UniProt has developed two prediction systems, UniRule and the Statistical Automatic Annotation System (SAAS) to automatically annotate the unreviewed records in UniProtKB/TrEMBL with high coverage and a high degree of accuracy.

These prediction systems can annotate protein properties such as protein names, function, catalytic activity, pathway membership, and subcellular location, along with sequence-specific information, such as the positions of post-translational modifications and active sites.

As a result of discussions with researchers and genome sequencing centres interested in functional annotation, we plan to make our annotation rules publicly available for download. We would like to engage with users in the development of a standardised format for sharing these annotation rules, to help you use the rules for functional annotation of your own data.

Apply the UniProt rules on your own proteins

We also plan to provide a standalone tool to execute the UniProt annotation rules and enrich your own data with high-quality annotations. We invite user feedback towards the provision of such a tool for functional annotation of coding sequences.

By providing input data such as the protein sequences, taxonomy data and InterProScan signatures, along with the rules, a rule engine will be able to reason on the rules to infer new protein annotations.

Get involved

Would you like to try out the UniProt rules to annotate your own data? Would you like an early peek at the systems, formats and functionality we plan to make available and provide valuable feedback? Are you interested in integrating the UniProt rules (UniRule and SAAS) in your annotation pipeline? We would love to have your feedback and give you the opportunity to beta-test our latest developments.

Register your interest here:

Tuesday, February 6, 2018

UniProt and the Expanding Tree of Life

UniProt loves life in all its forms, but we especially love its complement of proteins.  We want to bring you the protein sequences from the massive diversity of organisms across the whole planet.  We have been closely following how the Tree of Life is expanding and being increasingly accurately resolved.  Here's a look at a couple of the most exciting discoveries and how they are reshaping what we do.  Below is a revised Tree of Life presented by Laura Hug et al., which is based upon an alignment of 16 ribosomal proteins.

Figure 1. The revised Tree of Life from Hug et al. 2016. Lineages lacking an isolated representative are highlighted with non-italicized names and red dots.

We can see that the large majority of organisms are microbial, and as yet we are unable to grow a large fraction of them in the lab. Red dots in the figure show phyla for which not even a single organism has been cultured.  However, due to the power of next generation sequencing and improving metagenomic assembly and binning tools, we now have access to thousands of complete or near complete genomes assembled from metagenomic data (Anantharaman et al. 2016, Parks et al. 2017).  These genomes have been called MAGs for metagenomic assembled genomes.

Probably the most exciting MAG to have been assembled is that of an enigmatic archaebacterium that lives in deep sea sediments. Lokiarchaebacterium is named after the location at which it was first identified (Spang et al. 2015); the Loki’s Castle field of active hydrothermal vents or black smokers found in the mid-Atlantic. This microbe possesses many protein families that were considered to be characteristic of eukaryotes.  It was suggested that an ancestor of Lokiarchaebacterium was the origin of all eukaryotic cells.  The identification of more remotely related archaebacteria led to the definition of the Asgard phylum of archaebacteria (Zaremba-Niedzwiedzka et al. 2017).  Phylogenetic analysis showed that eukaryotes could be considered an ingroup of these archaebacteria.  This finding and the growth of MAGs of these organisms gives us an unprecedented opportunity to study the earliest events in the emergence of the eukaryotic lineage.

A huge yet mysterious branch on the tree of life has become known as the Candidate Phyla Radiation (CPR).  This branch of bacteria contains vast numbers of uncultured organisms.  Analysis of the MAGs of these organisms suggests that they do not have all the machinery necessary for free living and are likely to exist in symbiotic associations.  These cells are extremely small, yet offer a glimpse into hitherto unknown protein functions and diversity. 

The authors of the influential papers mentioned here have submitted data to the DNA sequence databanks which flows into UniProt.  The set of protein sequences (proteomes) for these complete genomes  can be searched for in the UniProt Proteome database.  Some of these organisms have been computationally selected as part of the UniProt Reference Proteome collection (Chen et al. 2011) which aims to provide a selection of key proteomes and their proteins that cover the diversity of life. Reference proteomes and are indicated with the following icon:

You can investigate the protein sequences in these organisms in UniProt by following the links below to relevant proteomes:
·      Lokiarchaebacteria
·      Woesebacteria – An example group from the Candidate Phylogeny Radiation.


Hug LABaker BJAnantharaman KBrown CTProbst AJ,  Castelle CJ,  Butterfield CN
Hernsdorf AWAmano YIse KSuzuki YDudek N,  Relman DA,  Banfield JF  

Anantharaman KBrown CTHug LASharon ICastelle CJProbst AJThomas BCSingh AWilkins MJKaraoz UBrodie ELWilliams KHHubbard SS,  Banfield JF. Nat Commun Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. [24 Oct 2016, 7:13219].

Zaremba-Niedzwiedzka KCaceres EFSaw JHBäckström DJuzokaite LVancaester ESeitz KWAnantharaman KStarnawski PKjeldsen KUStott MBNunoura TBanfield JFSchramm ABaker BJ, Spang A, Ettema TJ. Nature. Asgard archaea illuminate the origin of eukaryotic cellular complexity. 11 Jan 2017, 541(7637):353-358.

Chen C, Natale DA, Finn RD, Huang H, Zhang J, Wu CH, Mazumder R. Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation. PLoS One 27 Apr 2011, 6(4):e18910.