Monday, May 4, 2015

UniProt Knowledgebase just got smaller!

UniProt release 2015_04 at the beginning of April 2015 saw the number of protein entries in the UniProtKB go from 92,672,207 to 47,262,724. Wondering what happened?


Prior to release 2015_04, UniProtKB had doubled in size in the past year to over 90 million entries with a high level of redundancy. This was especially true for bacterial species where different genomes of the same bacterium have been sequenced and submitted (e.g. 4,080 proteomes for Staphylococcus aureus comprising 10.88 Million entries).

To deal with this redundancy, we developed a procedure to identify highly redundant proteomes within species groups. This procedure was implemented for bacterial species and the sequences corresponding to redundant proteomes (approximately 47 million entries) were deprecated. All of these protein entries belonged to the unreviewed TrEMBL part of the UniProt Knowledgebase. Reviewed Swiss-Prot protein entries will remain unaffected by the procedure. 

So how does this procedure actually work? Here we break it down into 4 steps.

Step 1: Group proteomes by taxonomy level

Proteomes can only be redundant to other proteomes of the same taxonomy branch at species level or below (sub-species, strains, etc.).


Step 2: Pairwise comparison of proteomes within each group

We use the CD-Hit 2D program for pairwise comparison of proteomes within each group. Based on the results, we calculate the level of similarity between pairs of proteomes within the groups.


Step 3: Graph analysis 

We now select just the proteome pairs with similarity higher than 90%. With these proteomes as nodes, we create a directed weighted graph where edges are the level of similarity. To identify the most redundant proteomes, we rank all proteomes in this graph.

Ranking is by Proteome(Indegree, Outdegree) where for a proteome A:
Indegree (the higher the better): Number of proteomes that are redundant to proteome A.
Outdegree (the lower the better): Number of proteomes to which proteome A is redundant.
So, for example:
A(5,1) is better than B(1,1)


Step 4: Elimination of redundant proteomes

Proteomes that rank lowest are the most redundant. These are marked as ‘redundant’ on the UniProt proteomes web portal and protein entries belonging to these redundant proteomes are removed from UniProtKB TrEMBL. 

This process is run iteratively to identify all redundant proteomes. All proteomes remain searchable through UniProt’s Proteomes interface (http://www.uniprot.org/proteomes/) and redundant proteome sets are now available for download from the UniProt Archive UniParc.