The pEVE database
The pEVE database contains BLASTN search (e-value ≤ 1E-10) results of all available eukaryotic and viral genomes (as of August 14, 2017) as target and query sequences, respectively.
1. Details of the search
We utilized the GenomeSync database for the genomic sequence and taxonomy data (as of August 14, 2017). Eukaryotic genomes comprised 4,102 sequences from 335 vertebrates, 472 invertebrates, 2,518 fungi, 224 land plants, and 553 other eukaryotic species (which are listed here). For viral genomes, we focused on non-retroviral DNA/RNA virus-like sequences, excluding those that belonged to retro-transcribing viruses including Retroviridae, Caulimoviridae, and Hepadnaviridae. As a result, a total of 7,007 viral genomes were included in the study: 2,697 double-strand DNA (dsDNA) viruses, 284 double-strand RNA (dsRNA) viruses, 911 single-strand DNA (ssDNA) viruses, 1,624 single-strand RNA (ssRNA) viruses, 223 Satellites, 4 environmental samples, and 1,264 other viruses including phages and viroids (which are listed here).
We conducted BLASTN searches (e-value < 1E-10) using the Genome Search Toolkit (GSTK) scripts for the 4,102 eukaryotic and 7,007 viral genomes as target and query sequences, respectively, to comprehensively identify EVEs in eukaryotic genomes. Before the search, we masked simple repeat regions in the query genome sequences using tantan with the options “–x N–r 0.0005.” (2.2% were masked).
2. Summary of the search
55.0% (3,856/7,007) of the viral genomes were included in at least one eukaryotic genome. Interestingly, 99.9% (4,098/4,102) of the eukaryotic genomes contained at least one EVE-like sequence. It was found that the most and the second-most abundant viruses (found in 4,091 and 4,089 eukaryotic genomes, respectively) were uncultured human fecal viruses found in metagenomic analyses; this may have been an artifact. Other than these viruses, many of abundant viruses in eukaryotic genomes were from dsDNA viruses.
Figure. Taxonomic tree of the 7,007 viruses stored in this database, labeled and colored according to the virus groups. The bars in each circle indicate the maximum BLAST bit scores for hits of each virus to genomes in eight categories of eukaryotes; the height of the bars is log-scaled. This tree is generated using GraPhlAn.