The entire workflow below is based upon instructions for prof. dr. Patrick Schloss on the mothur blog and is provided by me without any warranty! It’s just as a primer for me to show how I tried to reformat SILVA v128 based upon the instructions above and which issues and results I ran into, hoping it can help you to do the same. As a general suggestion I suggest to stick to the mothur recreated SILVA v123 SEED for alignement purposes (see last section on my motivation for this). For classification, feel free to follow the README below to generate the SILVA v128 nr yourself (available directly from me upon request as well).
This README file explains how we generated the silva reference files for use with mothur’s classify.seqs and align.seqs commands. I’ll assume that you have a functioning copy of arb installed on your computer. For this README we are using arb 6.0.3. First we need to download the database and decompress it. From the (linux) command line we do the following:
wget https://www.arb-silva.de/fileadmin/arb_web_db/release_128/ARB_files/SSURef_NR99_128_SILVA_07_09_16_opt.arb.gz
gunzip SSURef_NR99_128_SILVA_07_09_16_opt.arb.gz
arb SSURef_NR99_128_SILVA_07_09_16_opt.arb
This will launch us into the arb environment with the ‘’Ref NR 99’’ database opened. This database has 646,151 sequences within it that are not more than 99% similar to each other. The release notes for this database as well as the idea behind the non-redundant database are available from the silva website. Within arb do the following:
Export
option should be set to marked
, Filter
to none
, and Compression
should be set to no
.Choose an output file name enter
make sure the path has you in the arb_ref_128
folder and enter silva.full_v128.fasta
.Select a format: fasta_mothur.eft. This is a custom formatting file that was created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create fasta_mothur.eft
in the /opt/local/share/arb/lib/export/
folder with the code below.
SUFFIX fasta
BEGIN
>*(acc).*(name)\t*(align_ident_slv)\t*(tax_slv);
*(|export_sequence)
You can now quit arb.
We used an E.coli (accession EU014689.2) 16S rRNA gene sequence trimmed within primers 27F and 1492R. The primers were removed. An alignment was made with the full length SILVA release.
align.seqs(fasta=ecoli27f1492rinside.fasta,reference=silva.full_v128.fasta,flip=T,processors=20)
Using 20 processors.
Reading in the silva.full_v128.fasta template sequences...DONE.
It took 1101 to read 577832 sequences.
Aligning sequences from ecoli27f1492rinside.fasta ...
It took 3 secs to align 1 sequences.
Output File Names:
ecoli27f1492rinside.align
ecoli27f1492rinside.align.report
> summary.seqs()
Using ecoli27f1492rinside.align as input file for the fasta parameter.
Using 20 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1044 43116 1463 0 6 1
2.5%-tile: 1044 43116 1463 0 6 1
25%-tile: 1044 43116 1463 0 6 1
Median: 1044 43116 1463 0 6 1
75%-tile: 1044 43116 1463 0 6 1
97.5%-tile: 1044 43116 1463 0 6 1
Maximum: 1044 43116 1463 0 6 1
Mean: 1044 43116 1463 0 6
# of Seqs: 1
Output File Names:
ecoli27f1492rinside.summary
It took 0 secs to summarize 1 sequences.
Now we need to screen the sequences for those that span the 27f and 1492r primer region, have 5 or fewer ambiguous base calls, and that are unique. We’ll also extract the taxonomic information from the header line:
mothur "#screen.seqs(fasta=silva.full_v128.fasta, start=1044, end=43116, maxambig=5, processors=18);
pcr.seqs(start=1044, end=43116, keepdots=T);
degap.seqs();
unique.seqs();"
> screen.seqs(fasta=silva.full_v128.fasta,start=1044,end=43116,maxambig=5)
Output File Names:
silva.full_v128.good.fasta
silva.full_v128.bad.accnos
It took 439 secs to screen 577832 sequences.
> get.current()
Current RAM usage: 0.0610046 Gigabytes. Total Ram: 31.4078 Gigabytes.
Current files saved by mothur:
fasta=silva.full_v128.good.fasta
processors=18
Current default directory saved by mothur: /home/fpkerckh/mothur/
Current working directory: /home/fpkerckh/Taxonomies/
Output File Names:
current_files.summary
> pcr.seqs(start=1044, end=43116, keepdots=T)
Output File Names:
silva.full_v128.good.pcr.fasta
It took 186 secs to screen 223622 sequences.
> degap.seqs()
Using silva.full_v128.good.pcr.fasta as input file for the fasta parameter.
Using 18 processors.
Degapping sequences from silva.full_v128.good.pcr.fasta ...
It took 46 secs to degap 223622 sequences.
Output File Names:
silva.full_v128.good.pcr.ng.fasta
> unique.seqs()
Using silva.full_v128.good.pcr.ng.fasta as input file for the fasta parameter.
Output File Names:
silva.full_v128.good.pcr.ng.names
silva.full_v128.good.pcr.ng.unique.fasta
grep ">" silva.full_v128.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.full_v128.good.pcr.ng.unique.accnos
mothur "#get.seqs(fasta=silva.full_v128.good.pcr.fasta, accnos=silva.full_v128.good.pcr.ng.unique.accnos)"
Selected 190661 sequences from your fasta file.
Output File Names:
silva.full_v128.good.pcr.pick.fasta
#generate alignment file
mv silva.full_v128.good.pcr.pick.fasta silva.nr_v128.align
#generate taxonomy file
grep "^>" silva.full_v128.fasta | cut -f 1,3 | cut -c 2- > silva.full_v128.tax.temp
The mothur command above does several things. First the screen.seqs command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (>900 bp) archaeal 16S rRNA gene sequences. Second, pcr.seqs convert any base calls that occur before position 1044 and after 43116 to .
to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (degap.seqs) and identify the unique sequences (unique.seqs). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (get.seqs).
Now we want to make sure the taxonomy file is properly formatted for use with mothur, for which we use R (version 3.3.1).
tax <- read.table(file="silva.full_v128.tax.temp", sep="\t")
tax$V2 <- gsub(" ", "_", tax$V2) #convert any spaces to underscores
tax$V2 <- gsub("uncultured;", "", tax$V2) #remove any "uncultured" taxa names
#tax$V2 <- paste0("Root;", tax$V2) #pre-empt all classifications with the Root level.
#we want to see whether everything has 7 (6) taxonomic levesl (Root to genus)
getDepth <- function(taxonString){
initial <- nchar(taxonString)
removed <- nchar(gsub(";", "", taxonString))
return(initial-removed)
}
depth <- getDepth(tax$V2)
bacteria <- grepl("Bacteria;", tax$V2)
archaea <- grepl("Archaea;", tax$V2)
eukarya <- grepl("Eukaryota;", tax$V2)
tax[depth > 6 & bacteria,] #good to go
tax[depth > 6 & archaea,] #good to go
tax[depth > 6 & eukarya,] #eh, there's a lot here - will truncate to the pseudo genus level
tax[depth > 6 & eukarya,2] <- gsub("([^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;).*", "\\1", tax[depth > 6 & eukarya,2])
depth <- getDepth(tax$V2)
tax[depth > 6 & eukarya,] #good to go
write.table(tax, file="silva.full_v128.tax", quote=F, row.names=F, col.names=F)
The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to figure out which sequences are part of the SEED. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the seuqences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_128 taxonomy file as well…:
grep ">" silva.nr_v128.align | cut -f 1,2 | grep "[[:space:]]100" | cut -f 1 | cut -c 2- > silva.seed_v128.accnos
grep ">" silva.nr_v128.align | cut -f 1,2 | grep -P "\t100" | cut -f 1 | cut -c 2- > silva.seed_v128.accnos
mothur "#get.seqs(fasta=silva.nr_v128.align, taxonomy=silva.full_v128.tax, accnos=silva.seed_v128.accnos)"
Selected 11213 sequences from your fasta file.
Selected 11213 sequences from your taxonomy file.
Output File Names:
silva.nr_v128.pick.align
silva.full_v128.pick.tax
mv silva.nr_v128.pick.align silva.seed_v128.align
mv silva.full_v128.pick.tax silva.seed_v128.tax
mothur "#get.seqs(taxonomy=silva.full_v128.tax, accnos=silva.full_v128.good.pcr.ng.unique.accnos)"
Selected 190661 sequences from your taxonomy file.
Output File Names:
silva.full_v128.pick.tax
mv silva.full_v128.pick.tax silva.nr_v128.tax
Let’s look to see how many different taxa we have for each taxonomic level within the silva.full_v128.tax
, silva.nr_v128.tax
, silva.seed_v128.tax
:
library(splitstackshape)
getNumTaxaNames <- function(file, kingdom){
taxonomy <- read.table(file=file, row.names=1)
sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])
phyla <- as.vector(levels(as.factor(gsub("[^;]*;([^;]*;).*", "\\1", sub.tax))))
phyla <- sum(!grepl(kingdom, phyla))
class <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
class <- sum(!grepl(kingdom, class))
order <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
order <- sum(!grepl(kingdom, order))
family <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
family <- sum(!grepl(kingdom, family))
genus <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
genus <- sum(!grepl(kingdom, genus))
n.seqs <- length(sub.tax)
return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}
getNumTaxaNamesFM <- function(file, kingdom){
taxonomy <- read.table(file=file, row.names=1)
sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])
sub.tax.spl<-cSplit(as.data.frame(sub.tax),1,sep=";")
if(kingdom!="Eukaryota")
{
setnames(sub.tax.spl,old=colnames(sub.tax.spl),
new=c("Regnum","Phylum","Classis","Ordo","Familia","Genus"))
}else{
setnames(sub.tax.spl,old=colnames(sub.tax.spl),
new=c("Regnum","Supergroup","Phylum","Classis","Ordo",
"Familia","Genus"))
}
phyla <- nlevels(sub.tax.spl$Phylum)
classis <- nlevels(sub.tax.spl$Classis)
#avoid using class, an internal R function, as an object name
ordo <- nlevels(sub.tax.spl$Ordo)
#avoid using order, an internal R function, as an object name
familia <- nlevels(sub.tax.spl$Familia)
#avoid using family, an internal R function, as an object name
genus <- nlevels(sub.tax.spl$Genus)
n.seqs <- length(sub.tax)
return(c(phyla=phyla, classis=classis, ordo=ordo, familia=familia,
genus=genus, n.seqs=n.seqs))
}
kingdoms <- c("Bacteria", "Archaea", "Eukaryota")
tax.levels <- c("phyla", "class", "order", "family", "genus", "n.seqs")
tax.levels.FM <- c("phyla", "classis", "ordo", "familia", "genus", "n.seqs")
full.file <- "silva.full_v128.tax"
full.matrix <- matrix(rep(0,18), nrow=3)
full.matrix[1,] <- getNumTaxaNames(full.file, kingdoms[1])
full.matrix[2,] <- getNumTaxaNames(full.file, kingdoms[2])
full.matrix[3,] <- getNumTaxaNames(full.file, kingdoms[3])
rownames(full.matrix) <- kingdoms
colnames(full.matrix) <- tax.levels
full.matrix
# phyla class order family genus n.seqs
# Bacteria 75 220 314 655 2827 510886
# Archaea 27 44 25 66 148 23030
# Eukaryota 13 31 81 185 481 43916
nr.file <- "silva.nr_v128.tax"
nr.matrix <- matrix(rep(0,18), nrow=3)
nr.matrix[1,] <- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] <- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] <- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) <- kingdoms
colnames(nr.matrix) <- tax.levels
nr.matrix
# phyla class order family genus n.seqs
# Bacteria 74 210 307 635 2706 168111
# Archaea 24 35 22 58 142 4337
# Eukaryota 11 26 76 139 410 18213
seed.file <- "silva.seed_v128.tax"
seed.matrix <- matrix(rep(0,18), nrow=3)
seed.matrix[1,] <- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] <- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] <- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) <- kingdoms
colnames(seed.matrix) <- tax.levels
seed.matrix
# phyla class order family genus n.seqs
# Bacteria 54 122 166 345 1149 8512
# Archaea 9 12 16 27 43 147
# Eukaryota 7 14 23 47 85 2554
nr.matrix / full.matrix
# phyla class order family genus n.seqs
# Bacteria 0.9866667 0.9545455 0.9777070 0.9694656 0.9571984 0.3290578
# Archaea 0.8888889 0.7954545 0.8800000 0.8787879 0.9594595 0.1883196
# Eukaryota 0.8461538 0.8387097 0.9382716 0.7513514 0.8523909 0.4147236
seed.matrix / full.matrix
# phyla class order family genus n.seqs
# Bacteria 0.7200000 0.5545455 0.5286624 0.5267176 0.4064379 0.016661251
# Archaea 0.3333333 0.2727273 0.6400000 0.4090909 0.2905405 0.006382979
# Eukaryota 0.5384615 0.4516129 0.2839506 0.2540541 0.1767152 0.058156481
So… which to use for what application? Since we have the RAM (at least on the la06c801), we could be using silva.nr_v128.align
in align.seqs
. It took about 10 minutes to read in the database file. We extracted the V3-V4 from the SILVA v128 alignment with outside primer trimming to make sure diverging 16S rRNA genes can also align, but an inside trimmed reference alignment can work as well.:
mothur "#align.seqs(fasta=ecoli341f785routside.fasta,reference=silva.nr_v128.align, flip=T, processors=20)"
Using 20 processors.
Reading in the silva.nr_v128.align template sequences... DONE.
It took 304 to read 190661 sequences.
Aligning sequences from ecoli341f785routside.fasta ...
It took 0 secs to align 1 sequences.
Output File Names:
ecoli341f785routside.align
ecoli341f785routside.align.report
mothur "#summary.seqs()"
Using ecoli341f785routside.align as input file for the fasta parameter.
Using 20 processors.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 6388 25316 465 1 6 1
2.5%-tile: 6388 25316 465 1 6 1
25%-tile: 6388 25316 465 1 6 1
Median: 6388 25316 465 1 6 1
75%-tile: 6388 25316 465 1 6 1
97.5%-tile: 6388 25316 465 1 6 1
Maximum: 6388 25316 465 1 6 1
Mean: 6388 25316 465 1 6
# of Seqs: 1
Output File Names:
ecoli341f785routside.summary
It took 0 secs to summarize 1 sequences.
mothur "#align.seqs(fasta=ecoli341f785rinside.fasta,reference=silva.nr_v128.align, flip=T, processors=20)"
Reading in the silva.nr_v128.align template sequences... DONE.
It took 297 to read 190661 sequences.
Aligning sequences from ecoli341f785rinside.fasta ...
It took 1 secs to align 1 sequences.
Output File Names:
ecoli341f785rinside.align
ecoli341f785rinside.align.report
mothur "#summary.seqs()"
Using ecoli341f785rinside.align as input file for the fasta parameter.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 6428 23440 427 1 6 1
2.5%-tile: 6428 23440 427 1 6 1
25%-tile: 6428 23440 427 1 6 1
Median: 6428 23440 427 1 6 1
75%-tile: 6428 23440 427 1 6 1
97.5%-tile: 6428 23440 427 1 6 1
Maximum: 6428 23440 427 1 6 1
Mean: 6428 23440 427 1 6
# of Seqs: 1
Output File Names:
ecoli341f785rinside.summary
It took 0 secs to summarize 1 sequences.
mothur "#pcr.seqs(fasta=silva.nr_v128.align, start=6388, end=25316, keepdots=F, processors=20);unique.seqs();summary.seqs(name=current)"
Output File Names:
silva.nr_v128.pcr.align
It took 72 secs to screen 190661 sequences.
Using silva.nr_v128.pcr.align as input file for the fasta parameter.
Output File Names:
silva.nr_v128.pcr.names
silva.nr_v128.pcr.unique.align
Using silva.nr_v128.pcr.unique.align as input file for the fasta parameter.
Using silva.nr_v128.pcr.names as input file for the name parameter.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 15382 312 0 3 1
2.5%-tile: 1 18928 434 0 4 4767
25%-tile: 1 18928 440 0 4 47666
Median: 1 18928 460 0 5 95331
75%-tile: 1 18928 464 0 6 142996
97.5%-tile: 1 18928 605 1 7 185895
Maximum: 3865 18928 1666 5 16 190661
Mean: 1.02972 18927.8 467.313 0.0611294 5.02901
# of unique seqs: 149371
total # of seqs: 190661
Output File Names:
silva.nr_v128.pcr.unique.summary
It took 14 secs to summarize 190661 sequences.
mv silva.nr_v128.pcr.unique.align silva.nr_v128.LGCoutside.unique.align
mothur "#pcr.seqs(fasta=silva.nr_v128.align, start=6428, end=23440, keepdots=F, processors=20);unique.seqs();summary.seqs(name=current)"
Output File Names:
silva.nr_v128.pcr.align
It took 67 secs to screen 190661 sequences.
Output File Names:
silva.nr_v128.pcr.names
silva.nr_v128.pcr.unique.align
Using silva.nr_v128.pcr.names as input file for the name parameter.
Using silva.nr_v128.pcr.unique.align as input file for the fasta parameter.
Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 15342 274 0 3 1
2.5%-tile: 2 17012 396 0 4 4767
25%-tile: 2 17012 402 0 4 47666
Median: 2 17012 422 0 5 95331
75%-tile: 2 17012 426 0 6 142996
97.5%-tile: 2 17012 567 1 7 185895
Maximum: 3825 17012 1628 5 16 190661
Mean: 2.0663 17011.8 429.254 0.0561678 5.02073
# of unique seqs: 147682
total # of seqs: 190661
Output File Names:
silva.nr_v128.pcr.unique.summary
It took 14 secs to summarize 190661 sequences.
mv silva.nr_v128.pcr.unique.align silva.nr_v128.LGCinside.unique.align
This will get you 149,371 unique sequences (147682 with inside trimming) to then align against. Other tricks to consider would be to use get.lineage
to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using filter.seqs
with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know a priori). You could just use the silva.seed_v128.align
reference for aligning, though the low number of sequences retained with the procedure above makes me wary of doing this. For classifying sequences, Patrick Schloss (and I) would strongly recommend using the silva.nr_v128.align
and silva.nr_v128.tax
references after running pcr.seqs on silva.nr_v128.align
. You shouldn’t run unique.seqs
on the output. I have attended the ARB/SILVA workshop from Primer to Paper in november 2016, where I had the pleasure to talk to the SILVA developers (Jörg Peplies, Jan Gerken, Christian Quast, Pelin Yilmaz, Antonio Fernandez-Guerra, Ivaylo Kostadinov) and PI (Frank Oliver Glöckner) on the SEED alignment. Apparently, even though they would like to make the SEED alignment accessible, it contains proprietary sequences which prohibit public sharing of the SEED alignment. Hence, dr. Schloss tried to recreate the SEED (and his README on how to do so has been used above and applied to release 128). It is however not exactly the same as the SEED of SILVA. Furthermore, the SEED alignment does not change very fast (it is not a multiple sequence alignment, but an incremental alignment). Therefore, given the issues outlined above for the v.128 mothur recreated SEED, currently at CMET we stick to the v.123 mothur recreated seed for alignment purposes.
If you are going to use the files generated in this README, you should be aware of SILVA’s dual use license. We’ll leave it to you to work out the details.