DISCLAIMER

The entire workflow below is based upon instructions for prof. dr. Patrick Schloss on the mothur blog and is provided by me without any warranty! It’s just as a primer for me to show how I tried to reformat SILVA v128 based upon the instructions above and which issues and results I ran into, hoping it can help you to do the same. As a general suggestion I suggest to stick to the mothur recreated SILVA v123 SEED for alignement purposes (see last section on my motivation for this). For classification, feel free to follow the README below to generate the SILVA v128 nr yourself (available directly from me upon request as well).

Curation of references

Getting the data in and out of the ARB database

This README file explains how we generated the silva reference files for use with mothur’s classify.seqs and align.seqs commands. I’ll assume that you have a functioning copy of arb installed on your computer. For this README we are using arb 6.0.3. First we need to download the database and decompress it. From the (linux) command line we do the following:

wget https://www.arb-silva.de/fileadmin/arb_web_db/release_128/ARB_files/SSURef_NR99_128_SILVA_07_09_16_opt.arb.gz
gunzip SSURef_NR99_128_SILVA_07_09_16_opt.arb.gz
arb SSURef_NR99_128_SILVA_07_09_16_opt.arb

This will launch us into the arb environment with the ‘’Ref NR 99’’ database opened. This database has 646,151 sequences within it that are not more than 99% similar to each other. The release notes for this database as well as the idea behind the non-redundant database are available from the silva website. Within arb do the following:

  1. Click the search button
  2. Set the first search field to ‘ARB_color’ and set it to 1. Click on the equal sign until it indicates not equal (this removes low quality reads and chimeras)
  3. Click ‘Search’. This yielded 557832 hits
  4. Click the “Mark Listed Unmark Rest” button
  5. Close the “Search and Query” box
  6. Now click on File->export->export to external format
  7. In this box the Export option should be set to marked, Filter to none, and Compression should be set to no.
  8. In the field for Choose an output file name enter make sure the path has you in the arb_ref_128 folder and enter silva.full_v128.fasta.
  9. Select a format: fasta_mothur.eft. This is a custom formatting file that was created that includes the sequences accession number and it’s taxonomy across the top line. To create one for you will need to create fasta_mothur.eft in the /opt/local/share/arb/lib/export/ folder with the code below.

    SUFFIX          fasta
    BEGIN
    >*(acc).*(name)\t*(align_ident_slv)\t*(tax_slv);
    *(|export_sequence)
  10. Save this as silva.full_v128.fasta (this can take a while >5min at least)
  11. You can now quit arb.

Determine trimming positions

We used an E.coli (accession EU014689.2) 16S rRNA gene sequence trimmed within primers 27F and 1492R. The primers were removed. An alignment was made with the full length SILVA release.

align.seqs(fasta=ecoli27f1492rinside.fasta,reference=silva.full_v128.fasta,flip=T,processors=20)
Using 20 processors.
Reading in the silva.full_v128.fasta template sequences...DONE.
It took 1101 to read  577832 sequences.
Aligning sequences from ecoli27f1492rinside.fasta ...
It took 3 secs to align 1 sequences.
Output File Names:
ecoli27f1492rinside.align
ecoli27f1492rinside.align.report


> summary.seqs()
Using ecoli27f1492rinside.align as input file for the fasta parameter.
Using 20 processors.
                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1044    43116   1463    0       6       1
2.5%-tile:      1044    43116   1463    0       6       1
25%-tile:       1044    43116   1463    0       6       1
Median:         1044    43116   1463    0       6       1
75%-tile:       1044    43116   1463    0       6       1
97.5%-tile:     1044    43116   1463    0       6       1
Maximum:        1044    43116   1463    0       6       1
Mean:   1044    43116   1463    0       6
# of Seqs:      1
Output File Names:
ecoli27f1492rinside.summary
It took 0 secs to summarize 1 sequences.

Screening the sequences

Now we need to screen the sequences for those that span the 27f and 1492r primer region, have 5 or fewer ambiguous base calls, and that are unique. We’ll also extract the taxonomic information from the header line:

mothur "#screen.seqs(fasta=silva.full_v128.fasta, start=1044, end=43116, maxambig=5, processors=18);
        pcr.seqs(start=1044, end=43116, keepdots=T);
        degap.seqs();
        unique.seqs();"
> screen.seqs(fasta=silva.full_v128.fasta,start=1044,end=43116,maxambig=5)
Output File Names:
silva.full_v128.good.fasta
silva.full_v128.bad.accnos
It took 439 secs to screen 577832 sequences.


> get.current()
Current RAM usage: 0.0610046 Gigabytes. Total Ram: 31.4078 Gigabytes.
Current files saved by mothur:
fasta=silva.full_v128.good.fasta
processors=18
Current default directory saved by mothur: /home/fpkerckh/mothur/
Current working directory: /home/fpkerckh/Taxonomies/
Output File Names:
current_files.summary


> pcr.seqs(start=1044, end=43116, keepdots=T)
Output File Names:
silva.full_v128.good.pcr.fasta
It took 186 secs to screen 223622 sequences.


    
> degap.seqs()
Using silva.full_v128.good.pcr.fasta as input file for the fasta parameter.
Using 18 processors.
Degapping sequences from silva.full_v128.good.pcr.fasta ...
It took 46 secs to degap 223622 sequences.
Output File Names:
silva.full_v128.good.pcr.ng.fasta


> unique.seqs()
Using silva.full_v128.good.pcr.ng.fasta as input file for the fasta parameter.
Output File Names:
silva.full_v128.good.pcr.ng.names
silva.full_v128.good.pcr.ng.unique.fasta


grep ">" silva.full_v128.good.pcr.ng.unique.fasta | cut -f 1 | cut -c 2- > silva.full_v128.good.pcr.ng.unique.accnos


mothur "#get.seqs(fasta=silva.full_v128.good.pcr.fasta, accnos=silva.full_v128.good.pcr.ng.unique.accnos)"
Selected 190661 sequences from your fasta file.
Output File Names:
silva.full_v128.good.pcr.pick.fasta


#generate alignment file
mv silva.full_v128.good.pcr.pick.fasta silva.nr_v128.align

#generate taxonomy file
grep "^>" silva.full_v128.fasta | cut -f 1,3 | cut -c 2- > silva.full_v128.tax.temp

The mothur command above does several things. First the screen.seqs command removes sequences that are not full length and have more than 5 ambiguous base calls. Note: this will remove a number of Archaea since the ARB RN reference database lets in shorter (>900 bp) archaeal 16S rRNA gene sequences. Second, pcr.seqs convert any base calls that occur before position 1044 and after 43116 to . to make them only span the region between the 27f and 1492r priming sites. Finally, it is possible that weird things happen in the alignments and so we unalign the sequences (degap.seqs) and identify the unique sequences (unique.seqs). We then convert the resulting fasta file into an accnos file so that we can go back into mothur and pull out the unique sequences from the aligned file (get.seqs).

Formatting the taxonomy files

Now we want to make sure the taxonomy file is properly formatted for use with mothur, for which we use R (version 3.3.1).

tax <- read.table(file="silva.full_v128.tax.temp", sep="\t")
tax$V2 <- gsub(" ", "_", tax$V2)  #convert any spaces to underscores
tax$V2 <- gsub("uncultured;", "", tax$V2)   #remove any "uncultured" taxa names

#tax$V2 <- paste0("Root;", tax$V2)   #pre-empt all classifications with the Root level.

#we want to see whether everything has 7 (6) taxonomic levesl (Root to genus)
getDepth <- function(taxonString){
  initial <- nchar(taxonString)
    removed <- nchar(gsub(";", "", taxonString))
    return(initial-removed)
}


depth <- getDepth(tax$V2)
bacteria <- grepl("Bacteria;", tax$V2)
archaea <- grepl("Archaea;", tax$V2)
eukarya <- grepl("Eukaryota;", tax$V2)

tax[depth > 6 & bacteria,] #good to go
tax[depth > 6 & archaea,]  #good to go
tax[depth > 6 & eukarya,]  #eh, there's a lot here - will truncate to the pseudo genus level
tax[depth > 6 & eukarya,2] <- gsub("([^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;).*", "\\1", tax[depth > 6 & eukarya,2])

depth <- getDepth(tax$V2)
tax[depth > 6 & eukarya,]  #good to go

write.table(tax, file="silva.full_v128.tax", quote=F, row.names=F, col.names=F)

Building the SEED references

The first thing to note is that SILVA does not release their SEED; it is private. By screening through the ARB databases we can attempt to recreate it. Our previous publications show that classify.seqs with the recreated SEED does an excellent job of realigning sequences to look like they would if you used SINA and the true SEED. Now we want to figure out which sequences are part of the SEED. Earlier, when we exported the sequences from ARB, we included the align_ident_slv field from the database in our output. Let’s generate an accnos file that contains the names of the seuqences with 100% to the SEED database and then use mothur to generate SEED fasta and taxonomy files. While we’re at it we’ll also generate the nr_128 taxonomy file as well…:

grep ">" silva.nr_v128.align | cut -f 1,2 | grep "[[:space:]]100" | cut -f 1 | cut -c 2- > silva.seed_v128.accnos
grep ">" silva.nr_v128.align | cut -f 1,2 | grep -P "\t100" | cut -f 1 | cut -c 2- > silva.seed_v128.accnos
mothur "#get.seqs(fasta=silva.nr_v128.align, taxonomy=silva.full_v128.tax, accnos=silva.seed_v128.accnos)"
Selected 11213 sequences from your fasta file.
Selected 11213 sequences from your taxonomy file.
Output File Names:
silva.nr_v128.pick.align
silva.full_v128.pick.tax


mv silva.nr_v128.pick.align silva.seed_v128.align
mv silva.full_v128.pick.tax silva.seed_v128.tax

mothur "#get.seqs(taxonomy=silva.full_v128.tax, accnos=silva.full_v128.good.pcr.ng.unique.accnos)"
Selected 190661 sequences from your taxonomy file.
Output File Names:
silva.full_v128.pick.tax

mv silva.full_v128.pick.tax silva.nr_v128.tax

Taxonomic representation

Let’s look to see how many different taxa we have for each taxonomic level within the silva.full_v128.tax, silva.nr_v128.tax, silva.seed_v128.tax:

library(splitstackshape)
getNumTaxaNames <- function(file, kingdom){
  taxonomy <- read.table(file=file, row.names=1)
  sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])

  phyla <- as.vector(levels(as.factor(gsub("[^;]*;([^;]*;).*", "\\1", sub.tax))))
  phyla <- sum(!grepl(kingdom, phyla))

  class <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  class <- sum(!grepl(kingdom, class))

  order <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  order <- sum(!grepl(kingdom, order))

  family <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  family <- sum(!grepl(kingdom, family))

  genus <- as.vector(levels(as.factor(gsub("[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;([^;]*;).*", "\\1", sub.tax))))
  genus <- sum(!grepl(kingdom, genus))

  n.seqs <- length(sub.tax)
  return(c(phyla=phyla, class=class, order=order, family=family, genus=genus, n.seqs=n.seqs))
}
getNumTaxaNamesFM <- function(file, kingdom){
  taxonomy <- read.table(file=file, row.names=1)
    sub.tax <- as.character(taxonomy[grepl(kingdom, taxonomy[,1]),])
    sub.tax.spl<-cSplit(as.data.frame(sub.tax),1,sep=";")
    if(kingdom!="Eukaryota")
    {
        setnames(sub.tax.spl,old=colnames(sub.tax.spl),
                 new=c("Regnum","Phylum","Classis","Ordo","Familia","Genus"))
    }else{
        setnames(sub.tax.spl,old=colnames(sub.tax.spl),
                 new=c("Regnum","Supergroup","Phylum","Classis","Ordo",
                       "Familia","Genus"))
    }
    phyla <- nlevels(sub.tax.spl$Phylum)
  classis <- nlevels(sub.tax.spl$Classis) 
  #avoid using class, an internal R function, as an object name
  ordo <- nlevels(sub.tax.spl$Ordo)
    #avoid using order, an internal R function, as an object name
    familia <- nlevels(sub.tax.spl$Familia)
    #avoid using family, an internal R function, as an object name
    genus <- nlevels(sub.tax.spl$Genus)
    n.seqs <- length(sub.tax)
    return(c(phyla=phyla, classis=classis, ordo=ordo, familia=familia,
             genus=genus, n.seqs=n.seqs))
}

kingdoms <- c("Bacteria", "Archaea", "Eukaryota")
tax.levels <- c("phyla", "class", "order", "family", "genus", "n.seqs")
tax.levels.FM <- c("phyla", "classis", "ordo", "familia", "genus", "n.seqs")



full.file <- "silva.full_v128.tax"
full.matrix <- matrix(rep(0,18), nrow=3)
full.matrix[1,] <- getNumTaxaNames(full.file, kingdoms[1])
full.matrix[2,] <- getNumTaxaNames(full.file, kingdoms[2])
full.matrix[3,] <- getNumTaxaNames(full.file, kingdoms[3])
rownames(full.matrix) <- kingdoms
colnames(full.matrix) <- tax.levels
full.matrix
#           phyla class order family genus n.seqs
# Bacteria     75   220   314    655  2827 510886
# Archaea      27    44    25     66   148  23030
# Eukaryota    13    31    81    185   481  43916

nr.file <- "silva.nr_v128.tax"
nr.matrix <- matrix(rep(0,18), nrow=3)
nr.matrix[1,] <- getNumTaxaNames(nr.file, kingdoms[1])
nr.matrix[2,] <- getNumTaxaNames(nr.file, kingdoms[2])
nr.matrix[3,] <- getNumTaxaNames(nr.file, kingdoms[3])
rownames(nr.matrix) <- kingdoms
colnames(nr.matrix) <- tax.levels
nr.matrix
#           phyla class order family genus n.seqs
# Bacteria     74   210   307    635  2706 168111
# Archaea      24    35    22     58   142   4337
# Eukaryota    11    26    76    139   410  18213


seed.file <- "silva.seed_v128.tax"
seed.matrix <- matrix(rep(0,18), nrow=3)
seed.matrix[1,] <- getNumTaxaNames(seed.file, kingdoms[1])
seed.matrix[2,] <- getNumTaxaNames(seed.file, kingdoms[2])
seed.matrix[3,] <- getNumTaxaNames(seed.file, kingdoms[3])
rownames(seed.matrix) <- kingdoms
colnames(seed.matrix) <- tax.levels
seed.matrix
#           phyla class order family genus n.seqs
# Bacteria     54   122   166    345  1149   8512
# Archaea       9    12    16     27    43    147
# Eukaryota     7    14    23     47    85   2554


nr.matrix / full.matrix
#               phyla     class     order    family     genus    n.seqs
# Bacteria  0.9866667 0.9545455 0.9777070 0.9694656 0.9571984 0.3290578
# Archaea   0.8888889 0.7954545 0.8800000 0.8787879 0.9594595 0.1883196
# Eukaryota 0.8461538 0.8387097 0.9382716 0.7513514 0.8523909 0.4147236


seed.matrix / full.matrix
#               phyla     class     order    family     genus      n.seqs
# Bacteria  0.7200000 0.5545455 0.5286624 0.5267176 0.4064379 0.016661251
# Archaea   0.3333333 0.2727273 0.6400000 0.4090909 0.2905405 0.006382979
# Eukaryota 0.5384615 0.4516129 0.2839506 0.2540541 0.1767152 0.058156481

Application (CMET version)

So… which to use for what application? Since we have the RAM (at least on the la06c801), we could be using silva.nr_v128.align in align.seqs. It took about 10 minutes to read in the database file. We extracted the V3-V4 from the SILVA v128 alignment with outside primer trimming to make sure diverging 16S rRNA genes can also align, but an inside trimmed reference alignment can work as well.:

mothur "#align.seqs(fasta=ecoli341f785routside.fasta,reference=silva.nr_v128.align, flip=T, processors=20)"
Using 20 processors.
Reading in the silva.nr_v128.align template sequences...        DONE.
It took 304 to read  190661 sequences.
Aligning sequences from ecoli341f785routside.fasta ...
It took 0 secs to align 1 sequences.
Output File Names:
ecoli341f785routside.align
ecoli341f785routside.align.report

mothur "#summary.seqs()"
Using ecoli341f785routside.align as input file for the fasta parameter.
Using 20 processors.
                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        6388    25316   465     1       6       1
2.5%-tile:      6388    25316   465     1       6       1
25%-tile:       6388    25316   465     1       6       1
Median:         6388    25316   465     1       6       1
75%-tile:       6388    25316   465     1       6       1
97.5%-tile:     6388    25316   465     1       6       1
Maximum:        6388    25316   465     1       6       1
Mean:   6388    25316   465     1       6
# of Seqs:      1
Output File Names:
ecoli341f785routside.summary
It took 0 secs to summarize 1 sequences.

mothur "#align.seqs(fasta=ecoli341f785rinside.fasta,reference=silva.nr_v128.align, flip=T, processors=20)"
Reading in the silva.nr_v128.align template sequences...        DONE.
It took 297 to read  190661 sequences.
Aligning sequences from ecoli341f785rinside.fasta ...
It took 1 secs to align 1 sequences.
Output File Names:
ecoli341f785rinside.align
ecoli341f785rinside.align.report


mothur "#summary.seqs()"
Using ecoli341f785rinside.align as input file for the fasta parameter.
                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        6428    23440   427     1       6       1
2.5%-tile:      6428    23440   427     1       6       1
25%-tile:       6428    23440   427     1       6       1
Median:         6428    23440   427     1       6       1
75%-tile:       6428    23440   427     1       6       1
97.5%-tile:     6428    23440   427     1       6       1
Maximum:        6428    23440   427     1       6       1
Mean:   6428    23440   427     1       6
# of Seqs:      1
Output File Names:
ecoli341f785rinside.summary
It took 0 secs to summarize 1 sequences.

mothur "#pcr.seqs(fasta=silva.nr_v128.align, start=6388, end=25316, keepdots=F, processors=20);unique.seqs();summary.seqs(name=current)"
Output File Names:
silva.nr_v128.pcr.align
It took 72 secs to screen 190661 sequences.
Using silva.nr_v128.pcr.align as input file for the fasta parameter.
Output File Names:
silva.nr_v128.pcr.names
silva.nr_v128.pcr.unique.align
Using silva.nr_v128.pcr.unique.align as input file for the fasta parameter.
Using silva.nr_v128.pcr.names as input file for the name parameter.
                Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       15382   312     0       3       1
2.5%-tile:      1       18928   434     0       4       4767
25%-tile:       1       18928   440     0       4       47666
Median:         1       18928   460     0       5       95331
75%-tile:       1       18928   464     0       6       142996
97.5%-tile:     1       18928   605     1       7       185895
Maximum:        3865    18928   1666    5       16      190661
Mean:   1.02972 18927.8 467.313 0.0611294       5.02901
# of unique seqs:       149371
total # of seqs:        190661
Output File Names:
silva.nr_v128.pcr.unique.summary
It took 14 secs to summarize 190661 sequences.


mv  silva.nr_v128.pcr.unique.align  silva.nr_v128.LGCoutside.unique.align

mothur "#pcr.seqs(fasta=silva.nr_v128.align, start=6428, end=23440, keepdots=F, processors=20);unique.seqs();summary.seqs(name=current)"
Output File Names:
silva.nr_v128.pcr.align
It took 67 secs to screen 190661 sequences.
Output File Names:
silva.nr_v128.pcr.names
silva.nr_v128.pcr.unique.align
Using silva.nr_v128.pcr.names as input file for the name parameter.
Using silva.nr_v128.pcr.unique.align as input file for the fasta parameter.
 Start   End     NBases  Ambigs  Polymer NumSeqs
Minimum:        1       15342   274     0       3       1
2.5%-tile:      2       17012   396     0       4       4767
25%-tile:       2       17012   402     0       4       47666
Median:         2       17012   422     0       5       95331
75%-tile:       2       17012   426     0       6       142996
97.5%-tile:     2       17012   567     1       7       185895
Maximum:        3825    17012   1628    5       16      190661
Mean:   2.0663  17011.8 429.254 0.0561678       5.02073
# of unique seqs:       147682
total # of seqs:        190661
Output File Names:
silva.nr_v128.pcr.unique.summary
It took 14 secs to summarize 190661 sequences.

mv  silva.nr_v128.pcr.unique.align  silva.nr_v128.LGCinside.unique.align

This will get you 149,371 unique sequences (147682 with inside trimming) to then align against. Other tricks to consider would be to use get.lineage to pull out the reference sequences that are from the Bacteria, this will probably only reduce the size of the database by ~10%. You could also try using filter.seqs with vertical=T; however, that might be problematic if there are insertions in your sequences (can’t know a priori). You could just use the silva.seed_v128.align reference for aligning, though the low number of sequences retained with the procedure above makes me wary of doing this. For classifying sequences, Patrick Schloss (and I) would strongly recommend using the silva.nr_v128.align and silva.nr_v128.tax references after running pcr.seqs on silva.nr_v128.align. You shouldn’t run unique.seqs on the output. I have attended the ARB/SILVA workshop from Primer to Paper in november 2016, where I had the pleasure to talk to the SILVA developers (Jörg Peplies, Jan Gerken, Christian Quast, Pelin Yilmaz, Antonio Fernandez-Guerra, Ivaylo Kostadinov) and PI (Frank Oliver Glöckner) on the SEED alignment. Apparently, even though they would like to make the SEED alignment accessible, it contains proprietary sequences which prohibit public sharing of the SEED alignment. Hence, dr. Schloss tried to recreate the SEED (and his README on how to do so has been used above and applied to release 128). It is however not exactly the same as the SEED of SILVA. Furthermore, the SEED alignment does not change very fast (it is not a multiple sequence alignment, but an incremental alignment). Therefore, given the issues outlined above for the v.128 mothur recreated SEED, currently at CMET we stick to the v.123 mothur recreated seed for alignment purposes.

Legalese

If you are going to use the files generated in this README, you should be aware of SILVA’s dual use license. We’ll leave it to you to work out the details.