7 Variant annotation
The final step in our pipeline is to annotate all the variants that were called in the previous step. For this, a tool like snpEff can be used.
7.1 snpEff configuration
SnpEff expects that all of its required are stored in a specific location and with specific filenames, which should the names used in the genome database section of a snpEff.config file. Briefly, you will need:
- A copy of the reference genome placed in
./data/snpEff_database/<reference-name>/sequences.fa(a symlink also works) - A gene annotation file placed in
./data/snpEff_database/<reference-name>/genes.gff - An optional CDS fasta file placed in:
./data/snpEff_database/<reference-name>/cds.fa - An optional protein fasta file placed in:
./data/snpEff_database/<reference-name>/protein.fa
Note that these files should all be decompressed (no .gz files allowed). Afterwards, you will need to run the snpEff build command to construct a snpEff database. You can read more about how to do this in the snpEff documentation. It is also useful to know that snpEff provides pre-built databases for various organisms too.
The generate_snpEff.sh script below can take care of all of these steps for you. First, you will need to modify the script and set the filepaths to your reference genomes’ location on your system.
This script expects that each of your reference genomes are stored in a separate directory, with the fasta, annotation and cds/protein files together.
It also makes some rather stringent assumptions about what the file names of the CDS and protein fasta files are. In this case, it expects the naming conventions used by PlasmoDB, but you can modify these lines of code yourself, or even comment them out and add these two files manually.
If the CDS or protein fasta files are missing for your reference genome of interest, or in case there are problems with the names in the annotation file, you can skip these files by adding the flags -noCheckCds -noCheckProtein to your snpEff build command to skip these checks. In this case, the protein check was disabled.
Lastly, it does not copy or move files, but instead makes use of symlinks. These are nothing more than smart shortcuts that can refer to the same file using a different filepath, saving space. For more info on symlinks, you can check out this tutorial.
#!/usr/bin/env bash
# set output directory if given as argument, default to current working directory
if [ ! -z $1 ]
then
output_dir=$1
else
# get file path of script and set it to output directory
output_dir=$(pwd)/
fi
# add the paths to your reference genomes here
declare -a genome_array=(
"/path/to/references/plasmodium_falciparum/PlasmoDB-release-68/PlasmoDB-68_Pfalciparum3D7_Genome.fasta"
"/path/to/references/human/Homo_sapiens.GRCh38.p14.GENCODE.release45/GRCh38.primary_assembly.genome.fa.gz"
)
snpeff_dir="${output_dir}/snpEff"
snpeff_db="${snpeff_dir}/snpEff_database"
snpeff_config="${snpeff_dir}/snpEff.config"
mkdir -p "${snpeff_dir}"
mkdir -p "${snpeff_db}"
# create snpeff.config file
cat > "${snpeff_config}" << EOF
#---------------------------------------
# Custom snpEff configuration
#---------------------------------------
data.dir = ${snpeff_db}
#-------------------------------------------------------------------------------
# Codon tables
#
# Format: It's a comma separated "codon/aminoAcid[+*]" list
# Where 'codon' is in uppper case, aminoAcid is a one letter
# code, '+' denotes start codon and '*' denotes stop codon.
#
# References: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
# ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
#-------------------------------------------------------------------------------
codon.Standard : TTT/F , TTC/F , TTA/L , TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L+, CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Vertebrate_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I+, ATA/M+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/* , AGG/* , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Yeast_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/T , CTC/T , CTA/T , CTG/T , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/M+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Mold_Mitochondrial : TTT/F , TTC/F , TTA/L+, TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L+, CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I+, ATA/I+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Protozoan_Mitochondrial : TTT/F , TTC/F , TTA/L+, TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L+, CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I+, ATA/I+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Coelenterate : TTT/F , TTC/F , TTA/L+, TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L+, CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I+, ATA/I+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Mitochondrial : TTT/F , TTC/F , TTA/L+, TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L+, CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I+, ATA/I+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Mycoplasma : TTT/F , TTC/F , TTA/L+, TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L+, CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I+, ATA/I+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Spiroplasma : TTT/F , TTC/F , TTA/L+, TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L+, CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I+, ATA/I+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Invertebrate_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I+, ATA/M+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/S , AGG/S , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Ciliate_Nuclear : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/Q , TAG/Q , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Dasycladacean_Nuclear : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/Q , TAG/Q , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Hexamita_Nuclear : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/Q , TAG/Q , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Echinoderm_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/N , AAG/K , AGT/S , AGC/S , AGA/S , AGG/S , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Flatworm_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/N , AAG/K , AGT/S , AGC/S , AGA/S , AGG/S , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Euplotid_Nuclear : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/C , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Bacterial_and_Plant_Plastid : TTT/F , TTC/F , TTA/L , TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L+, CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I+, ATA/I+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Alternative_Yeast_Nuclear : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/S+, CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Ascidian_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L+, TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/M+, ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/G , AGG/G , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Alternative_Flatworm_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/Y , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/N , AAG/K , AGT/S , AGC/S , AGA/S , AGG/S , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Blepharisma_Macronuclear : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/Q , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Chlorophycean_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/L , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Trematode_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/W , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/M , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/N , AAG/K , AGT/S , AGC/S , AGA/S , AGG/S , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Scenedesmus_obliquus_Mitochondrial : TTT/F , TTC/F , TTA/L , TTG/L , TCT/S , TCC/S , TCA/* , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/L , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I , ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V , GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
codon.Thraustochytrium_Mitochondrial : TTT/F , TTC/F , TTA/* , TTG/L , TCT/S , TCC/S , TCA/S , TCG/S , TAT/Y , TAC/Y , TAA/* , TAG/* , TGT/C , TGC/C , TGA/* , TGG/W , CTT/L , CTC/L , CTA/L , CTG/L , CCT/P , CCC/P , CCA/P , CCG/P , CAT/H , CAC/H , CAA/Q , CAG/Q , CGT/R , CGC/R , CGA/R , CGG/R , ATT/I+, ATC/I , ATA/I , ATG/M+, ACT/T , ACC/T , ACA/T , ACG/T , AAT/N , AAC/N , AAA/K , AAG/K , AGT/S , AGC/S , AGA/R , AGG/R , GTT/V , GTC/V , GTA/V , GTG/V+, GCT/A , GCC/A , GCA/A , GCG/A , GAT/D , GAC/D , GAA/E , GAG/E , GGT/G , GGC/G , GGA/G , GGG/G
EOF
for genome in "${genome_array[@]}"; do
echo "Creating snpEff database for ${genome}..."
if [[ ! -f "${genome}" ]]; then
echo "ERROR: Genome file not found: ${genome}"
exit 1
fi
# fetch genome
genome_filename="$(basename "${genome}")"
genome_parent="$(dirname "${genome}")"
genome_dir="${snpeff_db}/${genome_filename}"
genome_dir="${genome_dir%_Genome.fasta}"
genome_dir="${genome_dir%.fa.gz}"
genome_dir="${genome_dir%.fa}"
mkdir -p "${genome_dir}"
db_name="$(basename "${genome_dir}")"
if file --mime-type "${genome}" | grep -q gzip$; then
ln -sf "${genome}" "${genome_dir}/sequences.fa.gz"
else
ln -sf "${genome}" "${genome_dir}/sequences.fa"
fi
# fetch annotation
annotation=""
type=""
# Look for GFF first
for f in "$genome_parent"/*.gff*; do
if [[ -f "$f" ]]; then
annotation="$f"
type="gff"
break
fi
done
# If no GFF found, look for GTF
if [[ -z "$annotation" ]]; then
for f in "$genome_parent"/*.gtf*; do
if [[ -f "$f" ]]; then
annotation="$f"
type="gtf"
break
fi
done
fi
if [[ -z "$annotation" ]]; then
echo "ERROR: No GFF or GTF found in $genome_parent"
exit 1
fi
echo "Found annotation: ${annotation} (${type})"
if file --mime-type "${annotation}" | grep -q gzip$; then
ln -sf "${annotation}" "${genome_dir}/genes.${type}.gz"
else
ln -sf "${annotation}" "${genome_dir}/genes.${type}"
fi
# fetch CDS
cds="${genome%_Genome.fasta}_AnnotatedCDSs.fasta"
ln -sf ${cds} "${genome_dir}/cds.fa"
# fetch protein
# protein="${genome%_Genome.fasta}_AnnotatedProteins.fasta"
# ln -sf ${protein} "${genome_dir}/protein.fa"
# append genome to config file
echo "# ${db_name} genome configuration" >> "${snpeff_config}"
echo "${db_name}.genome : ${db_name}" >> "${snpeff_config}"
echo "" >> "${snpeff_config}"
# optional: adjust codon table
bed="$(ls "${genome_parent}"/*.bed 2>/dev/null | head -n 1 || true)"
if [[ -n "${bed}" ]]; then
echo "Found BED file: ${bed}"
MITO_CONTIGS=$(awk '$1 ~ /[Mm][Ii][Tt]/ {print $1}' "${bed}" | sort -u)
API_CONTIGS=$(awk '$1 ~ /[Aa][Pp][Ii]/ {print $1}' "${bed}" | sort -u)
if [[ -n "${MITO_CONTIGS}" ]]; then
echo "# Mitochondrial codon table configuration" >> "${snpeff_config}"
for contig in ${MITO_CONTIGS}; do
echo "${db_name}.${contig}.codonTable : Protozoan_Mitochondrial" >> "${snpeff_config}"
done
echo "" >> "${snpeff_config}"
fi
if [[ -n "${API_CONTIGS}" ]]; then
echo "# Apicoplast codon table configuration" >> "${snpeff_config}"
for contig in ${API_CONTIGS}; do
echo "${db_name}.${contig}.codonTable : Bacterial_and_Plant_Plastid" >> "${snpeff_config}"
done
echo "" >> "${snpeff_config}"
fi
fi
# build database
echo "Building snpEff databases for ${genome}..."
if [ ! "${type}" == "gtf" ]; then
snpEff build -c "${snpeff_config}" -dataDir "${snpeff_db}" -gff3 -v -noCheckProtein $(basename ${genome_dir})
else
snpEff build -c "${snpeff_config}" -dataDir "${snpeff_db}" -gtf22 -v -noCheckProtein $(basename ${genome_dir})
fi
doneAfter running the script, you might still need to modify the ./config/snpEff.config file somewhat.
For example, you can change the following line:
data.dir = ./snpEff_database/, to point to the location of your newly created database directory.
It’s best to set this to the absolute path on your system. Alternatively, you can omit this line from the config file and use the flag -dataDir when calling snpEff. Keep in mind that all the paths that were defined earlier should be relative to this config file, not to the working directory where snpEff is called from (this is in contrast to how most other tools work).
Additionally, if your reference genome includes a mitochondrial (or apicoplast in the case of _Plasmodium) contig, you will need to modify its codon table to make sure variants are translated into the correct amino acid.
7.2 Annotation
To annotate your VCF files, you can now run the following command, where database_name should match one of the names you defined in your config file/one of the subdirectories in the snpeff database directory (e.g. PlasmoDB-68_Pfalciparum3D7):
# NOTE: dataDir is relative to snpEff.config when provided as a relative path
snpEff ann \\
-config /path/to/snpeff.config \\
-dataDir /path/to/snpeff_database \\
-csvStats spneff_annotation.csv \\
-stats snpeff_annotation.html \\
${args} \\
<database_name> \\
${vcf} \\
> snps.ann.vcfSimilar to the gatk commands we saw earlier, various java flags can be passed to snpeff in order to, for example, set the maximum amount of memory that can be used -Xmx$8g.
Run snpeff on your own data and explore the output files. You can refer to the documentation pages to learn more about the different outputs.
7.3 MultiQC
At the end of your pipeline, you will likely end up with a directory filled with various outputs: log files, alignment and qc stats, etc. This is where the multiqc becomes a lifesaver. It can search through a directory with these types of outputs and summarise the results for all the tools that it knows about. Fortunately, it is backed by an extensive database of compatible tools and outputs. Running it is as simple as multiqc <results_directory>. For more info, we refer to its extensive documentation.
You can explore an example report here.