9 The Salmonella Concord mystery: reconstructing transmission across borders
Several young patients in Ethiopia presented with severe gastrointestinal disease in the early 200’s. Stool cultures yielded Salmonella, raising concerns about a possible localized outbreak. At first glance, the cases appeared epidemiologically linked: similar symptoms, similar age group, and temporal proximity.
However, the situation quickly became more complex.
Additional isolates with highly similar preliminary typing results were identified in European surveillance databases. Some were recovered from patients with recent travel to Ethiopia. Others had no documented travel history at all. Were these sporadic cases? Imported infections? Or evidence of sustained international transmission?
You have been asked to investigate.
As a microbial genomics investigator, your role is to determine whether these isolates represent:
- a single clonal outbreak strain,
- multiple unrelated introductions,
- or a broader, circulating lineage with regional persistence.
To answer this, you will move beyond traditional typing and reconstruct their evolutionary relationships using whole-genome data.
9.1 Analytical framework
Whole-genome sequencing was performed using Illumina short reads for all isolates, with selected genomes additionally sequenced using long-read platforms to generate high-quality references.
Your analysis will proceed from raw sequencing reads to phylogenetic inference. Along the way, you will:
- assess read quality and assembly characteristics,
- confirm species identity and serotype,
- explore antimicrobial resistance determinants,
- construct a core-genome SNP alignment,
- and infer a phylogeny to evaluate genetic relatedness.
The central question is not simply whether the isolates are Salmonella Concord, but whether they are genetically indistinguishable at outbreak scale, and what their phylogenetic structure reveals about transmission dynamics.
- Are Ethiopian cases part of a tight monophyletic cluster?
- Do European isolates nest within that cluster?
- Is there evidence for repeated exportation events?
- Or does the tree suggest long-term endemic circulation?
9.2 Isolate overview (short-read run accessions)
The data in Table 9.1 provide the epidemiological context. The genome sequences will provide the evolutionary context.
In this exercise, you will integrate both.
| sequencing_ID | lab_ID | SR_run_accession | country_of_isolation | year_of_isolation | source | travel_origin |
|---|---|---|---|---|---|---|
| SRS4345282 | 679052 | SRR8552256 | UK | 2019 | human | Zambia |
| 32640_1_336 | H044240362 | ERR9516343 | UK | 2004 | human | Kenya |
| 32640_1_39 | 64206 | ERR9516195 | Austria | 2006 | human | Ethiopia |
| 32640_1_47 | 95907 | ERR9516199 | Austria | 2007 | human | Ethiopia |
| 32640_1_316 | 1035531 | ERR9516333 | The Netherlands | 2006 | stool | NA |
| 32640_1_372 | 254833 | ERR9516357 | The Netherlands | 2003 | stool | NA |
| 32640_1_328 | H04340470 | ERR9516339 | UK | 2004 | human | NA |
| SRS5777896 | 850890 | SRR10604677 | UK | 2019 | human | NA |
| 32640_1_296 | 0508H45184 | ERR9516323 | Denmark | 2005 | stool | Ethiopia |
| 32640_1_364 | 70366 | ERR9516355 | USA | 2007 | stool | Ethiopia |
| FD01848716 | GR536 | ERR3636805 | Ethiopia | 2006 | blood | None |
| FD01848803 | GR450 | ERR3636847 | Ethiopia | 2006 | stool | None |
9.3 retrieving short read data
Even in the era of long-read sequencing, short reads remain highly relevant in 2026. They are typically less expensive to generate and are well suited for phylogenetic inference. Moreover, several established pipelines are specifically optimised for short-read data. Let us therefore retrieve some Illumina sequencing data.
Note that in addition to the isolate accessions mentioned above, we will download reads for an old isolate, one from 1977 with accession ERR9516255, which will be convenient for rooting the phylogentic tree eventually. Here, we hypothesise that this isolate from 1977 from Djibouti has a most recent common ancestor with all other isolates studied (note: we are not 100% sure about this, but it is a plausible hypothesis).
runs=( SRR8552256
ERR9516343
ERR9516195
ERR9516199
ERR9516333
ERR9516357
ERR9516339
SRR10604677
ERR9516323
ERR9516355
ERR9516255
ERR3636805
ERR3636847
)
CPUS=16
mkdir -p fastq_illumina
for run in "${runs[@]}"
do
fasterq-dump "$run" \
--threads "$CPUS" \
--split-files \
--progress \
--outdir fastq_illumina
done
cd fastq_illumina
pigz -p "$CPUS" *.fastq 9.4 QC, mapping, and variant calling
Recall the previous sessions in which we performed quality control, read mapping, and variant calling. This was quite challenging for Plasmodium, correct?
For bacteria such as Salmonella, however, there is good news. Well-established pipelines are available that perform read mapping, variant calling, and even core SNP alignment automatically. This allows you to focus primarily on interpreting the resulting phylogenetic tree.
Nevertheless, it remains essential that you understand each step of the pipeline and the tools involved.
Perform quality control on the raw Illumina reads.
You may use tools such as fastqc and multiqc, or any equivalent workflow you are familiar with.
Answer briefly:
- What is the overall per-base quality profile?
- Is there evidence of adapter contamination?
- Do any samples require trimming?
- Are there major differences in sequencing depth between samples?
- Support your answers with specific observations from the QC reports.
Also, by now, you may be familiar with tools that perform rapid taxonomic classifiaction such as kraken2, or tools such as checkM. Leverage your expertise with these tools to asses whether your have actually sequenced Salmonella and not some other nasty bacterium that has contaminated your culture!
Tip: for the FA5 advanced bioinformatics course, you will find the minikraken database installed on your system, and you can use this database. Alternatively, this database can also be used for kraken2, you can retrive it using the wget command and point kraken2 to the directory where it downloaded.
cd /home
mkdir databases
cd /home/databases
wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_08_GB_20251015.tar.gz- Are most reads being classified as Salmonella?
- Use your bash skils to obtain a summary of the taxonomic classification results across all samples.
- Are there samples where you doubt the quality and suspect that there was a contamination?
Last but not least, if most of the data appears to be Salmonella, it would be interesting to confirm the sertype that was determined in the lab bioinformatically. For this purpose, a tools exists named seqsero2 that can determine in silico what serotype your isolate is.
Although you can run seqsero2 directly on the raw reads in theory, it is often recommended to run it on the draft assembly, as this may yield more accurate results. Also, while preparing this course, seqsero2 only worked on assemblies.
for read1 in *_R1.fastq.gz
do
read2=${read1/_R1/_R2}
sample=basename "$read1" _R1.fastq.gz
spades.py \
--pe1-1 "$read1" \
--pe1-2 "$read2" \
--threads "$CPUS" \
-o "spades_${sample}"
done
Great! now you have assembled your reads into a draft assembly, which means its a fragmented assembly, where one biological molecule (e.g the bacterial chromosome), is represented by multiple contis (continuous stretches of assembled reads).
You can now use the assemblies to run Seqsero, but also, you could leverage this opportunity of having assembles genomes for another round of QC!
Perform a quick assembly QC using Quast.
- What can you learn when assessing the number of contigs, and the contig N50 value? If it is not clear what these values mean, you can check the
Quastmanual.
Now, if everything looks good to go quality-wise, run SeqSero2 and determine whether your isolates are actually Serovar Concord and not some other serovar.
- Are your draft assemblies classified as Salmonella enterica subspecies enterica serovar Concord?
For bacterial variant calling, we will use Snippy, a widely adopted pipeline for haploid bacterial genomes:
https://github.com/tseemann/snippy
Carefully read the documentation and determine:
- how to provide paired-end Illumina reads as input
- how to specify the reference genome
- which output files are generated
For this project, we will use a reference genome that was generated using PacBio long-read sequencing. The genome originates from a Salmonella Concord isolate recovered from a stool sample of a young Ethiopian adoptee who was treated for gastroenteritis at the Institute of Tropical Medicine in Antwerp, Belgium, in 2008.
Using a suitable reference genome is an important step in bacterial genomics analyses. Many analyses, such as read mapping, SNP detection, and phylogenetic reconstruction, depend on aligning sequencing reads to a reference. When the reference genome is closely related to the isolates under study, reads map more accurately and fewer regions are incorrectly interpreted as variation.
This is particularly important for Salmonella. Although isolates within a serovar can be highly similar, their genomes may still contain substantial variation due to mobile genetic elements such as plasmids, genomic islands, and prophages. If a distant reference genome is used, parts of the genome may fail to map or may produce misleading signals of variation. A reference that is closely related to the study population therefore improves the reliability of downstream analyses.
You can download the this reference genome as follows:
wget https://zenodo.org/records/18813829/files/reference_str1309.fasta?download=1
mv -v reference_str1309.fasta?download=1 concord_str1309.fastaNow you are all set for the next excercise!
Run Snippy separately for each isolate. You will need a reference genome, which we will provide during the course. (link to Zenodo)
After completion, inspect the output directory for one sample. You should find, among others:
- a
*.bamfile (aligned reads)
- a
*.bam.baiindex file
- a
*.vcffile (variant calls)
- a
*.consensus.fafile
- summary statistics
9.4.0.1 Questions
- What information is stored in the
*.vcffile?- What do the columns represent?
- How are SNPs defined relative to the reference genome?
- What do the columns represent?
- What is the purpose of the BAM file?
- How could you visualise it?
- What does read depth tell you?
- How could you visualise it?
- For each isolate, how many variants are detected relative to the reference genome?
- How can you count them directly from the VCF file?
- Are all variants high quality? How do you know?
- How can you count them directly from the VCF file?
Be precise. Interpret the file contents rather than only reporting filenames.
9.5 From variants to outbreak inference
Once all isolates have been processed, your next step is to compare them.
Snippy provides an additional command that combines per-sample results into a core SNP alignment across all isolates. Identify the correct command in the documentation yourself and examine its parameters.
Here is the snippy repo linked: Snippy: Rapid haploid variant calling and core genome alignment
After generating a core SNP alignment, you should find a file named core.full.aln in the output directory. This file contains the aligned sequences of all isolates at the core SNP sites.
- Inspect the alignment file. What does each row represent?
- How many core SNP sites are included?
- Are there many differences between isolates, or only a few?
This alignment can be used as input for a phylogenetic inference tool. The resulting tree will allow you to:
- assess genetic relatedness among isolates
- determine whether they cluster tightly
- evaluate whether they are consistent with a single outbreak strain
Tip: u can use tools such as FastTree, IQtree, or RaxML-ng. They should all provide roughly the same results. RaxML-ng, when used correctly, may yield the most accurate hypothesis of the phylogeny of Salmonella Concord, but will also require ac higher runtime and more ocmputational resources.
In bacterial outbreak genomics, small SNP distances often indicate recent transmission, whereas large distances suggest unrelated cases.
Your final interpretation should integrate:
- SNP distances
- clustering patterns
- resistance phenotypes
- metadata (cfr. epidemiological context in Ethiopia
Only then can you conclude whether this dataset represents a localized outbreak of Salmonella Concord or multiple unrelated infections of strains imported from elsewhere.