9 The Salmonella Concord mystery: reconstructing transmission across borders

Author

Affiliation

Wim Cuypers

Several young patients in Ethiopia presented with severe gastrointestinal disease in the early 200’s. Stool cultures yielded Salmonella, raising concerns about a possible localized outbreak. At first glance, the cases appeared epidemiologically linked: similar symptoms, similar age group, and temporal proximity.

However, the situation quickly became more complex.

Additional isolates with highly similar preliminary typing results were identified in European surveillance databases. Some were recovered from patients with recent travel to Ethiopia. Others had no documented travel history at all. Were these sporadic cases? Imported infections? Or evidence of sustained international transmission?

You have been asked to investigate.

As a microbial genomics investigator, your role is to determine whether these isolates represent:

a single clonal outbreak strain,
multiple unrelated introductions,
or a broader, circulating lineage with regional persistence.

To answer this, you will move beyond traditional typing and reconstruct their evolutionary relationships using whole-genome data.

9.1 Analytical framework

Whole-genome sequencing was performed using Illumina short reads for all isolates, with selected genomes additionally sequenced using long-read platforms to generate high-quality references.

Your analysis will proceed from raw sequencing reads to phylogenetic inference. Along the way, you will:

assess read quality and assembly characteristics,
confirm species identity and serotype,
explore antimicrobial resistance determinants,
construct a core-genome SNP alignment,
and infer a phylogeny to evaluate genetic relatedness.

The central question is not simply whether the isolates are Salmonella Concord, but whether they are genetically indistinguishable at outbreak scale, and what their phylogenetic structure reveals about transmission dynamics.

Are Ethiopian cases part of a tight monophyletic cluster?
Do European isolates nest within that cluster?
Is there evidence for repeated exportation events?
Or does the tree suggest long-term endemic circulation?

9.2 Isolate overview (short-read run accessions)

The data in Table 9.1 provide the epidemiological context. The genome sequences will provide the evolutionary context.

In this exercise, you will integrate both.

Table 9.1: Overview of isolates included in the course, showing short-read run accessions and key epidemiological metadata.

sequencing_ID	lab_ID	SR_run_accession	country_of_isolation	year_of_isolation	source	travel_origin
SRS4345282	679052	SRR8552256	UK	2019	human	Zambia
32640_1_336	H044240362	ERR9516343	UK	2004	human	Kenya
32640_1_39	64206	ERR9516195	Austria	2006	human	Ethiopia
32640_1_47	95907	ERR9516199	Austria	2007	human	Ethiopia
32640_1_316	1035531	ERR9516333	The Netherlands	2006	stool	NA
32640_1_372	254833	ERR9516357	The Netherlands	2003	stool	NA
32640_1_328	H04340470	ERR9516339	UK	2004	human	NA
SRS5777896	850890	SRR10604677	UK	2019	human	NA
32640_1_296	0508H45184	ERR9516323	Denmark	2005	stool	Ethiopia
32640_1_364	70366	ERR9516355	USA	2007	stool	Ethiopia
FD01848716	GR536	ERR3636805	Ethiopia	2006	blood	None
FD01848803	GR450	ERR3636847	Ethiopia	2006	stool	None

9.3 retrieving short read data

Even in the era of long-read sequencing, short reads remain highly relevant in 2026. They are typically less expensive to generate and are well suited for phylogenetic inference. Moreover, several established pipelines are specifically optimised for short-read data. Let us therefore retrieve some Illumina sequencing data.

Note that in addition to the isolate accessions mentioned above, we will download reads for an old isolate, one from 1977 with accession ERR9516255, which will be convenient for rooting the phylogentic tree eventually. Here, we hypothesise that this isolate from 1977 from Djibouti has a most recent common ancestor with all other isolates studied (note: we are not 100% sure about this, but it is a plausible hypothesis).

runs=( SRR8552256
ERR9516343
ERR9516195
ERR9516199
ERR9516333
ERR9516357
ERR9516339
SRR10604677
ERR9516323
ERR9516355
ERR9516255
ERR3636805
ERR3636847
)


CPUS=16

mkdir -p fastq_illumina

for run in "${runs[@]}"
do
  fasterq-dump "$run" \
    --threads "$CPUS" \
    --split-files \
    --progress \
    --outdir fastq_illumina

done

cd fastq_illumina
pigz -p "$CPUS" *.fastq

9.4 QC, mapping, and variant calling

Recall the previous sessions in which we performed quality control, read mapping, and variant calling. This was quite challenging for Plasmodium, correct?

For bacteria such as Salmonella, however, there is good news. Well-established pipelines are available that perform read mapping, variant calling, and even core SNP alignment automatically. This allows you to focus primarily on interpreting the resulting phylogenetic tree.

Nevertheless, it remains essential that you understand each step of the pipeline and the tools involved.

Exercise 1: Quality control

Perform quality control on the raw Illumina reads.

You may use tools such as fastqc and multiqc, or any equivalent workflow you are familiar with.

Answer briefly:

What is the overall per-base quality profile?
Is there evidence of adapter contamination?
Do any samples require trimming?
Are there major differences in sequencing depth between samples?
Support your answers with specific observations from the QC reports.

Also, by now, you may be familiar with tools that perform rapid taxonomic classifiaction such as kraken2, or tools such as checkM. Leverage your expertise with these tools to asses whether your have actually sequenced Salmonella and not some other nasty bacterium that has contaminated your culture!

Tip: for the FA5 advanced bioinformatics course, you will find the minikraken database installed on your system, and you can use this database. Alternatively, this database can also be used for kraken2, you can retrive it using the wget command and point kraken2 to the directory where it downloaded.

cd /home
mkdir databases
cd /home/databases
wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_08_GB_20251015.tar.gz

Are most reads being classified as Salmonella?
Use your bash skils to obtain a summary of the taxonomic classification results across all samples.
Are there samples where you doubt the quality and suspect that there was a contamination?

Optional excercise 1.1: serotype determination

Last but not least, if most of the data appears to be Salmonella, it would be interesting to confirm the sertype that was determined in the lab bioinformatically. For this purpose, a tools exists named seqsero2 that can determine in silico what serotype your isolate is.

Although you can run seqsero2 directly on the raw reads in theory, it is often recommended to run it on the draft assembly, as this may yield more accurate results. Also, while preparing this course, seqsero2 only worked on assemblies.


for read1 in *_R1.fastq.gz
do
    read2=${read1/_R1/_R2}
    sample=basename "$read1" _R1.fastq.gz

    spades.py \
        --pe1-1 "$read1" \
        --pe1-2 "$read2" \
        --threads "$CPUS" \
        -o "spades_${sample}"
done

Great! now you have assembled your reads into a draft assembly, which means its a fragmented assembly, where one biological molecule (e.g the bacterial chromosome), is represented by multiple contis (continuous stretches of assembled reads).

You can now use the assemblies to run Seqsero, but also, you could leverage this opportunity of having assembles genomes for another round of QC!

Perform a quick assembly QC using Quast.

What can you learn when assessing the number of contigs, and the contig N50 value? If it is not clear what these values mean, you can check the Quast manual.

Now, if everything looks good to go quality-wise, run SeqSero2 and determine whether your isolates are actually Serovar Concord and not some other serovar.

Are your draft assemblies classified as Salmonella enterica subspecies enterica serovar Concord?

For bacterial variant calling, we will use Snippy, a widely adopted pipeline for haploid bacterial genomes:

https://github.com/tseemann/snippy

Carefully read the documentation and determine:

how to provide paired-end Illumina reads as input
how to specify the reference genome
which output files are generated

Tip

For this project, we will use a reference genome that was generated using PacBio long-read sequencing. The genome originates from a Salmonella Concord isolate recovered from a stool sample of a young Ethiopian adoptee who was treated for gastroenteritis at the Institute of Tropical Medicine in Antwerp, Belgium, in 2008.

Using a suitable reference genome is an important step in bacterial genomics analyses. Many analyses, such as read mapping, SNP detection, and phylogenetic reconstruction, depend on aligning sequencing reads to a reference. When the reference genome is closely related to the isolates under study, reads map more accurately and fewer regions are incorrectly interpreted as variation.

This is particularly important for Salmonella. Although isolates within a serovar can be highly similar, their genomes may still contain substantial variation due to mobile genetic elements such as plasmids, genomic islands, and prophages. If a distant reference genome is used, parts of the genome may fail to map or may produce misleading signals of variation. A reference that is closely related to the study population therefore improves the reliability of downstream analyses.

You can download the this reference genome as follows:

wget https://zenodo.org/records/18813829/files/reference_str1309.fasta?download=1 
mv -v reference_str1309.fasta?download=1 concord_str1309.fasta

Now you are all set for the next excercise!

Exercise 2: mapping, variant calling

Run Snippy separately for each isolate. You will need a reference genome, which we will provide during the course. (link to Zenodo)

After completion, inspect the output directory for one sample. You should find, among others:

a *.bam file (aligned reads)
a *.bam.bai index file
a *.vcf file (variant calls)
a *.consensus.fa file
summary statistics

9.4.0.1 Questions

What information is stored in the *.vcf file?
- What do the columns represent?
- How are SNPs defined relative to the reference genome?
What is the purpose of the BAM file?
- How could you visualise it?
- What does read depth tell you?
For each isolate, how many variants are detected relative to the reference genome?
- How can you count them directly from the VCF file?
- Are all variants high quality? How do you know?

Be precise. Interpret the file contents rather than only reporting filenames.

9.5 From variants to outbreak inference

Once all isolates have been processed, your next step is to compare them.

Snippy provides an additional command that combines per-sample results into a core SNP alignment across all isolates. Identify the correct command in the documentation yourself and examine its parameters.

Here is the snippy repo linked: Snippy: Rapid haploid variant calling and core genome alignment

Excercise 3: generating a core SNP alignment and phylogenetic inference

After generating a core SNP alignment, you should find a file named core.full.aln in the output directory. This file contains the aligned sequences of all isolates at the core SNP sites.

Inspect the alignment file. What does each row represent?
How many core SNP sites are included?
Are there many differences between isolates, or only a few?

This alignment can be used as input for a phylogenetic inference tool. The resulting tree will allow you to:

assess genetic relatedness among isolates
determine whether they cluster tightly
evaluate whether they are consistent with a single outbreak strain

Tip: u can use tools such as FastTree, IQtree, or RaxML-ng. They should all provide roughly the same results. RaxML-ng, when used correctly, may yield the most accurate hypothesis of the phylogeny of Salmonella Concord, but will also require ac higher runtime and more ocmputational resources.

In bacterial outbreak genomics, small SNP distances often indicate recent transmission, whereas large distances suggest unrelated cases.

Your final interpretation should integrate:

SNP distances
clustering patterns
resistance phenotypes
metadata (cfr. epidemiological context in Ethiopia

Only then can you conclude whether this dataset represents a localized outbreak of Salmonella Concord or multiple unrelated infections of strains imported from elsewhere.