10 The Salmonella Concord mystery: antimicrobial resistance (AMR)

Author

Affiliation

Wim Cuypers

In the previous chapter, we focused on phylogenetic inference and outbreak investigation. In this chapter, we shift our attention to the genomic features of the Salmonella Concord isolates, with particular emphasis on antimicrobial resistance (AMR).

An important clinical detail not discussed earlier is that, due to concerns about invasive bacterial infection in these young patients, several were treated empirically with antibiotics. In all cases, first-line therapy was ineffective, and treatment was switched to either ceftriaxone or ciprofloxacin. In one patient, ciprofloxacin also failed, and therapy was subsequently changed to azithromycin, which successfully resolved the infection.

Phenotypic antimicrobial susceptibility testing demonstrated resistance to selected antibiotics in vitro.

In this chapter, we will investigate how to identify AMR genes in genome assemblies and how to interpret these findings in light of phenotypic resistance data. Ideally, we generate long-read reference genomes to obtain a comprehensive view of genome structure, including plasmids and other mobile genetic elements that frequently carry AMR genes. We can then annotate the genomes to identify specific resistance determinants and compare these results with phenotypic susceptibility testing.

This integrated approach allows us to assess the potential clinical implications of genomic findings and to better understand the mechanisms underlying antimicrobial resistance in these isolates.

10.1 Objectives

The objectives of this tutorial are:

To understand the importance of assessing genome assembly quality before downstream analyses
To learn how to identify antimicrobial resistance (AMR) genes in bacterial genome assemblies using dedicated tools
To compare genomic findings with phenotypic antimicrobial susceptibility testing results and interpret discrepancies in the context of potential resistance mechanisms

10.2 Long-read sequencing to obtain reference genomes

We have nanopore data available on the SRA/ENA, that was generated using the (nowadays old) R9.4.1 data. Despite being of lower quality than the R10.4.1 data we have nowadays, the sequencing depth and coverage were high, so the overall quality is accurate and sufficient for de novo assembly and downstream analyses.

10.2.1 QC on Nanopore Reads

For retrieving the data, we’ll use a tool named ‘fasterq-dump’ from NCBI. It can be parallelised, en hence it should be relatively fast. For the sake of time and computational resources, we will only retrieve the FASTQ data for one of the samples for now for illustrating Nanopore QC, but you can retrieve the data for all samples if you like. The sample we will use is SRR10604677.

conda activate concord_dataset

CPUS=8

run="ERR9855653"
output_dir="concord_fastq_ERR9855653"
mkdir -p "$output_dir"

fasterq-dump "$run" --threads "$CPUS" --progress --outdir "$output_dir"

# hint: use pigz to gzip the fastq files in parallel

pigz -p "$CPUS" ERR9855653.fastq

You will end up with a FASTQ file containing the raw reads. We can now perform some quality control on these reads to assess their quality and suitability for downstream analyses.

Note

Exercise 1: QC on the raw reads

Perform QC on the raw Nanopore reads using tools such as fastqc, fastlong or NanoPlot. Assess the read length distribution, quality scores, and overall data quality. Are there any issues with the data that might affect downstream analyses? If so, how would you address them?
When using fastqc, the quality values look terrible, why is that? Do you think the data is of poor quality, or is there another explanation for the low quality scores? Use your knowledge of Nanopore sequencing and the tools you are using to support your answer.

10.2.2 Reference genomes (this course) and de novo assembly (optional)

Because downloading the reads, and performing de novo assembly takes quite some time to run, we have already performed the assemblies for you. You can find the resulting FASTA files here:

wget https://zenodo.org/api/records/18813829/files-archive -O concord_fasta.zip
unzip concord_fasta.zip

If you do want to start from the raw reads and perform the assemblies yourself, you can find the raw data on the SRA/ENA and retrieve it using the ‘fasterq-dump’ tool, as shown in the next (optional) section. After retrieving the data, you can use the ‘filtlong’ tool to filter the reads based on quality and length, and to select the best 100× of the genome size (5 Mb for Salmonella). This will help you to reduce the computational time for assembly, while still retaining the most informative reads.

Optional background to retrieve the unassembled reads yourself from the European Nucleotide archive and perform de novo genome assembly.

conda activate concord_dataset


runs=(
  ERR9855645
  ERR9855644
  ERR9855648
  ERR9855650
  ERR9855647
  ERR9855646
  ERR9855652
  ERR9855653
  ERR9855651
  ERR9855649
)

CPUS=8

mkdir -p concord_fastq
for run in "${runs[@]}"
do
  fasterq-dump "$run" --threads "$CPUS" --progress --outdir fastq
done

Oh no, fasterq-dump downloaded unzipped fastq files! We want to gzip them to save some space on our computer or server, but we have 8 CPUs, so we can do it in parallel. For that, we can use a tool named ‘pigz’, which is a parallel implementation of gzip.


pigz -p "$CPUS" fastq/*.fastq

Now we can use a tool named ‘filtlong’ to filter the reads based on quality and length, and to select the best 100× of the genome size (5 Mb for Salmonella). This will help us to reduce the computational time for assembly, while still retaining the most informative reads.


# 100× of 5 Mb = 500 Mb
TARGET_BASES=500000000

for file in *.fastq.gz
do
  filtlong --target_bases $TARGET_BASES "$file" | gzip > "${file%.fastq.gz}.best500Mb.fastq.gz"
done

With these subsets, we can now run the assembly program ‘flye’ to perform de novo assembly of the genomes. If you use the subsets, the process will be a bit faster.

We will specify the genome size, the number of threads to use, and the number of iterations for polishing the assembly.

# run flye 

genome=5m
threads=16

for fq in *.best500Mb.fastq.gz; do
 
  # assemble
  flye --nano-hq "$fq" \
    --genome-size "$genome" \
    --out-dir flye_${fq%.best500Mb.fastq.gz} \
    --threads $threads \
    --iterations 2
done

For the Salmonella Concord paper
(https://www.nature.com/articles/s41467-023-38902-x), from which these data were sourced, the program Trycycler was used for genome assembly.

In that study, we used not only Flye but also several other assemblers, and we semi-manually curated the assemblies to obtain the best possible results.

Today, an exciting new tool called Autocycler
(https://github.com/rrwick/Autocycler/wiki) automates this entire process. It is well worth exploring and may save you considerable time.

You can skip the section above where the reads are downloaded. Instead, we will move directly to the more interesting part: identifying features in de novo genome assemblies!

If you are unsure about the difference between de novo assembly and read mapping, see the information box below.

De novo assembly vs reference-based mapping

When sequencing a bacterial genome, we obtain millions of reads.
There are two main analysis strategies.

10.2.3 🔎 Reference-based mapping

Align reads to an existing reference genome.

Best for - SNP and small indel detection
- Phylogenetic analysis
- Outbreak investigation
- Resistance gene screening

Strengths - Fast and computationally light
- Works with moderate coverage (Illumina reads for instance have high quality) - Easy comparison across isolates

Limitations - Requires a close reference
- Biased toward the reference
- Misses novel genes
- Poor detection of large rearrangements
- Cannot reconstruct full genome structure

You see differences relative to a known genome — you do not rebuild it.

10.2.4 🧩 De novo genome assembly

Reconstruct the genome directly from reads.

Best for - Novel strain characterisation
- Gene discovery
- Plasmid and mobile element analysis
- Genome structure analysis, including full-length AMR genes and their genomic context

Strengths - No reference required
- Reconstructs chromosomes and plasmids
- Enables genome annotation (genes, rRNA, tRNA)
- Detects structural variation

Limitations - Higher coverage required
- More computationally demanding
- Risk of fragmented assemblies, and the assembly process often requires manual intervention.

You rebuild the genome and can study its full structure.

10.2.5 🔬 In practice (bacterial genomes are small)

For bacteria, we usually use both approaches in parallel:

Mapping → phylogenetic inference and SNP-based comparisons
De novo assembly → gene content, plasmids, genome structure

Because bacterial genomes are relatively small, running both is feasible and complementary. Also, by running both approaches, you can cross-validate findings (also related to quality control!) and get a more complete picture of the genome and its epidemiological context.

10.2.6 📌 Quick comparison

Question	Mapping	De novo
Need a reference?	Yes	No
Best for SNPs?	Excellent	Secondary
Discover new genes?	Rarely	Yes
Study plasmids?	Difficult	Yes
Study genome structure?	Limited	Yes
Computational cost	Low	Moderate–High

10.2.7 🧠 Simple analogy

Mapping = compare to an existing book.
Assembly = rebuild the entire book from fragments.

10.3 Assessing the reference genomes, and identifying features in the assemblies

10.3.1 Quick QC assesments

Typically, quality control is performed on raw sequencing reads. However, whenever you download or generate a bacterial genome assembly, it is equally important to assess its quality.

When assemblies are obtained from curated repositories such as RefSeq, quality metrics are usually provided. In this course, the genomes were downloaded from a Zenodo repository, so we will evaluate their quality ourselves before proceeding with downstream analyses.

Exercise 2: QC on the reference genomes

Perform quality control on the genome sequences. First, let’s get to know our data. You know already different bash tips and tricks to do this. Use them to asnwer the following questions:

How many contigs are there in each assembly?
What is the total genome size for each assembly?
Are there any contigs that are unusually short or long?
Support your answers with specific observations from the assembly FASTA files.
Do you find logical results when you compare the assemblies to each other? For example, do they all have a similar number of contigs and similar genome sizes? If not, what could be the reason for that? Use your microbiology expertise or literature search skills to support your answer.

You may use bash one-liners, or any tools and workflows you are familiar with. Present your results in a table to facilitate comparison between assemblies and for use in the subsequent exercises.

Good, we now have an idea of what our contigs represent (Plasmids, Chromosome, …), and we can start looking for features in the genomes!

10.3.2 Finding features in the assemblies

Now that we have an idea of the quality of the assemblies, we can start looking for features in the genomes. We are particularly interested in antimicrobial resistance genes and plasmid replicon genes, but you can also look for other features if you like. Typically, we begin by annotating the complete genome assembl. Once annotation is complete, specific features of interest can be extracted from the resulting annotation files. Bakta Bakta GitHub repository is the preferred tool for annotation, and should be available in your conda environment for these excercises.

In addition, we tend to perform additional, specialist annotation for AMR and virulence genes (namely AMRFinder) and replicon genes (namely ABRicate using the plasmidfinder database; the latter indicate the presence of Plasmids).

Exercise 3: Annotation, AMR genes, and replicon genes

Run Bakta to perform full annotation. Look for the best options for this project, run te pipeline, and inspect the output files. You may specifically want to inspect the .gff files, and look for you favourite genes!
After inspecting the output files, provide the complete DNA sequence of the Bla-CTX-M-15 gene.
Find more information about this gene in the literature, and explain why it is important in the context of antimicrobial resistance in Ethiopia.
Run AMRFinder to identify AMR genes in the assemblies. How many AMR genes do you find in each assembly? Are there any differences between the assemblies? If so, what could be the reason for that? Use your microbiology expertise or literature search skills to support your answer. To further understand the types of AMR markers, you can look at the info below.

Genomic AMR mechanisms and phenotype

Genomic AMR annotation identifies resistance determinants that are present, but this does not automatically predict the observed phenotype. Resistance can arise through several major mechanisms:

Target modification
Mutations in antibiotic targets reduce drug binding. For example, mutations in gyrA or gyrB (QRDR region) can increase fluoroquinolone MICs.

Drug inactivation
Enzymes such as β-lactamases (e.g. blaTEM, blaCTX-M) hydrolyse β-lactam antibiotics, leading to ampicillin or cephalosporin resistance.

Target protection
Proteins such as Qnr protect DNA gyrase and topoisomerase IV from quinolones, often resulting in low-level resistance.

Reduced intracellular drug concentration
Efflux pumps (e.g. acrAB-tolC) decrease intracellular antibiotic levels and can contribute to multidrug decreased susceptibility.

Acquisition via horizontal gene transfer (HGT)
Many AMR genes are located on plasmids, transposons, or integrons, enabling rapid spread between strains and species.

10.3.3 Important caveat

Genotype–phenotype discordance is common. A gene may be truncated, partially assembled, poorly expressed, or present at low copy number. Conversely, chromosomal mutations, regulatory changes, or combined mechanisms (e.g. porin loss plus efflux upregulation) can increase MICs without obvious acquired genes.

Genomic findings should ideally be interpreted together with phenotypic susceptibility testing and relevant literature.

Plasmid incompatibility (Inc) types and their relation to AMR

Plasmids are classified into incompatibility (Inc) groups based on their replication and maintenance systems. Plasmids belonging to the same Inc group cannot stably coexist in the same (bacterial) cell.

Certain Inc types are strongly associated with specific AMR genes and epidemiological patterns. For example:

IncF plasmids are common in Enterobacteriaceae and frequently carry ESBL genes (e.g. blaCTX-M).
IncI and IncHI plasmids are often associated with multidrug resistance.
Broad-host-range plasmids (e.g. IncA/C) facilitate interspecies spread.

Identifying plasmid replicon types (e.g. with PlasmidFinder) helps assess: - Whether resistance genes are likely mobile - The potential for horizontal transfer - The epidemiological relatedness of isolates

The presence of an AMR gene on a conjugative plasmid (which can transfer horizontally) has different clinical and public health implications than the same gene integrated into the chromosome.

10.3.4 Genotype-phenotype association

Now that we have identified the presence of AMR genes in the assemblies, we can compare these findings to the phenotypic antimicrobial susceptibility testing results (see Table 10.1). This is an important step to understand the potential clinical implications of the genomic findings, and to evaluate whether the presence of certain AMR genes correlates with resistance observed in the laboratory.

Table 10.1: Phenotypic antimicrobial susceptibility profiles (R/S) for isolates included in the Ethiopia course. AMP = ampicillin; SXT = trimethoprim/sulfamethoxazole; CHL = chloramphenicol; CRO = ceftriaxone; CIP = ciprofloxacin; AZM = azithromycin; CAZ/AVI = ceftazidime/avibactam; C/T = ceftolozane/tazobactam; CST = colistin; MER/VAB = meropenem/vaborbactam; TZP = piperacillin/tazobactam; TEM = temocillin; TGC = tigecycline; MEM = meropenem. R = resistant; S = susceptible.

sequencing_ID	AMP	SXT	CHL	CRO	CIP	AZM	CAZ_AVI	C_T	CST	MER_VAB	TZP	TEM	TGC	MEM
SRS4345282	S	S	S	S	S	S	S	S	S	S	S	S	S	S
32640_1_336	S	S	S	S	S	S	S	S	S	S	S	S	S	S
32640_1_39	R	R	R	R	S	S	S	S	R	S	S	R	S	S
32640_1_47	R	R	R	R	S	S	S	S	S	S	S	R	S	S
32640_1_316	R	R	R	R	S	S	S	S	S	S	S	R	S	S
32640_1_372	R	R	R	R	S	S	S	R	S	S	S	R	S	S
32640_1_328	S	S	S	S	S	S	S	S	S	S	S	S	S	S
SRS5777896	R	R	R	R	R	S	S	S	S	S	S	R	S	S
32640_1_296	R	R	R	R	S	R	S	S	S	S	S	R	S	S
32640_1_364	R	R	R	R	S	R	S	S	S	S	S	R	S	S

Exercise 4: genotype-phenotype association

In this exercise, you will compare the AMR genes detected in the genome assemblies with the phenotypic antimicrobial susceptibility testing (AST) results shown in Table 10.1.

The goal is to evaluate concordance between genotype and phenotype and to interpret discrepancies in a biologically meaningful way.

10.3.5 Part 1 – β-lactam resistance

Identify isolates that are resistant to ampicillin (AMP) and ceftriaxone (CRO).
For these isolates:
- Which β-lactamase genes are present (e.g. blaTEM, blaCTX-M)?
- Are ESBL genes detected?
Do the genotypes explain the observed phenotypes?
Are there isolates carrying β-lactamase genes but remaining phenotypically susceptible?

10.3.6 Part 2 – Trimethoprim/sulfamethoxazole and chloramphenicol

For isolates resistant to SXT:
- Are dfr and/or sul genes detected?
For chloramphenicol resistance:
- Are cat or cml genes present?
Is genotype–phenotype concordance complete?

10.3.7 Part 3 – Fluoroquinolones and azithromycin

Identify isolates resistant to ciprofloxacin (CIP).
- Are plasmid-mediated quinolone resistance genes (e.g. qnr) detected?
- Are QRDR mutations present in gyrA and/or parC?
For azithromycin-resistant isolates:
- Are macrolide resistance genes (e.g. mph, erm) present?
Discuss possible mechanisms if resistance is observed without an obvious gene.

10.3.8 Part 4 – Carbapenems and last-line agents

Are any isolates resistant to meropenem (MEM), meropenem/vaborbactam (MER/VAB), or colistin (CST)?
Are carbapenemase genes (e.g. blaKPC, blaNDM, blaOXA-48-like) detected?
If carbapenem resistance is absent, what does this suggest about the clinical risk profile of this dataset?

10.3.9 Part 5 – Plasmid context

Using your plasmid replicon typing results:

Which Inc types are detected in the MDR isolates?
Are ESBL genes located on plasmid-associated contigs?
Do multiple isolates share the same Inc type and resistance genes?
What does this suggest about:
- Horizontal gene transfer?
- Epidemiological relatedness?
- The potential for international spread?

10.3.10 Final questions

For each isolate, would you consider the genomic AMR prediction clinically reliable?
Identify any genotype–phenotype discrepancies.
- Propose biological explanations (gene expression, porin loss, efflux, gene truncation, assembly fragmentation, breakpoint interpretation).
Based on both genomic and phenotypic data, which isolates would you prioritise for public health surveillance?

Summarise your findings in a concise table that includes: - sequencing_ID
- key resistance genes
- key plasmid Inc types
- MDR status
- notable genotype–phenotype discrepancies

10.4 Wrap up and outlook

As you may have noticed, the relationship between genotype and phenotype is not always straightforward. Even in Salmonella, where genotype–phenotype concordance is relatively high compared with many other bacteria, discrepancies can occur.

For example, you may detect a partial match to a known antimicrobial resistance (AMR) gene, while phenotypic susceptibility testing indicates that the isolate remains susceptible to the corresponding antibiotic. How should such a result be interpreted?

One approach is to consult the literature to determine whether similar variants have been described and functionally characterised. Another strategy is to evaluate the genetic variation in more detail. For instance:

Does the variant introduce a missense mutation, a premature stop codon, or a frameshift?
Could the mutation alter protein function by affecting key residues?
Might the change influence the three-dimensional conformation of the target protein?
Assessing mutations in their structural context can provide additional insight into their potential functional impact.

The next chapter introduces tools for working with protein 3D structures.