Bacterial Pathogen Genomics Workshop

Author

Vishnu Raghuram

5-Day Pathogen Genomics Workshop Schedule

The goal of this workshop is to provide participants with a solid foundation in pathogen genomics, covering key bioinformatics tools and concepts, ensuring a mix of theoretical learning and hands-on exercises.


Day 1: Bash Basics & Genomics Workflow

Goal: Get everyone on the same page regarding command-line basics, the bacterial genomics workflow, and basic tools used in pathogen genomics.

1. Introduction to the Command Line

  • What is the command line? Why is it essential in genomics?
  • File structure, directories, and navigating the command line.
  • Key commands (e.g., ls, cd, cp,mkdir, etc.)
  • Simple text processing (using grep, sed, tr)
  • Hands-on exercises combining different commands, piping and output redirection.

2. Overview of the Bacterial Genomics Workflow

  • Overview of the genomic analysis pipeline for bacteria: QC, genome assembly, annotation, variant calling
  • Why do we perform each step? What are the key questions each step answers in pathogen genomics?
  • Participants will run fastp, Shovill, bakta and snippy (or snippy-multi) on sample data.
  • Writing a simple loop to automate processing multiple samples

Day 2: Public Databases and overall genomic distance estimation

Goal: Introduce public pathogen surveillance databases and tools for genome comparison.

3. Public Pathogen Genomics Databases

  • Exploring of NCBI, SRA, BioSamples, BioProjects.
  • The importance of using public databases in surveillance and research
  • Retrieving genomic data from public databases (ncbi-genome-download,ncbi-datasets,sratools).
  • Exploring Enterobase.

4. Genomic distance estimation

  • MLST and how it’s used in pathogen typing
  • Introduction to genome distance estimation methods (MASH, ANI, SNP distances)
  • Participants will use MLST to assign sequence types to bacterial genomes.
  • Participants will run mash,skani, and snp-dists calculate genome distances from sample data.

5. Basic Data Visualization with RStudio

  • Setting up RStudio, installing essential packages, reading data tables
  • Data visualization in RStudio: ggplot2, ggtree, and other essential packages.
  • Participants will create basic visualizations using ggplot2 (e.g., bar plots, scatter plots).

Day 3: Analysis of popoulations using pangenomics and phylogenies

Goal: Introduce pangenomes and phylogenies

6. Phylogenetic Trees and gene presence/absence data

  • What is a phylogenetic tree? How is it used to study pathogen evolution?
  • Discussion on tree-building methods (e.g. Neighbor-Joining, Maximum Likelihood)
  • Participants will generate a Neighbour-Joining tree using mashtree and a ML tree using IQ-TREE.
  • What is pangenomics? Why is it important for studying pathogen diversity?
  • Overview of pangenomic analysis workflows and tools (e.g., Panaroo for bacterial pangenomes).

Day 4: Advanced Visualization & Data Analysis

Goal: Dive deeper into RStudio for advanced data visualization techniques and analysis.

7. Advanced Visualization with R

  • Advanced visualizations: Heatmaps, PCA/UMAP, faceting.
  • Advanced ggtree usage with gheatmap
  • Exploring visualization of SNP distances, pangenomes, and phylogenetic trees.

8. Large-scale data analysis using HPCs, Nextflow, SLURM

  • Brief overview of writing pipelines, NextFlow, SLURM
  • Discussing scaling up and automation (using Bactopia)

Day 5: Troubleshooting & Open Session

Goal: Allow time for troubleshooting, revisiting any unclear topics, and addressing specific questions.

  • Extra time in case of any overflow from previous days
  • Brief recap of everything covered over the week.
  • Participants can ask specific questions or seek help with problems encountered during the week.
  • Help with more advanced problems that might relate to participants’ own research.