Skip to the content.

SNPKIT

Synopsis

SNPKIT is a microbial variant calling pipeline/toolkit that can be used for outbreak investigations and other clinical microbiology projects.

Author

Ali Pirani

Contents

Installation

The pipeline can be set up in two steps:

  1. Clone the snpkit github directory onto your system.
git clone https://github.com/alipirani88/snpkit.git

  1. Use snpkit/environment.yml and snpkit/environment_gubbins.yml files to set up the conda environments.
conda env create -f snpkit/envs/environment.yml -n snpkit
conda env create -f snpkit/envs/environment_gubbins.yml -n gubbins

Check installation

conda activate snpkit

python snpkit/snpkit.py -h

Quick Start

Assuming you want to call variants for more than a few samples against a reference genome KPNIH1 and run the analysis in parallel with SLURM HPC manager.


python snpkit/snpkit.py \
-type PE \
-readsdir /Path-To-Your/test_readsdir/ \
-outdir /Path/test_output_core/ \
-analysis output_prefix \
-index KPNIH1 \
-steps All \
-cluster cluster \
-scheduler SLURM \
-clean

The results of variant calling will be placed in an individual folder generated for each sample in the output directory.

python snpkit/snpkit.py \
-type PE \
-readsdir /Path-To-Your/test_readsdir/ \
-outdir /Path/test_output_core/ \
-analysis output_prefix \
-index reference.fasta \
-steps core_All \
-cluster cluster \
-gubbins yes \
-scheduler SLURM

This step will gather all the variant call results from the first step, generate SNP-Indel Matrices, qc reports and core/non-core sequence alignments that can be used as an input for downstream phylogenetic analysis such as gubbins, iqtree and beast.

Input

The pipeline requires three main inputs -

1. readsdir: Place your Illumina SE/PE reads in a folder and give path to this folder with -readsdir argument. Apart from the standard Miseq/Hiseq fastq naming convention (R1_001_final.fastq.gz), other acceptable fastq extensions are:


- R1.fastq.gz/_R1.fastq.gz, 
- 1_combine.fastq.gz, 
- 1_sequence.fastq.gz, 
- _forward.fastq.gz, 
- _1.fastq.gz/.1.fastq.gz.

2. config: A high level YAML format configuration file that lets you configure your system wide runs and specify analysis parameters, path to the installed tools, data and system wide information.

3. index: a reference genome index name as specified in a config file.

For example; if you have set the reference genome path in config file as shown below, then the required value for command line argument -index would be -index KPNIH1

[KPNIH1]
# path to the reference genome fasta file.
Ref_Path: /nfs/esnitkin/bin_group/variant_calling_bin/reference/KPNIH1/
# Name of reference genome fasta file.
Ref_Name: KPNIH1.fasta

Here, Ref_Name is the reference genome fasta file located in Ref_Path. Similarly, if you want to use a different version of KPNIH reference genome, you can create a new section in your config file with a different index name.

[KPNIH1_new]
# path to the reference genome fasta file.
Ref_Path: /nfs/esnitkin/bin_group/variant_calling_bin/reference/KPNIH1_new/
# Name of reference genome fasta file.
Ref_Name: KPNIH1_new.fasta

THe pipeline also requires Phaster results of your reference genome to mask phage region. To enable this place the phaster results files in the reference genome folder.

For detailed information, please refer to the wiki page.