Input and the working directory

For each experiment (for example each sequencing run), you need to create a working directory inside which you run amplimap. All the input and output files will be kept in this directory.

Input overview

Required input

Optional input

Input files

reads_in/

This subdirectory should contain pairs of gzipped FASTQ files from paired-end sequencing. Filenames should follow standard Illumina naming conventions (ending in _L001_R1_001.fastq.gz and _L001_R2_001.fastq.gz, where L001 is the lane). Data from multiple lanes (e.g. L001 and L002) will be merged, as long as all samples have a pair of files for each lane.

Unmapped BAM files (unmapped_bams_in/)

Instead of the reads_in directory you can also provide your sequencing data as unmapped BAM files inside a directory called unmapped_bams_in/. This is useful for running amplimap on data from capture-based protocols.

If your reads contained UMIs, these should have been trimmed off already and be provided in the bam file using a BAM tag (e.g. RX). You will need to specify the name of this tag in the config file.

See also Running on capture-based data for more details and examples.

When running amplimap with capture-based data probes can no longer be identified based on the primer arms. Thus, some of the output files can not be generated.

If your BAM files have already been mapped (aligned) you can also put them into a directory called mapped_bams_in/ instead, which will skip the alignment step inside amplimap.

probes.csv

This file contains information about the sequences of the primers and their expected location on the reference genome. It can also be generated directly from the output table created by MIPGEN - see below!

The columns are:

  1. id: a unique name for the probe
  2. first_primer_5to3: sequence of the forward primer, in 5’ to 3’ orientation (needs to match start of read 1)
  3. second_primer_5to3: sequence of the reverse primer, in 5’ to 3’ orientation (needs to match start of read 2)
  4. chr: chromosome (needs to match the naming scheme used for the reference genome, eg. 1 for Ensembl or chr1 for UCSC)
  5. target_start: start of the target region (after the primer, 1-based coordinate system)
  6. target_end: end of the target region
  7. strand: + or -

The first row of the file should be the header listing these column names. Any additional columns are ignored.

No two primers should be so similar to each other that a read could match both of them. Otherwise, the read will be ambigous and may end up being counted multiple times, once for each probe. This will skew any downstream analysis.

If you have two probes for the same region to account for a SNP in the primer sequence, you need to provide these as a single entry. To make that merged entry match both primers, replace all ambiguous (SNP) nucleotides in the primer sequences with a dot (.). This way, any nucleotide will be allowed in this location and reads from either version of the probe will be counted together.

MIP names cannot contain characters other than alphanumeric characters (A-Z, 0-9), or _:+-. Avoid using multiple colons in a row (e.g. ::) since this is used as a field separator internally.

The file needs to be in plain CSV format with UNIX/Windows (not Mac) style line endings.

MIPGEN probe design table (picked_mips.txt)

Instead of the standard probes.csv file you can also provide a MIPGEN probe design table. Copy the xxxxx.picked_mips.txt file to the working directory and rename it to picked_mips.txt. When you run amplimap, this file will automatically be converted into a probes.csv file in the right format.

If MIPGEN generated two versions of the same probe to account for a SNP, amplimap will detect this based on the identical value in the mip_name column (optionally followed by _SNP_a and _SNP_b) and merge them into a single entry, replacing any differences in the primer sequences by a dot (see above). Probes with the same name but not the same location, or probes where the primers differ by more than 10 characters, are not expected and will result in an error.

targets.csv

List of target regions (e.g. exons, not the MIPs themselves) in CSV format. This file should contain the following columns:

  1. chr (needs to match the naming scheme used for the reference genome, eg. 1 for Ensembl or chr1 for UCSC)
  2. start (start position, 1-based coordinate system)
  3. end (end position)
  4. [optional] id (name of the target)

The first row of the file should be the header listing these column names. Any additional columns are ignored.

Variants will only be called inside these target regions! If any of the target regions overlap, they will be merged for variant calling and cause an error when trying to calculate pileups.

targets.bed

You can also provide this data in BED format. In that case, the file should be called targets.bed and use the standard BED columns (chromosome, 0-based start position, end position, id). The score and strand columns may be included, but do not have any effect on the pipeline. Note that BED files do not contain column headers!

snps.txt (for allele counting pileup)

If you have certain SNPs that you want to generate pileups for, you can provide a list in tab-separated text format here. The columns are:

  1. chr (needs to match the naming scheme used for the reference genome, eg. 1 for Ensembl or chr1 for UCSC)
  2. pos (1-based coordinate system)
  3. id
  4. snp_ref
  5. snp_alt (only a single alt allele is supported)

This will generate the pileups_snps directory with reference/alternate allele counts for each SNP. The filter column in the pileup tables will reflect whether the observed alleles matched the SNP alleles, or whether additional alleles were found.

Note: This file do not contain column headers!

sample_info.csv (for variant calling)

This file can be provided to add additional sample information columns to the coverage and variant tables. If provided, it always needs to start with these two columns:

  1. Sample: the sample id, including the _S123 part
    • needs to match the sample identifiers of the input fastq files
    • example: Barcode-001_S1
  2. Targets: a semicolon-separated list of target ids
    • needs to match the ids provided in targets.csv
    • example: GENE1-Ex1.1;GENE1-Ex1.2;GENE1-Ex2;GENE3

These should then be followed by one or more annotation columns, which can contain information like the id of the corresponding individual or other information about the samples. All of these columns will be copied into the coverage and variant tables.

A single sample id (= barcode) can have multiple rows with different annotation columns, as long as none of the targets are the same. In other words, any combination of sample/target id may only occur once.

If there are two overlapping target regions and a variant call is made in the overlapping part, it can get assigned to either of them. To avoid errors due to this, overlapping target regions must always be listed in pairs and never be split up. For example, if the targets GENE1-Ex1a and GENE1-Ex1b overlap, you should never have a row where you only list GENE1-Ex1a or only list GENE1-Ex1b. They should always be listed together (GENE1-Ex1a;GENE1-Ex1b) or not at all.

This file needs to be in plain CSV format with UNIX/Windows (not Mac) style line endings.

Running amplimap

The pipeline is based on Snakemake, which uses predefined rules to figure out what it needs to do to generate a certain output file.

To run the most basic version of the pipeline, just enter amplimap in your terminal:

amplimap

By default, this will only start a so-called dry run. This will not actually run any of the code yet. However, it will make sure that the expected input files are present and tell you which jobs it would be running.

If the output of this dry run looks as expected you can start the actual pipeline by adding the --run parameter:

amplimap --run

This will go through the first few steps of the pipeline but will not run the more advanced analysis-specific parts.

To run these additional analyses, you need to add so-called target rules to the amplimap command line. Some of these are listed below.

bams (read parsing and alignment)

This rule performs read parsing and read alignment, creating the bams/ and reads_parsed/ output directories.

amplimap bams

variants and variants_umi (germline variant calling/annotation)

These rules perform germline variant calling and annotation on the alignments generated by the bams rule. If that rule has not been called yet, it will be called implicitly.

To call variants from raw reads, creating the variants_raw/ directory:

amplimap variants

To call variants from UMI-deduplicated (but not consensus) reads, creating the variants_umi/ directory:

amplimap variants_umi

Please note that this command simply selects a random representative read per UMI group. It does not perform any consensus calling like the pileups command would.

coverages (target coverage tables)

This rule calculates the target region coverage of the alignments generated by the bams rule. If that rule has not been called yet, it will be called implicitly. Its output will be available in the bams/coverages/ directory.

amplimap coverages

pileups (allele counts and allele frequencies)

This rule calculates per-basepair allele counts and allele frequencies in the target regions, using the alignments generated by the bams rule. If that rule has not been called yet, it will be called implicitly.

Its output will be available in the pileups/ directory (if a targets.csv file has been provided) and/or the pileups_snps/ directory (if a snps.csv file has been provided).

amplimap pileups

To only perform a pileup of the SNPs, even if a targets file is present, run:

amplimap pileups_snps

variants_low_frequency (low-frequency/somatic variant calling)

[EXPERIMENTAL!] To call low-frequency variants using Mutect2 use this command:

amplimap variants_low_frequency

This function is still experimental and has not been thoroughly tested. Its output will be available in the variants_low_frequency/ directory

Multiple analyses

You can also group together multiple target rules to run several analyses at once:

amplimap pileups variants coverages

Running on a cluster

You can specify the additional parameter --cluster=qsub to run jobs in parallel on a SGE cluster:

amplimap --cluster=qsub

amplimap --cluster=qsub variants

amplimap --cluster=qsub pileups

This can speed up the processing by an order of magnitude, as commands will be run in parallel instead of sequentially. However, this process is a bit more complex and may lead to unexpected errors. If you get an error message, try running the standard command without the --cluster parameter instead.

You can set the number of jobs to submit by setting the --njobs parameter:

amplimap --cluster=qsub --njobs=5

To use other cluster environments (such as LSF), add an entry with the submission command to the clusters: section of the config file.

Cluster log files

When amplimap submits jobs to the cluster, it can no longer print their output to the screen. Instead, it will create separate a log file for each job containing all of its output. If one of your jobs fails because of an error, you may need to look for a log file with the full error message in there.

The naming of the log files will depend on the cluster architecture used, but they should usually be called “amplimap.RULENAME.JOBID.sh.oCLUSTERID” where RULENAME is the name of the amplimap rule that failed, JOBID is its ID and CLUSTERID is an additional ID that has been assigned by the cluster. By default these log files will be placed in a folder called cluster_log.

For example, if you see an error message like this:

(...)
Error in rule tool_version:
    jobid: 9
    (...)

Then you can find the error message in the log file that starts with cluster_log/amplimap.tool_version.9.sh.o followed by a number.

Output: the analysis directory

All analysis results will be written to the subdirectory analysis inside the working directory. These include:

  • reads_parsed/stats_samples.csv: Sample statistics - number of read pairs matching expected primers per sample, etc.
  • reads_parsed/stats_reads.csv: Read statistics - number of reads per probe per sample, number of UMIs per probe per sample, etc.
  • bams/: BAM files with aligned reads
  • stats_alignment/stats_alignment.csv: Alignment statistics for each sample - number of read pairs and unique UMI groups aligning in the expected location, etc.
  • reads_parsed/: unknown primer arm files - sequences from the start of reads that didn’t match any of the expected primer sequences. Will only include data for the first 10,000 read pairs with unknown primer arms.
  • versions/ and versions.yaml: a set of files providing the version numbers of various tools used for the analysis.

In addition, the analysis directory will contain config_used.yaml, which is a copy of the configuration that was used at the time the pipeline was first run. Note that this will not be updated if you run the pipeline a second time, unless you delete the old copy first.

Target-specific output

The analysis directory will contain further subdirectories for the different analyses that were performed by the pipeline:

Germline variant calling and annotation analysis: coverages and variants

  • bams_umi_dedup/: only generated if reads contained UMIs – deduplicated BAM files, with one read pair per UMI group chosen at random (no consensus calling, no minimum coverage per UMI)
  • bams/coverages/: min/average/zero coverage of target regions based on raw reads (ignoring mapping quality)
  • bams_umi_dedup/coverages/: min/average/zero coverage of target regions after UMI deduplication (ignoring mapping quality)
  • variants_raw/: annotated variant calls (ignoring UMIs, if any)
    • full summary table: variants_raw/variants_summary.csv includes summary of all variants in all samples, with deleteriousness score, etc.
    • filtered summary table: variants_raw/variants_summary_filtered.csv all variants from summary table that pass variant quality filters and have a coverage of at least 10
  • variants_umi/: only generated if reads contained UMIs – annotated variant calls based on UMI deduplicated reads (otherwise as above)

Low-frequency variation analysis: pileups

  • pileups/: target region pileup tables
    • per-basepair pileups for all positions and all samples: pileups/pileups_long.csv
    • coverage over target regions: pileups/target_coverage.csv
  • pileups_snps/: SNP pileup tables (optional, requires snps.txt)
    • per-SNP pileups: pileups_snps/target_snps_pileups_long.csv

Additional output

In addition to the analysis directory, these folders may be created:

  • cluster_log/: directory with log files for each job submitted to the cluster (contain error messages if cluster submission fails)