Configuration

Default configuration

The default configuration file is called config_default.yaml. It is located in the amplimap basedir, which is usually the amplimap.VERSION.egg/amplimap directory located in Python’s site-packages (where VERSION is the amplimap version number, e.g. 0.4.20). You can run amplimap --basedir to get the path to the basedir.

Any settings in this file will be applied every time you run amplimap. This is particularly helpful for setting up correct paths for the reference files (genome build, aligner index, reference genome fasta).

You can also save this file under /etc/amplimap/VERSION/config.yaml (where VERSION is the amplimap version number, e.g. 0.4.20) or provide a different path in the AMPLIMAP_CONFIG environment variable.

Local configuration

To specify experiment-specific settings, you can place a file called config.yaml in your working directory. Any setting that is specified in this local configuration file will override the default configuration. This is useful for setting analysis-specific parameters, such as the quality filters, UMI lengths, etc.

To see the configuration that amplimap will use, based on your global and local configuration files, run amplimap --print-config.

Common configuration changes

Selecting the aligner and variant caller

amplimap can work with different aligners and variant callers.

Supported aligners, specified through align: aligner:, are:

  • BWA (bwa)
  • Bowtie2 (bowtie2)
  • STAR (star)

Supported variant callers, specified through variants: caller:, are:

  • Platypus (platypus)
  • GATK 4 (gatk)
  • weCall (wecall, experimental)
  • Octopus (octopus, experimental)

For example:

align:
  aligner: "bowtie2"
variants:
  caller: "octopus"

Additional aligners and variant callers can also be added by specifying the relevant commands under tools:. See the comments in the config file for details.

Reference genome paths

amplimap requires a reference genome and associated indices (such as a FASTA index or a bwa index) to run. It is recommended that you specify these paths in the Default configuration file. For example, to specify paths for hg19 and hg38 of the human genome and set the default to hg38:

paths:
  hg38:
    bwa: "/PATH/TO/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
    fasta: "/PATH/TO/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
    annovar: '/PATH/TO/annovar/humandb'
  hg19:
    bwa: "/PATH/TO/Homo_sapiens.GRCh37.dna.primary_assembly.fa"
    fasta: "/PATH/TO/Homo_sapiens.GRCh37.dna.primary_assembly.fa"
    annovar: '/PATH/TO/annovar/humandb'
general:
  genome_name: "hg38"

For suggestions on where to obtain these references and how to create indices see 4. Set up your reference genome and indices.

Once you have set up these paths, you can then choose the genome to use for each experiment by specifying the genome_name in your local configuration file:

general:
  genome_name: "hg38"

However, you can also specify the paths and genome in the same file. Add a new section with a name of your choice under paths: and then set your genome_name to the same name. For example:

paths:
  mm10:
    bwa: "mm10/bwa"
    fasta: "mm10/genome.fa"
general:
  genome_name: "mm10"

Note that when you are doing variant annotation with Annovar your genome_name has to match the name that Annovar uses.

Setting up Annovar

Annovar is the software amplimap uses to annotate variant calls. For licensing reasons it needs to be downloaded and installed manually. Please see the Annovar website for details.

Once you have Annovar installed we recommend that you download the following indices:

  • refGene, esp6500siv2_all, 1000g2014oct_all, avsnp147, cosmic82, dbnsfp33a, clinvar_20150629, dbscsnv11, exac03, gnomad_genome, gnomad_exome

Finally, you need to specify the path to your Annovar index directory in your Default configuration file (see Reference genome paths). If you downloaded a different set of indices you also need to adjust the annovar protocols and operations parameters.

Running with UMIs (e.g. for smMIPs)

If one or both of your reads start with UMIs, you have to specify their lengths in the configuration file using the umi_one: and umi_two: settings under parse_reads:.

For example, to process an experiment with 5bp UMIs on each read, your config.yaml might look like this:

parse_reads:
  umi_one: 5
  umi_two: 5

Note that it is very important to specify the correct lengths here, since these UMIs will be trimmed off before amplimap tries to match the start of the read to the expected primer sequence. If the length is incorrect, the primer sequences will never match the reads and all of the reads will be discarded.

Primer trimming

By default, primer (extension/ligation) arms are removed from the beginnings and, if applicable, ends of reads before alignment. This is particularly important when using overlapping (tiled) probes, since the primers would otherwise skew the observed allele frequencies or even prevent a variant from being called in the first place. They can also lead to misalignment of off-target sequences that were inadvertendly captured, introducing false positives. However, removing them also means that only the targeted region in-between the arms will be aligned to the genome. This can be problematic if its sequence is not unique, leading to off-target alignment and reads with mapping quality 0. To turn off primer trimming, specify trim_primers: false under parse_reads:.

Quality trimming of reads

Reads can optionally be trimmed at their beginnings/ends to remove low-quality bases. This may be helpful to remove potentially noisy base calls during variant calling, although most variant callers should be able to account for this independently. To enable this, set a quality trimming threshold, which is the highest probability of an errorneous call that you would like to allow. The default (which results in quality trimming being turned off) is false, a suggested value to enable quality trimming would be 0.01 (1%): quality_trim_threshold: 0.01.

Minimum mapping quality (for pileups only)

By default, no mapping quality filter is applied for the pileup and alignment stats tables. If you think that filtering out low-quality mappings may improve your results, you can change this by setting a minimum mapping quality in the pileup: section using something like min_mapq: 20. Note that this setting has no effect on coverage and standard variant calling!

Support for modules

amplimap has some basic support for loading and unloading optional software packages through the modules system. To use this feature, specify the modules that should be loaded for each of the software packages listed under modules:. If you leave a setting empty, no module will be loaded and the software will have to be available without loading a module.

All configuration options

A commented list of all configuration options supported by amplimap and their default values is available in config_default.yaml:

paths:
  # first reference genome name - each reference genome must have a unique name, eg. hg38, hg19, mm10, ...
  hg38:
    # path to bwa index prefix, if bwa is used as aligner
    bwa: ""
    # path to bowtie2 index prefix, if bowtie2 is used as aligner
    bowtie2: ""
    # path to STAR genome directory, if STAR is used as aligner
    star: ""
    # path to reference genome FASTA file (indexed using samtools faidx)
    fasta: ""
    # path to directory with annovar indices
    annovar: ""
    # reference genome name used by annovar (leave empty if same as name above)
    annovar_name: ""

  # second reference genome
  mm10:
    bwa: ""

# general setup
general:
  # default reference genome (must match name in `paths`)
  genome_name: "hg38"
  # whether to ignore UMIs during pileups and alignment stats (if any)
  ignore_umis: false
  # minimum number of reads supporting the consensus call in a UMI group
  umi_min_consensus_count: 1
  # minimum fraction of reads supporting the consensus call in a UMI group
  umi_min_consensus_percentage: 51
  # when using FASTQs as input: number of lanes (should be detected automatically if set to 0)
  lanes: 0
  # when using unmapped BAMs as input: tag in which the UMI has been saved
  umi_tag_name: ""
  # beta: optional bed file specifying bases that should be masked with Ns.
  # if specified, amplimap will try to create a masked copy of the reference genome
  # and build custom bwa/bowtie2/star indices for it
  mask_bed: ""
  # set to true if probe data is not available. this means you do not need to provide
  # a probes.csv file, but it disables trimming of primers, stats_reads etc.
  use_raw_reads: false

# settings for FASTQ read parsing
parse_reads:
  # maximum number of mismatches between expected primer arm and beginning of read (per mate)
  max_mismatches: 2
  # length of UMI on read 1
  umi_one: 0
  # length of UMI on read 2
  umi_two: 0
  # sbort pipeline if fewer than this percentage of reads have a valid pair of primer arms
  min_percentage_good: 0
  # whether to trim off primers from beginning of reads
  trim_primers: true
  # quality trimming threshold as fraction between 0 and 1 (false = no trimming)
  quality_trim_threshold: false
  # quality phred score encoding (usually phred-33 or phred-64)
  quality_trim_phred_base: 33
  # minimum read length required after primer and quality trimming
  trim_min_length: 20

# alignment settings
align:
  # default aligner (bwa/bowtie2/star)
  aligner: "bwa"
  # settings for bowtie2
  bowtie2:
    # number of alignments to report (increase to allow multi-mapping reads)
    report_n: 1

# pileup settings
pileup:
  # minimum mapping quality
  min_mapq: 0
  # minimum base quality - amplimap will provide both total counts and counts with baseq>X in output files
  min_baseq: 30
  # maximum depth of reads to process at each base (set to false to leave practically unlimited, currently 1e7)
  subsample_reads: false
  # only consider base calls from reads that fall within the expected coordinates of their probe
  # this will remove errorneous calls due to untrimmed primer arms from overlapping probes
  validate_probe_targets: false
  # filter reads with softclipped bases (but still consider mate)
  filter_softclipped: false
  # only group reads together if they also have the same mate start positions (in addition to their UMI matching)
  group_with_mate_positions: false

# variant calling settings
variants:
  # default variant caller (platypus/gatk/wecall/octopus)
  caller: "gatk"
  # beta: variant caller to use for low frequency variant calling (only mutect2 currently supported)
  caller_low_frequency: "mutect2"
  # caller-specific parameters to use
  gatk:
    parameters: "--disable-read-filter MappingQualityAvailableReadFilter --disable-read-filter WellformedReadFilter"
  wecall:
    parameters: ""
  mutect2:
    parameters: ""
  platypus:
    parameters: "--filterDuplicates=0 --filterReadsWithDistantMates=0 --minFlank=0 --trimOverlapping=0 --sbThreshold=0 --abThreshold=0"

# variant annotation settings
annotate:
  # currently only annovar is supported
  annovar:
    # annovar protocols to use. note that these must have been downloaded to the index directory specified above
    # the "refGene" database is required for variant annotation, all others are optional
    protocols: 'refGene,esp6500siv2_all,1000g2014oct_all,avsnp147,cosmic82,dbnsfp33a,clinvar_20150629,dbscsnv11,exac03,gnomad_genome,gnomad_exome'
    # operations corresponding to the protocols. must be the same length as the protocols
    operations: 'g,f,f,f,f,f,f,f,f,f,f'

# cluster submission commands. you can specify multiple cluster commands here, each with their own name.
# if you specify the command as 'cluster_sync' it should pause until the job is completed,
# if you specify it as 'cluster_nosync' it does not need to.
# these will be used as the value of the --cluster or --cluster-sync parameters to Snakemake
# logfiles should be placed into a directory called '{workflow.workdir_init}/cluster_log' which will
# be created automatically on execution
clusters:
  qsub:
    command_sync: 'qsub -o {workflow.workdir_init}/cluster_log/ -sync yes -j y -m n'
  bsub:
    command_nosync: 'bsub -oo {workflow.workdir_init}/cluster_log/%J.log'
  slurm:
    command_nosync: 'sbatch -o {workflow.workdir_init}/cluster_log/%j_%x.log'

# if you are using the 'modules' system to provide additional software, specify the required module names here.
# eg. if the bwa aligner is provided in the module bwa/0.7.12, set bwa: 'bwa/0.7.12'.
# if a module is specified here, amplimap will load it before executing the corresponding command
modules:
  bedtools: ''
  samtools: ''
  bcftools: ''
  bwa: ''
  bowtie2: ''
  star: ''
  platypus: ''
  gatk: ''
  wecall: ''
  annovar: ''
  picard: ''
  octopus: ''

# options for external tools used by amplimap
tools:
  picard:
    # prefix to use when execute a picard tool.
    # for example, if your prefix is 'picard' then amplimap will run
    # `picard SamToFastq` to execute the `SamToFastq` tool
    prefix: 'picard'

  octopus:
    version_command: 'octopus --version'
    call_command: >-
      octopus --reference=%s
      --regions-file={input.targets:q}
      --reads={input[0]:q}
      --output={output[0]:q}

  # you can specify additional tools here:
  example_tool:
    # each tool needs a version_command that returns the version of the tool with exit code 0.
    version_command: 'my_tool --version'
    # specify one of these to make this tool available as aligner/caller, using the parameters as shown here:
    # align_command: 'my_tool align --reference-fasta=%s --threads={threads} --read1={input[0]:q} --read2={input[1]:q} --output={output[0]:q} --sample={wildcards.sample_full:q}'
    # call_command: 'my_tool call --reference-fasta=%s --targets={input.targets:q} --input-bam={input[0]:q} --output-vcf={output[0]:q}'