Configuration¶
Default configuration¶
The default configuration file is called config_default.yaml. It is
located in the amplimap basedir, which is usually the
amplimap.VERSION.egg/amplimap directory located in Python’s
site-packages
(where VERSION is the amplimap version number, eg. 0.3).
You can run amplimap --basedir to get the path to the basedir.
Any settings in this file will be applied every time you run amplimap. This is particularly helpful for setting up correct paths for the reference files (genome build, aligner index, reference genome fasta).
You can also save this file under /etc/amplimap/VERSION/config.yaml
(where VERSION is the amplimap version number, eg. 0.3) or provide a different
path in the AMPLIMAP_CONFIG environment variable.
Local configuration¶
To specify experiment-specific settings, you can place a file called config.yaml in your working
directory. Any setting that is specified in this local configuration
file will override the default configuration. This is useful for setting
analysis-specific parameters, such as the quality filters, UMI lengths,
etc.
To see the configuration that amplimap will use, based on your global and local
configuration files, run amplimap --print-config.
Common configuration changes¶
Reference genome paths¶
amplimap requires a reference genome and associated indices (such as a FASTA index
or a bwa index) to run. For example, to specify paths for hg19 and hg38
of the human genome and set the default to hg38:
paths:
hg38:
bwa: "/PATH/TO/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
fasta: "/PATH/TO/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
annovar: '/PATH/TO/annovar/humandb'
hg19:
bwa: "/PATH/TO/Homo_sapiens.GRCh37.dna.primary_assembly.fa"
fasta: "/PATH/TO/Homo_sapiens.GRCh37.dna.primary_assembly.fa"
annovar: '/PATH/TO/annovar/humandb'
general:
genome_name: "hg38"
It is recommended that you specify these paths in the Default configuration file.
Once you have set up these paths, you can then choose the genome to use for each
experiment by specifying the genome_name in your local configuration file:
general:
genome_name: "hg38"
However, you can also specify the paths and genome in the same file.
Add a new section with a name of your choice under paths: and then
set your genome_name to the same name. For example:
paths:
mm10:
bwa: "mm10/bwa"
fasta: "mm10/genome.fa"
general:
genome_name: "mm10"
Note that when you are doing variant annotation with Annovar your
genome_name has to match the name that Annovar uses.
Setting up Annovar¶
For licensing reasons annovar needs to be downloaded and installed manually. Please see the Annovar website for details.
Once you have Annovar installed we recommend that you download the following indices:
- refGene, esp6500siv2_all, 1000g2014oct_all, avsnp147, cosmic82, dbnsfp33a, clinvar_20150629, dbscsnv11, exac03, gnomad_genome, gnomad_exome
Finally, you need to specify the path to your Annovar index directory in your Default configuration file (see Reference genome paths). If you downloaded a different set of indices you also need to adjust the annovar protocols and operations parameters.
Running with UMIs (eg. for smMIPs)¶
If one or both of your reads start with UMIs, you have to specify their lengths
in the configuration file using the umi_one: and umi_two: settings
under parse_reads:.
For example, to process an experiment with 5bp UMIs on each read, your
config.yaml might look like this:
parse_reads:
umi_one: 5
umi_two: 5
Note that it is very important to specify the correct lengths here, since these UMIs will be trimmed off before amplimap tries to match the start of the read to the expected primer sequence. If the length is incorrect, the primer sequences will never match the reads and all of the reads will be discarded.
Primer trimming¶
By default, primer (extension/ligation) arms are removed from the
beginnings and, if applicable, ends of reads before alignment. This is
particularly important when using overlapping (tiled) probes, since the
primers would otherwise skew the observed allele frequencies or even
prevent a variant from being called in the first place. They can also
lead to misalignment of off-target sequences that were inadvertendly
captured, introducing false positives. However, removing them also means
that only the targeted region in-between the arms will be aligned to the
genome. This can be problematic if its sequence is not unique, leading
to off-target alignment and reads with mapping quality 0. To turn off
primer trimming, specify trim_primers: false under general:.
Quality trimming of reads¶
Reads can optionally be trimmed at their beginnings/ends to remove
low-quality bases. This may be helpful to remove potentially noisy base
calls during variant calling, although most variant callers should be
able to account for this independently. To enable this, set a quality
trimming threshold, which is the highest probability of an errorneous
call that you would like to allow. The default (which results in quality
trimming being turned off) is false, a suggested value to enable
quality trimming would be 0.01 (1%): quality_trim_threshold: 0.01.
Minimum mapping quality (for pileups only)¶
By default, no mapping quality filter is applied for the pileup and
alignment stats tables. If you think that filtering out low-quality
mappings may improve your results, you can change this by setting a
minimum mapping quality in the pileup: section using something like
min_mapq: 20. Note that this setting has no effect on coverage and
standard variant calling!
Support for modules¶
amplimap has some basic support for loading and unloading optional software packages
through the modules system. To use this feature, specify the modules that should be loaded
for each of the software packages listed under modules:.
If you leave a setting empty, no module will be loaded and the software will have to be
available without loading a module.