Code documentation

Overview

When amplimap is launched, the function amplimap.run.main() is called. This function will perform various pre-processing steps (see below) and then launch Snakemake. Snakemake, using the Snakefile provided in the amplimap package, will then determine the jobs that need to be executed.

Testing amplimap

To test amplimap, install the pytest package through pip and then run this command in the amplimap source directory:

pytest .

amplimap components

Snakefile

amplimap’s Snakefile is used by the amplimap command-line executable to determine the shell commands and library functions to execute to generate a given file. Each “rule” in this file either executes a shell command (e.g. running bwa), executes a small set of Python commands or calls a function from the amplimap Python package.

Running snakemake directly

In theory, this file can also be used directly with the snakemake executable. However, this will omit various steps performed by the amplimap executable, including checking and loading the configuration files. Thus, a configuration file needs to be provided with the --configfile parameter that contains all settings included in config_default.yaml.

In addition, you may want to add the following setting:

general:
  amplimap_parent_dir: /path/to/parent/of/amplimap/package/directory

This directory will be inserted into the Python path when running Snakemake to make sure import amplimap.xxx imports the correct files. If it is not provided Snakemake will load the version of amplimap installed in the default Python path.

amplimap.run

This module provides the amplimap.run.main() method called from the amplimap command-line executable, as well as some helper functions for reading and checking the config files.

amplimap.run.check_config_keys(default_config, my_config, path=[])

Recursively check that config keys provided in my_config also exist in default_config (ignoring ‘paths’ and ‘clusters’).

amplimap.run.compare_config_dicts(my_config, used_config, path=[])

Recursively search for differences in values between two dicts.

amplimap.run.main(argv=None)

Main function for the amplimap executable. This function:

  • parses command line arguments
  • reads, merges and checks each of these config files, if they exist:
    • config_default.yaml in the amplimap package
    • /etc/amplimap/VERSION/config.yaml (where VERSION is the amplimap version)
    • $AMPLIMAP_CONFIG
    • config.yaml in the working directory
  • checks for an existing analysis directory (and compares the amplimap version used to create it)
  • adds its own parent directory to the config file (to be inserted back into the python path inside Snakemake)
  • creates an analysis directory
  • writes config_used.yaml to the new analysis directory
  • creates a cluster_log directory (if running in cluster mode)
  • launches Snakemake, using the amplimap Snakefile, config_used.yaml as the config file and cluster parameters as specified in the command line arguments and config.
amplimap.run.read_config_file(print_config, path)

Helper function to read a config file, if it exists. Always returns a dict, but it may be empty.

amplimap.reader

This module contains methods related to reading input files, such as targets.csv, probes.csv etc.

exception amplimap.reader.AmplimapReaderException(e: Exception, filename: str, should_have_header: bool = None)

Will be raised if reading one of the standard input files has failed, with the aim of providing a more useful error message in the user-visible output.

amplimap.reader.get_code_versions(path: str = '.') → dict

Get the file modification times of common code files for versioning.

amplimap.reader.get_file_hashes(path: str = '.') → dict

Get SHA256 hashes for common input files, so we can make sure these didn’t change between runs.

amplimap.reader.merge_probes_by_id(rows: pandas.core.frame.DataFrame) → pandas.core.series.Series

Merge multiple probes.csv rows with the same probe ID together, to handle cases where MIPGEN generated multiple version because of a SNP.

amplimap.reader.process_probe_design(design: pandas.core.frame.DataFrame, reference_type: str = 'genome') → pandas.core.frame.DataFrame

Read amplimap probes.csv file and return pandas dataframe.

amplimap.reader.read_and_convert_heatseq_probes(path: str) → pandas.core.frame.DataFrame

UNTESTED: Read probes file from Roche heatseq in CSV format and generate an amplimap probes.csv from it.

amplimap.reader.read_and_convert_mipgen_probes(path: str, sep=', ') → pandas.core.frame.DataFrame

Read probes file from MIPGEN in CSV or TSV format and generate an amplimap probes.csv from it.

amplimap.reader.read_new_probe_design(path: str, reference_type: str = 'genome') → pandas.core.frame.DataFrame

Read amplimap probes.csv file and return pandas dataframe.

amplimap.reader.read_sample_info(path: str) → pandas.core.frame.DataFrame

Read amplimap sample_info.csv file and return pandas dataframe.

amplimap.reader.read_snps_txt(path: str, reference_type: str = 'genome') → pandas.core.frame.DataFrame

Read amplimap snps.txt file and return pandas dataframe.

amplimap.reader.read_targets(path: str, check_overlaps: bool = False, reference_type: str = 'genome', file_type: str = 'bed') → pandas.core.frame.DataFrame

Read amplimap targets.csv or targets.bed file and return pandas dataframe.

amplimap.reader.write_targets_bed(path: str, targets: pandas.core.frame.DataFrame)

Write targets dataframe to bed file.

amplimap.parse_reads

This module contains methods for reading raw FASTQ files, finding primers in them and trimming the reads.

amplimap.parse_reads.make_trimmed_read(read_id, read, probe, target_len, umi, umi_len, arm_len, other_arm_len, trim_primers=False, trim_min_length=None, check_sequence=None, check_sequence_mismatches=None, find_arm_sequence=None, find_arm_sequence_mismatches=None, quality_trim_threshold=None, quality_trim_phred_base=None)

Take a fastq read tuple and generate new fastq string after removing umi and potential opposite primer arm.

Returns:
  • str: FastQ lines
  • int: Index of first base trimmed off from the end (to trim opposite primer)
  • bool: was trimmed len great than trim_min_len
  • int: change after quality trimming
  • int: trimmed length
Return type:tuple
amplimap.parse_reads.parse_read_pairs(sample, files1, files2, fastq_out1, fastq_out2, probes_dict, stats_samples, stats_reads, unknown_arms_directory, umi_one, umi_two, mismatches, trim_primers, trim_min_length, trim_primers_strict, trim_primers_smart, quality_trim_threshold, quality_trim_phred_base, allow_multiple_probes, consensus_fastqs=None, min_consensus_count=1, min_consensus_fraction=0.51, debug=False, debug_probe=False)

Read pairs of reads from FASTQ files and write new FASTQ files with trimmed reads.

Recognizes probes based on primer sequences, trims off UMI and probe arms, optionally performs qualtiy trimming and writes consensus FASTQs based on UMI groups.

Parameters:
  • sample (str) – Sample name.
  • files1 (list of str) – Input path to paired FASTQ files (Read1)
  • files2 (list of str) – Input path to paired FASTQ files (Read2)
  • fastq_out1 (list of str) – Output path to paired, trimmed FASTQ files (Read1)
  • fastq_out2 (list of str) – Output path to paired, trimmed FASTQ files (Read2)
  • probes_dict (dict) – Dictionary of probe data (see amplimap.reader.read_probes).
  • stats_samples (dict of lists) – Sample stats. Can be empty, in which case it will be created.
  • stats_reads (list of dicts) – Read stats will be appended to this list.
  • unknown_arms_directory (str) – Directory to write unknown arms files to.
  • umi_two (umi_one,) – Length of UMIs on read1 and read2.
  • mismatches (int) – Number of mismatches to allow in primer sequences.
amplimap.parse_reads.quality_trim_read(read_seq, read_qual_raw, phred_base: int = 33, p_threshold: float = 0.01) → tuple

Use Richard Mott’s algorithm implemented in bwa and others to quality-trim reads.

Returns:length of read after quality trimming, trimmed sequence, trimmed quality scores
Return type:tuple

NOTE: our implementation will always trim from first high-q base (sum > 0) to the best part of the read (max of the running sum), even if there are some low-quality bases in-between.

amplimap.naive_mapper

This is a custom read mapper for amplicon data. It simply places the reads at the location they were expected to come from, based on the target coordinates of their associated probe. It will perform an aligment within the target region to handle small differences but will not correctly handle chimeric off-target reads or errors.

exception amplimap.naive_mapper.AmplimapNoAlignment
amplimap.naive_mapper.align_and_find_cigar(read, ref, debug=False, reverse=False) → tuple

Align read and reference sequence using parwise global alignment and return start offset and CIGAR ops.

amplimap.naive_mapper.create_bam(sample, files_in1, files_in2, ref_fasta, probes_dict, output, has_trimmed_primers=True, debug=False)

Create a BAM file with reads placed at their expected locations, adjusted through pairwise alignment to the target sequences.

This will give reasonable results as long as probes capture the exact target sequences, but will generate alignments with many mismatches if there are any discrepancies.

amplimap.naive_mapper.find_cigar_for_alignment(read_len, alignment, debug) → tuple

Generate tuple of start offset and CIGAR operations for given alignment.

amplimap.pileup

This module contains methods for generating pileups (per-base allele counts) using pysam/samtools.

exception amplimap.pileup.PileupGroupFilterException(filter_column)

Raised when UMI group fails a pileup filter, with filter_column being the column to count this in.

exception amplimap.pileup.PileupRowFilterException(filter_column, skip_read_pair=True)

Raised when row fails a pileup filter, with filter_column being the column to count this in.

amplimap.pileup.get_al_mate_starts(al: pysam.libcalignedsegment.AlignedSegment)

Get set of mate starts for AlignedSegment, always giving read1 first and read2 second

amplimap.pileup.get_group_consensus(group_calls, min_consensus_count=1, min_consensus_fraction=0.51, ignore_groups=False, debug=False)

Calculate consensus call, count and phred for UMI group.

amplimap.pileup.get_pileup_row(chrom, pos_0, raw_coverage=0, target_id=None, target_type=None, ref=None, validate_probe_targets=False)

Get ordered dict of columns for pileup table.

amplimap.pileup.process_pileup_base(ref, probes_dict, targets_dict, snps_dict, region_index, pileup_base, min_consensus_count, min_consensus_fraction, min_mapq, min_baseq, ignore_groups, group_with_mate_positions, validate_probe_targets, filter_softclipped, no_probe_data, read_metadata, filtered_pair_qnames, umi_to_group, debug)

Process a single basepair of pileup, generating a pileup table row.

amplimap.pileup.process_pileup_read(pr, probes_dict, reference_name, reference_pos_0, read_metadata, min_mapq, ignore_groups, group_with_mate_positions, validate_probe_targets, filter_softclipped, no_probe_data, debug)

Process single pileup read.

amplimap.pileup.process_pileup_row(row, seen_probes, call_groups, snps_dict, ignore_groups, min_consensus_count, min_consensus_fraction, min_baseq, ref, debug=False)

Generate a pileup table row from processed read data and calculate some additional stats and annotation.

amplimap.pileup.record_read_in_group(read_calls, my_call, my_phred, my_umi, read_name)

Add call data from read to read_calls dictionary of name -> call.

amplimap.stats_alignment

This module contains methods for generating statistics about the alignments generated by amplimap. This includes the number of reads and read families per probe, per sample as well as data on off-target alignments.

amplimap.stats_alignment.process_file(probes_path: str, input_path: str, output_path: str, min_mapq: int, min_consensus_count: int, include_primers: bool, use_naive_groups: bool, ignore_groups: bool, ignore_group_mismatch: bool = False, debug: bool = False, targets_path: str = None)

Loop through coordinate-sorted BAM file and collect alignment statistics.

Parameters:
  • probes_path (str) – Path to probes design CSV file
  • input_path (str) – Path to input BAM file
  • output_path (str) – Prefix of output CSV file (‘.stats_alignment.csv’ will be appended)

Output file columns:

  • read_pairs_total: Total number of read pairs (one pair may have multiple alignments)
  • alignments_total: Total number of read pair alignments (one pair may have multiple alignments)
  • alignments_good: Total number of alignments that were on target (coordinates match target coordinates) and fully covered the target region
  • umis_good: Total number of UMI groups (read families) that were on target and fully covered the target region
  • umis_good_coverage_ge_NN: Total number of UMI groups (read families) that were on target, fully covered the target region and also had at least NN supporting read pairs
  • alignments_partial: Total number of alignments that were on target but did not fully cover the target region
  • alignments_off_target: Total number of alignments with coordinates different from the target coordinates
  • alignments_unmapped_single: Total number of alignments where one mate was unmapped
  • alignments_unmapped_both: Total number of alignments where both mates were unmapped
  • alignments_flagged: Total number of alignments where at least one mate was flagged as QC fail or supplementary alignment
  • alignments_invalid_pair: Total number of alignments where mates were on different chromosomes or not in the expected orientation (one forward, one reverse)
  • alignments_mapq_lt_NN: Total number of alignments with at least one mate with mapping quality under NN
  • multimapping_alignments: Total number of read pairs that had multiple possible alignments
  • offtarget_locations: Locations and counts for the three most common off-target alignment locations

Will only process a read pair once both alignments have been found, which means that the first mate has to be kept in memory until the second mate has been reached. This may lead to high memory usage if mates are far away and there are many alignments but seems to work fine in practice.

However, reads with multiple alternative mate alignments may not be handled properly since an alignment is only kept in memory until the first matching mate has been found. These will be reported as “orphaned mates” at the end of the run.

amplimap.variants

This module contains methods to parse, merge, annotate and filter variant calls.

amplimap.variants.calculate_del_score(merged: pandas.core.frame.DataFrame)

Add a column DeleteriousScore to dataframe which contains a count of how many tools have assigned this variant a deletious scores.

Score ranges from 0-6, corresponding to the tools SIFT, Polyphen2, LRT, MutationTaster, GERP++ and phyloP100way_vertebrate.

Additionally, any stopgain, frameshift or splicing variants are always set to 6.

amplimap.variants.find_closest_exon(row: pandas.core.series.Series, gexs: dict) → int

Get distance of variant to the closest exon of the gene it has been annotated with.

amplimap.variants.load_gene_exons(file: str, genes: list) → dict

Load list of exon chr, strand, start and end locations for each gene in genes.

amplimap.variants.make_summary(input: list, output: list, config: dict, exon_table_path: str = None)

Load merged Annovar CSV file (plus targets and sample info), process them and output a new CSV file.

amplimap.variants.make_summary_condensed(input, output)

Make condensed summary table that only contains a subset of columns.

amplimap.variants.make_summary_dataframe(merged: pandas.core.frame.DataFrame, targets: pandas.core.frame.DataFrame = None, sample_info: pandas.core.frame.DataFrame = None, genome_name: str = None, include_gbrowse_links: bool = False, include_exon_distance: bool = False, include_score: bool = False, exon_table_path: str = None) → pandas.core.frame.DataFrame

Process merged Annovar dataframe (with optional targets and sample_info data frames) into a large summary table.

amplimap.variants.merge_variants_from_annovar(input, output)

Merge individual Annovar CSV files together.

amplimap.variants.merge_variants_unannotated(input_vcfs, output_file)

Merge unannotated VCF files into a single CSV.

amplimap.coverage

This module contains methods for processing and aggregating coverage files generated by bedtools.

amplimap.coverage.aggregate(input, output)

Read coverage summary files and create aggregate files.

Parameters:
  • input – dict containing ‘csvs’, the list of csvs fils to aggregate, and optionally ‘sample_info’, a table with additional sample annotation
  • output – dict containing paths for output files: merged, min_coverage, cov_per_bp, fraction_zero_coverage
amplimap.coverage.fraction_10x_coverage(coverage)

Calculate fraction of bases with coverage 10 or more.

amplimap.coverage.fraction_30x_coverage(coverage)

Calculate fraction of bases with coverage 30 or more.

amplimap.coverage.fraction_zero_coverage(coverage)

Calculate fraction of bases with coverage 0.

amplimap.coverage.process_file(input: str, output: str)

Read raw bedtools coverage file, calculate summary statistics and output them as CSV file.

Parameters:
  • input – path to a bedtools coverage file
  • output – path to the summary CSV file

amplimap.merge_folders

This module provides the amplimap.merge_folders.main() function called by the amplimap_merge script. This script merges coverage data and variant calls from different working directories together, making it possible to merge samples sequenced in different runs into a single output file.

Others

amplimap.common.find_umi_groups(umi_counts: dict, id_offset=0) → dict

Calculate dict with UMI group ID (values) for each raw UMI (keys).

Parameters:
  • umi_counts (dict) – Counts for each raw UMI sequence, keys must be bytes (not unicode strings).
  • id_offset (int) – Optional offset to add to all UMI IDs.
amplimap.common.make_extended_read_name(original_name: str, probe: str, umi: str) → str

Generate read name that contains probe name and UMI.

amplimap.common.parse_extended_read_name(extended_read_name: str) -> (<class 'str'>, <class 'str'>, <class 'str'>)

Obtain original read name, probe name and UMI from extended read name.

amplimap.simulate contains methods for simulating variants in reads by replacing a specific DNA sequence with a different sequence, some percentage of the time.