Code documentation¶
Overview¶
When amplimap
is launched, the function amplimap.run.main() is called. This function
will perform various pre-processing steps (see below) and then launch Snakemake.
Snakemake, using the Snakefile provided in the amplimap package, will then determine
the jobs that need to be executed.
Testing amplimap¶
To test amplimap, install the pytest
package through pip and then
run this command in the amplimap source directory:
pytest .
amplimap components¶
Snakefile¶
amplimap’s Snakefile
is used by the amplimap
command-line executable
to determine the shell commands and library functions to execute to generate a given file.
Each “rule” in this file either executes a shell command (e.g. running bwa
),
executes a small set of Python commands or calls a function from the amplimap Python package.
Running snakemake
directly¶
In theory, this file can also be used directly with the snakemake
executable.
However, this will omit various steps performed by the amplimap
executable,
including checking and loading the configuration files.
Thus, a configuration file needs to be provided with the --configfile
parameter that contains all settings
included in config_default.yaml
.
In addition, you may want to add the following setting:
general:
amplimap_parent_dir: /path/to/parent/of/amplimap/package/directory
This directory will be inserted into the Python path when running Snakemake to make
sure import amplimap.xxx
imports the correct files. If it is not provided
Snakemake will load the version of amplimap installed in the default Python path.
amplimap.run¶
This module provides the amplimap.run.main()
method called from the amplimap command-line executable,
as well as some helper functions for reading and checking the config files.
-
amplimap.run.
check_config_keys
(default_config, my_config, path=[])¶ Recursively check that config keys provided in my_config also exist in default_config (ignoring ‘paths’ and ‘clusters’).
-
amplimap.run.
compare_config_dicts
(my_config, used_config, path=[])¶ Recursively search for differences in values between two dicts.
-
amplimap.run.
main
(argv=None)¶ Main function for the
amplimap
executable. This function:- parses command line arguments
- reads, merges and checks each of these config files, if they exist:
config_default.yaml
in the amplimap package/etc/amplimap/VERSION/config.yaml
(where VERSION is the amplimap version)$AMPLIMAP_CONFIG
config.yaml
in the working directory
- checks for an existing analysis directory (and compares the amplimap version used to create it)
- adds its own parent directory to the config file (to be inserted back into the python path inside Snakemake)
- creates an analysis directory
- writes
config_used.yaml
to the new analysis directory - creates a
cluster_log
directory (if running in cluster mode) - launches Snakemake, using the amplimap Snakefile,
config_used.yaml
as the config file and cluster parameters as specified in the command line arguments and config.
-
amplimap.run.
read_config_file
(print_config, path)¶ Helper function to read a config file, if it exists. Always returns a dict, but it may be empty.
amplimap.reader¶
This module contains methods related to reading input files, such as targets.csv, probes.csv etc.
-
exception
amplimap.reader.
AmplimapReaderException
(e: Exception, filename: str, should_have_header: bool = None)¶ Will be raised if reading one of the standard input files has failed, with the aim of providing a more useful error message in the user-visible output.
-
amplimap.reader.
get_code_versions
(path: str = '.') → dict¶ Get the file modification times of common code files for versioning.
-
amplimap.reader.
get_file_hashes
(path: str = '.') → dict¶ Get SHA256 hashes for common input files, so we can make sure these didn’t change between runs.
-
amplimap.reader.
merge_probes_by_id
(rows: pandas.core.frame.DataFrame) → pandas.core.series.Series¶ Merge multiple probes.csv rows with the same probe ID together, to handle cases where MIPGEN generated multiple version because of a SNP.
-
amplimap.reader.
process_probe_design
(design: pandas.core.frame.DataFrame, reference_type: str = 'genome') → pandas.core.frame.DataFrame¶ Read amplimap probes.csv file and return pandas dataframe.
-
amplimap.reader.
read_and_convert_heatseq_probes
(path: str) → pandas.core.frame.DataFrame¶ UNTESTED: Read probes file from Roche heatseq in CSV format and generate an amplimap probes.csv from it.
-
amplimap.reader.
read_and_convert_mipgen_probes
(path: str, sep=', ') → pandas.core.frame.DataFrame¶ Read probes file from MIPGEN in CSV or TSV format and generate an amplimap probes.csv from it.
-
amplimap.reader.
read_new_probe_design
(path: str, reference_type: str = 'genome') → pandas.core.frame.DataFrame¶ Read amplimap probes.csv file and return pandas dataframe.
-
amplimap.reader.
read_sample_info
(path: str) → pandas.core.frame.DataFrame¶ Read amplimap sample_info.csv file and return pandas dataframe.
-
amplimap.reader.
read_snps_txt
(path: str, reference_type: str = 'genome') → pandas.core.frame.DataFrame¶ Read amplimap snps.txt file and return pandas dataframe.
-
amplimap.reader.
read_targets
(path: str, check_overlaps: bool = False, reference_type: str = 'genome', file_type: str = 'bed') → pandas.core.frame.DataFrame¶ Read amplimap targets.csv or targets.bed file and return pandas dataframe.
-
amplimap.reader.
write_targets_bed
(path: str, targets: pandas.core.frame.DataFrame)¶ Write targets dataframe to bed file.
amplimap.parse_reads¶
This module contains methods for reading raw FASTQ files, finding primers in them and trimming the reads.
-
amplimap.parse_reads.
make_trimmed_read
(read_id, read, probe, target_len, umi, umi_len, arm_len, other_arm_len, trim_primers=False, trim_min_length=None, check_sequence=None, check_sequence_mismatches=None, find_arm_sequence=None, find_arm_sequence_mismatches=None, quality_trim_threshold=None, quality_trim_phred_base=None)¶ Take a fastq read tuple and generate new fastq string after removing umi and potential opposite primer arm.
Returns: - str: FastQ lines
- int: Index of first base trimmed off from the end (to trim opposite primer)
- bool: was trimmed len great than trim_min_len
- int: change after quality trimming
- int: trimmed length
Return type: tuple
-
amplimap.parse_reads.
parse_read_pairs
(sample, files1, files2, fastq_out1, fastq_out2, probes_dict, stats_samples, stats_reads, unknown_arms_directory, umi_one, umi_two, mismatches, trim_primers, trim_min_length, trim_primers_strict, trim_primers_smart, quality_trim_threshold, quality_trim_phred_base, allow_multiple_probes, consensus_fastqs=None, min_consensus_count=1, min_consensus_fraction=0.51, debug=False, debug_probe=False)¶ Read pairs of reads from FASTQ files and write new FASTQ files with trimmed reads.
Recognizes probes based on primer sequences, trims off UMI and probe arms, optionally performs qualtiy trimming and writes consensus FASTQs based on UMI groups.
Parameters: - sample (str) – Sample name.
- files1 (list of str) – Input path to paired FASTQ files (Read1)
- files2 (list of str) – Input path to paired FASTQ files (Read2)
- fastq_out1 (list of str) – Output path to paired, trimmed FASTQ files (Read1)
- fastq_out2 (list of str) – Output path to paired, trimmed FASTQ files (Read2)
- probes_dict (dict) – Dictionary of probe data (see amplimap.reader.read_probes).
- stats_samples (dict of lists) – Sample stats. Can be empty, in which case it will be created.
- stats_reads (list of dicts) – Read stats will be appended to this list.
- unknown_arms_directory (str) – Directory to write unknown arms files to.
- umi_two (umi_one,) – Length of UMIs on read1 and read2.
- mismatches (int) – Number of mismatches to allow in primer sequences.
-
amplimap.parse_reads.
quality_trim_read
(read_seq, read_qual_raw, phred_base: int = 33, p_threshold: float = 0.01) → tuple¶ Use Richard Mott’s algorithm implemented in bwa and others to quality-trim reads.
Returns: length of read after quality trimming, trimmed sequence, trimmed quality scores Return type: tuple - see description: http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/Quality_trimming.html
- see sample code (but not sure if correct!): https://www.biostars.org/p/1923/
- see version in bio.seqio: http://biopython.org/DIST/docs/api/Bio.SeqIO.AbiIO-module.html
- see implementation (may be inefficient, keeps list of all cumulative scores…): http://biopython.org/DIST/docs/api/Bio.SeqIO.AbiIO-pysrc.html#_abi_trim
NOTE: our implementation will always trim from first high-q base (sum > 0) to the best part of the read (max of the running sum), even if there are some low-quality bases in-between.
amplimap.naive_mapper¶
This is a custom read mapper for amplicon data. It simply places the reads at the location they were expected to come from, based on the target coordinates of their associated probe. It will perform an aligment within the target region to handle small differences but will not correctly handle chimeric off-target reads or errors.
-
exception
amplimap.naive_mapper.
AmplimapNoAlignment
¶
-
amplimap.naive_mapper.
align_and_find_cigar
(read, ref, debug=False, reverse=False) → tuple¶ Align read and reference sequence using parwise global alignment and return start offset and CIGAR ops.
-
amplimap.naive_mapper.
create_bam
(sample, files_in1, files_in2, ref_fasta, probes_dict, output, has_trimmed_primers=True, debug=False)¶ Create a BAM file with reads placed at their expected locations, adjusted through pairwise alignment to the target sequences.
This will give reasonable results as long as probes capture the exact target sequences, but will generate alignments with many mismatches if there are any discrepancies.
-
amplimap.naive_mapper.
find_cigar_for_alignment
(read_len, alignment, debug) → tuple¶ Generate tuple of start offset and CIGAR operations for given alignment.
amplimap.pileup¶
This module contains methods for generating pileups (per-base allele counts) using pysam/samtools.
-
exception
amplimap.pileup.
PileupGroupFilterException
(filter_column)¶ Raised when UMI group fails a pileup filter, with filter_column being the column to count this in.
-
exception
amplimap.pileup.
PileupRowFilterException
(filter_column, skip_read_pair=True)¶ Raised when row fails a pileup filter, with filter_column being the column to count this in.
-
amplimap.pileup.
get_al_mate_starts
(al: pysam.libcalignedsegment.AlignedSegment)¶ Get set of mate starts for AlignedSegment, always giving read1 first and read2 second
-
amplimap.pileup.
get_group_consensus
(group_calls, min_consensus_count=1, min_consensus_fraction=0.51, ignore_groups=False, debug=False)¶ Calculate consensus call, count and phred for UMI group.
-
amplimap.pileup.
get_pileup_row
(chrom, pos_0, raw_coverage=0, target_id=None, target_type=None, ref=None, validate_probe_targets=False)¶ Get ordered dict of columns for pileup table.
-
amplimap.pileup.
process_pileup_base
(ref, probes_dict, targets_dict, snps_dict, region_index, pileup_base, min_consensus_count, min_consensus_fraction, min_mapq, min_baseq, ignore_groups, group_with_mate_positions, validate_probe_targets, filter_softclipped, no_probe_data, read_metadata, filtered_pair_qnames, umi_to_group, debug)¶ Process a single basepair of pileup, generating a pileup table row.
-
amplimap.pileup.
process_pileup_read
(pr, probes_dict, reference_name, reference_pos_0, read_metadata, min_mapq, ignore_groups, group_with_mate_positions, validate_probe_targets, filter_softclipped, no_probe_data, debug)¶ Process single pileup read.
-
amplimap.pileup.
process_pileup_row
(row, seen_probes, call_groups, snps_dict, ignore_groups, min_consensus_count, min_consensus_fraction, min_baseq, ref, debug=False)¶ Generate a pileup table row from processed read data and calculate some additional stats and annotation.
-
amplimap.pileup.
record_read_in_group
(read_calls, my_call, my_phred, my_umi, read_name)¶ Add call data from read to read_calls dictionary of name -> call.
amplimap.stats_alignment¶
This module contains methods for generating statistics about the alignments generated by amplimap. This includes the number of reads and read families per probe, per sample as well as data on off-target alignments.
-
amplimap.stats_alignment.
process_file
(probes_path: str, input_path: str, output_path: str, min_mapq: int, min_consensus_count: int, include_primers: bool, use_naive_groups: bool, ignore_groups: bool, ignore_group_mismatch: bool = False, debug: bool = False, targets_path: str = None)¶ Loop through coordinate-sorted BAM file and collect alignment statistics.
Parameters: - probes_path (str) – Path to probes design CSV file
- input_path (str) – Path to input BAM file
- output_path (str) – Prefix of output CSV file (‘.stats_alignment.csv’ will be appended)
Output file columns:
- read_pairs_total: Total number of read pairs (one pair may have multiple alignments)
- alignments_total: Total number of read pair alignments (one pair may have multiple alignments)
- alignments_good: Total number of alignments that were on target (coordinates match target coordinates) and fully covered the target region
- umis_good: Total number of UMI groups (read families) that were on target and fully covered the target region
- umis_good_coverage_ge_NN: Total number of UMI groups (read families) that were on target, fully covered the target region and also had at least NN supporting read pairs
- alignments_partial: Total number of alignments that were on target but did not fully cover the target region
- alignments_off_target: Total number of alignments with coordinates different from the target coordinates
- alignments_unmapped_single: Total number of alignments where one mate was unmapped
- alignments_unmapped_both: Total number of alignments where both mates were unmapped
- alignments_flagged: Total number of alignments where at least one mate was flagged as QC fail or supplementary alignment
- alignments_invalid_pair: Total number of alignments where mates were on different chromosomes or not in the expected orientation (one forward, one reverse)
- alignments_mapq_lt_NN: Total number of alignments with at least one mate with mapping quality under NN
- multimapping_alignments: Total number of read pairs that had multiple possible alignments
- offtarget_locations: Locations and counts for the three most common off-target alignment locations
Will only process a read pair once both alignments have been found, which means that the first mate has to be kept in memory until the second mate has been reached. This may lead to high memory usage if mates are far away and there are many alignments but seems to work fine in practice.
However, reads with multiple alternative mate alignments may not be handled properly since an alignment is only kept in memory until the first matching mate has been found. These will be reported as “orphaned mates” at the end of the run.
amplimap.variants¶
This module contains methods to parse, merge, annotate and filter variant calls.
-
amplimap.variants.
calculate_del_score
(merged: pandas.core.frame.DataFrame)¶ Add a column DeleteriousScore to dataframe which contains a count of how many tools have assigned this variant a deletious scores.
Score ranges from 0-6, corresponding to the tools SIFT, Polyphen2, LRT, MutationTaster, GERP++ and phyloP100way_vertebrate.
Additionally, any stopgain, frameshift or splicing variants are always set to 6.
-
amplimap.variants.
find_closest_exon
(row: pandas.core.series.Series, gexs: dict) → int¶ Get distance of variant to the closest exon of the gene it has been annotated with.
-
amplimap.variants.
load_gene_exons
(file: str, genes: list) → dict¶ Load list of exon chr, strand, start and end locations for each gene in genes.
-
amplimap.variants.
make_summary
(input: list, output: list, config: dict, exon_table_path: str = None)¶ Load merged Annovar CSV file (plus targets and sample info), process them and output a new CSV file.
-
amplimap.variants.
make_summary_condensed
(input, output)¶ Make condensed summary table that only contains a subset of columns.
-
amplimap.variants.
make_summary_dataframe
(merged: pandas.core.frame.DataFrame, targets: pandas.core.frame.DataFrame = None, sample_info: pandas.core.frame.DataFrame = None, genome_name: str = None, include_gbrowse_links: bool = False, include_exon_distance: bool = False, include_score: bool = False, exon_table_path: str = None) → pandas.core.frame.DataFrame¶ Process merged Annovar dataframe (with optional targets and sample_info data frames) into a large summary table.
-
amplimap.variants.
merge_variants_from_annovar
(input, output)¶ Merge individual Annovar CSV files together.
-
amplimap.variants.
merge_variants_unannotated
(input_vcfs, output_file)¶ Merge unannotated VCF files into a single CSV.
amplimap.coverage¶
This module contains methods for processing and aggregating coverage files generated by bedtools
.
-
amplimap.coverage.
aggregate
(input, output)¶ Read coverage summary files and create aggregate files.
Parameters: - input – dict containing ‘csvs’, the list of csvs fils to aggregate, and optionally ‘sample_info’, a table with additional sample annotation
- output – dict containing paths for output files: merged, min_coverage, cov_per_bp, fraction_zero_coverage
-
amplimap.coverage.
fraction_10x_coverage
(coverage)¶ Calculate fraction of bases with coverage 10 or more.
-
amplimap.coverage.
fraction_30x_coverage
(coverage)¶ Calculate fraction of bases with coverage 30 or more.
-
amplimap.coverage.
fraction_zero_coverage
(coverage)¶ Calculate fraction of bases with coverage 0.
-
amplimap.coverage.
process_file
(input: str, output: str)¶ Read raw bedtools coverage file, calculate summary statistics and output them as CSV file.
Parameters: - input – path to a bedtools coverage file
- output – path to the summary CSV file
amplimap.merge_folders¶
This module provides the amplimap.merge_folders.main() function called by the amplimap_merge
script.
This script merges coverage data and variant calls from different working directories together,
making it possible to merge samples sequenced in different runs into a single output file.
Others¶
-
amplimap.common.
find_umi_groups
(umi_counts: dict, id_offset=0) → dict¶ Calculate dict with UMI group ID (values) for each raw UMI (keys).
Parameters: - umi_counts (dict) – Counts for each raw UMI sequence, keys must be bytes (not unicode strings).
- id_offset (int) – Optional offset to add to all UMI IDs.
-
amplimap.common.
make_extended_read_name
(original_name: str, probe: str, umi: str) → str¶ Generate read name that contains probe name and UMI.
-
amplimap.common.
parse_extended_read_name
(extended_read_name: str) -> (<class 'str'>, <class 'str'>, <class 'str'>)¶ Obtain original read name, probe name and UMI from extended read name.
amplimap.simulate
contains methods for simulating variants in reads by replacing a specific DNA sequence with a different sequence,
some percentage of the time.