Advanced usage

Ignoring read families (UMI groups)

Even when your reads contain unique molecular identifiers (UMIs) you may want to ignore them to perform a pileup on the raw reads. To do this, set ignore_umis: true under general:

general:
  ignore_umis: true

Running on capture-based data

Capture-based sequencing results in reads of varying length from the target regions, which do not contain specific primer sequences. This means that some features of amplimap, such as the trimming of primer sequences or low quality bases and the detection of off-target capture events will no longer be applicable.

However, amplimap can still produce pileups and variant calls from the raw FASTQ or unmapped/mapped BAM files.

To do this, create a working directory containing a targets.bed or targets.csv file as usual, as well as a config.yaml file containing at least this setting:

general:
  use_raw_reads: true

If you have gzipped FASTQ files without UMIs, put them into the usual reads_in/ directory.

If you have BAM files with unmapped reads, do not create a reads_in directory but instead create a subdirectory called unmapped_bams_in inside your working directory. Place the files there (see also Unmapped BAM files (unmapped_bams_in/)).

If you have BAM files with reads that have already been mapped to the genome, you can put these into a directory called mapped_bams_in instead.

Then you can amplimap as usual to generate coverage data, pileups or variants calls.

If your reads have UMIs, these should be provided in the BAM file as a tag, the name of which should be given in the config file:

general:
  umi_tag_name: "RX"
  use_raw_reads: true

UMI-deduplicated variant calling (variants_umi) is currently not supported for capture-based data.

Simulating variation

In order to test that the pipeline works as expected it may be helpful to simulate variants and check whether these are correctly reported in the output files. This can be particularly helpful in noisy regions, where an abundance of errors may mask real variants. To make this sort of analysis easier, the pipeline comes with a simple testing framework, which can generate variant reads based on an existing data set.

Simulation of variant reads is based on a very straightforward approach: we search for a given search sequence in each read pair and replace it with a replacement sequence. So to simulate a A>G SNP in the sequence “AAAAAAAAAAA”, we would simply ask the testing framework to search for “AAAAAAAAAAA” and replace it with “AAAAAGAAAAA”. The search sequence should be chosen such that it is unique in the targeted regions, but not too long to avoid problems with sequencing errors (see Caveats).

The replacement is done on both the forward and reverse strand and in both mates of the read pairs. The pipeline is then run using these edited reads and the usual output files are generated.

To simulate low-frequency or heterozygous variants, the replacement percentage of reads that should be edited can be specified as any number between 0 to 100. When choosing the reads to edit, the simulation algorithm takes into account UMIs such that read pairs with the same UMI will be treated the same way.

Running a simulation

To run a simulation of variant calls, use the following command, where SEARCH is the search sequence, REPLACE is the replacement sequence and PERCENTAGE is the replacement percentage:

amplimap test__SEARCH_REPLACE_PERCENTAGE/test_variants.done

For example:

amplimap test__TTTACCTCTATTGTTGG_TTTACCTATATTGTTGG_50/test_variants.done

Similarly, pileups can be simulated like this:

amplimap test__SEARCH_REPLACE_PERCENTAGE/test_pileups.done

For example:

amplimap test__TTTACCTCTATTGTTGG_TTTACCTATATTGTTGG_0.5/test_pileups.done

Simulation statistics

Some statistics on the number of found occurrences of the search string and the number of replacements that have actually been carried out can be found in the file test__SEARCH_REPLACE_PERCENTAGE/stats_replacements/stats_replacements.csv. Note that the percentage of replacements may be significantly different from the percentage that had been specified, especially if the read count or UMI count is low.

Caveats

This testing framework is meant to be a useful first step towards simulating data, but it is not as sophisticated as other alternatives. In particular, there are several caveats that should be kept in mind when using this:

  • Only exact matches of the search sequence will be replaced, so any reads that contain mismatches in the region will be ignored. This means that random sequencing errors in the read will prevent replacement, potentially leading to a lower variant read percentage than specified.
  • Only full matches of the search sequence will be replaced, so search sequences that are only partially covered by a read will never be edited. Thus, the locations for simulations should be chosen such that they are fully contained in all overlapping reads.
  • Matches of the sequence will be replaced regardless of the genomic location. Consequently, if the chosen sequence is not unique, multiple variants may be introduced.
  • Matches inside the sequences will be replaced as well. This may cause problems with matching primer sequences to expected probe arms.

Merging runs

To merge data from multiple runs together, use the amplimap_merge script. You can run amplimap_merge --help to see the parameters. Here is an example:

amplimap_merge /data/OUTPUT_FOLDER /data/working_directory1/analysis /data/working_directory2/analysis /data/working_directory3/analysis

This will merge the variant summary and coverage files from /data/working_directory1, 2 and 3 together and save them in a folder called /data/OUTPUT_FOLDER. If you only want to get one row per sample, you can use the --unique-sample-id-column to specify the column name containing the sample ID (e.g. DNAId). This will generate an additional file called variants_summary_filtered.unique.csv, which contains all unique filtered variants, and another file called coverage_full.unique.csv, which contains the highest coverage numbers observed for each sample.

For example:

amplimap_merge --unique-sample-id-column=DNAId /data/OUTPUT_FOLDER /data/working_directory1/analysis /data/working_directory2/analysis /data/working_directory3/analysis

Additional Notes

Platypus variant filters

The filters that a variant may have failed are described here: http://www.well.ox.ac.uk/~gerton/Platypus/ng.3036-S1.pdf

Using screen

While the pipeline is running, you normally need to keep your SSH terminal connected. When the connection is lost, the pipeline run will be aborted.

However, you can use the screen tool to make it sure it keeps running even when you are not connected. To do this, run the command screen in the terminal. This will start a screen session, inside which you can now run any normal commands. Even if you now disconnect your SSH session, any commands that are running inside screen will continue to run. To reconnect to the screen session later and check the status of the pipeline, connect to the same server and type screen -r (r = reattach).

To scroll up and down in screen you need to use a special key combination: Press Ctrl-A, and then the ESC key to activate copy mode. In copy mode, you can use the arrow keys or Ctrl-U to go up and Ctrl-D to go down, as well as ? and / to search backwards/forwards. Press ESC again to get back to normal typing mode.

Linking files

Instead of copying large amounts of data into the working directory you can also just create a link from the working directory to the actual location of the files. This way, only one copy of the files is kept on the file system.

This is particularly useful if you make multiple working directories for the same set of samples, to analyse them with different parameters.

To create a link, use the ln -s command in the terminal, like this:

ln -s /path/to/source/location name_of_link

So for example, to link the probes.csv file from another directory into the current directory with the same name, you can run:

ln -s /other/directory/probes.csv probes.csv

You can also link multiple files using wildcards - for example, to link all fastq.gz files from your data directory into the reads_in folder:

ln -s /path/to/data/directory/*.fastq.gz reads_in/