Advanced usage --------------- Ignoring read families (UMI groups) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Even when your reads contain unique molecular identifiers (UMIs) you may want to ignore them to perform a pileup on the raw reads. To do this, set ``ignore_umis: true`` under ``general:`` :: general: ignore_umis: true .. _running-capture: Running on capture-based data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Capture-based sequencing results in reads of varying length from the target regions, which do not contain specific primer sequences. This means that some features of amplimap, such as the trimming of primer sequences or low quality bases and the detection of off-target capture events will no longer be applicable. However, amplimap can still produce pileups and variant calls from the raw FASTQ or unmapped/mapped BAM files. To do this, create a working directory containing a ``targets.bed`` or ``targets.csv`` file as usual, as well as a ``config.yaml`` file containing at least this setting: :: general: use_raw_reads: true If you have gzipped FASTQ files without UMIs, put them into the usual :ref:`reads-in` directory. If you have BAM files with unmapped reads, do not create a ``reads_in`` directory but instead create a subdirectory called ``unmapped_bams_in`` inside your working directory. Place the files there (see also :ref:`unmapped-bams`). If you have BAM files with reads that have already been mapped to the genome, you can put these into a directory called ``mapped_bams_in`` instead. Then you can amplimap as usual to generate coverage data, pileups or variants calls. If your reads have UMIs, these should be provided in the BAM file as a tag, the name of which should be given in the config file: :: general: umi_tag_name: "RX" use_raw_reads: true UMI-deduplicated variant calling (``variants_umi``) is currently not supported for capture-based data. Simulating variation ~~~~~~~~~~~~~~~~~~~~~~~~ In order to test that the pipeline works as expected it may be helpful to simulate variants and check whether these are correctly reported in the output files. This can be particularly helpful in noisy regions, where an abundance of errors may mask real variants. To make this sort of analysis easier, the pipeline comes with a simple testing framework, which can generate variant reads based on an existing data set. Simulation of variant reads is based on a very straightforward approach: we search for a given **search sequence** in each read pair and replace it with a **replacement sequence**. So to simulate a A>G SNP in the sequence “AAAAAAAAAAA”, we would simply ask the testing framework to search for “AAAAAAAAAAA” and replace it with “AAAAAGAAAAA”. The search sequence should be chosen such that it is unique in the targeted regions, but not too long to avoid problems with sequencing errors (see Caveats). The replacement is done on both the forward and reverse strand and in both mates of the read pairs. The pipeline is then run using these edited reads and the usual output files are generated. To simulate low-frequency or heterozygous variants, the **replacement percentage** of reads that should be edited can be specified as any number between 0 to 100. When choosing the reads to edit, the simulation algorithm takes into account UMIs such that read pairs with the same UMI will be treated the same way. Running a simulation ^^^^^^^^^^^^^^^^^^^^^ To run a simulation of variant calls, use the following command, where SEARCH is the **search sequence**, REPLACE is the **replacement sequence** and PERCENTAGE is the **replacement percentage**: :: amplimap test__SEARCH_REPLACE_PERCENTAGE/test_variants.done For example: :: amplimap test__TTTACCTCTATTGTTGG_TTTACCTATATTGTTGG_50/test_variants.done Similarly, pileups can be simulated like this: :: amplimap test__SEARCH_REPLACE_PERCENTAGE/test_pileups.done For example: :: amplimap test__TTTACCTCTATTGTTGG_TTTACCTATATTGTTGG_0.5/test_pileups.done Simulation statistics ^^^^^^^^^^^^^^^^^^^^^ Some statistics on the number of found occurrences of the search string and the number of replacements that have actually been carried out can be found in the file ``test__SEARCH_REPLACE_PERCENTAGE/stats_replacements/stats_replacements.csv``. Note that the percentage of replacements may be significantly different from the percentage that had been specified, especially if the read count or UMI count is low. Caveats ^^^^^^^^^^^^^^^^^^^^^ This testing framework is meant to be a useful first step towards simulating data, but it is not as sophisticated as other alternatives. In particular, there are several caveats that should be kept in mind when using this: - Only exact matches of the search sequence will be replaced, so any reads that contain mismatches in the region will be ignored. This means that random sequencing errors in the read will prevent replacement, potentially leading to a lower variant read percentage than specified. - Only full matches of the search sequence will be replaced, so search sequences that are only partially covered by a read will never be edited. Thus, the locations for simulations should be chosen such that they are fully contained in all overlapping reads. - Matches of the sequence will be replaced regardless of the genomic location. Consequently, if the chosen sequence is not unique, multiple variants may be introduced. - Matches inside the sequences will be replaced as well. This may cause problems with matching primer sequences to expected probe arms. Merging runs ~~~~~~~~~~~~ To merge data from multiple runs together, use the ``amplimap_merge`` script. You can run ``amplimap_merge --help`` to see the parameters. Here is an example: :: amplimap_merge /data/OUTPUT_FOLDER /data/working_directory1/analysis /data/working_directory2/analysis /data/working_directory3/analysis This will merge the variant summary and coverage files from ``/data/working_directory1``, ``2`` and ``3`` together and save them in a folder called ``/data/OUTPUT_FOLDER``. If you only want to get one row per sample, you can use the ``--unique-sample-id-column`` to specify the column name containing the sample ID (e.g. ``DNAId``). This will generate an additional file called ``variants_summary_filtered.unique.csv``, which contains all unique filtered variants, and another file called ``coverage_full.unique.csv``, which contains the highest coverage numbers observed for each sample. For example: :: amplimap_merge --unique-sample-id-column=DNAId /data/OUTPUT_FOLDER /data/working_directory1/analysis /data/working_directory2/analysis /data/working_directory3/analysis Additional Notes ~~~~~~~~~~~~~~~~~~ Platypus variant filters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The filters that a variant may have failed are described here: http://www.well.ox.ac.uk/~gerton/Platypus/ng.3036-S1.pdf Using ``screen`` ^^^^^^^^^^^^^^^^^^^^^ While the pipeline is running, you normally need to keep your SSH terminal connected. When the connection is lost, the pipeline run will be aborted. However, you can use the ``screen`` tool to make it sure it keeps running even when you are not connected. To do this, run the command ``screen`` in the terminal. This will start a ``screen`` session, inside which you can now run any normal commands. Even if you now disconnect your SSH session, any commands that are running inside ``screen`` will continue to run. To reconnect to the ``screen`` session later and check the status of the pipeline, connect to the same server and type ``screen -r`` (r = reattach). To scroll up and down in ``screen`` you need to use a special key combination: Press ``Ctrl``-``A``, and then the ``ESC`` key to activate copy mode. In copy mode, you can use the arrow keys or ``Ctrl``-``U`` to go up and ``Ctrl``-``D`` to go down, as well as ``?`` and ``/`` to search backwards/forwards. Press ``ESC`` again to get back to normal typing mode. Linking files ^^^^^^^^^^^^^^ Instead of copying large amounts of data into the working directory you can also just create a link from the working directory to the actual location of the files. This way, only one copy of the files is kept on the file system. This is particularly useful if you make multiple working directories for the same set of samples, to analyse them with different parameters. To create a link, use the ``ln -s`` command in the terminal, like this: :: ln -s /path/to/source/location name_of_link So for example, to link the ``probes.csv`` file from another directory into the current directory with the same name, you can run: :: ln -s /other/directory/probes.csv probes.csv You can also link multiple files using wildcards - for example, to link all fastq.gz files from your data directory into the ``reads_in`` folder: :: ln -s /path/to/data/directory/*.fastq.gz reads_in/