open2c / pairtools Goto Github PK

View Code? Open in Web Editor NEW

95.0 11.0 33.0 3.33 MB

Extract 3D contacts (.pairs) from sequencing alignments

License: MIT License

Python 96.75% Makefile 0.18% Cython 3.07%

3d-genome bioinformatics file-formatter hi-c ngs python pairs-file

pairtools's Introduction

pairtools

Process Hi-C pairs with pairtools

pairtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment.

pairtools process pair-end sequence alignments and perform the following operations:

detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end sequences of Hi-C DNA molecules
sort .pairs files for downstream analyses
detect, tag and remove PCR/optical duplicates
generate extensive statistics of Hi-C datasets
select Hi-C pairs given flexibly defined criteria
restore .sam alignments from Hi-C pairs
annotate restriction digestion sites
get the mutated positions in Hi-C pairs

To get started:

Visit pairtools tutorials,
Take a look at a quick example,
Check out the detailed documentation.

Data formats

pairtools produce and operate on tab-separated files compliant with the .pairs format defined by the 4D Nucleome Consortium. All pairtools properly manage file headers and keep track of the data processing history.

Additionally, pairtools define the .pairsam format, an extension of .pairs that includes the SAM alignments of a sequenced Hi-C molecule. .pairsam complies with the .pairs format, and can be processed by any tool that operates on .pairs files.

pairtools produces a set of additional extra columns, which describe properties of alignments, phase, mutations, restriction and complex walks. The full list of possible extra columns is provided in the pairtools format specification.

Installation

Requirements:

Python 3.x
Python packages cython, pysam, bioframe, pyyaml, numpy, scipy, pandas and click.
Command-line utilities sort (the Unix version), bgzip (shipped with samtools) and samtools. If available, pairtools can compress outputs with pbgzip and lz4.

For the full list of recommended versions, see requirements in the the GitHub repo.

We highly recommend using the conda package manager to install pairtools together with all its dependencies. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.

With conda, you can install pairtools and all of its dependencies from the bioconda channel.

$ conda install -c conda-forge -c bioconda pairtools

Alternatively, install non-Python dependencies and pairtools with Python-only dependencies from PyPI using pip:

$ pip install numpy pysam cython
$ pip install pairtools

Quick example

Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:

$ mkdir /tmp/test-pairtools
$ cd /tmp/test-pairtools
$ wget https://github.com/open2c/distiller-test-data/raw/master/bam/MATalpha_R1.bam

Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:

$ wget https://raw.githubusercontent.com/open2c/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes

With pairtools parse, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:

$ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam

Inspect the resulting table:

$ less MATalpha_R1.pairs.gz

Pipelines

We provide a simple working example of a mapping bash pipeline in /examples/.
distiller is a powerful Hi-C data analysis workflow, based on pairtools and nextflow.

Tools

parse: read .sam/.bam files produced by bwa and form Hi-C pairs
- form Hi-C pairs by reporting the outer-most mapped positions and the strand on the either side of each molecule;
- report unmapped/multimapped (ambiguous alignments)/chimeric alignments as chromosome "!", position 0, strand "-";
- perform upper-triangular flipping of the sides of Hi-C molecules such that the first side has a lower sorting index than the second side;
- form hybrid pairsam output, where each line contains all available data for one Hi-C molecule (outer-most mapped positions on the either side, read ID, pair type, and .sam entries for each alignment);
- report .sam tags or mutations of the alignments;
- print the .sam header as #-comment lines at the start of the file.
parse2: read .sam/.bam files with long paired-and or single-end reads and form Hi-C pairs from complex walks
- identify and rescue chrimeric alignments produced by singly-ligated Hi-C molecules with a sequenced ligation junction on one of the sides;
- annotate chimeric alignments by restriction fragments and report true junctions and hops (One-Read-Based Interactions Annotation, ORBITA);
- perform intra-molecule deduplication of paired-end data when one side reads through the DNA on the other side of the read;
- report index of the pair in the complex walk;
- make combinatorial expansion of pairs produced from the same walk;
sort: sort pairs files (the lexicographic order for chromosomes, the numeric order for the positions, the lexicographic order for pair types).
merge: merge sorted .pairs files
- merge sort .pairs;
- combine the .pairs headers from all input files;
- check that each .pairs file was mapped to the same reference genome index (by checking the identity of the @SQ sam header lines).
select: select pairs according to specified criteria
- select pairs entries according to the provided condition. A programmable interface allows for arbitrarily complex queries on specific pair types, chromosomes, positions, strands, read IDs (including matches to a wildcard/regexp/list).
- optionally print the non-matching entries into a separate file.
dedup: remove PCR duplicates from a sorted triu-flipped .pairs file
- remove PCR duplicates by finding pairs of entries with both sides mapped to similar genomic locations (+/- N bp);
- optionally output the PCR duplicate entries into a separate file;
- detect optical duplicates from the original Illumina read ids;
- apply filtering by various properties of pairs (MAPQ; orientation; distance) together with deduplication;
- output yaml or convenient tsv deduplication stats into text file.
- NOTE: in order to remove all PCR duplicates, the input must contain *all* mapped read pairs from a single experimental replicate;
maskasdup: mark all pairs in a pairsam as Hi-C duplicates
- change the field pair_type to DD;
- change the pair_type tag (Yt:Z:) for all sam alignments;
- set the PCR duplicate binary flag for all sam alignments (0x400).
split: split a .pairsam file into .pairs and .sam.
flip: flip pairs to get an upper-triangular matrix
header: manipulate the .pairs/.pairsam header
- generate new header for headerless .pairs file
- transfer header from one .pairs file to another
- set column names for the .pairs file
- validate that the header corresponds to the information stored in .pairs file
stats: calculate various statistics of .pairs files
restrict: identify the span of the restriction fragment forming a Hi-C junction
phase: phase pairs mapped to a diploid genome

Contributing

Pull requests are welcome.

For development, clone and install in "editable" (i.e. development) mode with the -e option. This way you can also pull changes on the fly.

$ git clone https://github.com/open2c/pairtools.git
$ cd pairtools
$ pip install -e .

Citing `pairtools`

Open2C*, Nezar Abdennur, Geoffrey Fudenberg, Ilya M. Flyamer, Aleksandra A. Galitsyna*, Anton Goloborodko*, Maxim Imakaev, Sergey V. Venev. "Pairtools: from sequencing data to chromosome contacts" bioRxiv, February 13, 2023. ; doi: https://doi.org/10.1101/2023.02.13.528389

License

MIT

pairtools's People

Contributors

Stargazers

Watchers

pairtools's Issues

calculate Pc(s) in stats

we can calculate Pc(s) during stats computation. We'd need chromsizes (can take from pairs header, or supply an external file) and centromeres/gaps (can take from a curated database/online/external file).
via @Phlya

make distiller leave a comment in the header section of pairsam

multimer compatability?

maybe not worth thinking about... but maybe it is? it looks like the pairsam should be flexible enough to extend to the case of things like c-walks?

make tests

add docs in readme

in parse, flip pairs according to the semantic order, provided via a chrom.sizes file

optical dedup stats

Enhancement:

readid contains info about spatial position of the physical read (short DNA molecule) on a sequencing flowcell.
This information can be used to find out how many of the observed duplicates are originating from PCR preceding sequencing step, and how many have been formed in the flowcell due to non-ideal conditions/read migration etc.

Proposition:

we can add an onlineDeduplicator-style buffer in the stats module that would keep track of incoming read-pairs and would start accumulating them as soon as it hits a DD-one, until the next one in the buffer in non-DD-one:

rid  c1   c2   p1   p2   s1   s2  pt  index
.    .    .    .    .    .    .   LL    1
.    .    .    .    .    .    .   DD    2
.    .    .    .    .    .    .   DD    3
.    .    .    .    .    .    .   DD    4
.    .    .    .    .    .    .   DD    5
.    .    .    .    .    .    .   LL    6

so with the input as above the content of the buffer would include pairs with index from 1-5 (including).

read_id (rid) column of that buffer can be further used to cluster pairs by their X,Y - position on the flowcell and to generate corresponding statistics.
We could use the read-clustering algorithm developed by Betul @betulakgol in the lab, and would have to optional because it is not really known, if readid of "any" sequencer would have such information and if it is in the same exact format.

Software:

from coding perspective, we would only need to make stats.add_pair(c1,p1,s1,c2,p2,s2,pt) to accept a read-id as well stats.add_pair(rid,c1,p1,s1,c2,p2,s2,pt) (either permanently or optionally - how? dunno ...) and implement that onlineDedup-style buffer inside the stats, so that everyhting will be nice and hidden from the dedup perspective.

stats are becoming pretty heavy at that point, and some Cythonization might be applied to it (split stats into _stats.pyx with the class and methods and pairsamtools_stats.py - a stand alone script capable of doing stats).

Seems like a doable and not very intrusive thing, which I already discussed with @hakanozadam,
@nvictus, @golobor - what do you think?

Versioning Pairsamtools

I think it would be nice to start versioning pairsamtools. I am seeing differences in functions and APIs between the versions of pairsamtools from May and December 2017. With versioning, we can impose a particular version on tools & pipelines using pairsamtools. Right now, in pairsamtools, I see

__version__ = '0.0.1-dev'

make a conda recipe

merge pairsam-related tools into pairsam tools

-- use click

in all pairsamtools, have a unified way to provide/override column indices and the separator

Eg have a flag --col NAME,index that can be repeated multiple types.

pairsamtools merge fails when symbolic links are provided as an input

Here is an example:
pairsamtools merge pairsam_chunk* --nproc 2 -o MATa_R2.lane1.pairsam.gz
where
pairsam_chunk1 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.01.pairsam.gz pairsam_chunk2 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.00.pairsam.gz
fail with the message:

...
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_merge.py", line 59, in merge_py
      h, _ = _headerops.get_header(f)
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/_headerops.py", line 38, in get_header
    for line in instream:
File "/home/polzovatel/miniconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

While running

pairsamtools merge /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.0* --nproc 2 -o MATa_R2.lane1.pairsam.gz

yields non-empty MATa_R2.lane1.pairsam.gz

enable optional reporting of 5' alignments for non-rescuable chimeras

Drop_seq corrupts Sam entries

allow specifying column order for sorting and other tools

add a test for the example_pipeline

parse throws an error when a chromosome is present in the chromsizes, but not in the bwa index (== sam header).

in stats, report cis_1kb+, cis_3kb+, cis_10kb+, cis_30kb+

comply with the DCIC pairs standard

-- use DCIC pairs header
-- change the order of the columns
-- lump all sams into a single column

add non-additive stat summaries to dedup/stats: complexity estimation, cis/total, ...

use line.rstrip() to avoid bugs

line might not end with newline symbol, so strip it instead of line[:-1] ...

https://github.com/mirnylab/pairsamtools/blob/cd269671cc493819b68c021a2b4300abc1f34bdc/pairsamtools/pairsam_select.py#L138

add restriction tool

make pairsamtools conda-installable

in merge, add an option to take a header from the first file and ignore other headers

add mapq filters, possibly at the later stage

pairsamtools merge single file no action?

Wouldn't it be a good approach not to do anything with an input file, in case pairsamtools merge is provided with only a single file ?

Right now it's up to distiller to check if it's a single-file input. Single files are treated differently - they're copied by distiller and are left untouched.
pairsamtools merge however does something to the pairsam header, even if it's just a single file and there is nothing to merge. Modified headers cause downstream issues in case of hierarchical merge implemented in distiller. Dealing with that in pairsamtools merge would make distiller code cleaner.

make a splitted version of distiller.sh for step-wise execution

keep unmapped/singlesided/multimapped/abnormal chimeras in pairsam, split later

flip and resort pairs in sort

in parse, add other options to store (CIGAR, genomic span, NM, AS, XA, etc)

In merge,change the default number of threads per infile to 1

make auto_open default to stdin/stdout when path evaluates to False

add an option to use other compressors for input/output (lz4, etc)

Trim column list in header after split

After splitting off SAMs from pairs the #columns header still reads:

#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2 pair_type sam1 sam2

adding stats to dedup

In order to address open2c/distiller-nf#80 we agreed to add stats to the dedup.

And this: https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L315 looks like a proper place to add something like out_stat.add_pair(algn2, algn1, pair_type)

Techincal Qs:

should we parse line_buffer[i] right next to where we're writing it to outstream?
should we instead activate cols_buffer[i] (used for mark duplicates) for stats and skip parsing line_buffer[i]?
stats-related: should we change stats-API, and instead of out_stat.add_pair(algn2, algn1, pair_type), say out_stat.add_pair(c2, p2, s2, c1, p1, s1, pair_type), where algn={'chrom':c,'pos':p,'strand':s}?
Such an API change would simplify stats module itself, here https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_stats.py#L78
It was written the way it is right now, in order to please and simplify parse-code, i.e. avoid unpacking align dictionaries there.

What do you @golobor , @nvictus guys think ?

We'll try to debug this, but in the meantime, maybe you guys could provide some extra insight into this?