Giter Site home page Giter Site logo

open2c / pairtools Goto Github PK

View Code? Open in Web Editor NEW
95.0 11.0 33.0 3.33 MB

Extract 3D contacts (.pairs) from sequencing alignments

License: MIT License

Python 96.75% Makefile 0.18% Cython 3.07%
3d-genome bioinformatics file-formatter hi-c ngs python pairs-file

pairtools's Introduction

pairtools

Documentation Status Build Status Join the chat on Slack DOI

Process Hi-C pairs with pairtools

pairtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment.

pairtools process pair-end sequence alignments and perform the following operations:

  • detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end sequences of Hi-C DNA molecules
  • sort .pairs files for downstream analyses
  • detect, tag and remove PCR/optical duplicates
  • generate extensive statistics of Hi-C datasets
  • select Hi-C pairs given flexibly defined criteria
  • restore .sam alignments from Hi-C pairs
  • annotate restriction digestion sites
  • get the mutated positions in Hi-C pairs

To get started:

Data formats

pairtools produce and operate on tab-separated files compliant with the .pairs format defined by the 4D Nucleome Consortium. All pairtools properly manage file headers and keep track of the data processing history.

Additionally, pairtools define the .pairsam format, an extension of .pairs that includes the SAM alignments of a sequenced Hi-C molecule. .pairsam complies with the .pairs format, and can be processed by any tool that operates on .pairs files.

pairtools produces a set of additional extra columns, which describe properties of alignments, phase, mutations, restriction and complex walks. The full list of possible extra columns is provided in the pairtools format specification.

Installation

Requirements:

  • Python 3.x
  • Python packages cython, pysam, bioframe, pyyaml, numpy, scipy, pandas and click.
  • Command-line utilities sort (the Unix version), bgzip (shipped with samtools) and samtools. If available, pairtools can compress outputs with pbgzip and lz4.

For the full list of recommended versions, see requirements in the the GitHub repo.

We highly recommend using the conda package manager to install pairtools together with all its dependencies. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.

With conda, you can install pairtools and all of its dependencies from the bioconda channel.

$ conda install -c conda-forge -c bioconda pairtools

Alternatively, install non-Python dependencies and pairtools with Python-only dependencies from PyPI using pip:

$ pip install numpy pysam cython
$ pip install pairtools

Quick example

Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:

$ mkdir /tmp/test-pairtools
$ cd /tmp/test-pairtools
$ wget https://github.com/open2c/distiller-test-data/raw/master/bam/MATalpha_R1.bam

Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:

$ wget https://raw.githubusercontent.com/open2c/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes

With pairtools parse, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:

$ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam 

Inspect the resulting table:

$ less MATalpha_R1.pairs.gz

Pipelines

  • We provide a simple working example of a mapping bash pipeline in /examples/.
  • distiller is a powerful Hi-C data analysis workflow, based on pairtools and nextflow.

Tools

  • parse: read .sam/.bam files produced by bwa and form Hi-C pairs

    • form Hi-C pairs by reporting the outer-most mapped positions and the strand on the either side of each molecule;
    • report unmapped/multimapped (ambiguous alignments)/chimeric alignments as chromosome "!", position 0, strand "-";
    • perform upper-triangular flipping of the sides of Hi-C molecules such that the first side has a lower sorting index than the second side;
    • form hybrid pairsam output, where each line contains all available data for one Hi-C molecule (outer-most mapped positions on the either side, read ID, pair type, and .sam entries for each alignment);
    • report .sam tags or mutations of the alignments;
    • print the .sam header as #-comment lines at the start of the file.
  • parse2: read .sam/.bam files with long paired-and or single-end reads and form Hi-C pairs from complex walks

    • identify and rescue chrimeric alignments produced by singly-ligated Hi-C molecules with a sequenced ligation junction on one of the sides;
    • annotate chimeric alignments by restriction fragments and report true junctions and hops (One-Read-Based Interactions Annotation, ORBITA);
    • perform intra-molecule deduplication of paired-end data when one side reads through the DNA on the other side of the read;
    • report index of the pair in the complex walk;
    • make combinatorial expansion of pairs produced from the same walk;
  • sort: sort pairs files (the lexicographic order for chromosomes, the numeric order for the positions, the lexicographic order for pair types).

  • merge: merge sorted .pairs files

    • merge sort .pairs;
    • combine the .pairs headers from all input files;
    • check that each .pairs file was mapped to the same reference genome index (by checking the identity of the @SQ sam header lines).
  • select: select pairs according to specified criteria

    • select pairs entries according to the provided condition. A programmable interface allows for arbitrarily complex queries on specific pair types, chromosomes, positions, strands, read IDs (including matches to a wildcard/regexp/list).
    • optionally print the non-matching entries into a separate file.
  • dedup: remove PCR duplicates from a sorted triu-flipped .pairs file

    • remove PCR duplicates by finding pairs of entries with both sides mapped to similar genomic locations (+/- N bp);
    • optionally output the PCR duplicate entries into a separate file;
    • detect optical duplicates from the original Illumina read ids;
    • apply filtering by various properties of pairs (MAPQ; orientation; distance) together with deduplication;
    • output yaml or convenient tsv deduplication stats into text file.
    • NOTE: in order to remove all PCR duplicates, the input must contain *all* mapped read pairs from a single experimental replicate;
  • maskasdup: mark all pairs in a pairsam as Hi-C duplicates

    • change the field pair_type to DD;
    • change the pair_type tag (Yt:Z:) for all sam alignments;
    • set the PCR duplicate binary flag for all sam alignments (0x400).
  • split: split a .pairsam file into .pairs and .sam.

  • flip: flip pairs to get an upper-triangular matrix

  • header: manipulate the .pairs/.pairsam header

    • generate new header for headerless .pairs file
    • transfer header from one .pairs file to another
    • set column names for the .pairs file
    • validate that the header corresponds to the information stored in .pairs file
  • stats: calculate various statistics of .pairs files

  • restrict: identify the span of the restriction fragment forming a Hi-C junction

  • phase: phase pairs mapped to a diploid genome

Contributing

Pull requests are welcome.

For development, clone and install in "editable" (i.e. development) mode with the -e option. This way you can also pull changes on the fly.

$ git clone https://github.com/open2c/pairtools.git
$ cd pairtools
$ pip install -e .

Citing pairtools

Open2C*, Nezar Abdennur, Geoffrey Fudenberg, Ilya M. Flyamer, Aleksandra A. Galitsyna*, Anton Goloborodko*, Maxim Imakaev, Sergey V. Venev. "Pairtools: from sequencing data to chromosome contacts" bioRxiv, February 13, 2023. ; doi: https://doi.org/10.1101/2023.02.13.528389

License

MIT

pairtools's People

Contributors

agalitsyna avatar gfudenberg avatar golobor avatar hbbrandao avatar itsameerkat avatar nvictus avatar phlya avatar sergpolly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pairtools's Issues

calculate Pc(s) in stats

we can calculate Pc(s) during stats computation. We'd need chromsizes (can take from pairs header, or supply an external file) and centromeres/gaps (can take from a curated database/online/external file).
via @Phlya

multimer compatability?

maybe not worth thinking about... but maybe it is? it looks like the pairsam should be flexible enough to extend to the case of things like c-walks?

optical dedup stats

Enhancement:

readid contains info about spatial position of the physical read (short DNA molecule) on a sequencing flowcell.
This information can be used to find out how many of the observed duplicates are originating from PCR preceding sequencing step, and how many have been formed in the flowcell due to non-ideal conditions/read migration etc.

Proposition:

we can add an onlineDeduplicator-style buffer in the stats module that would keep track of incoming read-pairs and would start accumulating them as soon as it hits a DD-one, until the next one in the buffer in non-DD-one:

rid  c1   c2   p1   p2   s1   s2  pt  index
.    .    .    .    .    .    .   LL    1
.    .    .    .    .    .    .   DD    2
.    .    .    .    .    .    .   DD    3
.    .    .    .    .    .    .   DD    4
.    .    .    .    .    .    .   DD    5
.    .    .    .    .    .    .   LL    6

so with the input as above the content of the buffer would include pairs with index from 1-5 (including).

read_id (rid) column of that buffer can be further used to cluster pairs by their X,Y - position on the flowcell and to generate corresponding statistics.
We could use the read-clustering algorithm developed by Betul @betulakgol in the lab, and would have to optional because it is not really known, if readid of "any" sequencer would have such information and if it is in the same exact format.

Software:

from coding perspective, we would only need to make stats.add_pair(c1,p1,s1,c2,p2,s2,pt) to accept a read-id as well stats.add_pair(rid,c1,p1,s1,c2,p2,s2,pt) (either permanently or optionally - how? dunno ...) and implement that onlineDedup-style buffer inside the stats, so that everyhting will be nice and hidden from the dedup perspective.

stats are becoming pretty heavy at that point, and some Cythonization might be applied to it (split stats into _stats.pyx with the class and methods and pairsamtools_stats.py - a stand alone script capable of doing stats).

Seems like a doable and not very intrusive thing, which I already discussed with @hakanozadam,
@nvictus, @golobor - what do you think?

Versioning Pairsamtools

I think it would be nice to start versioning pairsamtools. I am seeing differences in functions and APIs between the versions of pairsamtools from May and December 2017. With versioning, we can impose a particular version on tools & pipelines using pairsamtools. Right now, in pairsamtools, I see

__version__ = '0.0.1-dev'

pairsamtools merge fails when symbolic links are provided as an input

Here is an example:
pairsamtools merge pairsam_chunk* --nproc 2 -o MATa_R2.lane1.pairsam.gz
where
pairsam_chunk1 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.01.pairsam.gz pairsam_chunk2 -> /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.00.pairsam.gz
fail with the message:

...
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/pairsam_merge.py", line 59, in merge_py
      h, _ = _headerops.get_header(f)
File "/home/polzovatel/miniconda3/lib/python3.6/site-packages/pairsamtools/_headerops.py", line 38, in get_header
    for line in instream:
File "/home/polzovatel/miniconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

While running

pairsamtools merge /some_path/intermediates/pairsam/chunks/MATa_R2.lane1.0* --nproc 2 -o MATa_R2.lane1.pairsam.gz

yields non-empty MATa_R2.lane1.pairsam.gz

pairsamtools merge single file no action?

Wouldn't it be a good approach not to do anything with an input file, in case pairsamtools merge is provided with only a single file ?

  • Right now it's up to distiller to check if it's a single-file input. Single files are treated differently - they're copied by distiller and are left untouched.

  • pairsamtools merge however does something to the pairsam header, even if it's just a single file and there is nothing to merge. Modified headers cause downstream issues in case of hierarchical merge implemented in distiller. Dealing with that in pairsamtools merge would make distiller code cleaner.

adding stats to dedup

In order to address open2c/distiller-nf#80 we agreed to add stats to the dedup.

And this: https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L315 looks like a proper place to add something like out_stat.add_pair(algn2, algn1, pair_type)

Techincal Qs:

  • should we parse line_buffer[i] right next to where we're writing it to outstream?
  • should we instead activate cols_buffer[i] (used for mark duplicates) for stats and skip parsing line_buffer[i]?
  • stats-related: should we change stats-API, and instead of out_stat.add_pair(algn2, algn1, pair_type), say out_stat.add_pair(c2, p2, s2, c1, p1, s1, pair_type), where algn={'chrom':c,'pos':p,'strand':s}?
    Such an API change would simplify stats module itself, here https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_stats.py#L78
    It was written the way it is right now, in order to please and simplify parse-code, i.e. avoid unpacking align dictionaries there.

What do you @golobor , @nvictus guys think ?

`parse_sort` takes forever when reads mapped to scaffold

This is just a preliminary thing - more of a question rather than issue:
In case we have a 30'000 pieces scaffold instead of a genome with ~20 well defined chromosomes, would you guys expect any "delays" with the parsing step?

In our experience it took more than a day on 100' million read chunks, and after that we killed it. Same step takes minutes for hg19 and ~10-30 million reads chunks.

We'll try to debug this, but in the meantime, maybe you guys could provide some extra insight into this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.