Giter Site home page Giter Site logo

schictools's Introduction

scHiCTools

Summary

A computational toolbox for analyzing single cell Hi-C (high-throughput sequencing for 3C) data which includes functions for:

  1. Loading single-cell HiC datasets
  2. Screening valid single-cell data.
  3. Smoothing the contact maps with linear convolution, random walk or network enhancing
  4. Calculating pairwise similarity using measures include InnerProduct, HiCRep and Selfish
  5. Calculating embeddings for single cell HiC datasets efficiently with MDS, t-SNE and PHATE
  6. Clustering the cells using scHiCluster, k-means and spectral clustering.
  7. Visualizing embedded cells via 2-D or 3-D scatter plot.

Installation

Required Python Packages

  • Python (version >= 3.6)
  • numpy (version >= 1.15.4)
  • scipy (version >= 1.0)
  • matplotlib (version >=3.1.1)
  • pandas (version >=0.19)
  • simplejson
  • six
  • h5py

** interactive_scatter feature requirement**

  • plotly (version >= 4.8.0)

Install from GitHub

You can install the package with following command:

  $ git clone https://github.com/liu-bioinfo-lab/scHiCTools.git
  $ cd scHiCTools
  $ python setup.py install

Install from PyPI

  $ pip install scHiCTools

Install optional interactive dependencie

  $ pip install scHiCTools[interactive_scatter]

or

  $ pip install -e .[interactive_scatter]

Usage

Supported Formats

  • Pre-processed Matrices: If the data is already processed into matrices for intra-chromosomal contacts, the chromosome from the same cell must be stored in the same folder with chromosome names as file names (e.g., scHiC/cell_1/chr1.txt). You only need to provide the folder name for a cell (e.g., scHiC/cell_1).

    • npy: numpy.array / numpy.matrix
    • npz: scipy.sparse.coo_matrix
    • matrix: matrix stored as pure text
    • matrix_txt: matrix stored as .txt file
    • HiCRep: the format required by HiCRep package
  • Edge List
    For all formats below:
      str - strand (forward / reverse)
      chr - chromosome
      pos - position
      score - contact reads
      frag - fragments (will be ignored)
      mapq - map quality

    • Shortest
    <chr1> <pos1> <chr2> <pos2>
    
    • Shortest_Score
    <chr1> <pos1> <chr2> <pos2> <score>
    
    • Short
    <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>
    
    • Short_Score
    <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <score>
    
    • Medium
    <readname> <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <mapq2>
    
    • Long
    <str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2> <mapq1> <cigar1> <sequence1> <mapq2> <cigar2> <sequence2> <readname1> <readname2>
    
    • 4DN
    ## pairs format v1.0
    #columns: readID chr1 position1 chr2 position2 strand1 strand2
    
  • .hic format: we adapted "straw" from JuiceTools.

  • .mcool format: we adapted "dump" from cool.

  • Other formats: simply give the indices (start from 1) in the order of
    "chromosome1 - position1 - chromosome2 - position2 - score" or
    "chromosome1 - position1 - chromosome2 - position2" or
    "chromosome1 - position1 - chromosome2 - position2 - mapq1 - mapq2".
    For example, you can provide "2356" or [2, 3, 5, 6] if the file takes this format:

<name> <chromosome1> <position1> <frag1> <chromosome2> <position2> <frag2> <strand1> <strand2>
contact_1 chr1 3000000 1 chr1 3001000 1 + -

Import Package

>>>import scHiCTools

Load scHiC data

The scHiC data is stored in a series of files, with each of the files corresponding to one cell. You need to specify the list of scHiC file paths.

Only intra-chromosomal interactions are counted.

>>>from scHiCTools import scHiCs
>>>files = ['test/data/cell_01', 'test/data/cell_02', 'test/data/cell_03']
>>>loaded_data = scHiCs(
... files, reference_genome='mm9',
... resolution=500000, keep_n_strata=10,
... format='customized', adjust_resolution=True,
... customized_format=12345, header=0, chromosomes='except Y',
... operations=['OE_norm', 'convolution']
... )
  • reference genome (dict or str): now supporting 'mm9', 'mm10', 'hg19', 'hg38'. If your reference genome is not in ['mm9', 'mm10', 'hg19', 'hg38'], you need to provide the lengths of chromosomes you are going to use with a Python dict. e.g. {'chr1': 150000000, 'chr2': 130000000, 'chr3': 200000000}
  • resolution (int): the resolution to separate genome into bins. If using .hic file format, the given resolution must match with the resolutions in .hic file.
  • keep_n_strata (None or int): only store contacts within n strata near the diagonal. Default: 10. If 'None', it will not store strata
  • store_full_map (bool): whether store full contact maps in numpy matrices or scipy sparse matrices,If False, it will save memory.
  • sparse (bool): whether to use sparse matrices
  • format (str): file format, supported formats: 'shortest', 'shortest_score', 'short', 'short_score' , 'medium', 'long', '4DN', '.hic', '.mcool', 'npy', 'npz', 'matrix', 'HiCRep', 'matrix_txt' and 'customized'. Default: 'customized'.
  • customized_format (int or str or list): the column indices in the order of "chromosome 1 - position 1 - chromosome 2 - position 2 - contact reads" or "chromosome 1 - position 1 - chromosome 2 - position 2" or "chromosome 1 - position 1 - chromosome 2 - position 2 - map quality 1 - map quality 2". e.g. if the line is "chr1 5000000 chr2 3500000 2", the format should be '12345' or [1, 2, 3, 4, 5]; if there is no column indicating number of reads, you can just provide 4 numbers like '1234', and contact read will be set as 1. Default: '12345'.
  • adjust_resolution: whether to adjust resolution for the input file. Sometimes the input file is already in the proper resolution (e.g. position 3000000 has already been changed to 6 with 500kb resolution). For this situation you can set adjust_resolution=False. Default: True.
  • map_filter (float): keep all contacts with mapq higher than this threshold. Default: 0.0
  • header (int): how many header lines does the file have. Default: 0.
  • chromosomes (list or str): chromosomes to use, eg. ['chr1', 'chr2'], or just 'except Y', 'except XY', 'all'. Default: 'all', which means chr 1-19 + XY for mouse and chr 1-22 + XY for human.
  • operations (list or None): the operations use for pre-processing or smoothing the maps given in a list. The operations will happen in the given order. Supported operations: 'logarithm', 'power', 'convolution', 'random_walk', 'network_enhancing', 'OE_norm', 'VC_norm', 'VC_SQRT_norm', 'KR_norm'。 Default: None.
  • For preprocessing and smoothing operations, sometimes you need additional arguments (introduced in next sub-section).

You can also skip pre-processing and smoothing in loading step (operations=None), and do them in next lines.

Plot number of contacts and select cells

You can plot the number of contacts of your cells.

>>>import matplotlib.pyplot as plt
>>>loaded_data.plot_contacts(hist=True, percent=True)
>>>plt.show()

If hist is True, plot Histogram of the number of contacts. If percent is True, plot the scatter plot of cells with of short-range contacts (< 2 Mb) versus contacts at the mitotic band (2-12 Mb).

You can select cells based on number of contacts:

>>>loaded_data.select_cells(min_n_contacts=10000,max_short_range_contact=0.99)

The above command select cells have number of contacts bigger than 10000 and percent of short range contacts small than .99.

Pre-processing and Smoothing Operations Stand alone pre-processing and smoothing:

>>>loaded_data.processing(['random_walk', 'network_enhancing'])

If you didn't store full map (i.e. store_full_map=False), processing is not doable in a separate step.

  • logarithm: new_W_ij = log_(base) (W_ij + epsilon). Additional arguments:

    • log_base: default: e
    • epsilon: default: 1
  • power: new_W_ij = (W_ij)^pow. Additional argument:

    • pow: default: 0.5 (i.e., sqrt(W_ij))
  • VC_norm: VC normalization - each value divided by the sum of corresponding row then divided by the sum of corresponding column

  • VC_SQRT_norm: VC_SQRT normalization - each value divided by the sqrt of the sum of corresponding row then divided by the sqrt of the sum of corresponding column

  • KR_norm: KR normalization - iterating until the sum of each row / column is one Argument:

    • maximum_error_rate (float): iteration ends when max error is smaller than (maximum_error_rate). Default: 1e-4
  • OE_norm: OE normalization - each value divided by the average of its corresponding strata (diagonal line)

  • convolution: smoothing with a N by N convolution kernel, with each value equal to 1/N^2. Argument:

    • kernel_shape: an integer. e.g. kernel_shape=3 means a 3*3 matrix with each value = 1/9. Default: 3.
  • Random walk: multiply by a transition matrix (also calculated from contact map itself). Argument:

    • random_walk_ratio: a value between 0 and 1, e.g. if ratio=0.9, the result will be 0.9 * matrix_after_random_walk + 0.1 * original_matrix. Default: 1.0.
  • Network enhancing: transition matrix only comes from k-nearest neighbors of each line. Arguments:

    • kNN: value 'k' in kNN. Default: 20.
    • iterations: number of iterations for network enhancing. Default: 1
    • alpha: similar with random_walk_ratio. Default: 0.9

Learn Embeddings

>>>embs = loaded_data.learn_embedding(
... dim=2, similarity_method='inner_product', embedding_method='MDS',
... n_strata=None, aggregation='median', return_distance=False
... )

This function will return the embeddings in the format of a numpy array with shape ( # of cells, # of dimensions).

  • dim (int): the dimension for the embedding
  • similarity_method (str): reproducibility measure, 'InnerProduct', 'HiCRep' or 'Selfish'. Default: 'InnerProduct'
  • embedding_method (str): 'MDS', 'tSNE' or 'UMAP'
  • n_strata (int): only consider contacts within this genomic distance. Default: None. If it is None, it will use the all strata kept (the argument keep_n_strata from previous loading process). Thus n_strata and keep_n_strata (loading step) cannot be None at the same time.
  • aggregation (str): method to aggregate different chromosomes, 'mean' or 'median'. Default: 'median'.
  • return_distance (bool): if True, return (embeddings, distance_matrix); if False, only return embeddings. Default: False.
  • Some additional argument for Selfish:
    • n_windows (int): split contact map into n windows, default: 10
    • sigma (float): sigma in the Gaussian-like kernel: default: 1.6

Clustering

There are two functions to cluster cells.

>>>label=loaded_data.clustering(
... n_clusters=4, clustering_method='kmeans', similarity_method='innerproduct',
... aggregation='median', n_strata=None
... )

clustering function returns a numpy array of cell labels clustered.

  • n_clusters (int): Number of clusters.

  • clustering_method (str): Clustering method in 'kmeans', 'spectral_clustering' or 'HAC'(hierarchical agglomerative clustering).

  • similarity_method (str): Reproducibility measure. Value in ‘InnerProduct’, ‘HiCRep’ or ‘Selfish’.

  • aggregation (str): Method to aggregate different chromosomes. Value is either 'mean' or 'median'. Default: 'median'.

  • n_strata (int or None): Only consider contacts within this genomic distance. If it is None, it will use the all strata kept (the argument keep_n_strata) from previous loading process. Default: None.

  • print_time (bool): Whether to print the processing time. Default: False.

>>>hicluster=loaded_data.scHiCluster(dim=2,cutoff=0.8,n_PCs=10,k=4)

scHiCluster function returns two componments. First componment is a numpy array of embedding of cells using HiCluster. Second componment is a numpy of cell labels clustered by HiCluster.

  • dim (int): Number of dimension of embedding. Default: 2.

  • cutoff (float): The cutoff proportion to convert the real contact matrix into binary matrix. Default: 0.8.

  • n_PCs (int): Number of principal components. Default: 10.

  • k (int): Number of clusters. Default: 4.

Visualization

>>>import matplotlib.pyplot as plt
>>>scatter(data, dimension="2D", point_size=3, sty='default',
... label=None, title=None, alpha=None, aes_label=None
... )
>>>plt.show()

This function is to plot scatter plot of embedding points of single cell data. Scatter plot of either two-dimensions or three-dimensions will be generated.

  • data (numpy.array): A numpy array which has 2 or 3 columns, every row represent a point.

  • dimension (str): Specifiy the dimension of the plot, either "2D" or "3D". Default: "2D".

  • point_size (float): Set the size of the points in scatter plot. Default: 3.

  • sty (str): Styles of Matplotlib. Default: 'default'.

  • label (list or None): specifiy the label of each point. Default: None.

  • title (str): Title of the plot. Default: None.

  • alpha (float): The alpha blending value. Default: None.

  • aes_label (list): Set the label of every axis. Default: None.

"scHiCTools" also support interactive scatter plot which require the module 'plotly'

>>>interactive_scatter(loaded_data, data, out_file, dimension='2D', point_size=3,
... label=None, title=None, alpha=1, aes_label=None)

This function is to generate an interactive scatter plot of embedded single cell data. The plot will be stored in a file.

  • schic (scHiCs): A scHiCs object.
  • data (numpy.array): A numpy array which has 2 or 3 columns, every row represent a point.
  • out_file (str): Output file path.
  • dimension (str): Specifiy the dimension of the plot, either "2D" or "3D". The default is "2D".
  • point_size (float): Set the size of the points in scatter plot. The default is 3.
  • label (list or None): Specifiy the label of each point. The default is None.
  • title (str): Title of the plot. The default is None.
  • alpha (float): The alpha blending value. The default is 1.
  • aes_label (list): Set the label of every axis. The default is None.

Citation

Xinjun Li, Fan Feng, Wai Yan Leung and Jie Liu. "scHiCTools: a computational toolbox for analyzing single cell Hi-C data." bioRxiv (2019): 769513.

References

A. R. Ardakany, F. Ay, and S. Lonardi. Selfish: Discovery of differential chromatininteractions via a self-similarity measure.bioRxiv, 2019.

N. C. Durand, J. T. Robinson, M. S. Shamim, I. Machol, J. P. Mesirov, E. S. Lander, and E. Lieberman Aiden. "Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom." Cell Systems 3(1), 2016.

J. Liu, D. Lin, G. Yardimci, and W. S. Noble. Unsupervised embedding of single-cellHi-C data.Bioinformatics, 34:96–104, 2018.

T. Nagano, Y. Lubling, C. Várnai, C. Dudley, W. Leung, Y. Baran, N. M. Cohen,S. Wingett, P. Fraser, and A. Tanay. Cell-cycle dynamics of chromosomal organization at single-cell resolution.Nature, 547:61–67, 2017.

B. Wang, A. Pourshafeie, M. Zitnik, J. Zhu, C. D. Bustamante, S. Batzoglou, andJ. Leskovec. Network enhancement as a general method to denoise weighted biological networks.Nature Communications, 9(1):3108, 2018.

T. Yang, F. Zhang, G. G. Y. mcı, F. Song, R. C. Hardison, W. S. Noble, F. Yue, andQ. Li. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient.Genome Research, 27(11):1939–1949, 2017.

G. G. Yardımcı, H. Ozadam, M. E. Sauria, O. Ursu, K. K. Yan, T. Yang,A. Chakraborty, A. Kaul, B. R. Lajoie, F. Song, et al. Measuring the reproducibilityand quality of hi-c data.Genome Biology, 20(1):57, 2019.

schictools's People

Contributors

dmcbffeng avatar liu-bioinfo-lab avatar xinjunum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

schictools's Issues

Questions in scHiCTools

Hi,
Thank you for this great software! However, I encountered two problems right now.

The first is that I don't really understand the format of the input data. Right now I have txt files looking like the input of "scHiCluster", which is like below:
1 1 9
1 2 1
.....
The first two columns are positions, and the third is count number. Which format does this belong to? matrix_txt?

The second question is that in my code, I wrote explicitly that chromosomes='chr1' (and I only have data in chr1 in that folder.), however, it asks for chr2. Did I do something wrong?

loaded_data = scHiCs(
files, sparse=True, reference_genome='mm9',
resolution=1000000, keep_n_strata=10,
format='matrix_txt', adjust_resolution=False, header=0, chromosomes='chr1',
operations=['convolution']
)

Processing 1 out of 1 files: C:/Users/solei/Downloads/Ren/FreeHiC_sc/2i
Traceback (most recent call last):
File "", line 5, in
File "C:\Users\solei\miniconda3\envs\scHiCTools\lib\site-packages\schictools-0.0.3-py3.7.egg\scHiCTools\load\ContactMaps.py", line 142, in init
**kwargs)
File "C:\Users\solei\miniconda3\envs\scHiCTools\lib\site-packages\schictools-0.0.3-py3.7.egg\scHiCTools\load\load_hic_file.py", line 166, in load_HiC
mat = np.loadtxt('{0}/{1}.txt'.format(file, chromosome))
File "C:\Users\solei\miniconda3\envs\scHiCTools\lib\site-packages\numpy-1.19.1-py3.7-win-amd64.egg\numpy\lib\npyio.py", line 961, in loadtxt
fh = np.lib._datasource.open(fname, 'rt', encoding=encoding)
File "C:\Users\solei\miniconda3\envs\scHiCTools\lib\site-packages\numpy-1.19.1-py3.7-win-amd64.egg\numpy\lib_datasource.py", line 195, in open
return ds.open(path, mode, encoding=encoding, newline=newline)
File "C:\Users\solei\miniconda3\envs\scHiCTools\lib\site-packages\numpy-1.19.1-py3.7-win-amd64.egg\numpy\lib_datasource.py", line 535, in open
raise IOError("%s not found." % path)
OSError: C:/Users/solei/Downloads/Ren/FreeHiC_sc/2i/chr2.txt not found.

Yours Sincerely,
Siqi Shen

ValueError: negative dimensions are not allowed

files = ['MII-13.allValidPairs.hic', 'MII-20.allValidPairs.hic']

Sus_scrofa = {'1': 274330532, '2': 151935994, '3': 132848913, '4':130910915, '5':104526007, '6':170843587, '7':121844099,'8':138966237,'9':139512083,'10':69359453,
'11':79169978,'12':61602749,'13':208334590,'14':141755446,'15':140412725,'16':79944280,'17':63494081,'18':55982971,'X':125939595,'Y':43547828,'MT':16613}

loaded_data = scHiCs(files, reference_genome=Sus_scrofa, resolution=40000,
max_distance=4000000, format='.hic', adjust_resolution=True,
chromosomes='all', operations=['convolution'], kernel_shape=3,
keep_n_strata=10, store_full_map=True)
assert x.num_of_cells==len(files)
error
Hello, I encountered this problem when running the above code, may I ask if I have misunderstood somewhere, and how should I solve it?

About creating loaded_data

Can the scHiCs( ) run the type of files which is " .matrix " ? If yes, which file format should I choose?

How should get the data format?

The Readme file uses the data test/data/cell_01', 'test/data/cell_02', 'test/data/cell_03 in test folder, but how can I transfer my data (fastq/ncc/hic) to the test data format?
截屏2024-04-18 14 07 54

Minor issue when genome_reference is a dict

Hi,

I think on line 91 of load_hic_file.py there's a small bug. The value of that dictionary should be divided by the resolution. Otherwise it would lead to an extremely large "maps" variable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.