Giter Site home page Giter Site logo

dyxstat / metacc Goto Github PK

View Code? Open in Web Editor NEW
8.0 3.0 2.0 13.33 MB

MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data

Makefile 0.66% M4 0.58% Shell 0.23% C 87.44% Perl 2.46% Roff 6.98% TeX 0.02% Python 1.63%
hi-c metagenomics long-read-sequencing

metacc's Introduction

MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data

Update

  • v1.2.0 (05/2024): Remove the dependence of NormCC on the R scripts

Overview

MetaCC is an efficient and integrative framework for analyzing both long-read and short-read metaHi-C datasets. In the MetaCC framework, raw metagenomic Hi-C contacts are first efficiently and effectively normalized by a new normalization method, NormCC. Leveraging NormCC-normalized Hi-C contacts, the binning module in MetaCC enables the retrieval of high-quality MAGs and downstream analyses.

  • If you want to reproduce results in our MetaCC paper, please read our instructions here.

  • Some scripts to process the intermediate data and plot figures of our MetaCC paper are available here.

System Requirements

Hardware requirements

MetaCC requires only a standard computer with enough RAM to support the in-memory operations.

Software requirements

OS Requirements

MetaCC v1.2.0 is supported and tested in MacOS and Linux systems.

Python Dependencies

MetaCC mainly depends on the Python scientific stack:

numpy
scipy
pysam
scikit-learn
pandas
Biopython
igraph
leidenalg
statsmodels

Installation Guide

We recommend using conda to install MetaCC. Typical installation time is 1-5 minutes depending on your system.

Clone the repository with git

git clone https://github.com/dyxstat/MetaCC.git

Once complete, enter the repository folder and then create a MetaCC environment using conda.

Enter the MetaCC folder

cd MetaCC

Add dependencies of external softwares

Since MetaCC needs to execute the external softwares in the folder Auxiliary, you may need to run the following commands to make sure that all external softwares are executable:

chmod +x Auxiliary/test_getmarker.pl Auxiliary/FragGeneScan/FragGeneScan Auxiliary/FragGeneScan/run_FragGeneScan.pl Auxiliary/hmmer-3.3.2/bin/hmmsearch

Construct the conda environment in the linux or MacOS system

conda env create -f MetaCC_env.yaml

Enter the conda environment

conda activate MetaCC_env

A test dataset to demo MetaCC

We provide a small dataset, located under the Test directory, to test the software:

python ./MetaCC.py test

Instruction to process raw data

Follow the instructions in this section to process the raw shotgun and Hi-C data and generate the input for the MetaCC framework:

Clean raw shotgun and Hi-C reads

Adaptor sequences are removed by bbduk from the BBTools suite with parameter ktrim=r k=23 mink=11 hdist=1 minlen=50 tpe tbo and reads are quality-trimmed using bbduk with parameters trimq=10 qtrim=r ftm=5 minlen=50. Additionally, the first 10 nucleotides of Hi-C reads are trimmed by bbduk with parameter ftl=10. Identical PCR optical and tile-edge duplicates for Hi-C reads were removed by the script clumpify.sh from BBTools suite.

Assemble shotgun reads

For the shotgun library, de novo metagenome assembly is produced by an assembly software, such as MEGAHIT.

megahit -1 SG1.fastq.gz -2 SG2.fastq.gz -o ASSEMBLY --min-contig-len 1000 --k-min 21 --k-max 141 --k-step 12 --merge-level 20,0.95

Align Hi-C paired-end reads to assembled contigs

Hi-C paired-end reads are aligned to assembled contigs using a DNA mapping software, such as BWA MEM. Then, samtools with parameters ‘view -F 0x904’ is applied to remove unmapped reads, supplementary alignments, and secondary alignments. BAM file needs to be sorted by name using 'samtools sort'.

bwa index final.contigs.fa
bwa mem -5SP final.contigs.fa hic_read1.fastq.gz hic_read2.fastq.gz > MAP.sam
samtools view -F 0x904 -bS MAP.sam > MAP_UNSORTED.bam
samtools sort -n MAP_UNSORTED.bam -o MAP_SORTED.bam

MetaCC analysis

Implement the NormCC normalization module

Since the raw metagenomic Hi-C contacts are biased, MetaCC pipeline provides a comprehensive and scalable normalization module NormCC to eliminate the systematic biases of Hi-C contacts, which can significantly benefit the downstream analysis.

python /path_to_MetaCC/MetaCC.py norm [Parameters] FASTA_file BAM_file OUTPUT_directory

Parameters

-e (required): Case-sensitive enzyme name. Use multiple times for multiple enzymes 
--min-len: Minimum acceptable contig length (default 1000)
--min-mapq: Minimum acceptable alignment quality (default 30)
--min-match: Accepted alignments must be at least N matches (default 30)
--min-signal: Minimum acceptable signal (default 2)
--thres: the fraction of discarded normalized Hi-C contacts 
         (default 0.05, which means discarding the lowest 5% of normalized Hi-C contacts as spurious)
--cover (optional): Cover existing files. Otherwise, an error will be returned if the output file is detected to exist.
-v (optional): Verbose output about more specific details of the procedure.

Input File

  • FASTA_file: a fasta file of the assembled contigs (e.g. Test/final.contigs.fa)
  • BAM_file: a bam file of the Hi-C alignment (e.g. Test/MAP_SORTED.bam)

Output File

  • contig_info.csv: information of assembled contigs with three columns (contig name, the number of restriction sites on contigs, and contig length).
  • Normalized_contact_matrix.npz: a sparse matrix of normalized Hi-C contact maps in csr format and can be reloaded using Python command 'scipy.sparse.load_npz('Normalized_contact_matrix.npz')'.
  • NormCC_normalized_contact.gz: Compressed format of the normalized contacts and contig information by pickle. This file can further serve as the input of MetaCC binning module.
  • MetaCC.log: the specific implementation information of NormCC normalization module.

Example

python ./MetaCC.py norm -e HindIII -e NcoI -v final.contigs.fa MAP_SORTED.bam out_directory

Implement the MetaCC binning module

MetaCC binning module is based on the NormCC-normalized Hi-C contacts and thus must be implemented after the NormCC normalization module.

python /path_to_MetaCC/MetaCC.py bin --cover [Parameters] FASTA_file OUTPUT_directory

Parameters

--min-binsize: Minimum bin size used in output (default 150,000)
--num-gene (optional): Number of marker genes detected. If there is no input, 
                       the number of marker genes will be automatically detected.
--random-seed (optional): seed for the Leiden clustering. If there is no input, a random seed will be employed.
-v (optional): Verbose output about more specific details of the procedure.

Input File

  • FASTA_file: a fasta file of the assembled contigs (e.g. Test/final.contigs.fa)
  • OUTPUT_directory: please make sure that the output directory of the MetaCC binning module should be the same as that of the NormCC normalization module.

Output File

  • BIN: folder containing the fasta files of draft genomic bins
  • MetaCC.log: the specific implementation information of MetaCC binning module

Example

python ./MetaCC.py bin --cover -v final.contigs.fa out_directory

Implement the post-processing step of the MetaCC binning module

Draft genomic bins are assessed using CheckM2/CheckM. Then the post-processing step of the MetaCC binning module is conducted for partially contaminated bins with completeness larger than 50% and contamination larger than 10% in order to purify the partially contaminated bins.

python /path_to_MetaCC/MetaCC.py postprocess --cover [Parameters] FASTA_file Contaminated_Bins_file OUTPUT_directory

Parameters

--min-binsize: Minimum bin size used in output (default 150,000)
-v (optional): Verbose output about more specific details of the procedure.

Input File

  • FASTA_file: a fasta file of the assembled contigs (e.g. Test/final.contigs.fa).
  • Contaminated_Bins_file: a csv file of the names of the partially contaminated bins; Bin names are arranged in columns and don't include the file formats .fa at the end of each name.
  • OUTPUT_directory: please make sure that the output directory of the post-processing step of the MetaCC binning module should be the same as the previous steps.

Example of a Contaminated_Bins_file:

BIN0001
BIN0003
BIN0005
...

Output File

  • BIN: folder containing the fasta files of draft genomic bins after cleaning partially contaminated bins.
  • MetaCC.log: the specific implementation information of the post-processing step of the MetaCC binning module.

Example

python ./MetaCC.py postprocess --cover -v final.contigs.fa contaminated_bins.csv out_directory

Contacts and bug reports

If you have any questions or suggestions, welcome to contact Yuxuan Du ([email protected]).

Copyright and License Information

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

metacc's People

Contributors

brambloemen avatar dyxstat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

metacc's Issues

NormCC fails during glm fitting

Hi, I've found that the normalization step sometimes fails in normcc.R, where the glm fitter encounters NaN / Infinite values:
Error in glm.fitter(x = X, y = Y, w = w, etastart = eta, offset = offset, : NA/NaN/Inf in 'x'
A test dataset is attached.
contig_info.csv

I was able to resolve this issue by using python statsmodels for the glm fitting, but I don't know if that would be satisfactory here.

Program having issues finding the correct files

Hi Yuxuan! Thanks for making this program. It runs much better than HiCBin for sure.

These seem like relatively minor issues, but figured that they be brought up at least.

When I run the program from outside the MetaCC directory, it seems like the program has a hard time finding the program files. This is easily solved by cd'ing into the MetaCC directory and then running the program. That's not so much a big deal.

A slightly more problematic issue is when I try running:

python MetaCC.py postprocess --cover -v \
/scratch/duhaimem_root/duhaimem1/jamestan/from_turbo_20230822/samples/samp_508/assembly/megahit_noNORM/final.contigs.renamed.fa \
/scratch/duhaimem_root/duhaimem1/jamestan/from_turbo_20230822/metacc/samp_508/contaminated_bins.csv \
/scratch/duhaimem_root/duhaimem1/jamestan/from_turbo_20230822/metacc/samp_508

after I've run checkM and also have all the correct files for the postprocessing steps.

It gives me the error: "File NormCC_normalized_contact.gz is not found", even though I've checked and the file is certainly in the designated output directory from my first pass through MetaCC.

Let me know if that makes sense or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.