Giter Site home page Giter Site logo

kussell-lab / mcorr Goto Github PK

View Code? Open in Web Editor NEW
36.0 4.0 8.0 610 KB

Inferring bacterial recombination rates from large-scale sequencing datasets.

Home Page: https://www.nature.com/articles/s41592-018-0293-7

Python 35.46% Go 64.54%
correlation recombination

mcorr's Introduction

mcorr

Using Correlation Profiles of mutations to infer the recombination rate from large-scale sequencing data in bacteria.

Requirements

Installation

  1. Install mcorr-xmfa, mcorr-bam, and mcorr-fit from your terminal:
go get -u github.com/kussell-lab/mcorr/cmd/mcorr-xmfa
go get -u github.com/kussell-lab/mcorr/cmd/mcorr-bam
cd $HOME/go/src/github.com/kussell-lab/mcorr/cmd/mcorr-fit
python3 setup.py install

or to install mcorr-fit in local directory (~/.local/bin in Linux or ~/Library/Python/3.6/bin in MacOS):

python3 setup.py install --user
  1. Add $HOME/go/bin and $HOME/.local/bin to your $PATH environment. In Linux, you can do it in your terminal:
export PATH=$PATH:$HOME/go/bin:$HOME/.local/bin

In MacOS, you can do it as follows:

export PATH=$PATH:$HOME/go/bin:$HOME/Library/Python/3.6/bin

We have tested installation in Windows 10, Ubuntu 17.10, and MacOS Big Sur (on both Intel and M1 chips), using Python 3 and Go 1.15 and 1.16.

Typical installation time on an iMac is 10 minutes.

Basic Usage

The inference of recombination parameters requires two steps:

  1. Calculate Correlation Profile

    1. For whole-genome alignments (multiple gene alignments), use mcorr-xmfa:

      mcorr-xmfa <input XMFA file> <output prefix>

      The XMFA files should contain only coding sequences. The description of XMFA file can be found in http://darlinglab.org/mauve/user-guide/files.html. We provide two useful pipelines to generate whole-genome alignments:

    2. For read alignments, use mcorr-bam:

      mcorr-bam <GFF3 file> <sorted BAM file> <output prefix>

      The GFF3 file is used for extracting the coding regions of the sorted BAM file.

    3. For calculating correlation profiles between two clades or sequence clusters from whole-genome alignments, you can use mcorr-xmfa-2clades:

      mcorr-xmfa-2clades <input XMFA file 1> <input XMFA file 2>  <output prefix>

      Where file 1 and file 2 are the multiple gene alignments for the two clades.

    All programs will produce two files:

    • a .csv file stores the calculated Correlation Profile, which will be used for fitting in the next step;
    • a .json file stores the (intermediate) Correlation Profile for each gene.
  2. Fit the Correlation Profile using mcorr-fit:

    1. For fitting correlation profiles as described in the 2019 Nature Methods paper use mcorr-fit:

      mcorr-fit <.csv file> <output_prefix>

      It will produce four files:

      • <output_prefix>_best_fit.svg shows the plots of the Correlation Profile, fitting, and residuals;
      • <output_prefix>_fit_reports.txt shows the summary of the fitted parameters;
      • <output_prefix>_fit_results.csv shows the table of fitted parameters;
      • <output_prefix>_lmfit_report.csv shows goodness of fit-statistics from LMFIT
    2. To fit correlation profiles using the method from the Nature Methods paper and do model selection with AIC by comparing to the zero recombination case, use mcorrFitCompare:

      mcorrFitCompare <.csv file> <output_prefix>

      It will produce five files:

      • <output_prefix>_recombo_best_fit.svg and <output_prefix>_zero-recombo_best_fit.svg show the plots of the Correlation Profile, fitting, and residuals for the model with recombination and for the zero recombination case;
      • <output_prefix>_comparemodels.csv shows the table of fitted parameters and AIC values;
      • <output_prefix>_recombo_residuals.csv and <output_prefix>_zero-recombo_residuals.csv includes residuals for the model with recombination and the zero-recombination case

Examples

  1. Inferring recombination rates of Helicobacter pylori from whole genome sequences of a set of global strains;
  2. Inferring recombination rates of Helicobacter pylori from reads sequenced from a transformation experiment.

mcorr's People

Contributors

apsteinberg avatar edokussell avatar mingzhi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mcorr's Issues

Installation into a new conda environment

Hey! First, thanks for mcorr it is a great tool. I had to make some adjustments to the install recommendations - partially related to an update with the subcommands in go - thought I would share just incase it is helpful to you or others looking to install mcorr

Install Go -  https://golang.org/doc/install #as in the recommended installation instructions
export PATH=$PATH:/usr/local/go/bin
go install github.com/kussell-lab/mcorr/cmd/mcorr-xmfa@latest
go install github.com/kussell-lab/mcorr/cmd/mcorr-bam@latest
git clone https://github.com/kussell-lab/mcorr
cd GITREPO/DOWNLOAD/LOCATION/mcorr/cmd/
conda create -n mcorr python=3.7 pip
pip install ./mcorr-fit #prior to running pip here, be sure to make suggested changes in issue #7 to avoid memory overflows in complex samples
pip install importlib-metadata #this is required for mcorr-xmfa & mcorr-bam but seems to not be part of the instructions
export PATH=$PATH:$HOME/go/bin:$HOME/.local/bin

This worked for me on Ubuntu 18.04.5 LTS with conda 4.8.3

bloated phi_pool output

Hi,

I am really new to this and not really familiar with whats possible in bacterial genomes. Phi pool describes the recombinational divergence of pools of related sequences. Confusingly, I get this bloated phi p of 3075362166333290000 (3.08E18). Is this even possible? These are closely related bacteria collected from the same animal host and organ. Here is the fit that I got. I am assuming there are sequences that I need to get rid of first before running with mcorr?

image

Does recombination inference apply only to core genomes?

Dear author,

Thank you for this great tool to detect the recombination events in the bacterial genomes. I noticed that the pipieline "
ReferenceAlignmentGenerator" or "AssemblyAlignmentGenerator" only extract the core genes from the genome, does this mean the mcorr software inference the recombination parameters only within the core-genome, not including the accessory genome?

Error during installation

Hello!

Thanks for developing this great tool. Unfortunately, I've been having some issues installing it mcorr-xmfa, seemingly related to installing v2 of the pb package.

When I run go get -u github.com/kussell-lab/mcorr/cmd/mcorr-xmfa I get the following output:

# cd .; git clone https://gopkg.in/cheggaaa/pb.v2 /home/debian/software/go/src/gopkg.in/cheggaaa/pb.v2
Cloning into '/home/debian/software/go/src/gopkg.in/cheggaaa/pb.v2'...
error: RPC failed; curl 52 Empty reply from server
fatal: The remote end hung up unexpectedly
package gopkg.in/cheggaaa/pb.v2: exit status 128

This is on debian 9

It looks like mcorr-bam and mcorr-fit have installed without issue though.

Run with core genome alignment as input?

Hello,

I was able to use mcorr for whole-genome alignments, but I was wondering if it's possible to use it for core genome alignments.

I tried extracting a core genome alignment from my whole-genome xmfa file using this approach:

stripSubsetLCBs full_alignment.xmfa full_alignment.xmfa.bbcols core_alignment.xmfa 500 4

But I keep getting this error:
Exception FileNotOpened thrown from Unknown() in gnFileSource.cpp 67 Called by Unknown() Read 41960 backbone entries seq_count is: 0 output_ivs.size() 3049 Segmentation fault

Are there any alternatives?

Error during installation

Hi,

I am trying to install mcorr-xmfa using the following command

go get -u github.com/kussell-lab/mcorr/cmd/mcorr-xmfa

However when I run the command with go version 1.17.6 I am getting the following error

go get: github.com/kussell-lab/[email protected]: parsing go.mod:
module declares its path as: github.com/apsteinberg/mcorr
but was required as: github.com/kussell-lab/mcorr

Do you know what may be causing this?

Thank you very much!

Best,
Sam

Understanding output plots

Hello!

Thanks for this awesome tool, it's super useful! I could use some help understanding the plots produced though. After running mcorr-bam and mcorr-fit on a population I got this plot:
Screen Shot 2021-06-01 at 10 26 07 PM

I re-ran it and got the same output. When I run it on other populations it works just fine, so it is probably something specific to this population. Do you have any ideas about why a population would plot like this?

Thank you for your time!

plot interpret

image
Could you please explain why isnt there a fit?
thank you in advance!

Could the multiple alignment generated by Roary and other softwares be directly converted to XMFA format?

Hey there,

Before diving into mcor, I noticed we need to set up some XMFA files. I've been using AssemblyAlignmentGenerate, which you've recommended in the past. It's a solid tool, but I've been feeling it's a little rigid, especially when we have innovative tools like Bakta and Panaroo coming up.

The part that trips me up in AssemblyAlignmentGenerate is when it has to transform Roary's output into XMFA format. You know, Roary and some other tools can autonomously spit out multiple sequence alignments of core genes using the "-e" parameter. If we could jump right from these alignments to XMFA format, it would be a great enhancement.

So, here's a thought. What if you consider weaving in a feature that directly flips these multiple sequence alignments, those produced by Roary for instance, into XMFA format? I feel like this tweak could seriously streamline the workflow, adding more flexibility and ramping up the efficiency. What do you think?

memory allocation error

hello, Thank you for creating and sharing this cool tool!
I have encountered this error when running mcorr-fit on a csv file that is ~6Mb.
File "/n/home12/yhwang/.local/bin/mcorr-fit", line 33, in
sys.exit(load_entry_point('mcorr==20180506', 'console_scripts', 'mcorr-fit')())
File "/n/home12/yhwang/.local/lib/python3.7/site-packages/mcorr-20180506-py3.7.egg/mcorr/cli.py", line 61, in main
plot_params(fit_results, model_params[1:7], params_hist_file)
File "/n/home12/yhwang/.local/lib/python3.7/site-packages/mcorr-20180506-py3.7.egg/mcorr/plot.py", line 86, in plot_params
bins='auto', color="green")
File "/n/home12/yhwang/.local/lib/python3.7/site-packages/matplotlib-3.3.4-py3.7-linux-x86_64.egg/matplotlib/init.py", line 1447, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "/n/home12/yhwang/.local/lib/python3.7/site-packages/matplotlib-3.3.4-py3.7-linux-x86_64.egg/matplotlib/axes/_axes.py", line 6651, in hist
m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
File "<array_function internals>", line 6, in histogram
File "/n/home12/yhwang/.local/lib/python3.7/site-packages/numpy-1.20.1-py3.7-linux-x86_64.egg/numpy/lib/histograms.py", line 792, in histogram
bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
File "/n/home12/yhwang/.local/lib/python3.7/site-packages/numpy-1.20.1-py3.7-linux-x86_64.egg/numpy/lib/histograms.py", line 448, in _get_bin_edges
endpoint=True, dtype=bin_type)
File "<array_function internals>", line 6, in linspace
File "/n/home12/yhwang/.local/lib/python3.7/site-packages/numpy-1.20.1-py3.7-linux-x86_64.egg/numpy/core/function_base.py", line 135, in linspace
y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta))
numpy.core._exceptions.MemoryError: Unable to allocate 45.5 TiB for an array with shape (6256526054292,) and data type float64

I get the best_fit.svg output and the fit_reports.txt and fit_reports.csv files. But no histograms.svg
Thank you in advance!

effects of sequence coverage on the estimate

Hello mcorr team,

I am wondering for the metagenomic case, whether the coverage of the target genome affected the final c estimate. I can see that the simulation use 5X coverage for the genome. In many real world cases, coverage varies by quite a lot sample by sample. Is there a rule to say what coverage threshold is the best, e.g., >5X?

Thanks for the great tool.

Jianshu

panic: runtime error: index out of range

Hi,
I've just received the following error while running mcorr-xmfa with an xmfa file of 530 isolates x 1064 genes. I've tested the code with a separate xmfa file of 178 isolates x 92 genes and it worked.

panic: runtime error: index out of range

goroutine 81 [running]:
github.com/kussell-lab/biogo/seq.XMFAReader.Read(0xc000202f60, 0x1000, 0x1000, 0xc000262000, 0x0, 0x0)
/home/webstejo/go/src/github.com/kussell-lab/biogo/seq/xmfa.go:33 +0x724
main.readXMFA.func1(0xc000228000, 0x7fff11ed030e, 0x39)
/home/webstejo/go/src/github.com/kussell-lab/mcorr/cmd/mcorr-xmfa/main.go:198 +0x1de
created by main.readXMFA
/home/webstejo/go/src/github.com/kussell-lab/mcorr/cmd/mcorr-xmfa/main.go:190 +0x6c

Add to bioconda?

Any plans for adding mcorr to bioconda? It would then be easier for researchers to add mcorr to reproducible bioinformatics computational environments

ZeroDivisionError running mcorr-fit

To the authors-
Thank you for writing mcorr! It has been running mostly smoothly for me. However, one issue that I've run into on a couple of my runs is this error when running mcorr-fit with default parameters on the .csv output of mcorr-bam. It only happens on maybe 1 in 10 of my .csv files, not on all of them.
Here is a link to a .csv file output by mcorr-bam where this issue occurs:
https://drive.google.com/file/d/1MQn0nOoLjUwx8SpU3mmGhf-bSAnGbSeQ/view?usp=sharing

  8%|██▌                             | 79/1001 [00:43<08:22,  1.83it/s]ZeroDivisionError
   <_ast.Module object at 0x7fde5f0a6c18>
               ^^^
float division by zero
Traceback (most recent call last):
  File "/home/alexcc/.pyenv/versions/3.6.0/bin/mcorr-fit", line 11, in <module>
    load_entry_point('mcorr==20180506', 'console_scripts', 'mcorr-fit')()
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/mcorr-20180506-py3.6.egg/mcorr/cli.py", line 38, in main
    fit_results = fit_p2(fitdatas, r1_func=r1_func, disable_progress_bar=quiet)
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/mcorr-20180506-py3.6.egg/mcorr/fit.py", line 87, in fit_p2
    fitres = fit_one(fitdata, r1_func)
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/mcorr-20180506-py3.6.egg/mcorr/fit.py", line 80, in fit_one
    return FitRes(fitdata.group, fitres, dsample)
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/mcorr-20180506-py3.6.egg/mcorr/fit_res.py", line 7, in __init__
    params = fit_res.params.valuesdict()
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/lmfit/parameter.py", line 379, in valuesdict
    return OrderedDict(((p.name, p.value) for p in self.values()))
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/lmfit/parameter.py", line 379, in <genexpr>
    return OrderedDict(((p.name, p.value) for p in self.values()))
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/lmfit/parameter.py", line 791, in value
    return self._getval()
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/lmfit/parameter.py", line 773, in _getval
    check_ast_errors(self._expr_eval)
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/lmfit/parameter.py", line 21, in check_ast_errors
    expr_eval.raise_exception(None)
  File "/home/alexcc/.pyenv/versions/3.6.0/lib/python3.6/site-packages/asteval/asteval.py", line 248, in raise_exception
    raise exc(self.error_msg)
ZeroDivisionError:  in expr='<_ast.Module object at 0x7fde5f0a6c18>'

mcorr-bam - Very large phi_pool values from simulated data

I am trying to analyse some simulated datasets but am getting very large results for phi_pool (i.e. 1e50).

I have some simulated datasets that I generated using the program fastsimbac that I would like to try with mcorr-bam. With the simulated datasets only the bam and reference fasta is available. For the gff I tried to simply point to the entire reference sequence that was used for alignment.

I have attached an example simulated bam, fasta and gff file where the phi_pool value is really large. The bam has been simulated with a population recombination rate (rho) 0.5, 50 genomes of size 10Kbp and population mutation rate (theta) of 0.1.

I used the following commands for mcorr-bam and mcorr-fit and have attached the bootstrapping_report file along with the other files generated by mcorr.

mcorr-bam test.gff Aligned.bam test_bam
mcorr-fit test_bam.csv test_bam

I was also wondering how phi_pool differs from the population recombination rate (rho)?
data.zip

go get deprecated

As of go 1.7 go get only downloads and doesn't installs [1]

I managed to install with

go install github.com/kussell-lab/mcorr/cmd/mcorr-xmfa@latest
go install github.com/kussell-lab/mcorr/cmd/mcorr-bam@latest

instead of the mentioned

go get -u github.com/kussell-lab/mcorr/cmd/mcorr-xmfa
go get -u github.com/kussell-lab/mcorr/cmd/mcorr-bam

in the README.md file

[1] https://go.dev/doc/go-get-install-deprecation

panic: runtime error: index out of range

Hi,

I have been using your very well-thought program mcorr but unfortunately I am running into 2 errors messages I was hoping you could help me to solve them.

Before writing this message I tested the correct installation of the program by running one of the example you provided and I could run the whole workflow without having any problems.

1- I am getting the 1st error message after running the command bellow, but weirdly the program still output a .csv that seems to have all the information I need.

mcorr-xmfa --max-corr-length=1000 myInput.xmfa outputPrefix

panic: runtime error: index out of range

goroutine 1 [running]:
github.com/kussell-lab/mcorr.(*Collector).Results(0xc000050c30, 0xc0000a5c40, 0x15, 0x20)
/home/daron/go/src/github.com/kussell-lab/mcorr/collector.go:69 +0x70e
github.com/kussell-lab/mcorr.(*Bootstrap).Results(...)
/home/daron/go/src/github.com/kussell-lab/mcorr/bootstrap.go:51
github.com/kussell-lab/mcorr.CollectWrite(0xc00006e1e0, 0xc00001c300, 0x35, 0x3e8)
/home/daron/go/src/github.com/kussell-lab/mcorr/collect.go:71 +0x375
main.main()
/home/daron/go/src/github.com/kussell-lab/mcorr/cmd/mcorr-xmfa/main.go:73 +0xdca

2- The second error message is associated with the command mcorr-fit:

Traceback (most recent call last):
File "/home/daron/anaconda3/bin/mcorr-fit", line 11, in
load_entry_point('mcorr==20180506', 'console_scripts', 'mcorr-fit')()
File "/home/daron/anaconda3/lib/python3.6/site-packages/mcorr-20180506-py3.6.egg/mcorr/cli.py", line 61, in main
plot_params(fit_results, model_params[1:7], params_hist_file)
File "/home/daron/anaconda3/lib/python3.6/site-packages/mcorr-20180506-py3.6.egg/mcorr/plot.py", line 92, in plot_params
ax1.tick_params(axis='x', rotation=30)
File "/home/daron/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_base.py", line 2727, in tick_params
self.xaxis.set_tick_params(**xkw)
File "/home/daron/anaconda3/lib/python3.6/site-packages/matplotlib/axis.py", line 791, in set_tick_params
kwtrans = self._translate_tick_kw(kw, to_init_kw=True)
File "/home/daron/anaconda3/lib/python3.6/site-packages/matplotlib/axis.py", line 862, in _translate_tick_kw
% (key, kwkeys))
ValueError: keyword rotation is not recognized; valid keywords are ['size', 'width', 'color', 'tickdir', 'pad', 'labelsize', 'labelcolor', 'zorder', 'gridOn', 'tick1On', 'tick2On', 'label1On', 'label2On', 'length', 'direction', 'left', 'bottom', 'right', 'top', 'labelleft', 'labelbottom', 'labelright', 'labeltop']

Thanks a lot for your help.
Best,
Josquin

Problems with the generation of an xmfa file by using AssemblyAlignmentGenerator

Dear author,
I encountered a problem when trying to generate an xmfa file using AssemblyAlignmentGenerator
Using the command :go run AssemblyAlignmentGenerator/AlignSequences.go test.db out/out.csv
However,the following error was encountered:
`0 / 3746 [---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------] 0.00%panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x577a0b]

goroutine 46 [running]:
github.com/kussell-lab/biogo/seq.(*FastaReader).Read(0xc000297770, 0x0, 0x1000, 0x7fbf95510f18)
/home/dave/go/pkg/mod/github.com/kussell-lab/[email protected]/seq/fasta.go:28 +0x6b
github.com/kussell-lab/biogo/seq.(*FastaReader).ReadAll(0xc000297770, 0x1000, 0x1000, 0xc000420000, 0x7867c0, 0xc000012f30)
/home/dave/go/pkg/mod/github.com/kussell-lab/[email protected]/seq/fasta.go:120 +0x52
main.MultiAlign(0xc0005dc000, 0x168, 0x266, 0x750868, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/data/users/dave/softwares/AssemblyAlignmentGenerator/AlignSequences.go:142 +0x3bd
main.align(0xc0005dc000, 0x168, 0x266, 0xc0005e8000, 0x168, 0x266, 0x0, 0x0, 0x0, 0x0, ...)
/data/users/dave/softwares/AssemblyAlignmentGenerator/AlignSequences.go:88 +0x7a
main.main.func2(0xc0002100c0, 0xc000210120, 0xc00020e060)
/data/users/dave/softwares/AssemblyAlignmentGenerator/AlignSequences.go:53 +0x131
created by main.main
/data/users/dave/softwares/AssemblyAlignmentGenerator/AlignSequences.go:50 +0x606
exit status 2`

I used go 1.16 and go1.15 versions respectively, but both encountered the same error, how did this error happen and what should I do?
Yous,
Dave

how can I get the bam file?

Hi,
I notice that bam file could be the input for mcorr. But I am not sure how I can obtain the bam file.
For example, I have fastq data for 10 E. coli species.
Shall I just align 10 fastq to the E. coli reference genome to obtain 10 bam file? If so, no recombination information is in the bam file. How is the recombination rate measured?
Or is there special ways to set the reference genomes for the alignment?
Many thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.