philarevalo / popcogent Goto Github PK

Microbial Populations as Clusters Of Gene Transfer

License: GNU General Public License v3.0

Python 14.29% Shell 0.34% R 0.42% Jupyter Notebook 0.95% Makefile 1.30% C++ 81.70% HTML 0.39% JavaScript 0.25% SWIG 0.32% Dockerfile 0.04%

popcogent's People

Contributors

Stargazers

Watchers

Forkers

jananiharan ramalok jiyunli vinisalazar pandengwang 444thliao zhongwang aiswarya-prasad ningshuang-yao liyangjie hocnonsense

popcogent's Issues

Plot the .graphml file

Hello,
I was wondering if there's any visualization tool/script you would suggest for plotting the .graphml file generated by PopCOGenT? I'm thinking to make a figure like Figure 3 in the PopCOGenT paper.

Thank you!

Yiyuan

flexible genome sweep error

Hello, I tried running flexible genome sweep and got the error attached. Any idea on what could be the problem? Marcela

Error running phybreak2.maf_to_fasta.py

Hello! I'm having an issue trying to run the core_gene_sweeps module. The error I get is:

Traceback (most recent call last):
  File "phybreak2.maf_to_fasta.py", line 343, in <module>
    corefile.write(">"+iso +"\n"+ full_seqdict[iso] +"\n")
KeyError: 'IRLA172'

However, the genome in particular is found in both the 'strain_names.txt' file and in the .maf alignment file. Any ideas on what might be causing this?

Python 3 pachage rpy2 needs to be installed to use the R function.

Hi,
still having trouble running flexible_genome_sweeps (bash snakemake.sh).

After setting the configuration file and running the script I get this error:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 align_core_genes
1 all
1 cluster_tsv_to_tidy
1 get_flex_genes
1 make_master_table
5

rule cluster_tsv_to_tidy:
input: proc/acidovorax/clusters/clusters.0.tsv
output: output/acidovorax/acidovorax.0.master_presence_absence.csv
jobid: 4
wildcards: organism=acidovorax

Error in job cluster_tsv_to_tidy while creating output file output/acidovorax/acidovorax.0.master_presence_absence.csv.
RuleException:
ValueError in line 129 of /home/rsiani/PopCOGenT-master/src/flexible_genome_sweeps/Snakefile:
Python 3 package rpy2 needs to be installed to use the R function.
File "/home/rsiani/PopCOGenT-master/src/flexible_genome_sweeps/Snakefile", line 129, in __rule_cluster_tsv_to_tidy
File "/home/rsiani/.conda/envs/PopCOGenT/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

However I now installed rpy2 in every possible way and still cannot get through this.

Any idea?

Thanks in advance, Rob

PopCOGenT

Im running into an error copy pasted below. I am running PopCOGenT on 3 assemblies

Ouput directory does not exist. Creating new directory.
Traceback (most recent call last):
File "/GWSPH/groups/liu_price_lab/tools/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Larger genome'

During handling of the above exception, another exception occurred:

Error running phybreak4.retrieveLikelihood.py

Hello. First all, thanks for this great tool.

I'm trying to run this script but a KeyError arrives:

Traceback (most recent call last):
File "phybreak4.retrieveLikelihood.py", line 142, in
ML_dict[subseq][str(tree_no)] += tree
KeyError: '78'

I was reading that is a problem with dictionaries, and sincerely I don't have any idea for the solution. Can you help me please?

Regards.

Python 3 package rpy2 needs to be installed to use the R function

I am trying to run the flexible_genomes_sweeps using the Ruminococcus example dataset, as well as the default snakemake.sh and config.yaml files (with relevant pathway and file name entered). I am not able to get past the following error, and when I query which modules are present in the PopCOGenT environment (conda list -n PopCOGenT) I can see that rpy2 2.7.8 bioconda is present in the list. I am not sure what to try next.

Error in job cluster_tsv_to_tidy while creating output file output/Ruminococcus/Ruminococcus.0.master_presence_absence.csv.
RuleException:
ValueError in line 131 of /home/tate/PopCOGenT-master/src/flexible_genome_sweeps/Snakefile:
Python 3 package rpy2 needs to be installed to use the R function.
File "/home/tate/PopCOGenT-master/src/flexible_genome_sweeps/Snakefile", line 131, in __rule_cluster_tsv_to_tidy
File "/home/tate/miniconda3/envs/PopCOGenT/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

Run long time in runPhyML.py

Dear sir,
It good job in gene-flow and I want to explore the gene-flow in different host. So ran the code as your pipelines using your example in test directory. Then it still running and no error report in tree days later when run the phybreak3.MSAsubset_runPhyML.py. Hope your help.
Best regard,
Yun

Plotting

Any recommendations for making a figure like the figure 3 in the paper?

create a release?

Can you please create a release (and thus a stable url to a tarball) so that I can try to add PopCOGenT to bioconda. See the bioconda docs if you're wondering why I'd need a tagged release. Adding PopCOGenT to bioconda will make it easier to install with all of its dependencies into compute environments with existing bioinfo tools and all of their complex dependency structures.

failed job run_mmseqs

Hi phil@philarevalo
I meet a problem when running "bash snakefile.sh" in flexible_genome_sweeps. the following is the error.
Waiting at most 5 seconds for missing files.
MissingOutputException in line 88 of /disk1/cau/cvmljy/pop/PopCOGenT-master/src/flexible_genome_sweeps/Snakefile:
Missing files after 5 seconds:
proc/sulfolobus/clusters/clu.0
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Removing output files of failed job run_mmseqs since they might be corrupted:
proc/sulfolobus/clusters/DB.0
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Core Gene Sweeps phybreak_parameters.txt clarification

Hello,

I am unclear on the parameter inputs.

What is the difference between input_contig_dir and contig_dir and how do these relate to the input genomes dir for PopCOGenT?

What are ref_iso and ref_contig supposed to be?

Thanks,
Roth

Mapping of positions in the output core gene sweep

Hi,

I want to extract the sequences based on the output of *.core_sweeps.csv, which provided the start and end positions.
In the README, you mentioned that

*.core_sweeps.csv: The positions (in the coordinates of the whole genome alignment) of core genome sweeps.

I guess the align/*.core.fasta is not what you mentioned, as it was concatenated and only contains core genome. Besides, it has gap. The align/*maf seems to be the whole genome alignment, while it was not concatenated, making it hard to mapping the position. So I was wondering if the positions are in coordinates of the reference genome, which was provided in the phybreak_parameters.txt at ref_iso.

Thanks in advance!

Best,
Xiaojun

Flexible gene sweep test output

The flexible gene sweep pipeline should have a test output to check against.

PopCOGenT, cannot find Bio

Hello, I have been trying to get PopCOGenT running with the test data. I have not been able to get past this error.Is this the right place to ask for help?

(PopCOGenT) tate@Zareason:~/PopCOGenT-master/src/PopCOGenT$ bash PopCOGenT.sh
Traceback (most recent call last):
File "get_alignment_and_length_bias.py", line 5, in
from Bio import SeqIO
ModuleNotFoundError: No module named 'Bio'
Traceback (most recent call last):
File "cluster.py", line 2, in
import networkx as nx
ModuleNotFoundError: No module named 'networkx'

After the following I can see that biopython is in the PopCOGenT env.
conda list -n PopCOGenT

I installed the PopCOGenT environment in the following manner:
conda config --set restore_free_channel true
conda env create -f PopCOGenT.yml
conda install --name PopCOGenT mugsy=1.2.3 muscle=3.8.31
(I was not able to install phyml, mmseqs2, or infomap this way and have yet to install them).
And I have inserted the miniconda3 pathway as the mugsy installation and mugsyenv.sh

Thank-you,
Suzanne

Test expected output for PopCOGenT

PopCOGenT should have an expected test output to check against.

conda installation issue

Hi,

Thanks for this pipeline! I met that issue during the installation and I wonder if that could impact the installation.

_conda env create -f PopCOGenT.yml
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages
_openmp_mutex-4.5 | 22 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: \
SafetyError: The package for r-base located at /home/nico/miniconda3/pkgs/r-base-3.3.2-0
appears to be corrupted. The path 'lib/R/doc/html/packages.html'
has an incorrect size.
reported size: 2946 bytes
actual size: 11567 bytes

ClobberError: This transaction has incompatible packages due to a shared path.
packages: defaults/linux-64::libgfortran-3.0.0-1, defaults/linux-64::libgcc-7.2.0-h69d50b8_2
path: 'lib/libgfortran.so.3'

ClobberError: This transaction has incompatible packages due to a shared path.
packages: defaults/linux-64::openblas-0.2.19-0, defaults/linux-64::libopenblas-0.3.6-h5a2b251_2
path: 'lib/libopenblas.so'

ClobberError: This transaction has incompatible packages due to a shared path.
packages: defaults/linux-64::libopenblas-0.3.6-h5a2b251_2, conda-forge/linux-64::libblas-3.8.0-11_openblas
path: 'lib/libblas.so'

done
Executing transaction: done

To activate this environment, use

$ conda activate PopCOGenT

To deactivate an active environment, use

$ conda deactivate_

When I run the program, I got this:

sh PopCOGenT.sh
PopCOGenT.sh: 4: source: not found
PopCOGenT.sh: 5: source: not found
PopCOGenT.sh: 6: source: not found
usage: get_alignment_and_length_bias.py [-h] [--genome_dir GENOME_DIR]
[--genome_ext GENOME_EXT]
[--alignment_dir ALIGNMENT_DIR]
[--mugsy_path MUGSY_PATH]
[--mugsy_env MUGSY_ENV]
[--base_name BASE_NAME]
[--final_output_dir FINAL_OUTPUT_DIR]
[--num_threads NUM_THREADS]
[--keep_alignments] [--slurm]
[--script_dir SCRIPT_DIR]
[--source_path SOURCE_PATH]
get_alignment_and_length_bias.py: error: argument --genome_dir: expected one argument
usage: cluster.py [-h] [--base_name BASE_NAME]
[--length_bias_file LENGTH_BIAS_FILE]
[--clonal_cutoff CLONAL_CUTOFF]
[--output_directory OUTPUT_DIRECTORY]
[--infomap_args INFOMAP_ARGS] [--infomap_path INFOMAP_PATH]
[--single_cell]
cluster.py: error: argument --base_name: expected one argument

Here is the config file:

Base name for final output files ust a prefix to identify your outputs.

base_name='TARApop'

Output directory for the final output files.

This will create the directory if it does not already exist.

final_output_dir=/home/nico/programmes/PopCOGenT-master/output/
mkdir -p ${ [--final_output_dir FINAL_OUTPUT_DIR]
[--num_threads NUM_THREADS]
[--keep_alignments] [--slurm]
[--script_dir SCRIPT_DIR]
[--source_path SOURCE_PATH]
get_alignment_and_length_bias.py: error: argument --genome_dir: expected one argument
usage: cluster.py [-h] [--base_name BASE_NAME]
[--length_bias_file LENGTH_BIAS_FILE]
[--clonal_cutoff CLONAL_CUTOFF]
[--output_directory OUTPUT_DIRECTORY]
[--infomap_args INFOMAP_ARGS] [--infomap_path INFOMAP_PATH]
[--single_cell]
cluster.py: error: argument --base_name: expected one argument

Here is the config file:

_Base name for final output files ust a prefix to identify your outputs.
base_name='MAGspop'

Output directory for the final output files.

This will create the directory if it does not already exist.

final_output_dir=/home/nico/programmes/PopCOGenT-master/output/
mkdir -p ${final_output_dir}

Path to mugsy and mugsyenv.sh. Please provide absolute path.

mugsy_path=/home/nico/miniconda3/envs/PopCOGenT/bin/mugsy
mugsy_env=/home/nico/miniconda3/envs/PopCOGenT/bin/mugsyenv.sh

Path to infomap executable. Please provide absolute path.

infomap_path=/home/nico/programmes/PopCOGenT-master/Infomap

Path to genome files.

genome_dir=/media/nico/MyBook/test/

Genome file filename extension.

genome_ext=.fasta

Are you running on a single machine? Please specify the number of threads to run.

This can, at maximum, be the number of logical cores your machine has.

num_threads=10

Whether to keep alignments after length bias is calculated.

Alignment files can be 10MB each and thus a run on 100 genomes can take up on the order of 50 GB of space if alignment files are not discarded.

If you want to keep alignments, set to --keep_alignments. Otherwise leave as ''.

keep_alignments=--keep_alignments

Directory for output alignments. Must provide absolute path.

alignment_dir=/home/nico/programmes/PopCOGenT-master/output/proc/
mkdir -p ${alignment_dir}

Are your genomes single-cell genomes? If so, this should equal --single_cell. Otherwise leave as ''.

single_cell=''

Are you using a slurm environment? Then this should equal --slurm, otherwise, leave as empty quotes.

slurm_str=''

If using slurm, please specify the output directory for the runscripts and source scripts. Absolute paths required.

script_dir=''
source_path=''_

I probably have done something wrong, not sure where...

Thanks for your help!

conda env create error: joblib=0.9.4 package not found

Hello!

Would love to try PopCOGenT but ran into the error below getting started. I tried adding channels with the indicated joblib version to the yml but that didn't do the trick. Any suggestions? Thank you very much!

[k6logc@eofe7 PopCOGenT]$ conda env create -f PopCOGenT.yml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

joblib=0.9.4

PopCOGenT run error

Hi phil，thx a lot for provide this tool
But Im running into an error copy pasted below.

Traceback (most recent call last):
File "get_alignment_and_length_bias.py", line 166, in
main()
File "get_alignment_and_length_bias.py", line 89, in main
args.keep_alignments)
File "get_alignment_and_length_bias.py", line 142, in run_on_single_machine
renamed_genomes = [rename_for_mugsy(g) for g in glob.glob(genome_directory + '' + genome_extension)]
File "get_alignment_and_length_bias.py", line 142, in
renamed_genomes = [rename_for_mugsy(g) for g in glob.glob(genome_directory + '' + genome_extension)]
File "/home/zhang/Documents/PopCOGenT-master/src/PopCOGenT/length_bias_functions.py", line 45, in rename_for_mugsy
s.id = '{id}_{_num}'.format(id=mugsy_name, contig_num=str(i))
KeyError: '_num'
Traceback (most recent call last):
File "/home/zhang/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/pandas/indexes/base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'Larger genome'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "cluster.py", line 315, in
main()
File "cluster.py", line 70, in main
linear_model=negative_selection_linear_fit())
File "cluster.py", line 227, in make_edgefile
predict_df['Genome_size'] = trn_table['Larger genome'] / 1e6
File "/home/zhang/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/pandas/core/frame.py", line 2059, in getitem
return self._getitem_column(key)
File "/home/zhang/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "/home/zhang/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "/home/zhang/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/pandas/core/internals.py", line 3543, in get
loc = self.items.get_loc(item)
File "/home/zhang/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/pandas/indexes/base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'Larger genome'

And i read about a earlier issue report about this,but the solve plan only cause more run error,that's the first one up forward KeyError: '_num'

Error of running phybreak1.generate_maf.py

Hi,
I tried to find the core gene sweep by the test data using PopCOGenT, but I got an error when running phybreak1.generate_maf.py:
sh: 1: source: not found
12 genomes
Starting Nucmer: Tue Feb 2 21:05:42 CST 2021
sh: 1: cannot create ./align/sulpho.M1627_contigs.queries.fsa: Directory nonexistent
.sh: 1: cannot create ./align/sulpho.M1627_contigs.filt.delta: Directory nonexistent
Nucmer search failed. Can't find delta file ./align/sulpho.M1627_contigs.filt.delta at /home/lenovo/software/mugsy_x86-64-v1r2.2/mugsy line 812.

I have unloaded the result files of bash PopCOGenT.sh, log file of running phybreak1.generate_maf.py and phybreak_parameters.txt, could you please give me any suggestions? Thanks.
strain_names.txt
phybreak_parameters.txt
sulfolobus_0.000355362.txt.cluster.tab.txt
sulfolobus.length_bias.txt

Best
Hao

Core gene sweeps module

Hi,
thanks a lot for providing this excellent software, very useful!

I’d have a few questions about the core_gene_sweeps module. Hope it’s ok that I put them together in one issue here.

1) Empty directory project_dir/align/trees/
After running scripts 1-7 I noticed this directory was empty, and I was wondering if maybe some script did not finish?
The trees are present in a file project_dir/align/phy_split/output_prefix.phy_phyml_tree.txt though, so not sure if anything is missing.

2) Re-running core_gene_sweeps with changed focus population
Under usage it says:

For each population for which you wish to find sweeps, change the focus_population parameter and re-run scripts 3-7.

As script 3 (calculating the trees) was rather time-conusming, I was wondering if it’s necessary to run script 3 again, or if eventually the trees from the previous run could be used?

3) Generating output as given in Figure 5 (B and C) and Figure 6 in the paper
Are you planning to include scripts that generate this output (i.e. between population Pi in the sweep regions, Fst values, SNPs in a sliding window and trees for sweep and flanking regions) in the future, or eventually some of this can already be extracted directly from the output but I missed it?

4) Very minor: If I’m not mistaken, script 4 writes phybreak.leafdist_compare.R into the project_dir but then calls it from the working directory (PopCOGenT/src/core_gene_sweeps/).

Thank you!
Matthias

Is it acceptable to work with MAGs?

Hello,

I have recently read your article regarding your tool and find it very interesting. However, I have a doubt regarding the expected inputs, because in the publication it is mentioned mostly applied to SAGs or even genomes from single cell. I wonder if it would be acceptable to also introduce MAGs in the analysis, since MAGs themself would be a set of population genomes? And if so, which ones would be acceptable, since in general there are those who work with medium quality MAGs (completeness estimates of ≥50% and less than 10% contamination), and others who work only with high quality MAGs (>90% complete with less than 5% contamination)?

Conda installation failed, indicating that the package is outdated

Hi, recently I came across an article about using PopCOGenT. When I tried to use it, I followed the installation command in the README. Unfortunately, conda informed me that the corresponding version mentioned in the yml file is no longer available.

Looking for: ['biopython=1.68', 'joblib=0.9.4', 'networkx=1.11', 'numpy=1.11.3', 'pandas=0.19.2', 'python=3.6', 'scipy=0.18.1', 'statsmodels=0.8.0', 'snakemake=3.11.2', 'rpy2=2.8.5', 'r-tidyverse==1.0.0=r3.3.2_0']


Encountered problems while solving:
  - package pandas-0.19.2-np112py36_1 is excluded by strict repo priority
  - package scipy-0.18.1-np112py36_blas_openblas_201 is excluded by strict repo priority
  - package biopython-1.68-py35_0 is excluded by strict repo priority
  - package joblib-0.9.4-py36_0 is excluded by strict repo priority
  - package r-tidyverse-1.0.0-r3.3.2_0 requires r-base 3.3.2*, but none of the providers can be installed
  - package rpy2-2.8.5-py27r3.3.1_2 requires r-base 3.3.1*, but none of the providers can be installed

Issue with R on phybreak4

When I run phybreak4.retreiveLikelihood.py, I get the following:

PopCOGenT) connolly_j_husky_neu_edu@popcogent:/extra-space/home/PopCOGenT/src/core_gene_sweeps$ python3 phybreak4.retrieveLikelihood.py
/home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/R: /home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/../../lib/../../libtinfo.so.6: no version information available (required by /lib/x86_64-linux-gnu/libncursesw.so.5)
/home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/R: /home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/../../lib/../../libtinfo.so.6: no version information available (required by /lib/x86_64-linux-gnu/libncursesw.so.5)
/home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/R: /home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/../../lib/../../libtinfo.so.6: no version information available (required by /lib/x86_64-linux-gnu/libncursesw.so.5)
/home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/R: /home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/../../lib/../../libtinfo.so.6: no version information available (required by /lib/x86_64-linux-gnu/libncursesw.so.5)
/home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/R: /home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/../../lib/../../libtinfo.so.6: no version information available (required by /lib/x86_64-linux-gnu/libncursesw.so.5)
/home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/R: /home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/../../lib/../../libtinfo.so.6: no version information available (required by /lib/x86_64-linux-gnu/libncursesw.so.5)
/home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/R: /home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/../../lib/../../libtinfo.so.6: no version information available (required by /lib/x86_64-linux-gnu/libncursesw.so.5)
/home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/R: /home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/../../lib/../../libtinfo.so.6: no version information available (required by /lib/x86_64-linux-gnu/libncursesw.so.5)
/home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/R: /home/connolly_j_husky_neu_edu/.conda/envs/PopCOGenT/lib/R/bin/exec/../../lib/../../libtinfo.so.6: no version information availabble (required by /lib/x86_64-linux-gnu/libncursesw.so.5)
Fatal error: cannot open file 'phybreak.leafdist_compare.R': No such file or directory

It looks like some kind of versioning problem, but I haven't been able to resolve it. Any advice?

What if mugsy do not report all SNP?

I'm reading the output of mugsy results using M1612_contigs and M1613_contigs in test files. However, I found that some results do not matched prefectly:

a score=933 label=85 mult=2
s M1612_contigs.M1612_contigs_0		889370 931 + 1691529 GAATTATACAAAAATTTATAAATAATTATATCAAATATTACCCATGGGAAAGAGTAAGTACAAGAGGGATTGGAGCAAATACGATGAGAACGTTATAATGAGATATACCCTAATGTTCCCCTTCTACGTCTTTGAACACTGGTTTACTAGCAGAGGAGAATAGGAACGCTAGGGCAAAGTATAAAGCTCCAAAGGAATTTAACGAATTCCTCCACACCTACCCTATAGGGCCATAGAAGGAGAGCACTAGAAAGACTAAAGATCATCACAACAAGCCTAGACTACTCAACAATATGGGAAAGAATAAGAAACATGAACATAACATTCCCAGAGGCAAGTGATGAACTTGAAGCAGACGCAACGGGAATAAACAAGAGAGGACAATAGCAAAATGGGGTAAAACTAGAGACTCAAAATTCCTCAAGATGGACAAGGACGAATTCAACGTAATAAACGCTGAAGTAATTAGCAACGAAGTTAAGACGGTTAAGGATTCACAAGATAAGGGAAAGAAGGTTTTATGGGGATAAGGCTTATGATACCAACGAGGCTGGAGTTGAGGTTGTTGTCCCACCTAGGAAGAACGCTTCTACTAAACGCAGTCATCCTGCTAGGCTGTGAGGGAGTTCAAGAAACTTGGCTATAATCGTTGGAGGGAGGAGAAGGGTTATGGTGTTAGGTGGAGGGTTGAGTCCTTGTTTTCTGCTGTTAACTTTTGGGGAGTCTGTTAGGGCTACAAGTTTTTTAAGGCAAGTGGTTGAGGCCAAGTTCTGGGCTTATGCATGGATGGTCCACTTGGCTGTAGTCGATAGGGCTCACGGTATTAGGATGTGAGCTTGAGAATAACGTTGAAATAAATATTAATTACTGAAAAATTCTCC-TTATGTCG-TATCATGCTTATGAAATAAATTGAAGATATCAACAAAGCAAC
s M1613_contigs.M1613_contigs_0		48715 96 + 1741614 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------GGGTATTAGGGTGTGAGCTTGCGAATAACGTTGAAATAAATATTAATTACTGAAAGATT-TCCGTTAT-ACGATATCGTTTTAATGAAATAAATTGAA-----------------

However, PopCOGenT will not recognize the mismatch, and just take it as a whole sequence.

PopCOGenT/src/PopCOGenT/length_bias_functions.py

Lines 247 to 264 in 7296af9

    
               with open(alignment, 'r') as infile: 
        
                   ''' 
        
                   Parser assumes a maf format where every alignment block begins with a 
        
                   statement of how many sequences are in that block, indicated by 
        
                   "mult=." Also assumes that the order of sequences in each block is 
        
                   the same. 
        
                   ''' 
        
                   seqs = [] 
        
                   total_len = 0 
        
                   for lines in infile: 
        
                       if 'mult=2' in lines: 
        
                           seq_line_1 = next(infile) 
        
                           block_1 = seq_line_1.split()[-1].strip() 
        
                           total_len += len(block_1) 
        
                           seq_line_2 = next(infile) 
        
                           block_2 = seq_line_2.split()[-1].replace('\n', '') 
        
                           seqs.append((block_1, block_2)) 
        
               return seqs, total_len

Is this an expected feature?

Thanks!

flexible_genome_sweeps error

Hi,
I have been using your script on my data but an error occurred using flexible_genome_sweeps:

Waiting at most 5 seconds for missing files.
Error in job run_mmseqs while creating output files proc/acidovorax/clusters/DB.0, proc/acidovorax/clusters/clu.0.
MissingOutputException in line 88 of /home/rsiani/PopCOGenT-master/src/flexible_genome_sweeps/Snakefile:
Missing files after 5 seconds:
proc/acidovorax/clusters/clu.0
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Removing output files of failed job run_mmseqs since they might be corrupted:
proc/acidovorax/clusters/DB.0
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

Have you got any idea of what could have caused it? Thanks in advance and congratulations for your work!

Fails when two input genomes are exactly identical

Need to make sure that an easy fix doesn't also break the clonal clustering portion of the pipeline.

FileNotFoundError #36

Hi Team,
I am facing the following issue [file not found error] while executing the PopCOGenT tool on around 361 gene sequences. I was able to run the tool on the test data.

context on the input data:
We are using sequences of a single gene from hundreds of isolates of the same genus. The sequence length of each gene is around 400 bp. Can this program be applied to single genes, or does it have to be used with draft genomes?

Thanks in advance,
Balu

Long run time for phyml script - phybreak3.MSAsubset_runPhyML.py

Hi,

We are currently running the core_gene_sweep module, and stuck at third step which is running phyml. Do you have any suggestions on how to make this step run quicker, maybe increasing the number of threads or run them in parallel.

Our dataset includes 53 genomes, and according to phyml stats file, the number of datasets to run through is 108,133. So at the current pace, the script would take more than 20 days to complete running. Below is the command being run,

~/.local/bin/phyml -i PopCOGenT/src/core_gene_sweeps/output2align/phy_split/bovienii_core_genes.phy -n 108133 -q -m JC69 -f e -c 2 -a 0.022

Thank you,
Bhavya

PopCOGenT run error

I had run PopCOGenT.sh successfully before, but when I use another dataset it failed, the error message is :
Traceback (most recent call last):
File "cluster.py", line 315, in
main()
File "cluster.py", line 70, in main
linear_model=negative_selection_linear_fit())
File "cluster.py", line 298, in make_edgefile
n2 = ','.join(clonal_components[n2])
TypeError: sequence item 0: expected str instance, numpy.int64 found
Is there any special requirements for input fasta_file name?

License and contributing

Hi,

thank you for providing this package. I am excited to use it.

Do you intend in providing a license, and perhaps a contributing guide?

Kind regards,

FileNotFoundError

Hi there,
Thank you for making this great tool. I'm running PopCOGenT on 190 bacterial genomes. And I got the error information as follows. I attached the complete log file here PopCOGenT.Gilliamella.log.

I was able to run PopCOGenT on the test dataset and another dataset with 80 genomes without any problem.

Thank you for any suggestions!

Yiyuan

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/yli/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py", line 130, in call
return self.func(*args, **kwargs)
File "/home/yli/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py", line 72, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/yli/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py", line 72, in
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/User/yli/PopCoGenT/PopCOGenT/src/PopCOGenT/length_bias_functions.py", line 20, in align_and_calculate_length_bias
random_seed)
File "/User/yli/PopCoGenT/PopCOGenT/src/PopCOGenT/length_bias_functions.py", line 90, in align_genomes
remove('{align_directory}/{prefix}'.format(prefix=prefix, align_directory=alignment_dir))
FileNotFoundError: [Errno 2] No such file or directory: '/User/yli/Startover/step10_PopCOGenT/Gilliamella/proc//bl26xmXEZE6LhLqtpVeqoPmYmSCzNPVO'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/yli/miniconda3/envs/PopCOGenT/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/yli/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py", line 140, in call
raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException

FileNotFoundError Sat May 30 00:30:33 2020
PID: 41436Python 3.6.10: /home/yli/miniconda3/envs/PopCOGenT/bin/python
...........................................................................
/home/yli/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
67 def init(self, iterator_slice):
68 self.items = list(iterator_slice)
69 self._size = len(self.items)
70
71 def call(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, ('/User/yli/S...ella/Gilliamella_zhB3022_Acerana.fa.renamed.mugsy', '/User/yli/S...liamella_zhP0221M0141_Amellifera.fa.renamed.mugsy', '/User/yli/Startover/step10_PopCOGenT/Gilliamella/
73
74 def len(self):
75 return self._size
76

fali to parsing sequences

Hi,

I meet trouble when running python get_alignment_and_length_bias.py for about 180 genomes.
The following is the error:

.Parsing sequences for R2MyF9PMoHcjJAH9 multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home-user/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py", line 130, in __call__
    return self.func(*args, **kwargs)
  File "/home-user/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home-user/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py", line 72, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/mnt/home-user/software/PopCOGenT/src/PopCOGenT/length_bias_functions.py", line 26, in align_and_calculate_length_bias
    length_bias_file)
  File "/mnt/home-user/software/PopCOGenT/src/PopCOGenT/length_bias_functions.py", line 110, in calculate_length_bias
    g2size)
  File "/mnt/home-user/software/PopCOGenT/src/PopCOGenT/length_bias_functions.py", line 131, in get_transfer_measurement
    s1temp, s2temp = zip(*filtered_blocks)
ValueError: not enough values to unpack (expected 2, got 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home-user/miniconda3/envs/PopCOGenT/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home-user/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py", line 140, in __call__
    raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException
___________________________________________________________________________
ValueError                                         Wed Jul  7 14:32:25 2021
PID: 302719Python 3.6.13: /home-user/miniconda3/envs/PopCOGenT/bin/python
...........................................................................
/home-user/miniconda3/envs/PopCOGenT/lib/python3.6/site-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)

I requested 16 threads for this job. It works when I apply it to other datasets with less than 100 genomes. I am not sure if the number of genomes matters.
Could you please give me some suggestions?

Thanks in advance!
Xiaojun

Unexpected results using different genome sets

Hi, popcogenT is excellent software and we have used it to resolve many questions.

However, one situation we meet is that we have gradually increased our genome set.
When the gene set is small (~500 genomes), we resolve them in to different populations. For example, genome A,B,C all belong to pop1.
When the genome set is larger (~1000 genomes), we resolve them again. In this time, genome A is pop2, genome B,C are still pop1.

This is the background to my following questions.

When I got through your code, I found out that you set alpha=0.1 in the method summary_frame. But within the length_bias.txt, they should be 95% CI. Accordingly, should we set alpha=0.05 ?
I believed that my questions came out because of the Negative selection cutoff changes along with the changes of the number in genomes. Is it normal, or how could we deal with it?

Failed to run flexible_genome_sweeps

Hi,
The following is the error information with my data.

Error in job parse_orfs while creating output file output/Mabs/Mabs.0.orfs.csv.
RuleException:
CalledProcessError in line 86 of /home/dragon/Database/python3/PopCOGenT/src/flexible_genome_sweeps/Snakefile:
Command '/home/dragon/Database/python3/miniconda3/envs/PopCOGenT/bin/python /home/dragon/Database/python3/PopCOGenT/src/flexible_genome_sweeps/.snakemake.1730bs5o.parse_orfs.py' returned non-zero exit status 1.
File "/home/dragon/Database/python3/PopCOGenT/src/flexible_genome_sweeps/Snakefile", line 86, in __rule_parse_orfs
File "/home/dragon/Database/python3/miniconda3/envs/PopCOGenT/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message

I noticed the bug was from this line:

strain, contig, orf = re.match(r"(.*)([^_]+)(\d+)$", strain_contig_orf).groups()

Is there any special rule for the name of contig ?

	with open(alignment, 'r') as infile:
	'''
	Parser assumes a maf format where every alignment block begins with a
	statement of how many sequences are in that block, indicated by
	"mult=." Also assumes that the order of sequences in each block is
	the same.
	'''
	seqs = []
	total_len = 0
	for lines in infile:
	if 'mult=2' in lines:
	seq_line_1 = next(infile)
	block_1 = seq_line_1.split()[-1].strip()
	total_len += len(block_1)
	seq_line_2 = next(infile)
	block_2 = seq_line_2.split()[-1].replace('\n', '')
	seqs.append((block_1, block_2))
	return seqs, total_len

philarevalo / popcogent Goto Github PK

popcogent's People

Contributors

Stargazers

Watchers

Forkers

popcogent's Issues

To activate this environment, use

$ conda activate PopCOGenT

To deactivate an active environment, use

$ conda deactivate_

Base name for final output files ust a prefix to identify your outputs.

Output directory for the final output files.

This will create the directory if it does not already exist.

Output directory for the final output files.

This will create the directory if it does not already exist.

Path to mugsy and mugsyenv.sh. Please provide absolute path.

Path to infomap executable. Please provide absolute path.

Path to genome files.

Genome file filename extension.

Are you running on a single machine? Please specify the number of threads to run.

This can, at maximum, be the number of logical cores your machine has.

Whether to keep alignments after length bias is calculated.

Alignment files can be 10MB each and thus a run on 100 genomes can take up on the order of 50 GB of space if alignment files are not discarded.

If you want to keep alignments, set to --keep_alignments. Otherwise leave as ''.

Directory for output alignments. Must provide absolute path.

Are your genomes single-cell genomes? If so, this should equal --single_cell. Otherwise leave as ''.

Are you using a slurm environment? Then this should equal --slurm, otherwise, leave as empty quotes.

If using slurm, please specify the output directory for the runscripts and source scripts. Absolute paths required.

Recommend Projects

Recommend Topics

Recommend Org