Giter Site home page Giter Site logo

synerclust's Introduction

SynerClust README

Dependencies

Already included:

Installation

You can install SynerClust by running the following command from the main folder:

python INSTALL.py

The default considers that Blast+ is in your path. If that is not the case, please use the "-e" option to specify the path to the Blast+ bin folder.

Input Data

data_catalog.txt should be formatted as the following example (paths can be relative of absolute paths):

//
Genome	Esch_coli_H296
Sequence	Esch_coli_H296/Esch_coli_H296.genome
Annotation	Esch_coli_H296/Esch_coli_H296_PRODIGAL_2.annotation.gff3
//
Genome	Esch_coli_H378_V1
Sequence	Esch_coli_H378_V1/Esch_coli_H378_V1.genome
Annotation	Esch_coli_H378_V1/Esch_coli_H378_V1_PRODIGAL_2.annotation.gff3
//

Running

On a single machine:

The minimal command to run SynerClust is the following:

/path/to/SynerClust/bin/synerclust.py -r /path/to/data_catalog.txt -w /working/directory/ -t /path/to/newick/tree.nwk [-n number_of_cores] [--run single]

If you use the option "--run single" that is all you need to do! The Results will be written in the root folder but symlinks can be found in in the main "results" folder.

If you prefer to run step by step, the next steps are:

Run the script indicated (all tasks can be run in parallel on a grid):

/working/directory/genomes/needed_extractions.cmd.txt

You can then start the actual computation (in part parallelizable on a grid):

/working/directory/jobs.sh

Once all jobs are finished, to have an easy to read output of the clusters, simply run:

/working/directory/post_process_root.sh

This will, among others, generate a final_clusters.txt and clusters_to_locus.txt file with the results in the "results" folder (linking to the root node).

With an UGE cluster:

Initialize your environnement (if on UGE):

use Python-2.7
use UGER

The minimal command to run SynerClust is the following:

/path/to/SynerClust/bin/synerclust.py -r /path/to/data_catalog.txt -w /working/directory/ -t /path/to/newick/tree.nwk [-n number_of_cores] [--run uger]

If you use the option "--run uger" that is all you need to do! The Results will be written in the root folder but symlinks can be found in in the main "results" folder.

If you prefer to run step by step, the next steps are:

Run the script indicated:

python /path/to/SynerClust/uger_auto_submit_simple.py -f /working/directory/genomes/needed_extractions.cmd.txt -tmp TMP_FOLDER

You can then start the actual computation (parallelizable on the grid):

/working/directory/uger_jobs.sh

If they are more jobs than your queue allows, run:

/path/to/SynerClust/uger_auto_submit.py -f /working/directory/uger_jobs.sh -l queue_size_limit [-n number_of_cores_per_job]

Once all jobs are finished, to have an easy to read output of the clusters, simply run the

/working/directory/post_process_root.sh

This will, among others, generate a final_clusters.txt and clusters_to_locus.txt file with the results in the "results" folder (linking to the root node).

Help/Questions

Output files

clusters.txt: Contains a list of all genes grouped by clusters with annotation information. One gene per line, clusters are seperated by empty lines.

final_clusters.txt: List of all the transcript IDs per cluster and count of genes and taxas. One cluster per line.

clust_to_trans.txt: List of all transcripts IDs, their cluster assignation and the most commonly encountered gene name that is not "hypothetical protein". One gene per line.

cluster_dist_per_genome.txt: Table of number of genes from each taxas per cluster. One cluster per line.

Keeping track of progress

A file "completion.txt" is generated in the main folder of the run where you can keep track of what nodes have finished being processed.

In case a node encountered an error, another file, "not_completed.txt" is generated containing the node identifier.

Running SynerClust on an extended dataset

Simply rerun SynerClust as you did on the firsrt dataset, but set the working directory to the same one as the first run.

List of Parameters and their meaning

-t SPECIES_TREE, --tree SPECIES_TREE
    Species tree relating all of the genomes to be analyzed. (Required)

-r COBRA_REPO, --repo COBRA_REPO
    Complete path to data_catalog in the repository containing your genomic data. (Required)

-w WORKING_DIR, --working WORKING_DIR
    Complete path to the working directory for this analysis. (Required)

-m MIN_BEST_HIT, --min_best_hit MIN_BEST_HIT
    Minimal % of match length for Blastp hits compared to best one. (default = 0.8)

-B BLAST_EVAL, --blast_eval BLAST_EVAL
    Minimal e-value for Blastp hits. (default = 0.00001)

-l LOCUS_FILE, --locus LOCUS_FILE
    A locus_tag_file.txt that corresponds to the data in this repository

-N CODED_NWK_FILE, --newick_tag CODED_NWK_FILE
    Output file for the newick tree using tag names and number of genomes as distances.

-n NUM_CORES, --num_cores NUM_CORES
    The number of cores used for blast analysis (-a flag), (default = 4)

-F MINSYNFRAC, --min_syntenic_fraction MINSYNFRAC
    Minimum common syntenic fraction required for two genes from the same species to be considered paralogs, range [0.0,1.0], default=0.7

-D DIST, --dist DIST
    Maximum FastTree distance between a representative sequence and sequences being represented for representative selection. (default = 1.2)

-s SYNTENY_WINDOW, --synteny_window SYNTENY_WINDOW
    Distance in base pairs that will contribute to upstream and downstream to syntenic fraction. The total window size is [int]*2. (default = 6000)

--no-synteny
    Disable use of synteny (required if information not available).

--run {none,single,uger}
    Specify if you want all computation to be run directly. Use "single" to run the local machine or "uger" to submit to the UGE grid.

--alignment {none,scc,all}
    Specify if you want cluster alignments using MUSCLE to be computed and written for the root node. Use "all" if you want all clusters to be aligned or "scc" if you only want Single Copy Core clusters to be aligned.

synerclust's People

Contributors

brianjohnhaas avatar georgescuc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

synerclust's Issues

ClusterPostProcessing

Hello,
I am running Synerclust on the test Ecoli example files you provided but am having problems compiling the final output files.

The command given in the instructions indicates as follows
path/to/SynerClust/bin/ClusterPostProcessing.py genomes/ nodes/N____*****/locus_mappings.pkl n

However there is only locus_mapping.pkl files in the L_000000_* not in the N____* folders. is this a typo or should there be "locus_mapping.pkl" files in the N_* nodes folder?

Thanks, Blake

AttributeError: 'Graph' object has no attribute 'edge'

I get this error when starting to run SynerClust. It then fails and exits.
Traceback (most recent call last): File "/gsap/garage-bacterial/Users/Tim/SynerClust/bin/synerclust.py", line 197, in <module> main() File "/gsap/garage-bacterial/Users/Tim/SynerClust/bin/synerclust.py", line 107, in main myTree.rootTree(root_edge) File "/gsap/garage-bacterial/Users/Tim/SynerClust/bin/TreeLib.py", line 193, in rootTree re_weight = (self.tree.edge[root_edge[0]][root_edge[1]]['weight']) / 2.0 AttributeError: 'Graph' object has no attribute 'edge'

FormatAnnotation_external.py GFF file format specificity

The FormatAnnotation_external.py helper script results in errors if the GFF3 file format/content deviates from the test dataset.

Specifcally, the script assumes "CDS" lines are preceded by "gene" lines. This is not always the case in prokaryote annotation, when done with Prodigal 2.6.3 (no "gene" lines by default) or Prokka 1.14 (no "gene" line by default, and added below "CDS" line when --addgenes flag is used in prokka).

I fixed this locally for my use case, but do not have a fix for the parser that is usable for the various GFF3 formats.

Support for pseudogenes in refseq gff3 files

In line 196 of FormatAnnotation_external.py, I had to change it to
if line[2] == "gene" or line[2] == "pseudogene":
Some genes were labeled as pseudogene instead of gene and it was affecting all downstream analysis.
I suggest updating the code to something like this to support gff3 files with pseudogenes.

Error building repo_spec

Hi,

I am having troubles with the first steps of SynerClust and cannot figure out what is going wrong. I am trying to apply it on a set of 80 genomes (mix of draft and complete), using a newick tree built with PhyloPhlan2 as input.

I always get the error message that one of my genomes is present in the tree but absent in repo_spec.

bin/synerclust.py -w wd/ -r wd/paths.txt -t wd/xanthomonadaceae.nwk --run single -n 3

Started
Wrote locus tags to locus_tag_file.txt
reading genome to locus
reading tree
[TREE.NWK]
parsing tree
Error: Genome
Stenotrophomonas_maltophilia_CFBP3035 found in the tree but not in the repo_spec.

I checked the spelling of the names between all the input files multiple times and nothing's wrong.

Here are the log files and paths files :
locus_tag_file.txt
needed_extractions.cmd.txt
paths.txt
run_SynerClust.log

Could you give me a hand to understand what's wrong here?

Thanks.

WF_RefineClusters.py errors at line 284 and stalls

I got this error message.
Traceback (most recent call last): File "/home/unix/tstraub/.conda/envs/synerclust/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap self.run() File "/gsap/garage-bacterial/Users/Tim/.conda/envs/synerclust/bin/WF_RefineClusters.py", line 58, in run identical_index = next_task(self.mrca, genes_to_cluster, self.cluster_counter, self.lock, ok_trees, identical_orphans_to_check, identical_orphans_to_check_dict, identical_index, potentials, self.minSynFrac, self.synteny) File "/gsap/garage-bacterial/Users/Tim/.conda/envs/synerclust/bin/WF_RefineClusters.py", line 284, in __call__ if pairs[k][1] == 0.0 and self.graph[n1][k]['identical'] == 1 and self.graph[k][n1]['identical'] == 1: KeyError: 0

I traced it to line 284. I edited the script to include checking for each key value before trying to access the dictionary with given value. I.e.

if pairs[k][1] == 0.0 and n1 in self.graph and k in self.graph[n1] and self.graph[n1][k]['identical'] == 1 and k in self.graph and n1 in self.graph[k] and self.graph[k][n1]['identical'] == 1:

This seemed to fix the error, though I did not verify that the tool performed the analysis as expected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.