fephyfofum / pyphlawd Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 7.0 12.92 MB

Python version of PHLAWD

License: GNU General Public License v2.0

Python 98.84% Shell 0.05% Cython 1.11%

clustering molecular-evolution phylogenetics

pyphlawd's People

Contributors

Stargazers

Watchers

Forkers

lj-zhao stewartialhy lunasare rlp52 pebgroup alexanderabair teagerv

pyphlawd's Issues

Location of logfile

Could the logfile log.md.gz be written to the output directory by default? I have scripts which run things concurrently, but if launched from the same directory they write to the same file. Same goes for mafft.out (although this is a temp file, pyphlawd complains when trying to delete it if already deleted by another process).

FastTree NOT IN PATH (ఠ్ఠ ˓̭ ఠ్ఠ)

Fasttree is in my path, but as fasttree and fasttreeMP (installed with sudo apt-get). Given that it gets installed using different casing pyphlawd should probably accommodate this.

cnode

Any suggestions for the following error?
Traceback (most recent call last):
File "../programs/PyPHLAWD/src/setup_clade.py", line 3, in
import tree_reader
File "/home/listona/programs/PyPHLAWD/src/tree_reader.py", line 4, in
from cnode import Node
ImportError: No module named cnode

Traceback error

I am getting the following traceback error while trying to run the baited example.

MAKING TREE Adoxaceae
Traceback (most recent call last):
File "Desktop/PyPHLAWD/src/get_ncbi_tax_tree_no_species.py", line 97, in
tree = construct_tree(taxon, DB, taxalist)
File "Desktop/PyPHLAWD/src/get_ncbi_tax_tree_no_species.py", line 53, in construct_tree
c.execute("select ncbi_id from taxonomy where name = ? and node_rank != 'species'", (taxon, ))
sqlite3.OperationalError: no such table: taxonomy
Traceback (most recent call last):
File "../../src/setup_clade_bait.py", line 39, in
trn = tree_reader.read_tree_file_iter(tname).next().label
StopIteration

I cannot figure out which file exactly is causing the problem.

missing variable in find_good_clusters_for_concat.py

Hi there,
I cloned the repo this week to try and build a large tapeworm phylogeny. When I ran the find_good_clusters script I got the error that the 'py' variable was undefined. To remedy, I just added "from conf import py" to the beginning of the script.

sqlite3.OperationalError: no such column: custom_id

Hi guys,

I am having problems trying to do runs with a fresh install of PyPHLAWD and a freshly made pln db from phlawd_db_maker 0.3. The full error is this:

Traceback (most recent call last):
  File "/home/nat/Applications/PyPHLAWD/src/populate_dirs_first_wc.py", line 45, in <module>
    mfid_in(tid,DB,dirl+dirr+"/"+orig+".fas",dirl+dirr+"/"+orig+".table",True,limitlist = taxalist) 
  File "/home/nat/Applications/PyPHLAWD/src/get_subset_genbank_wc.py", line 134, in make_files_with_id_internal
    c.execute("select name,custom_id,custom_parent_id,custom_name from taxonomy where ncbi_id = ? and name_class = 'scientific name'",(str(taxonid),))
sqlite3.OperationalError: no such column: custom_id

Seems like maybe formatting of ncbi database has changed?

Cheers,

Nat

Work on incertae sedis

treemake = True not implemented

Tree files are created, but empty. I don't think it has to do with my installed version of fasttree, but this would be good to be confirmed.

Cannot download prebuilt databases

Hi,

I cannot download any of the prebuilt databases, is the server down? Now I'm about to build a database for invertebrates but it take ages, as you can imagine.

Any help or advice is highly appreciated,

Bastian

add run html

add html for the analyses that you start from within pyphlawd

Monotypic taxa

e.g. an Order with a single family. Two issues:

constraint tree from find_good_clusters_for_concat.py includes root edge, which RAxML barfs on. would be nice to not write these. probably will involve internal knuckles as well. pxcltr can remove these but would be nice to not have to bother.
again using find_good_clusters_for_concat.py I get different numbers of default clusters depending on if I run it from the top directory (e.g., 3) or down one (e.g., 1). this doesn't make sense as no new taxa are added moving to the top directory.

Add functionality to relax constraint to only higher taxonomic levels

The idea would be to have a constraint for levels starting at the family level and higher, but to leave the genera and species to be unconstrained. This would allow to identify possible misidentifies samples and/or problems in the taxonomy, e.g., non-monophyletic genera described from morphology only

Traceback errors

PyPHLAWD is returning the following traceback errors:

File "src/setup_clade.py", line 40, in
trn = tree_reader.read_tree_file_iter(tname).next().label
File "/Users/phillipharris/Projects/PyPHLAWD/src/tree_reader.py", line 105, in read_tree_file_iter
yield read_tree_string(i.strip())
File "/Users/phillipharris/Projects/PyPHLAWD/src/tree_reader.py", line 83, in read_tree_string
curnode.add_child(newnode)
AttributeError: 'NoneType' object has no attribute 'add_child'

I have checked all dependencies and required path statements and everything seems to be in order. Any help/suggestions would be most appreciated.

About function

Might be nice to have an about (or get_info) function that describes a db. Information would include:

date created
ncbi release
root taxon
number terminal taxa
other stuff

Just a thought. Feel free to disregard.

Migrate to Python 3

Python 2 is in its twilight phase now, and more and more everything is moving to Python 3. For posterity's sake, it would be good if PyPHLAWD made the move as well. This should be relatively simple, as most, if not all of the python code should be automatically translatable using 2to3

can't compile with cython

If I run bash compile_cython.sh, I get the following:

[ptitle@iu-eri-006207 src]$ bash compile_cython.sh 
running build_ext
building 'cnode' extension
cc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -O3 -Wall -fPIC -I/home/linuxbrew/.linuxbrew/include -I/home/linuxbrew/.linuxbrew/opt/openssl/include -I/home/linuxbrew/.linuxbrew/opt/sqlite/include -I/home/linuxbrew/.linuxbrew/Cellar/python/3.7.2_1/include/python3.7m -c cnode.c -o build/temp.linux-x86_64-3.7/cnode.o
cnode.c: In function '__Pyx_ExceptionSwap':
cnode.c:7149:22: error: 'PyThreadState {aka struct _ts}' has no member named 'exc_type'
     tmp_type = tstate->exc_type;
                      ^
cnode.c:7150:23: error: 'PyThreadState {aka struct _ts}' has no member named 'exc_value'
     tmp_value = tstate->exc_value;
                       ^
cnode.c:7151:20: error: 'PyThreadState {aka struct _ts}' has no member named 'exc_traceback'
     tmp_tb = tstate->exc_traceback;
                    ^
cnode.c:7152:11: error: 'PyThreadState {aka struct _ts}' has no member named 'exc_type'
     tstate->exc_type = *type;
           ^
cnode.c:7153:11: error: 'PyThreadState {aka struct _ts}' has no member named 'exc_value'
     tstate->exc_value = *value;
           ^
cnode.c:7154:11: error: 'PyThreadState {aka struct _ts}' has no member named 'exc_traceback'
     tstate->exc_traceback = *tb;
           ^
error: command 'cc' failed with exit status 1

I checked, and I have the latest version of cython, installed in the package library for python3:

[ptitle@iu-eri-006207 src]$ pip3 install --upgrade cython
Requirement already up-to-date: cython in /home/linuxbrew/.linuxbrew/lib/python3.7/site-packages (0.29.6)

Any ideas?

setup_clade_ap.py crashes

Hey,

I used the command once before on a larger database and it worked fine, but the second time I tried it stopped in the middle and began repeating
Warning: [blastn] Number of threads was reduced to 8 to match the number of available CPUs
non-stop for a long time without working.
I tried again from the start and now it's just frozen.

I'm new to this so if you can explain to me what might be the problem in a simple way I would really appreciate it.

Option for excluding below species sequences

wget crashes when trying to downloading db

Hi,

I am trying to download a pre-built database, but wget crashes regularly, same with curl. Not sure if this is because something goes wrong on my site or yours, I have tried different internet connections, but it's the same. Do you have any advice on how to download the pre-built databases?

wget output:
wget -c -v "http://141.211.236.35:10998/pln.05082018.db"
--2019-01-25 16:16:11-- http://141.211.236.35:10998/pln.05082018.db
Connecting to 141.211.236.35:10998... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 22727467008 (21G), 10828490310 (10G) remaining [application/octet-stream]
Saving to: ‘pln.05082018.db’

pln.05082018.db 52%[+++++++++++++++++++++++++++ ] 11.09G 844KB/s in 16s

2019-01-25 16:16:27 (411 KB/s) - Connection closed at byte 11905569730. Retrying.

--2019-01-25 16:16:28-- (try: 2) http://141.211.236.35:10998/pln.05082018.db

Curl output:
curl -v -o plant.db http://141.211.236.35:10998/pln.05082018.db

Trying 141.211.236.35...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to 141.211.236.35 (141.211.236.35) port 10998 (#0)

GET /pln.05082018.db HTTP/1.1
Host: 141.211.236.35:10998
User-Agent: curl/7.47.0
Accept: /

< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Content-Length: 22727467008
< Content-Type: application/octet-stream
< Last-Modified: Tue, 08 May 2018 18:38:14 GMT
< Date: Sat, 26 Jan 2019 00:11:38 GMT
<
{ [1167 bytes data]
0 21.1G 0 6319k 0 0 434k 0 14:11:07 0:00:14 14:10:53 689k* transfer closed with 22719634945 bytes remaining to read
0 21.1G 0 7648k 0 0 496k 0 12:25:21 0:00:15 12:25:06 887k

Closing connection 0
curl: (18) transfer closed with 22719634945 bytes remaining to read

Problem creating seq files when running setup_clade_ap.py.

Question Where is the -s parameter (SEQGZFOLDER) for setup_clade_ap.py meant to point?

Issue: I seem to be having a problem populating the gzip directory with sequences. The .table file is all populated from the ncbi db, but it's not finding the sequences. I'm not sure where the -s parameter is supposed to be pointing maybe? ~/ is where all the compressed ncbi files are from phlawd_db_maker.

snail@snailbuntu:~/PyPHLAWD/src$ python3 setup_clade_ap.py -t Architaenioglossa -b /media/snail/RED1/ncbi/inv.db -o ~/Desktop/ -s ~/ -l ~/Desktop/logfile
STARTING PYPHLAWD *。ヾ(｡&gt;ｖ&lt;｡)ﾉﾞ*。
MAKING TREE Architaenioglossa ٩(๑꒦ິȏ꒦ິ๑)۶
MAKING DIRS IN /home/snail/Desktop ヽ(*´∀`)ﾉﾞ
PROBLEM CREATING /home/snail/Desktop/Architaenioglossa_75116 (´；ω；`)
POPULATING DIRS /home/snail/Desktop ヽ/❀o ل͜ o\ﾉ
Traceback (most recent call last):
  File "/home/snail/PyPHLAWD/src/populate_dirs_first.py", line 47, in <module>
    mfid_in(tid,DB,dirl+dirr+"/"+orig+".fas",dirl+dirr+"/"+orig+".table",gzfileloc,True,limitlist = taxalist) 
  File "/home/snail/PyPHLAWD/src/get_subset_genbank.py", line 275, in make_files_with_id_internal
    idstoseq = get_seqs_from_gz(gzfileloc,fn,files_ids[fn])
  File "/home/snail/PyPHLAWD/src/get_subset_genbank.py", line 24, in get_seqs_from_gz
    fl = gzip.open(gzdir+"/"+filename,"r")
  File "/usr/lib/python3.8/gzip.py", line 58, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/snail//seqs.Viviparus subpurpureus voucher USNM 1292588 histone 3 (H3) gene, partial cds.'
CREATED TEMPDIR_44273/
CLUSTERING SINGLE /home/snail/Desktop/Architaenioglossa_75116/Cyclophoroidea_75117/Megalomastomatidae_928797/Acroptychia_928777 ヽ(｡´･д･)ﾉ
Traceback (most recent call last):
  File "/home/snail/PyPHLAWD/src/cluster_tree.py", line 38, in <module>
    tablename = [x for x in files if ".table" in x][0]
IndexError: list index out of range
PYPHLAWD DONE ヽ(^□^｡)ノ
Total time (H:M:S): 0:00:00.638717 ٩(º౪º๑)۶
(⌐■_■)

Steps taken: Followed the steps on the Install page. Built phlawd_db_maker and all dependencies without errors. Built the database with phlawd_db_maker with no errors. Followed directions on the Runs page for a clustering analysis. Python version is 3.8.10

I know Python pretty well, so if I find a fix I'll make a pull request.

seqs at internal nodes

it is not correctly populating sequences at internal nodes.
a good test is fusarium

add outgroup

adding the functionality to add outgroups to the matrices and will make a new matrix.

enhancement: create conda package

Setup of PyPHLAWD would be much more convenient if it and all of its dependencies were wrapped up into a conda package. As it stands, it is a bit tedious. Many of the dependencies are already available via bioconda, and this package would be a perfect fit in bioconda.

switch to argparse

eh, this is one that probably should have made it in there a couple weeks ago, but slipped. so will do soon.

beautify html

PROBLEM REDOING ALIGNMENT

Hi!

I keep encountering "PROBLEM REDOING ALIGNMENT" when performing either the cluster or baited analysis. MAFFT is root installed and all the dependencies have been exported in PATH.

When doing a baited analysis to get Adoxaceae sequences, the info.csv contains:
"species,rbcL.fa
Viburnum kansuense,x
Viburnum erosum,x
..."

I also get problem_subMSAtable and problem_temp.mergealn. It is the same when doing a cluster analysis.

I suspected that it may be pycat, but it is clearly in PATH (I can see it when echo $PATH). I am not really getting an output overall. I would appreciate any insight to troubleshoot this issue.

Thank you very much,
Shing

find_good_clusters_for_concat should have extra args

It has become abundantly clear that a global conf.py does not work for every clade. Specifically:

smallest_cluster
cluster_prop

Presently find_good_clusters_for_concat.py just uses the values from conf.py. It is a bit of a pain to edit the latter while troubleshooting. It would be nice if the former used these values by default but could be overridden by optional args. This should be easy with the switch to argparse (#14).

Missing stuff docs

This gets it done for me on linux:

sudo pip install cython

So, easy, but just not expected until the error surfaces when trying to execute things.

change_id_to_name_fasta.py This will allow you to change names in a user input fasta file with a list of given names. The input is a tab delimited file containing the current names in the first column and the names to be replaced with in the second.
python change_id_to_name_fasta.py Table.tsv InputFasta.fa OutputFile

That means the table should only has two columns: old name spls[0] and new name to replace spls[1]

Pardon me if I misinterpret it :0

Miao