gbouras13 / dnaapler Goto Github PK
View Code? Open in Web Editor NEWReorients assembled microbial sequences
License: MIT License
Reorients assembled microbial sequences
License: MIT License
Think about how to consider error-prone assemblies with frameshifts, but with an option dnaA from BLAST. At the moment this will error out.
Hi @gbouras13!
Does the reoriented contig match the strandedness of the blast result? Or is it always positive?
This stems from rpetit3/dragonflye#22 where I plan to add dnaapler as a final step.
Thank you so much!
Robert
Also, I think I found the spot in the code where it happens, but wanted to make sure! Your code was so nice! And, the comments for very helpful, thank you for that!
I have read your readme, and I am aware that dnaapler is in a "Refactoring is in progress" state.
Even so, I would like to create a Docker container for this repository in https://github.com/StaPH-B/docker-builds. It would be very helpful to me if there was a version on github - even a pre-release one.
Hi,
attached are some suggestions for the documentation based on my review:
Line 20 in 7c67b3b
-c bioconda
, like in the command specified in Line 95 of the same file. This should prevent many first-time users from running into a "package not found" error, as it is unlikely they will have read the Beginner Conda Installation section of the docs. You may also link this section here.Line 106 in 7c67b3b
Line 46 in 7c67b3b
I will update this issue if I find anything else.
The installation instructions for dnaapler currently provide a choice between two channels:
The following things jump out to me:
A. I find it confusing that if one follows the 'beginner conda installation' instructions, then one ends up following a different codepath to (1) and installing with mamba instead.
B. Installing mamba via conda is no longer recommended.
C. I'm not sure if pip & conda can coexist peacefully in the same virtual environment. Does it make sense to suggest to your users allow users to install dnaapler via pip if BLAST is only available in bioconda?
To resolve (A) & (B) you could:
Hi There,
Thanks for developing this cool tool! It is indeed very useful for polishing assemblies.
I recently came across an issue whereby dnaapler was failing when a custom sequence specified for reorientation did not exist in one of the contigs in a multi-FASTA file. Ideally, an option to exclude/include length criteria could easily fix such an error by excluding such short/long contigs from the reorientation step and adding them onto the final results as it is.
Additionally, I noted that the bulk option can only be used for a multi-FASTA and fails when the input is a single sequence FASTA file. This causes a bit of an issue when such a tool is integrated into a workflow or a pipeline as typically the output or number of contigs generated using a de novo assembly step is unknown.
Thanks once again for developing dnaapler.
Sej
Hi @gbouras13, I have encountered a case where there are multiple blast hits with very low evalue (1e-180) that just happened to miss the start of the target by few residues, so they were discarded and a random gene ("nearest" option) was used instead.
I think that in those cases it would be best to use the ORF that overlaps the best blast hit, and even make this the default when --autocomplete
is not specified. What do you think?
Thanks for creating and maintaining dnaapler! It's running fantastically well. I had a feature request:
Is your feature request related to a problem? Please describe.
I am running dnaapler all ...
as the input contains both plasmids and genomes.
dnaapler
outputs fasta headers in the format {contig_name} rotated=True
. I want to know which gene (dnaA, repA, etc) was used to reorientate which sequence.
Describe the solution you'd like
The information is captured as it's present in *_summary.tsv
file. Fingers crossed this isn't a big ask
Describe alternatives you've considered
NA
Additional context
NA
Thanks!
Hi, dnaapler looks great and does nearly exactly what I need for my genome assemblies.
I had a few feature requests that would make it easier to run for an entire genome.
I often have assemblies with chromosomes and plasmids and would like to rotate all of them simultaneously, rather than running it for the chromosome and the plasmids separately. I could make a custom db with dnaA and repA and run with bulk, but it would be convenient to have an 'all' option that rotates all chromosomes and plasmids in a fasta file.
My second request is tied to this first one. I work with Agrobacterium, which has a circular chromosome and a second linear chromosome. I would like to rotate the circular chromosome but not the linear one. Similarly, sometimes I have nearly complete assemblies where the chromosome is almost but not quite complete, and the plasmids are complete. I would like to rotate only the contigs that are assembled as circular molecules.
The easiest way to get around this would be to have an 'ignore' option like circlator's that would let me provide a list of contigs to ignore.
Thanks!
-Alex
Add e-value threshold as option
Hi @gbouras13,
thank you very much for this super useful tool - something that I very often considered to start developing myself ;-) So, thanks a lot for this!
I've seen that the terL
DB of dnaapler is quite heavy compared to the dnaA
/repA
. Re-orientating (semi) finished bacterial genomes is a very common task. To save runtime in large-scale analyses, I guess it might be interesting/OK to skip the phage step for the majority of cases, since most phages would be inserted in a chrom/plasmid anyway and if not, it would be okish to have the phage not re-orientated. Trying dnaapler on a genome of a novel species took ~15 min (for which it was not able to find a decent dnaA
gene - most likely just lacking a closely related sequence to those within SwissProt ). Hence, I wonder if it would be possible and OK for you to provide an option either to explicitly skip the terL
step or to provide an option to choose 1 or 2 of all 3 steps, maybe something like --db [dnaa,repa,terl]
or --type [chrom,plasmid,phage]
.
Looking forward to your thoughts - and thanks again!
Hi @gbouras13,
There are a couple of places in the CONTRIBUTING.md file which reference one of your other projects (phrokka
) rather than dnaapler
. Could you please take a look and make sure that the contribution guidelines are appropriate for this repo? It might also be nice to link them in the main README.md.
Hi,
Thanks for this nice tool.
I tried to automate Dnaapler
in a pipeline to reorientate genomes assembled with flye
. In some instances, I got the genome assembly fragmented, and obviously not circular (also from flye info). Despite of this, Dnaapler
rotated these fragments and may have falsely merged the contig ends. Could it be possible that Dnaapler
takes the flye assembly_info.txt
and applies its logic only to the circularized contigs?
Thank you
Hi @gbouras13
I have a question about the list of authors in the JOSS paper. I can see that 3/5 of the authors have made contributions to the repo, but I am unclear about the contributions of the remaining authors. Could you please clarify what their contribution was?
I note that I've seen some JOSS papers provide a statement of author contributions according to the Credit guidelines, but this doesn't appear to be required.
The log message:
2023-07-13 01:42:57.043 | INFO | dnaapler.utils.validation:instantiate_dirs:20 - Checking the output directory FIX_START
2023-07-13 01:42:57.043 | INFO | dnaapler.utils.validation:instantiate_dirs:25 - --force was specified even though the output directory does not already exist. Continuing.
2023-07-13 01:42:57.055 | INFO | dnaapler.utils.util:begin_dnaapler:66 - You are using dnaapler version 0.1.0
2023-07-13 01:42:57.056 | INFO | dnaapler.utils.util:begin_dnaapler:67 - Repository homepage is https://github.com/gbouras13/dnaapler
2023-07-13 01:42:57.056 | INFO | dnaapler.utils.util:begin_dnaapler:68 - Written by George Bouras: [email protected]
2023-07-13 01:42:57.056 | INFO | dnaapler.utils.util:begin_dnaapler:69 - Your input FASTA is WH0762305A01__Edwardsiella_ictalur_chromosome.fa
2023-07-13 01:42:57.056 | INFO | dnaapler.utils.util:begin_dnaapler:70 - Your output directory is FIX_START
2023-07-13 01:42:57.057 | INFO | dnaapler.utils.util:begin_dnaapler:71 - You have specified 8 threads to use with blastx
2023-07-13 01:42:57.057 | INFO | dnaapler.utils.util:begin_dnaapler:72 - You have specified dnaA gene to reoirent your sequence
2023-07-13 01:42:57.947 | INFO | dnaapler.utils.util:begin_dnaapler:74 - blastx version 2.14.0+ found
2023-07-13 01:42:57.948 | INFO | dnaapler.utils.validation:validate_fasta:43 - Checking that the input file WH0762305A01__Edwardsiella_ictalur_chromosome.fa is in FASTA format and has only 1 entry.
2023-07-13 01:42:58.041 | INFO | dnaapler.utils.validation:validate_fasta:50 - WH0762305A01__Edwardsiella_ictalur_chromosome.fa file checked.
2023-07-13 01:42:58.082 | INFO | dnaapler.utils.validation:validate_fasta:59 - WH0762305A01__Edwardsiella_ictalur_chromosome.fa has only one entry.
2023-07-13 01:42:58.085 | INFO | dnaapler.utils.external_tools:run:45 - Started running blastx -db /usr/local/lib/python3.10/site-packages/dnaapler/db/dnaA_db -evalue 1e-10 -num_threads 8 -outfmt ' 6 qseqid qlen sseqid slen length qstart qend sstart send pident nident gaps mismatch evalue bitscore qseq sseq ' -out FIX_START/WH0762305A01__Edwardsiella_ictalur_fxstart_chromosome_blast_output.txt -query WH0762305A01__Edwardsiella_ictalur_chromosome.fa ...
2023-07-13 01:43:39.874 | INFO | dnaapler.utils.external_tools:run:47 - Done running blastx -db /usr/local/lib/python3.10/site-packages/dnaapler/db/dnaA_db -evalue 1e-10 -num_threads 8 -outfmt ' 6 qseqid qlen sseqid slen length qstart qend sstart send pident nident gaps mismatch evalue bitscore qseq sseq ' -out FIX_START/WH0762305A01__Edwardsiella_ictalur_fxstart_chromosome_blast_output.txt -query WH0762305A01__Edwardsiella_ictalur_chromosome.fa
2023-07-13 01:43:39.904 | INFO | dnaapler.utils.processing:reorient_sequence:93 - dnaA gene identified. It starts at coordinate 553459 on the reverse strand in your input file.
2023-07-13 01:43:39.904 | INFO | dnaapler.utils.processing:reorient_sequence:96 - The best hit with a valid start codon in the database is sp|P76594|LYSAC_ECOLI, which has length of 886 AAs.
2023-07-13 01:43:39.905 | INFO | dnaapler.utils.processing:reorient_sequence:99 - 887 AAs were covered by the best hit, with an overall coverage of 100.11%.
2023-07-13 01:43:39.905 | INFO | dnaapler.utils.processing:reorient_sequence:102 - 488 AAs were identical, with an overall identity of 55.02%.
2023-07-13 01:43:39.905 | INFO | dnaapler.utils.processing:reorient_sequence:103 - Re-orienting.
2023-07-13 01:43:40.036 | INFO | dnaapler.utils.util:end_dnaapler:102 - dnaapler has finished
2023-07-13 01:43:40.037 | INFO | dnaapler.utils.util:end_dnaapler:103 - Elapsed time: 42.99 seconds
It turn out it your database has some 'contaminated' sequence like in this image:
As a result, the sequence is not adjust correctly at dnaA!
Please respond as soon as possible, thank you
I have issues with circlator, and was hoping to use dnaapler instead.
For plasmids, circlator uses prodigal to detect a gene somewhere in the middle and rotates the sequence to that. Does dnaapler do anything similar or does it just look for dnaA?
The dnaapler workflow crashes right after the BLAST step on some PacBio HiFi assemblies.
I've been running dnaapler all
on bacterial assemblies of PacBio HiFi reads (made with Flye). The assemblies are all nearly identical and consist of a circular chromosome of ~4Mb and two plasmids of ~20kb. This works quite nicely, but fails in one case. The clue I'm getting in the log file is copied below.
If you have any suggestion how to fix this, I'd be happy to adjust some code and test it! Thanks in advance.
2023-08-29 13:28:48.143 | INFO | dnaapler.utils.external_tools:run:52 - Done running blastx -db .../lib/python3.11/site-packages/dnaapler/db/all_db -evalue 1e-10 -num_threads 4 -outfmt ' 6 qseqid qlen sseqid slen length qstart qend sstart send pident nident gaps mismatch evalue bitscore qseq sseq ' -out data/tmp/example_blast_output.txt -query data/tmp/example-assembly/assembly.fasta
Traceback (most recent call last):
File ".../bin/dnaapler", line 10, in <module>
sys.exit(main())
^^^^^^
File ".../lib/python3.11/site-packages/dnaapler/__init__.py", line 728, in main
main_cli()
File ".../lib/python3.11/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/dnaapler/__init__.py", line 709, in all
all_process_blast_output_and_reorient(
File ".../lib/python3.11/site-packages/dnaapler/utils/all.py", line 269, in all_process_blast_output_and_reorient
genes.append(gene)
^^^^
UnboundLocalError: cannot access local variable 'gene' where it is not associated with a value
Re openjournals/joss-reviews#5968
Hello,
I have tried to common sense check the output of dnaapler
on the provided test data by running the following
dnaapler all -i tests/test_data/SAOMS1.fasta -o output.all -f
In case it is helpful I have attached a .zip folder of the output that I get from this command.
Two things are surprising to me:
output.all/logs/blastx_7d5612cc02ea76c324567a7b1bfeb1dfe9d64fbfc4b176959385cdc60514d626.err
)top blastx hit ... not begin with a valid start codon
and dnaapler falls back to reorientating the sequence using pyrodigalCould you explain why BLAST is producing errors in (1), and then the BLAST ouptut is being ignored in (2)? I worry that (1) implies that BLAST has hard-crashed, and then the error has been swallowed because dnaapler
has fallen back to using pyrodigal
.
Hi,
attached are some corrections and suggestions for the manuscript text based on my review.
Line 36 in 7c67b3b
Suggestions
Line 38 in 7c67b3b
Corrections:
Suggestions
Line 46 in 7c67b3b
Suggestions
Corrections
Overall I believe that this paragraph must be more descriptive in what exactly does Dnaapler do, as this is only implied through context.
Line 61 in 7c67b3b
Corrections
Line 67 in 7c67b3b
Suggestions
Edit: Apologies, I accidentally submitted this issue before finishing writing it, so I will be editing it before it's complete. This issue is now complete.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.