Giter Site home page Giter Site logo

dnaapler's People

Contributors

beardymcjohnface avatar gbouras13 avatar samnooij avatar vini2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dnaapler's Issues

ONT assemblies (frameshift)

Think about how to consider error-prone assemblies with frameshifts, but with an option dnaA from BLAST. At the moment this will error out.

Add version

I have read your readme, and I am aware that dnaapler is in a "Refactoring is in progress" state.

Even so, I would like to create a Docker container for this repository in https://github.com/StaPH-B/docker-builds. It would be very helpful to me if there was a version on github - even a pre-release one.

[JOSS Review] - Documentation

Hi,

attached are some suggestions for the documentation based on my review:

conda create -n dnaapler_env dnaapler

  • Include -c bioconda, like in the command specified in Line 95 of the same file. This should prevent many first-time users from running into a "package not found" error, as it is unlikely they will have read the Beginner Conda Installation section of the docs. You may also link this section here.

You will need to install BLAST v2.10 or higher separately.

  • Specify that this is only required if installing with pip (would Prodigal be required too if the user wants to use the Pyrodigal commands?)

3. Install miniconda and follow the prompts.

  • This list numbering is rendering incorrectly in the readthedocs page.

I will update this issue if I find anything else.

cc openjournals/joss-reviews#5968

[JOSS Review] Installation instructions in the readthedocs

Re joss-reviews/issues/5968

The installation instructions for dnaapler currently provide a choice between two channels:

  1. Installing via conda
  2. Installing via pip
    and then there is a 'Beginner Conda Installation' subsection.

The following things jump out to me:
A. I find it confusing that if one follows the 'beginner conda installation' instructions, then one ends up following a different codepath to (1) and installing with mamba instead.
B. Installing mamba via conda is no longer recommended.
C. I'm not sure if pip & conda can coexist peacefully in the same virtual environment. Does it make sense to suggest to your users allow users to install dnaapler via pip if BLAST is only available in bioconda?

To resolve (A) & (B) you could:

  • make installation instructions for all 3 channels (conda, mamba, pip) available at the same level.
  • refer users to the right section of the conda/mamba website to help them install these package managers rather than hardcoding the commands yourself.

min and max length criteria

Hi There,

Thanks for developing this cool tool! It is indeed very useful for polishing assemblies.

I recently came across an issue whereby dnaapler was failing when a custom sequence specified for reorientation did not exist in one of the contigs in a multi-FASTA file. Ideally, an option to exclude/include length criteria could easily fix such an error by excluding such short/long contigs from the reorientation step and adding them onto the final results as it is.

Additionally, I noted that the bulk option can only be used for a multi-FASTA and fails when the input is a single sequence FASTA file. This causes a bit of an issue when such a tool is integrated into a workflow or a pipeline as typically the output or number of contigs generated using a de novo assembly step is unknown.

Thanks once again for developing dnaapler.
Sej

Improve autocomplete

Hi @gbouras13, I have encountered a case where there are multiple blast hits with very low evalue (1e-180) that just happened to miss the start of the target by few residues, so they were discarded and a random gene ("nearest" option) was used instead.
I think that in those cases it would be best to use the ORF that overlaps the best blast hit, and even make this the default when --autocomplete is not specified. What do you think?

Add option to annotate fasta headers with Gene_Reoriented

Thanks for creating and maintaining dnaapler! It's running fantastically well. I had a feature request:

Is your feature request related to a problem? Please describe.
I am running dnaapler all ... as the input contains both plasmids and genomes.

dnaapler outputs fasta headers in the format {contig_name} rotated=True. I want to know which gene (dnaA, repA, etc) was used to reorientate which sequence.

Describe the solution you'd like
The information is captured as it's present in *_summary.tsv file. Fingers crossed this isn't a big ask

Describe alternatives you've considered
NA

Additional context
NA

Thanks!

add ignore option as well as all databases option

Hi, dnaapler looks great and does nearly exactly what I need for my genome assemblies.

I had a few feature requests that would make it easier to run for an entire genome.

I often have assemblies with chromosomes and plasmids and would like to rotate all of them simultaneously, rather than running it for the chromosome and the plasmids separately. I could make a custom db with dnaA and repA and run with bulk, but it would be convenient to have an 'all' option that rotates all chromosomes and plasmids in a fasta file.

My second request is tied to this first one. I work with Agrobacterium, which has a circular chromosome and a second linear chromosome. I would like to rotate the circular chromosome but not the linear one. Similarly, sometimes I have nearly complete assemblies where the chromosome is almost but not quite complete, and the plasmids are complete. I would like to rotate only the contigs that are assembled as circular molecules.
The easiest way to get around this would be to have an 'ignore' option like circlator's that would let me provide a list of contigs to ignore.

Thanks!

-Alex

e-value

Add e-value threshold as option

Option to specify start gene type in all mode

Hi @gbouras13,
thank you very much for this super useful tool - something that I very often considered to start developing myself ;-) So, thanks a lot for this!

I've seen that the terL DB of dnaapler is quite heavy compared to the dnaA/repA. Re-orientating (semi) finished bacterial genomes is a very common task. To save runtime in large-scale analyses, I guess it might be interesting/OK to skip the phage step for the majority of cases, since most phages would be inserted in a chrom/plasmid anyway and if not, it would be okish to have the phage not re-orientated. Trying dnaapler on a genome of a novel species took ~15 min (for which it was not able to find a decent dnaA gene - most likely just lacking a closely related sequence to those within SwissProt ). Hence, I wonder if it would be possible and OK for you to provide an option either to explicitly skip the terL step or to provide an option to choose 1 or 2 of all 3 steps, maybe something like --db [dnaa,repa,terl] or --type [chrom,plasmid,phage].

Looking forward to your thoughts - and thanks again!

[JOSS Review] Typos in CONTRIBUTING.md

Re joss-reviews/issues/5968

Hi @gbouras13,

There are a couple of places in the CONTRIBUTING.md file which reference one of your other projects (phrokka) rather than dnaapler. Could you please take a look and make sure that the contribution guidelines are appropriate for this repo? It might also be nice to link them in the main README.md.

Dnaapler rotates non-circularized contigs

Hi,

Thanks for this nice tool.
I tried to automate Dnaapler in a pipeline to reorientate genomes assembled with flye. In some instances, I got the genome assembly fragmented, and obviously not circular (also from flye info). Despite of this, Dnaapler rotated these fragments and may have falsely merged the contig ends. Could it be possible that Dnaapler takes the flye assembly_info.txt and applies its logic only to the circularized contigs?

Thank you

[JOSS Review] Authorship list

Re joss-reviews/issues/5968

Hi @gbouras13

I have a question about the list of authors in the JOSS paper. I can see that 3/5 of the authors have made contributions to the repo, but I am unclear about the contributions of the remaining authors. Could you please clarify what their contribution was?

I note that I've seen some JOSS papers provide a statement of author contributions according to the Credit guidelines, but this doesn't appear to be required.

fix start at pat gene instead of dnaA

The log message:

2023-07-13 01:42:57.043 | INFO     | dnaapler.utils.validation:instantiate_dirs:20 - Checking the output directory FIX_START
2023-07-13 01:42:57.043 | INFO     | dnaapler.utils.validation:instantiate_dirs:25 - --force was specified even though the output directory does not already exist. Continuing.
2023-07-13 01:42:57.055 | INFO     | dnaapler.utils.util:begin_dnaapler:66 - You are using dnaapler version 0.1.0
2023-07-13 01:42:57.056 | INFO     | dnaapler.utils.util:begin_dnaapler:67 - Repository homepage is https://github.com/gbouras13/dnaapler
2023-07-13 01:42:57.056 | INFO     | dnaapler.utils.util:begin_dnaapler:68 - Written by George Bouras: [email protected]
2023-07-13 01:42:57.056 | INFO     | dnaapler.utils.util:begin_dnaapler:69 - Your input FASTA is WH0762305A01__Edwardsiella_ictalur_chromosome.fa
2023-07-13 01:42:57.056 | INFO     | dnaapler.utils.util:begin_dnaapler:70 - Your output directory  is FIX_START
2023-07-13 01:42:57.057 | INFO     | dnaapler.utils.util:begin_dnaapler:71 - You have specified 8 threads to use with blastx
2023-07-13 01:42:57.057 | INFO     | dnaapler.utils.util:begin_dnaapler:72 - You have specified dnaA gene to reoirent your sequence
2023-07-13 01:42:57.947 | INFO     | dnaapler.utils.util:begin_dnaapler:74 - blastx version 2.14.0+ found
2023-07-13 01:42:57.948 | INFO     | dnaapler.utils.validation:validate_fasta:43 - Checking that the input file WH0762305A01__Edwardsiella_ictalur_chromosome.fa is in FASTA format and has only 1 entry.
2023-07-13 01:42:58.041 | INFO     | dnaapler.utils.validation:validate_fasta:50 - WH0762305A01__Edwardsiella_ictalur_chromosome.fa file checked.
2023-07-13 01:42:58.082 | INFO     | dnaapler.utils.validation:validate_fasta:59 - WH0762305A01__Edwardsiella_ictalur_chromosome.fa has only one entry.
2023-07-13 01:42:58.085 | INFO     | dnaapler.utils.external_tools:run:45 - Started running blastx -db /usr/local/lib/python3.10/site-packages/dnaapler/db/dnaA_db -evalue 1e-10 -num_threads 8 -outfmt ' 6 qseqid qlen sseqid slen length qstart qend sstart send pident nident gaps mismatch evalue bitscore qseq sseq ' -out FIX_START/WH0762305A01__Edwardsiella_ictalur_fxstart_chromosome_blast_output.txt -query WH0762305A01__Edwardsiella_ictalur_chromosome.fa ...
2023-07-13 01:43:39.874 | INFO     | dnaapler.utils.external_tools:run:47 - Done running blastx -db /usr/local/lib/python3.10/site-packages/dnaapler/db/dnaA_db -evalue 1e-10 -num_threads 8 -outfmt ' 6 qseqid qlen sseqid slen length qstart qend sstart send pident nident gaps mismatch evalue bitscore qseq sseq ' -out FIX_START/WH0762305A01__Edwardsiella_ictalur_fxstart_chromosome_blast_output.txt -query WH0762305A01__Edwardsiella_ictalur_chromosome.fa
2023-07-13 01:43:39.904 | INFO     | dnaapler.utils.processing:reorient_sequence:93 - dnaA gene identified. It starts at coordinate 553459 on the reverse strand in your input file.
2023-07-13 01:43:39.904 | INFO     | dnaapler.utils.processing:reorient_sequence:96 - The best hit with a valid start codon in the database is sp|P76594|LYSAC_ECOLI, which has length of 886 AAs.
2023-07-13 01:43:39.905 | INFO     | dnaapler.utils.processing:reorient_sequence:99 - 887 AAs were covered by the best hit, with an overall coverage of 100.11%.
2023-07-13 01:43:39.905 | INFO     | dnaapler.utils.processing:reorient_sequence:102 - 488 AAs were identical, with an overall identity of 55.02%.
2023-07-13 01:43:39.905 | INFO     | dnaapler.utils.processing:reorient_sequence:103 - Re-orienting.
2023-07-13 01:43:40.036 | INFO     | dnaapler.utils.util:end_dnaapler:102 - dnaapler has finished
2023-07-13 01:43:40.037 | INFO     | dnaapler.utils.util:end_dnaapler:103 - Elapsed time: 42.99 seconds

It turn out it your database has some 'contaminated' sequence like in this image:
image

As a result, the sequence is not adjust correctly at dnaA!
circos

Please respond as soon as possible, thank you

Plasmid rotation?

I have issues with circlator, and was hoping to use dnaapler instead.

For plasmids, circlator uses prodigal to detect a gene somewhere in the middle and rotates the sequence to that. Does dnaapler do anything similar or does it just look for dnaA?

Error with dnaapler all on bacterial chromosome and plasmids

The dnaapler workflow crashes right after the BLAST step on some PacBio HiFi assemblies.
I've been running dnaapler all on bacterial assemblies of PacBio HiFi reads (made with Flye). The assemblies are all nearly identical and consist of a circular chromosome of ~4Mb and two plasmids of ~20kb. This works quite nicely, but fails in one case. The clue I'm getting in the log file is copied below.
If you have any suggestion how to fix this, I'd be happy to adjust some code and test it! Thanks in advance.

  • dnaapler version 0.3.0 (installed through conda)
  • OS: Rocky Linux 8.8
2023-08-29 13:28:48.143 | INFO     | dnaapler.utils.external_tools:run:52 - Done running blastx -db .../lib/python3.11/site-packages/dnaapler/db/all_db -evalue 1e-10 -num_threads 4 -outfmt ' 6 qseqid qlen sseqid slen length qstart qend sstart send pident nident gaps mismatch evalue bitscore qseq sseq ' -out data/tmp/example_blast_output.txt -query data/tmp/example-assembly/assembly.fasta

Traceback (most recent call last):
  File ".../bin/dnaapler", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File ".../lib/python3.11/site-packages/dnaapler/__init__.py", line 728, in main
    main_cli()
  File ".../lib/python3.11/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.11/site-packages/dnaapler/__init__.py", line 709, in all
    all_process_blast_output_and_reorient(
  File ".../lib/python3.11/site-packages/dnaapler/utils/all.py", line 269, in all_process_blast_output_and_reorient
    genes.append(gene)
                 ^^^^
UnboundLocalError: cannot access local variable 'gene' where it is not associated with a value

[JOSS Review] Clarifying the fallback behaviour from BLAST to pyrodigal

Re openjournals/joss-reviews#5968

Hello,

I have tried to common sense check the output of dnaapler on the provided test data by running the following

dnaapler all -i tests/test_data/SAOMS1.fasta -o output.all -f

In case it is helpful I have attached a .zip folder of the output that I get from this command.

Two things are surprising to me:

  1. We get logs from running the BLAST external dependency, but only the log for the standard error is populated (output.all/logs/blastx_7d5612cc02ea76c324567a7b1bfeb1dfe9d64fbfc4b176959385cdc60514d626.err)
  2. The primary log states that top blastx hit ... not begin with a valid start codon and dnaapler falls back to reorientating the sequence using pyrodigal

Could you explain why BLAST is producing errors in (1), and then the BLAST ouptut is being ignored in (2)? I worry that (1) implies that BLAST has hard-crashed, and then the error has been swallowed because dnaapler has fallen back to using pyrodigal.

output.all.zip

[JOSS Review] - Text corrections and suggestions

Hi,

attached are some corrections and suggestions for the manuscript text based on my review.

Microorganisms found in natural environments are fundamental components of ecosystems and play vital roles in various ecological processes. Studying their genomes can provide valuable insights into the diversity, functionality, and evolution of microbial life, as well as their impacts on human health. Once the genetic material is extracted from environmental samples, it undergoes sequencing using advanced technologies like whole genome sequencing (WGS). The raw sequence data is then analysed, and computational methods are applied to assemble the fragmented sequences and reconstruct the complete microbial genomes [@Wick:2021] [@Mallawaarachchi:2023] [@Bouras:2023].

Suggestions

  • Line 15: Remove "advanced" preceding "technologies", or replace with a more descriptive term such as "high-throughput".

Many microorganisms including archaea, bacteria, plasmids, viruses, and bacteriophages, can have circular genomes. However, a circular genome sequence once assembled is represented as a linear character string and labelled in some way to indicate that it should be circular. The point at which the linear sequence begins is random, due to the nature of the algorithms employed in assembling genomes from sequencing reads. Such arbitrary startpoints can affect downstream genome annotation and analysis; they may occur within coding sequences (CDS), can disrupt the prediction potential of mobile genetic elements like prophages, and make pangenome analyses based on gene order difficult. Therefore, microbial sequences are often required to be reoriented to begin by convention with certain genes: the dnaA chromosomal replication initiator gene for bacterial chromosomes, the repA plasmid replication initiation gene for plasmids and the terL large terminase subunit gene for bacteriophages as shown in \autoref{fig:workflow}. Here we present Dnaapler, a flexible microbial sequence reorientation tool that allows for rapid and consistent orientation of circular microbial genomes such as bacteria, plasmids and bacteriophages. Dnaapler is hosted on GitHub at [github.com/gbouras13/dnaapler](https://github.com/gbouras13/dnaapler).

Corrections:

  • Line 19: Plasmids should be referred as something other than a "microorganism", such as a "biological entity" or some other term. As the first line reads, it sounds as bacteriophages and viruses are different entities, should be rewritten as something such as "bacteriophages and other viruses".

Suggestions

  • Line 19: Archaea and Bacteria should be capitalised.
  • Line 20, 26: The second and third sentences both begin in the same way, with "however" and "therefore". The second sentence can be rewritten as "Once assembled, a circular genome is represented [...]".

Circlator [@Hunt:2015] is the most commonly used dedicated tool for reorienting bacterial genomes. However, Circlator was designed for bacterial chromosomes and plasmids only, is no longer supported by its developers, has several burdensome external dependencies, and requires the corrected reads in FASTA or FASTQ format along with the FASTA genome assembly as input. Alternatively, genome reorientation is often performed manually or with custom scripts on a genome-by-genome and project-by-project basis, making integration into assembly workflows difficult and creating inconsistencies between different projects and researchers. We propose Dnaapler, a light-weight command-line tool written in Python 3 that can easily be integrated into assembly workflows. Dnaapler takes only a FASTA formatted genome file as input. It uses the Basic Local Alignment Search Tool (BLAST) [@Altschul:1990] [@Mount:2007] โ€” its only external dependency โ€” or Pyrodigal [@Larralde:2022] [@Hyatt:2010] depending on the chosen subcommand for reorientation. A list of the subcommands provided in Dnaapler are as follows:

Suggestions

  • Line 40: add a comma after the word "difficult".
  • Line 41: replace "Python 3" with "Python". Swap "can easily be" for "can be easily". Could be rewritten as: "command-line tool written in Python that performs the reorientation of microbial genomes, and can be easily integrated [...]".
  • Line 43: Rewrite "uses the Basic Local Alignment [...]" for "uses BLAST".
  • Line 44: replace "only external dependency" with "only mandatory external dependency". Edit: I initially assumed Pyrodigal was also an external dependency, but now I've realised it is a Python package. However, wouldn't Pyrodigal depend on the Prodigal tool, and thus Prodigal would be an optional external dependency? Please confirm this, thank you.

Corrections

  • Line 43-44: explicitly state the purpose of BLAST and Pyrodigal, i.e. "uses BLAST or Pyrodigal as input files/for alignment/some other purpose".

Overall I believe that this paragraph must be more descriptive in what exactly does Dnaapler do, as this is only implied through context.

Dnaapler has already been integrated into the United States of America StaPH-B (State Public Health Lab Bioinformatics) consortium [Docker image collection](https://github.com/StaPH-B/docker-builds).

Corrections

  • Line 60-61: this should be moved to the "Availability" section.

and features Continuous Integration tests and test coverage, and Continuous Deployment using Github actions.

Suggestions

  • Line 65: replace "Github" with "GitHub" for consistency with Line 64.

Edit: Apologies, I accidentally submitted this issue before finishing writing it, so I will be editing it before it's complete. This issue is now complete.

cc openjournals/joss-reviews#5968

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.