qiime2 / qiime2 Goto Github PK

View Code? Open in Web Editor NEW

448.0 448.0 234.0 1.64 MB

Official repository for the QIIME 2 framework.

Home Page: https://qiime2.org

License: BSD 3-Clause "New" or "Revised" License

Python 95.86% Makefile 0.04% Shell 0.01% TeX 4.07% HTML 0.02%

hacktoberfest

qiime2's Introduction

qiime2 (the QIIME 2 framework)

Source code repository for the QIIME 2 framework.

QIIME 2™ is a powerful, extensible, and decentralized microbiome bioinformatics platform that is free, open source, and community developed. With a focus on data and analysis transparency, QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

Visit https://qiime2.org to learn more about the QIIME 2 project.

Installation

Detailed instructions are available in the documentation.

Users

Head to the user docs for help getting started, core concepts, tutorials, and other resources.

Just have a question? Please ask it in our forum.

Developers

Please visit the contributing page for more information on contributions, documentation links, and more.

Citing QIIME 2

If you use QIIME 2 for any published research, please include the following citation:

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. https://doi.org/10.1038/s41587-019-0209-9

qiime2's People

Contributors

Stargazers

Watchers

Forkers

ebolyen loganathanncsu gregcaporaso jairideout yimsea muslih14 gblanchard4 jakereps robertoalvarezm kestrelgorlick wasade benkaehler dayedepps mortonjt eldeveloper pingpi357 alenzhao nightest andrewsanchez asutosh7hota maxvonhippel nervous-laughter tankmermaid xfunture msarahan fasnicar bigbrothermo spencerimp shicenncsu foxmicrobiologist hunter-cameron mebapa rpatil8 br4qua patthehat033 kojimaryuta caracolsol77 rajaldebnath massaraevi fadhlyemen dna-language shf43 zhangbw777 and-flores rnandety iyang5 nreeve17 zhuangwb tw7649116 turanoo dauss75 a7032018 cgi-nrm xpingli antgonza jennybethcornell jonahventures chriskeefe jiangbingxu raijinmaru17 junghyunjj adixit213 uvricorelabs blankenberg mbf001 cloacina gejun1995 3liv nau-oss-archive glomicon abesuden suttungr shihuang047 dkzhang11 irapraharaj wangdi2014 mibwurmoleco ropolomx edwjchen diegoibt cameronmartino mys721tx asoback jeremycl01 sgc92 oddant1 ayanlj holobionts matthieurouland weiweibian longhdo leemelisa decenwang 5c077 ahmedelhosseiny nbokulich zachary-wu sephoh kissthink archieniv

qiime2's Issues

artifact file extension?

What should the recommended file extension be for serialized artifacts?

@gregcaporaso @ebolyen and I were thinking of .qtf for QIIME Tar Format, since the format is a tar file. .qtf is only used by a couple of unrelated software packages and doesn't seem to be a popular extension.

The only thing I'm not sold on is having part of the underlying file format in the name (Tar) in case that changes in the future. It would also be nice to have "artifact" somewhere in the name since it is a format for serializing QIIME artifacts.

reorganize and clean up package structure

There's also stubbed/outdated code that can be deleted.

remove interface module

All interfaces, including the q2d3 prototype and cli will live in separate repositories. These will then serve as examples for other interface developers, and keep the distinction between functionality and interface clear.

define Visualization.get_indices

a public API to get a list of the available data/index.* files (probably as tuples of paths and extensions).

squash all the commits

Squash commits in this repo and the qiime2-plugins repos.

ArtifactDataReader should track open files

ArtifactDataReader should track files that are opened for reading with get_file and close all tracked filehandles when Artifact is done using the data reader. This is a safety net for plugin developers in case they forget to close the filehandles they request. It also makes plugin code simpler. This is a similar strategy to what ArtifactDataWriter does.

define standard file extensions so they can be imported by interfaces

The extensions qza and qzv should be discoverable via the sdk.

warn if recommended artifact file extension isn't used

Artifact and Visualization should warn when load or save is called with a file path that doesn't have the recommended file extension. This issue depends on #6.

should compressed artifacts be supported?

Should there be an API for creating a compressed tar file with Artifact.save?

See tarfile.open for supported compression schemes in Python. Artifact.load currently accepts compressed or uncompressed tar files, so supporting this should be trivial.

@ebolyen pointed out that a tar file is likely to be understood by software into the foreseeable future because it is a simple format, but compression is more of an unknown. IMO future-proof data archiving is outside the responsibility of QIIME 2, so we may want to give users/devs control over whether to compress artifacts. I could see different systems built around QIIME 2 having different needs w.r.t. compression.

Artifact.data should be lazy

Currently when instantiating an Artifact from a tar file, the artifact's data is loaded into memory and stored at the data property. The data property should be lazy such that the data is only loaded when .data is first accessed.

Question: should this always load a new instance of data, or cache the result? This depends on how Artifact is expected to be interacted with, which is unclear to me right now.

verify type meets its variants' interfaces

VariantType.validate is currently stubbed and always returns True. It should verify that a type meets each of its variant's interfaces.

add flake8 to builds

Adding here but this also should be a requirement for the qiime2 plugins

plugin version, workflow, and website should be added to job provenance

should also include citation(s), workflow code (in case the job code is not read-only), ...

TypeMeta should verify generic type overrides

If generic=True, TypeMeta should verify that __lt__ and __gt__ have been overridden on its type.

q2d3 interface only works with "pip install -e"

@ebolyen, looks like some package data is not being declared in setup.py. We'll want to have this fixed before the call tomorrow, so others can install.

add gitter room

ArtifactDataWriter should only save tracked files

ArtifactDataWriter._save_ currently adds all files in its temporary directory to the tar file. It should only add files that were created with create_file in case other files are added to the temp dir through some other mechanism.

add framework unit tests

@gregcaporaso @ebolyen and I are extreme programming this today.

add first-class interactive visualization support

create a mechanism for importing data into artifacts

I have an old gist that illustrates how to do this - that code may or may not still work, but the idea is the same.

As of now, on import provenance should be None, but we likely want to provide some provenance information for these, such as when the file was uploaded, the source file path, etc.

stub execution context

This will first be necessary for the cli, and then for other interfaces (it wasn't necessary for q2d3 because users initiates execution themselves).

>>> e = LocalSubprocessExecution()
>>> e(workflow_instance, params to workflow_instance.to_script) # this would call subprocess

saving a file launches q2d3 server

This is really weird: if I have a q2d3 server running and save a file in my text editor (vim), a new index page is launched in my browser. I'm not editing or saving files in the server's working directory. Here's the jupyter notebook log entries that get added when I save a file:

[I 14:52:32.222 NotebookApp] The port 4445 is already in use, trying another random port.
[C 14:52:32.222 NotebookApp] ERROR: the notebook server could not be started because no available port could be found.

@ebolyen any ideas?

investigate snakemake as pipeline backend

Improvement Description
It has a simple, declarative file format, supports cluster execution, and connects to a variety of resources out of the box (including Amazon S3, Google Storage, Dropbox, FTP, SFTP). The snakemake devs are working on adding cloud support. It's written in Python 3, and was initially developed for bioinformatics applications (so there's a lot of real-world examples and publications using it). The snakemake devs are bioconda devs, so it's available via bioconda. This looks like a very, very sane bioinformatics library that could be used outright, or extended, to provide true DAG workflow/pipeline specification and execution in QIIME 2. This is definitely worth investigating further!

Comments
@bhillmann suggested snakemake as a workflow/pipeline management engine. @ebolyen and I briefly looked into it and are amazed. A few of the highlights (thanks @ebolyen!):

Should `.async` type-check before spawning?

Comments
This would make life a little easier for users of the "Artifact" API in a Jupyter notebook, etc... as they wouldn't need to call future.result() over and over to see if it "worked", but it wouldn't prevent all types of errors as the wrapped function may still raise. Additionally this will make it harder for interface developers who would need to catch errors and represent them from two different places.

conda package

@colinbrislawn has expressed interest in working on this here.

@colinbrislawn, let's primary track this one here. We'll follow up with you via this issue soon.

write Workflow.to_script

This will be required for the cli.

Plugin entry-point key should be the machine id of a plugin

Improvement Description
while the name argument provided to the Plugin constructor should be the human-readable name.

Visualizers fail on OSX through async calls, due to matplotlib main thread limitations

Testing out the summarize visualizer from the feature-table plugin through qiime-studio revealed an issue where matplotlib does not work outside of the main thread on OSX. The Visualizer fails with the following traceback:

The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().
Break on __THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC__() to debug.
Traceback (most recent call last):
  File "/Users/Develop/Developer/work/qiime-studio/qiime_studio/api/jobs.py", line 111, in callback
    results = future.result()
  File "/Users/Develop/anaconda/envs/qiime_studio/lib/python3.5/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/Users/Develop/anaconda/envs/qiime_studio/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Found it was due to matplotlib through this SO exchange: http://stackoverflow.com/a/16303620
Sounds like this might need another workaround.

add error messages to type system

Many error messages in the type system are empty or snarky. These need to become real error messages.

verify Type.ne behavior is correct

Should Type.__ne__ be implemented as (not self < other) and (not self > other)?

make types hashable

This will be useful when we need to do type mapping, for example in the cli when we map QIIME types to click types.

create a Job class

Job objects should be returned from SystemContext.__call__, and should contain (at least) the complete job markdown, and the workflow uuid that gets stored as part of that jobs provenance. That workflow uuid can then be used, for example, in naming the markdown files that q2d3 is creating, which would provide a link from the Artifact objects that get created when the job is run to the actual code that was run, if the interface wants to track that (note that the created artifact's provenance does already have a reference to the workflow template that generated it).

document plugin requirements and recommendations

This will include things like the structure of a plugin, requirements for official QIIME 2 plugins (CI testing and coverage, flake8, etc), and recommendations for unofficial plugins.

cookiecutter for plugin packages

cookiecutter may be useful for creating a standardized QIIME 2 plugin package structure (thanks @wasade for finding this!).

verify qiime.core.archiver.Archiver zipfile path separator logic is cross-platform compatible

qiime.core.archiver.Archiver currently uses os.path.join wherever a path separator is required. This is correct in many cases, but we should make sure this will work on Windows and Unix since the zip spec requires / internally. In other words, can os.path.join be used for constructing paths within a zip file, or is it always /?

create a mechanism for inspecting Artifacts

Likely as a method of Artifact, or some way to view properties of it from interfaces. We need to think about what this would be exactly, but we need some way to view provenance, view summaries of the underlying data, etc.

Property predicates should be "symbols" so that we can attach docs

Improvement Description
Currently, our design has them as strings. This is easy, but because semantic properties require community consensus it would be nice if they were "registered" in some way with some documentation. That way developers and users can better interpret what a given type with a semantic property actually means.

centralize test install instructions

Installation instructions are now duplicated across different repositories, including q2cli and q2d3. We should put these in one place - the QIIME 2 GitHub wiki is a likely spot for this.

Job object is not accessible through the use of SubprocessExecutor

.sdk.execution.SubprocessExecutor creates a .sdk.job.Job, and writes the code to a file to be ran, but doesn't allow access to the object information itself. This was noticed while trying to use the Job.uuid as the identifier for tracking the current processes being ran by qiime-studio, and realizing that only the Future object was being returned.

Found while working on PR#32 in qiime-studio

support user-configured temporary directory

Similar to QIIME 1's temp_dir config option, QIIME 2 should support a config file/directory where users can specify a temporary directory. ArtifactDataWriter should respect this temp dir in its call to tempfile.TemporaryDirectory.

better handling of unloadable artifacts

It is primarily up to interfaces to determine how they handle discovering artifacts that they cannot load (either because the plugin that defines their type is not installed, or the .qtf file is corrupt), but QIIME 2 should likely raise a custom error type to make this easier for interfaces to detect. The corresponding error should differentiate corrupt artifacts from unimportable artifacts. If Artifacts encode information about the plugin that defines them in their metadata, an unimportable Artifact could tell the user what plugin needs to be imported for it to be used.

Also, providing a non-qtf file should give a nice error message. The current message is something like: tarfile.ReadError: '/Users/jairideout/dev/qiime2/q2-ninja-ops/seqs.fna' is not a readable tar archive file.

define ResultBase.extract

public API which will extract to a default directory unless the user provides an alternative.

feature: interactive phylogenetic tree

I've started working on a d3.js based phylogenetic tree viewer which allows for dynamic interaction through an OTU map file.

Just started working on it so it's got a ways to go, but I thought it might be useful to others.

add mechanism to allow Workflow instances to be executed as Python functions

centralize functionality that is reused across many tests

Specifically DummyType, dummy workflows, etc. This is important for testing the basic functionality, as the actual types and workflows live in other repositories.

unit test expansion

As part of #42, basic unit tests were added for pieces of the core framework. Expand on this to cover boundary cases and errors.

various URLs should be importable from the framework

including:

citation (QIIME 1 for now: http://www.ncbi.nlm.nih.gov/pubmed/20383131)
help page: http://help.qiime.org
conda channel: https://anaconda.org/qiime2

these should probably be globals under SDK.

recommend a naming convention for plugins

@ebolyen and I think it'd make sense to recommend some sort of naming convention analogous to scikit-. We could do q2- for package names, and q2 for module names, so the diversity plugin would be called q2-diversity, and the module would be imported as q2diversity (like scikit-bio and import skbio).

ArtifactDataWriter should set appropriate file metadata

ArtifactDataWriter should set appropriate file metadata/attributes when saving files to the tar archive (e.g., owner, permissions, etc.). I'm not sure exactly what those should be; see TarInfo for the possible file attributes that can be set.

arbitrary order of inputs/parameters

Noticed this while playing around with q2cli today. When calling --help on a method or visualizer, the order of inputs and parameters changes across runs, which is going to be confusing to users. This doesn't happen with methods defined in Markdown files because order is preserved in the Markdown file, and Method.from_markdown uses OrderedDict internally to preserve that order.

We have a couple of options:

Display all options in alphabetical order in q2cli and qiime-studio, regardless of underlying function signature. Technically no change is necessary to the framework if we go this route.
Display options in the order they are defined in the function signature. Note that this won't always group input artifacts followed by parameters, as plugin developers can define their function signature in whatever order they wish. The plugin developer understands the ordering of parameters best, so perhaps it's best if we respect their defined ordering in (all?) interfaces. If we go this route, we'll need to update the framework to preserve order, and q2cli probably won't need updating because it's just looping over the signature (unsure about qiime-studio).

@gregcaporaso @ebolyen @jakereps what do you think? I'm leaning towards option 2.