bokulich-lab / rescript Goto Github PK

View Code? Open in Web Editor NEW

88.0 88.0 26.0 7.72 MB

REference Sequence annotation and CuRatIon Pipeline

License: BSD 3-Clause "New" or "Revised" License

Python 98.25% TeX 1.57% Makefile 0.05% HTML 0.12%

rescript's People

Contributors

Stargazers

Watchers

rescript's Issues

add example gallery

we should add a gallery showing example QZVs and also displaying example output plots inline (e.g., to give users a preview).

This could be added to this repo, or alternatively in the q2-forum tutorials?

Download RefSeq Genomes and associated Taxonomy

Provide the ability for users to download RefSeq Genome Assemblies, along with their associated taxonomy. See this forum thread for more details.

Some notes:

In order to obtain the taxonomy,
- we must pull the accession from the FASTA header of the given assembly.
- Or parse the Taxonomy ID from the HTML / XML of the returned Assembly Record.
Can not obtain the taxonomy based on assembly identifiers, i.e. GCF_000213215.
The goal would bbe to support the following example query against the "Assembly" db, to download all Archaeal and Bacterial Genomes: (txid2[orgn] OR txid2157[orgn]) AND "latest_refseq"[Properties]

Include RESCRIPt in qiime docker

Not really this package issue but it'll be nice if it will included in the docker, currently qiime rescript doesn't work in quay.io/qiime2/core:2021.4

QIIME2 error after installing RESCRIPt

Hello,

After installing RESCRIPt on Ubuntu 16.04 (QIIME2-2020.2) using

conda create -y -n rescript
conda activate rescript
conda install \
  -c conda-forge -c bioconda -c qiime2 -c defaults \
  qiime2 q2cli q2templates q2-types q2-longitudinal q2-feature-classifier "pandas>=0.25.3" xmltodict

pip install git+https://github.com/bokulich-lab/RESCRIPt.git

I get the following error below. I cannot execute any commands in QIIME2 anymore, as this error pops up. I haven't installed any other packages in my conda environment lately, such as Songbird or Tensorflow. This error only began after installing RESCRIPt. I've also tried running qiime dev refresh-cache, with the same error appearing.

QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
Traceback (most recent call last):
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/bin/qiime", line 11, in <module>
    sys.exit(qiime())
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 716, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 641, in make_context
    self.parse_args(ctx, args)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/click/command.py", line 43, in parse_args
    return super().parse_args(ctx, args)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 1086, in parse_args
    echo(ctx.get_help(), color=ctx.color)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 516, in get_help
    return self.command.get_help(self)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 879, in get_help
    self.format_help(ctx, formatter)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 898, in format_help
    self.format_options(ctx, formatter)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/click/command.py", line 157, in format_options
    for subcommand in self.list_commands(ctx):
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/commands.py", line 92, in list_commands
    plugins = sorted(self._plugin_lookup)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/commands.py", line 76, in _plugin_lookup
    import q2cli.core.cache
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 404, in <module>
    CACHE = DeploymentCache()
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 61, in __init__
    self._state = self._get_cached_state(refresh=refresh)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 107, in _get_cached_state
    self._cache_current_state(current_requirements)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 200, in _cache_current_state
    state = self._get_current_state()
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 238, in _get_current_state
    plugin_manager = qiime2.sdk.PluginManager()
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 54, in __new__
    self._init(add_plugins=add_plugins)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 81, in _init
    plugin = entry_point.load()
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2445, in load
    return self.resolve()
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2451, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/rescript/plugin_setup.py", line 24, in <module>
    from .get_data import get_silva_data
  File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/rescript/get_data.py", line 19, in <module>
    from q2_types.feature_data import RNAFASTAFormat
ImportError: cannot import name 'RNAFASTAFormat'

Does anyone know how I can stop this error and get qiime2 working again with RESCRIPt installed? Any help with this is greatly appreciated.

UPDATE: I removed RESCRIPt and then installed qiime2-2021.4. At that point, I was not getting the error after running commands. Next, I re-installed RESCRIPt using

conda activate qiime2-2021.4
conda install -c conda-forge -c bioconda -c qiime2 -c defaults xmltodict
pip install git+https://github.com/bokulich-lab/RESCRIPt.git

At this point, I am no longer getting the error. It seems I just needed to update to the latest version of QIIME2.

using of NCBI 16S taxonomy data to make sequence reference database

Hi Sirs,
I am Shakti Kumar from India, working on metagenomics data analysis. I want to use NCBI 16S (https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz) database to make sequence reference databases. Please help me to use RESCRIPt.

add `get_gtdb_data` method

the parallel of get_silva_data but download, import, and format from GTDB:

FASTA sequences
taxonomy
reference phylogeny?

should add license information to main readme and method description, similar to what we've done with the SILVA license.

GTDB is pretty much already Q2-compatible but this pipeline would streamline things for users, allow us to easily pipe these artifacts into classifier training workflows, and bake GTDB into provenance to ensure that users cite appropriately...

qiime2-2021.2 installation error

Hello lovely RESCRIPt team!

Because I'm trying to be cutting edge (and I have a plugin I want to update that relies on RESCRIPt 😄), I was trying to update t qiime2-2021.2.

When I go to refresh the cache, I get the following error in setting up the plugin, I think related to a disjoint between RESCRIPt's feature classifier and the core q2-feature-classifier functionality.

Traceback (most recent call last):
  File "/Users/jusdeb/miniconda3/envs/test-env/bin/qiime", line 11, in <module>
    sys.exit(qiime())
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/builtin/dev.py", line 31, in refresh_cache
    import q2cli.core.cache
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 406, in <module>
    CACHE = DeploymentCache()
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 61, in __init__
    self._state = self._get_cached_state(refresh=refresh)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 107, in _get_cached_state
    self._cache_current_state(current_requirements)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 200, in _cache_current_state
    state = self._get_current_state()
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 238, in _get_current_state
    plugin_manager = qiime2.sdk.PluginManager()
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 54, in __new__
    self._init(add_plugins=add_plugins)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 81, in _init
    plugin = entry_point.load()
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2472, in load
    return self.resolve()
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2478, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/rescript/plugin_setup.py", line 144, in <module>
    citations=[citations['bokulich2018optimizing']]
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/plugin/plugin.py", line 301, in register_function
    deprecated, examples)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/sdk/action.py", line 526, in _init
    output_descriptions)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/core/type/signature.py", line 118, in __init__
    self._assert_valid_parameters(parameters)
  File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/core/type/signature.py", line 270, in _assert_valid_parameters
    % (param_name, spec.qiime_type))
TypeError: Default value for parameter 'reads_per_batch' is not of semantic QIIME type Int % Range(1, None) | Str % Choices('auto') or `None`.

add pipeline to generate, format, filter SILVA database

inputs:

start with output of #28 ? or start with files from SILVA?

steps:

#28 or parse_silva_taxonomy? any way to make auto-download optional?
screen_seqs
filter_seqs_length_by_taxon
dereplicate (question: use one more more modes for taxonomy derep?)
any other processing steps?
evaluate with evaluate_taxonomy and cross_validate (with kfold disabled)

outputs:

trained NB classifier
rep seqs
rep taxonomy
should we add steps to build an alignment/tree? or can we use the SILVA rep tree for this?

add option to keep original downloaded silva files and/or different taxonomy files?

Should we allow an option to keep the raw imported SILVA files we download?

My thinking is that if users would like to construct classifiers with and without species labels they will not have to go through the download option twice.

Or... given that this is a nice convenience method, we can simply return both taxonomy files, i.e., with and without species labels, and then there'd be no need for the option of writing out the raw files? This would will make constructing the silva dbs much easier for each qiime release.

I suppose we can do it all.. but that might be confusing.

parallelize *-seqs methods

these all use skbio to iterate over sequences; splitting into chunks should enable processing in parallel.

Migrate to python 3.8

As a RESCRIPt user,
I want the plugin to support python 3.8 (and so be compatible with the latest QIIME 2 release)
so that I can benefit from the newest QIIME 2 features.

Silva 138 Classifier generated using RESCRIPt is not classifying Salmonella at genus level

Hi
I have tried all possible versions of Silva 138 V3-V4 classifiers generated using three different dereplicating mode "uniq" "majority" and "super" in RESCRIPt to classify 16s sequences for mock bacterial communities from Zymoresearch https://files.zymoresearch.com/datasheets/ds1706_zymobiomics_microbial_community_standards_data_sheet.pdf
However, using these classifier are not assigning "Salmonella" at genera level which is expected to be in the community. I blasted Not_assigned ASVs to NCBI nucleotide database then I found majority of "Not_assigned" have hits with salmonella genus.

In contrast, when I classify the same sequences with Silva 132 V3-V4 classifier, it gives me "Salmonella" at genera level.
I also classify some other sequences from other studies too, however, in every case I am missing salmonella.,
The overall classification with Silva 138 classifier is better than Silva 132; but why I am missing salmonella, which is a bit weird for me.

Can you please give me a clue regarding this issue?

Thanks in advanced
Ashutosh

add option to NOT propagate taxonomy in `parse_silva_taxonomy`

If we are interested in down-stream analyses that require in-depth taxonomic curation at various ranks, then we should allow the option to not propagate taxonomy. This will match the functionality currently in get-ncbi-data.

DOC: Add better description of rank propagation

Need to make it clear that rank propagation will propagate taxonomy ranks for all available taxonomy, using forward fill prior to subsetting the ranks requested by the user. That is :

Accession	Taxonomy-Ranks
All Available Ranks	d__Eukaryota sk__Holozoa k__Animalia ks__ sp__Lophotrochozoa p__Rotifera ps__ pi__ sc__ c__Monogononta cs__ ci__ so__ o__Ploimida os__ sf__ f__ fs__ g__Brachionus
Propagated Ranks	d__Eukaryota sk__Holozoa k__Animalia ks__Animalia sp__Lophotrochozoa p__Rotifera ps__Rotifera pi__Rotifera sc__Rotifera c__Monogononta cs__Monogononta ci__Monogononta so__Monogononta o__Ploimida os__Ploimida sf__Ploimida f__Ploimida fs__Ploimida g__Brachionus

So if a user wanted infraclass w/o rank propagation they'd get: ci__With Rank propagation they'd get:ci__Monogononta`

There is probably a better example we can use...

update documentation for 2021.2.0

the installation docs etc are a bit out of date, e.g, regarding minimum versions etc.

For example, see this forum issue.

ENH: use DNAIterator instead of pandas for handling large fasta files?

dereplicate._dereplicate_taxa uses pd.Series as the view type for fasta files. This is convenient for small files but may pose problems with very large fasta files. DNAIterator or DNAFASTAFormat may be a better way to iterate across sequences.

Other functions might also use pd.Series for handling fasta files.

Need to test a bit further to see if this shaves off runtime/memory use.

Request for function to reverse complement

I would really like the ability to reverse complement sequences in QIIME 2 without re-orienting against a reference. Is this appropriate for RESCRIPt (if so, I'm happy to issue the PR), or is it better for another more central plugin?

add method to automatically download data files from SILVA

this would save users the trouble of navigating the massive number of SILVA files to find the correct inputs to parse_silva_taxonomy.

inputs:

version number
choose non-redundant???
anything else @mikerobeson ?

output:

matching SILVA taxid map, taxonomy, and taxa tree for that release, or maybe just parse the taxonomy and output the taxonomy file
matching SILVA ref seqs

add method for orienting mixed-orientation reads

Some ideas and features (from conversation with @mikerobeson ):

need util function to detect orientation — this could be used as a stand-alone validator?
Add reverse-complement option
Need to handle IUPAC codes appropriately in reverse complementation (skbio may have a built-in to do this?)

The workflow would be:

iterate over seqs in fasta
detect orientation (relative to reference sequences? kmer-based approach for quick estimation?)
fix orientation (as needed)
write out

Error from evaluate-fit-classifier: "This classifier does not support confidence values"

I got this error using rescript evaluate-fit-classifier. I am using qiime 2022.2 and scikit-learn 0.24.1 on a Linux server. I am attempting to create a reference database of the Sylvilagus bachmani genome so I can blast sequences to it to determine if they are from S. bachmani or another organism. I can recreate this error by doing this:

wget https://data.qiime2.org/distro/core/qiime2-2022.2-py38-linux-conda.yml
conda env create -n qiime2-2022.2 --file qiime2-2022.2-py38-linux-conda.yml
conda activate qiime2-2022.2

pip install git+https://github.com/bokulich-lab/RESCRIPt.git

qiime rescript get-ncbi-data --p-query '(512907[BioProject]) AND "Sylvilagus bachmani"[porgn:__txid365149] ' \
--o-sequences bachmani-refseqs-unfiltered.qza \
--o-taxonomy bachmani-refseqs-taxonomy-unfiltered.qza \
--p-n-jobs 5

qiime rescript evaluate-fit-classifier  --i-sequences bachmani-refseqs-unfiltered.qza \
 --i-taxonomy bachmani-refseqs-taxonomy-unfiltered.qza  \
 --o-classifier bachmani-classifier.qza \
--o-observed-taxonomy bachmani-classifier-predicted-taxonomy.qza \
--verbose

Error info:

Validation: 134.04s
/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.24.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)
Training: 2663.73s
Traceback (most recent call last):
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2cli/commands.py", line 339, in __call__
    results = action(**arguments)
  File "<decorator-gen-457>", line 2, in evaluate_fit_classifier
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/qiime2/sdk/action.py", line 485, in _callable_executor_
    outputs = self._callable(scope.ctx, **view_args)
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/rescript/cross_validate.py", line 48, in evaluate_fit_classifie
r
    observed_taxonomy, = classify(reads=sequences,
  File "<decorator-gen-513>", line 2, in classify_sklearn
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/qiime2/sdk/action.py", line 391, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/classifier.py", line 220, in classify_skl
earn
    seq_ids, taxonomy, confidence = list(zip(*predictions))
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/_skl.py", line 46, in predict
    for calculated in workers(jobs):
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/_skl.py", line 54, in _predict_chunk
    return _predict_chunk_with_conf(pipeline, separator, confidence, chunk)
  File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/_skl.py", line 70, in _predict_chunk_with
_conf
    raise ValueError('this classifier does not support confidence values')
ValueError: this classifier does not support confidence values

Plugin error from rescript:

  this classifier does not support confidence values

See above for debug info

Any idea why this error was created and what I'm doing wrong here? Is this a problem with the new 2022.2 qiime version?

git cloning problem

HI,
i cannot clone the RESCRIPt.git via pip. Every time i try i get the following error-** ERROR: Command errored out with exit status 128: git clone -q https://github.com/bokulich-lab/RESCRIPt.git /tmp/pip-req-build-ryakchyf Check the logs for full command output **. Tried to run pip directly into qiime2 env as well as by creating rescript env.

allow handling of compressed SILVA files.

Many of the files we parse from SILVA, to generate QIIME compatible, taxonomy classifiers, are provided as gzipped files here. It appears that RESCRIPt is unable to process these files unless they are decompressed prior to import into QIIME.

List requests in conda recipe requirements

To successfully build a conda package, requests dependency should to be added to the conda recipe requirements section: https://github.com/bokulich-lab/RESCRIPt/blob/master/ci/recipe/meta.yaml.

Fix broken versioneer tag (2020.11.0)

It looks like the latest release tag has a typo in it, the tag is 2020.11, but in order to ~~be PEP440/versioneer compliant~~ compatible with the QIIME 2 train release, the tag should actually be 2020.11.0. In theory we can just delete the broken tag and replace it with a new tag, but I am not sure what kind of URLs may or may not break because of that (and if it even matters).

what was killed?

Testing a large COI dataset with qiime rescript evaluate-fit-classifier using an AWS instance and the job keeps crashing. Not sure why the job is failing on this command:

qiime rescript evaluate-fit-classifier \
  --i-sequences "$HOME"/input/boldCOI.derep.seqs.qza \
  --i-taxonomy "$HOME"/input/boldCOI.derep.tax.qza \
  --p-reads-per-batch 5000 --p-n-jobs 4 \
  --o-classifier tidybug_nBayes_classifier_seqs.qza \
  --o-observed-taxonomy tidybug_nBayes_classifier_tax.qza \
  --o-evaluation tidybug_nBayes_classifier_viz.qzv --verbose

The information on the screen just says:

Killed

and the resulting .log file doesn't contain information about the stopped job:

Validation: 92.70s
/home/ubuntu/miniconda3/envs/rescript/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.21.2. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
  warnings.warn(warning, UserWarning)

I don't think it's a memory issue as this job is running on an AWS EC2 instance with over 250GB of RAM. It's possible though, as the job that's crashing was attempted while another job was running on the same machine. That other job was another RESCRIPt command:

qiime rescript evaluate-cross-validate \
     --i-sequences "$HOME"/input/boldCOI.derep.seqs.qza \
     --i-taxonomy "$HOME"/input/boldCOI.derep.tax.qza \
     --p-reads-per-batch 5000 --p-n-jobs 4 \
     --o-expected-taxonomy tidybug_crossval_expected_taxonomy.qza \
     --o-observed-taxonomy tidybug_crossval_observed_taxonomy.qza

And that job is doing fine. No crashing (though it has been running for more than 48 hours!).

Any ideas how to track where the issues might be originating?

ENH: add action for `get_unite_data`

As a plugin user, I would like to use RESCRIPt to download and format fungal ITS sequence and taxonomy data from the UNITE database.

The proposed get_unite_data would:

download sequence data from UNITE
download the corresponding taxonomy information
return FeatureData[Sequence] and FeatureData[Taxonomy]

The PlutoF API can be used to download data directly from UNITE: https://plutof.docs.apiary.io/#

Questions:

should we use this to download only UNITE versioned releases? Or allow more specific queries (e.g., taxon-specific)

ENH: add action for `get_gg_data`

I'm not sure how much this is needed, but is a compliment to #123 ENH: add action for get_unite_data

GreenGenes is distributed from ftp://greengenes.microbio.me/greengenes_release/ with a predictable naming scheme

Downsteam, this would replace https://github.com/caporaso-lab/pretrained-feature-classifiers/blob/master/get_gg.sh#L14-L15

dereplicate: add cluster % parameter

for optional OTU clustering

evaluate-cross-validate crashing

evaluate-cross-validate seems to have a persistent issue. I have tried to run it twice now with the same result. Here are the commands:

$ qiime rescript get-ncbi-data --p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) AND "BARCODE"[KYWD]' --o-sequences seq.qza --o-taxonomy tax.qza --verbose --p-n-jobs 5
$ qiime rescript evaluate-cross-validate --i-sequences seq.qza --i-taxonomy tax.qza --o-expected-taxonomy expected.qza --o-observed-taxonomy observed.qza --o-evaluation evaluation.qzv --p-n-jobs 12

I'm running it under qiime2-2020.8.

I get

Plugin error from rescript:

  A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

  The exit codes of the workers are {SIGKILL(-9)}

Debug info has been saved to /tmp/qiime2-q2cli-err-4z14w9x3.log

Please find the log file attached.

Migrate transformers, etc... to qiime2/q2-types

Given the recent updates to q2-types (qiime2/q2-types#256) to handle RNA data, and the current required functionality for RESCRIPt (e.g. DNA-to-RNA conversions), we should migrate some of these transformers and related content to q2-types. Slate this for the next QIIME 2 (2021.6??) and RESCRIPt releases.

should `get_silva_data` also (optionally) grab the associated phylogeny?

Add support for NCBI Datasets project

As a RESCRIPt user,
I want the NCBI's datasets/dataformat command-line tools support in RESCRIPt
so that I can easily download annotated genomes and their metadata by taxon/taxon ID.

See here: https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/
And here some usage examples for viral genomes: https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-virus/

BUG: merge_taxa LCA modes leave empty labels when there is no LCA

The LCA methods will leave blank taxonomies if there is no LCA (e.g., because one is "Unclassified").

Should instead fill in a placeholder label (e.g., 'Unassigned'), as blank labels can cause some downstream methods to break. This placeholder label string should be exposed as a parameter so that users can specify placeholder labels that match their pre-existing taxonomies.

Reported on the Q2 forum.

silence cross-validate warnings

cross-validate throws various warnings that can be ignored:

about sklearn version. This does not matter since the classifiers are not being output so cannot be used with other version.
and kfold cross-validation class count (if any classes have fewer members than K). This is handled by taxonomic relabeling so can just be ignored.

Should silence these warnings so they don't bother anybody.

dereplicate should have option to fill empty taxonomic ranks

@mikerobeson reported that this function is not filling empty taxonomic ranks following consensus taxonomic assignment. We should add an option to toggle rank filling on/off.

E.g., two identical sequences with the following taxonomy labels:
Bacteria;Genus;SpeciesA
Bacteria;Genus;SpeciesB

Would be collapsed to:
Bacteria;Genus

We should allow the option to collapse instead to:
Bacteria;Genus;__

Filling the ambiguous rank.

Question: How should we fill the ambiguous ranks? Should we allow input of a user-defined filler? What about defining ambiguous ranks for multiple levels? (e.g., the greengenes convention of g__, s__, etc) Or do we just default to "__"

CI/BUG: NCBI API rate limit exceeded

Logs:

https://github.com/bokulich-lab/RESCRIPt/runs/1127331123

Relevant subsection:

2020-09-17T07:55:26.0796330Z [command]/bin/bash /tmp/2020817-2640-717k0p.u5tdk.sh
2020-09-17T07:55:28.7487757Z ============================= test session starts ==============================
2020-09-17T07:55:28.7489124Z platform linux -- Python 3.6.11, pytest-6.0.2, py-1.9.0, pluggy-0.13.1
2020-09-17T07:55:28.7489901Z rootdir: /home/runner/work/RESCRIPt/RESCRIPt
2020-09-17T07:55:28.7490429Z collected 143 items
2020-09-17T07:55:28.7490917Z 
2020-09-17T07:55:30.7171971Z tests/test_cross_validate.py ..............                              [  9%]
2020-09-17T07:55:30.7323262Z tests/test_degap.py ..                                                   [ 11%]
2020-09-17T07:55:32.7420963Z tests/test_derep.py .............                                        [ 20%]
2020-09-17T07:55:34.1696564Z tests/test_evaluate.py .............                                     [ 29%]
2020-09-17T07:55:35.1489782Z tests/test_filter_length.py .............................                [ 49%]
2020-09-17T07:56:11.6230159Z tests/test_get_data.py ...                                               [ 51%]
2020-09-17T07:56:12.9456416Z tests/test_merge.py .............                                        [ 60%]
2020-09-17T07:56:49.8844414Z tests/test_ncbi.py .....F.                                               [ 65%]
2020-09-17T07:56:50.1811833Z tests/test_orient.py ...                                                 [ 67%]
2020-09-17T07:56:51.4930550Z tests/test_parse_silva_taxonomy.py ..............                        [ 77%]
2020-09-17T07:56:51.5290260Z tests/test_screenseq.py .....                                            [ 81%]
2020-09-17T07:56:51.5817501Z types/tests/test_methods.py .                                            [ 81%]
2020-09-17T07:56:51.6674875Z types/tests/test_types_formats_transformers.py ......................... [ 99%]
2020-09-17T07:56:51.6843012Z .                                                                        [100%]
2020-09-17T07:56:51.6843455Z 
2020-09-17T07:56:51.6843863Z =================================== FAILURES ===================================
2020-09-17T07:56:51.6844375Z ________________ TestNCBI.test_get_ncbi_data_query_mushroom_two ________________
2020-09-17T07:56:51.6844784Z 
2020-09-17T07:56:51.6845375Z self = <rescript.tests.test_ncbi.TestNCBI testMethod=test_get_ncbi_data_query_mushroom_two>
2020-09-17T07:56:51.6845931Z 
2020-09-17T07:56:51.6846357Z     def test_get_ncbi_data_query_mushroom_two(self):
2020-09-17T07:56:51.6846844Z         seq, tax = self.get_ncbi_data(
2020-09-17T07:56:51.6847986Z             query='MT345279.1',
2020-09-17T07:56:51.6848792Z             ranks=['domain', 'phylum', 'subphylum', 'superfamily', 'family',
2020-09-17T07:56:51.6852430Z >                  'subfamily', 'genus', 'species', 'subspecies']
2020-09-17T07:56:51.6852704Z         )
2020-09-17T07:56:51.6852817Z 
2020-09-17T07:56:51.6853338Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/rescript/tests/test_ncbi.py:105: 
2020-09-17T07:56:51.6853727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2020-09-17T07:56:51.6854120Z <decorator-gen-179>:2: in get_ncbi_data
2020-09-17T07:56:51.6854379Z     ???
2020-09-17T07:56:51.6854926Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/qiime2/sdk/action.py:245: in bound_callable
2020-09-17T07:56:51.6855406Z     output_types, provenance)
2020-09-17T07:56:51.6856050Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/qiime2/sdk/action.py:390: in _callable_executor_
2020-09-17T07:56:51.6856557Z     output_views = self._callable(**view_args)
2020-09-17T07:56:51.6857168Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/rescript/ncbi.py:82: in get_ncbi_data
2020-09-17T07:56:51.6857665Z     taxids, ranks, rank_propagation, entrez_delay)
2020-09-17T07:56:51.6858318Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/rescript/ncbi.py:198: in get_taxonomies
2020-09-17T07:56:51.6858795Z     records = _get(params, ids, entrez_delay)
2020-09-17T07:56:51.6859373Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/rescript/ncbi.py:132: in _get
2020-09-17T07:56:51.6859775Z     content = parse(r.content)
2020-09-17T07:56:51.6860013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2020-09-17T07:56:51.6860159Z 
2020-09-17T07:56:51.6860617Z xml_input = b'{"error":"API rate limit exceeded","api-key":"40.79.21.212","count":"4","limit":"3"}\n'
2020-09-17T07:56:51.6860959Z encoding = None
2020-09-17T07:56:51.6861596Z expat = <module 'xml.parsers.expat' from '/home/runner/miniconda/envs/testing/lib/python3.6/xml/parsers/expat.py'>
2020-09-17T07:56:51.6862233Z process_namespaces = False, namespace_separator = None, disable_entities = True
2020-09-17T07:56:51.6862777Z kwargs = {}, handler = <xmltodict._DictSAXHandler object at 0x7efeb3173f28>
2020-09-17T07:56:51.6863276Z parser = <pyexpat.xmlparser object at 0x7efeb3148b88>
2020-09-17T07:56:51.6863950Z feature = 'http://apache.org/xml/features/disallow-doctype-decl'
2020-09-17T07:56:51.6864296Z 
2020-09-17T07:56:51.6864631Z     def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
2020-09-17T07:56:51.6865240Z               namespace_separator=':', disable_entities=True, **kwargs):
2020-09-17T07:56:51.6865673Z         """Parse the given XML input and convert it into a dictionary.
2020-09-17T07:56:51.6866133Z     
2020-09-17T07:56:51.6866570Z         `xml_input` can either be a `string` or a file-like object.
2020-09-17T07:56:51.6866846Z     
2020-09-17T07:56:51.6867242Z         If `xml_attribs` is `True`, element attributes are put in the dictionary
2020-09-17T07:56:51.6867926Z         among regular child elements, using `@` as a prefix to avoid collisions. If
2020-09-17T07:56:51.6868352Z         set to `False`, they are just ignored.
2020-09-17T07:56:51.6868575Z     
2020-09-17T07:56:51.6868800Z         Simple example::
2020-09-17T07:56:51.6869006Z     
2020-09-17T07:56:51.6869218Z             >>> import xmltodict
2020-09-17T07:56:51.6869509Z             >>> doc = xmltodict.parse(\"\"\"
2020-09-17T07:56:51.6869771Z             ... <a prop="x">
2020-09-17T07:56:51.6869963Z             ...   <b>1</b>
2020-09-17T07:56:51.6870144Z             ...   <b>2</b>
2020-09-17T07:56:51.6870327Z             ... </a>
2020-09-17T07:56:51.6870500Z             ... \"\"\")
2020-09-17T07:56:51.6870996Z             >>> doc['a']['@prop']
2020-09-17T07:56:51.6871283Z             u'x'
2020-09-17T07:56:51.6871559Z             >>> doc['a']['b']
2020-09-17T07:56:51.6871818Z             [u'1', u'2']
2020-09-17T07:56:51.6871988Z     
2020-09-17T07:56:51.6872287Z         If `item_depth` is `0`, the function returns a dictionary for the root
2020-09-17T07:56:51.6872759Z         element (default behavior). Otherwise, it calls `item_callback` every time
2020-09-17T07:56:51.6873223Z         an item at the specified depth is found and returns `None` in the end
2020-09-17T07:56:51.6873557Z         (streaming mode).
2020-09-17T07:56:51.6873757Z     
2020-09-17T07:56:51.6874091Z         The callback function receives two parameters: the `path` from the document
2020-09-17T07:56:51.6874698Z         root to the item (name-attribs pairs), and the `item` (dict). If the
2020-09-17T07:56:51.6875447Z         callback's return value is false-ish, parsing will be stopped with the
2020-09-17T07:56:51.6875913Z         :class:`ParsingInterrupted` exception.
2020-09-17T07:56:51.6876212Z     
2020-09-17T07:56:51.6876427Z         Streaming example::
2020-09-17T07:56:51.6876636Z     
2020-09-17T07:56:51.6876842Z             >>> def handle(path, item):
2020-09-17T07:56:51.6877260Z             ...     print('path:%s item:%s' % (path, item))
2020-09-17T07:56:51.6877512Z             ...     return True
2020-09-17T07:56:51.6877699Z             ...
2020-09-17T07:56:51.6877926Z             >>> xmltodict.parse(\"\"\"
2020-09-17T07:56:51.6878167Z             ... <a prop="x">
2020-09-17T07:56:51.6878351Z             ...   <b>1</b>
2020-09-17T07:56:51.6878525Z             ...   <b>2</b>
2020-09-17T07:56:51.6878779Z             ... </a>\"\"\", item_depth=2, item_callback=handle)
2020-09-17T07:56:51.6879217Z             path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:1
2020-09-17T07:56:51.6879618Z             path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:2
2020-09-17T07:56:51.6879835Z     
2020-09-17T07:56:51.6880162Z         The optional argument `postprocessor` is a function that takes `path`,
2020-09-17T07:56:51.6880649Z         `key` and `value` as positional arguments and returns a new `(key, value)`
2020-09-17T07:56:51.6881108Z         pair where both `key` and `value` may have changed. Usage example::
2020-09-17T07:56:51.6881398Z     
2020-09-17T07:56:51.6881649Z             >>> def postprocessor(path, key, value):
2020-09-17T07:56:51.6881895Z             ...     try:
2020-09-17T07:56:51.6882250Z             ...         return key + ':int', int(value)
2020-09-17T07:56:51.6882558Z             ...     except (ValueError, TypeError):
2020-09-17T07:56:51.6882847Z             ...         return key, value
2020-09-17T07:56:51.6883455Z             >>> xmltodict.parse('<a><b>1</b><b>2</b><b>x</b></a>',
2020-09-17T07:56:51.6883836Z             ...                 postprocessor=postprocessor)
2020-09-17T07:56:51.6884387Z             OrderedDict([(u'a', OrderedDict([(u'b:int', [1, 2]), (u'b', u'x')]))])
2020-09-17T07:56:51.6884676Z     
2020-09-17T07:56:51.6885010Z         You can pass an alternate version of `expat` (such as `defusedexpat`) by
2020-09-17T07:56:51.6885537Z         using the `expat` parameter. E.g:
2020-09-17T07:56:51.6885779Z     
2020-09-17T07:56:51.6886059Z             >>> import defusedexpat
2020-09-17T07:56:51.6886676Z             >>> xmltodict.parse('<a>hello</a>', expat=defusedexpat.pyexpat)
2020-09-17T07:56:51.6887238Z             OrderedDict([(u'a', u'hello')])
2020-09-17T07:56:51.6887475Z     
2020-09-17T07:56:51.6887788Z         You can use the force_list argument to force lists to be created even
2020-09-17T07:56:51.6888238Z         when there is only a single child of a given level of hierarchy. The
2020-09-17T07:56:51.6888693Z         force_list argument is a tuple of keys. If the key for a given level
2020-09-17T07:56:51.6889158Z         of hierarchy is in the force_list argument, that level of hierarchy
2020-09-17T07:56:51.6889769Z         will have a list as a child (even if there is only one sub-element).
2020-09-17T07:56:51.6890252Z         The index_keys operation takes precendence over this. This is applied
2020-09-17T07:56:51.6890888Z         after any user-supplied postprocessor has already run.
2020-09-17T07:56:51.6891217Z     
2020-09-17T07:56:51.6891453Z             For example, given this input:
2020-09-17T07:56:51.6891712Z             <servers>
2020-09-17T07:56:51.6891921Z               <server>
2020-09-17T07:56:51.6892145Z                 <name>host1</name>
2020-09-17T07:56:51.6892375Z                 <os>Linux</os>
2020-09-17T07:56:51.6892607Z                 <interfaces>
2020-09-17T07:56:51.6892847Z                   <interface>
2020-09-17T07:56:51.6893079Z                     <name>em0</name>
2020-09-17T07:56:51.6893339Z                     <ip_address>10.0.0.1</ip_address>
2020-09-17T07:56:51.6893582Z                   </interface>
2020-09-17T07:56:51.6893822Z                 </interfaces>
2020-09-17T07:56:51.6894047Z               </server>
2020-09-17T07:56:51.6894253Z             </servers>
2020-09-17T07:56:51.6894444Z     
2020-09-17T07:56:51.6894882Z             If called with force_list=('interface',), it will produce
2020-09-17T07:56:51.6895222Z             this dictionary:
2020-09-17T07:56:51.6895569Z             {'servers':
2020-09-17T07:56:51.6895884Z               {'server':
2020-09-17T07:56:51.6896210Z                 {'name': 'host1',
2020-09-17T07:56:51.6896530Z                  'os': 'Linux'},
2020-09-17T07:56:51.6896867Z                  'interfaces':
2020-09-17T07:56:51.6897215Z                   {'interface':
2020-09-17T07:56:51.6897599Z                     [ {'name': 'em0', 'ip_address': '10.0.0.1' } ] } } }
2020-09-17T07:56:51.6897825Z     
2020-09-17T07:56:51.6898306Z             `force_list` can also be a callable that receives `path`, `key` and
2020-09-17T07:56:51.6898774Z             `value`. This is helpful in cases where the logic that decides whether
2020-09-17T07:56:51.6899195Z             a list should be forced is more complex.
2020-09-17T07:56:51.6899454Z         """
2020-09-17T07:56:51.6899846Z         handler = _DictSAXHandler(namespace_separator=namespace_separator,
2020-09-17T07:56:51.6900270Z                                   **kwargs)
2020-09-17T07:56:51.6900562Z         if isinstance(xml_input, _unicode):
2020-09-17T07:56:51.6900862Z             if not encoding:
2020-09-17T07:56:51.6901416Z                 encoding = 'utf-8'
2020-09-17T07:56:51.6901730Z             xml_input = xml_input.encode(encoding)
2020-09-17T07:56:51.6902067Z         if not process_namespaces:
2020-09-17T07:56:51.6902375Z             namespace_separator = None
2020-09-17T07:56:51.6902724Z         parser = expat.ParserCreate(
2020-09-17T07:56:51.6903031Z             encoding,
2020-09-17T07:56:51.6903269Z             namespace_separator
2020-09-17T07:56:51.6903494Z         )
2020-09-17T07:56:51.6903666Z         try:
2020-09-17T07:56:51.6903943Z             parser.ordered_attributes = True
2020-09-17T07:56:51.6904292Z         except AttributeError:
2020-09-17T07:56:51.6904811Z             # Jython's expat does not support ordered_attributes
2020-09-17T07:56:51.6905201Z             pass
2020-09-17T07:56:51.6905783Z         parser.StartNamespaceDeclHandler = handler.startNamespaceDecl
2020-09-17T07:56:51.6906678Z         parser.StartElementHandler = handler.startElement
2020-09-17T07:56:51.6907352Z         parser.EndElementHandler = handler.endElement
2020-09-17T07:56:51.6908016Z         parser.CharacterDataHandler = handler.characters
2020-09-17T07:56:51.6908516Z         parser.buffer_text = True
2020-09-17T07:56:51.6908805Z         if disable_entities:
2020-09-17T07:56:51.6909033Z             try:
2020-09-17T07:56:51.6909503Z                 # Attempt to disable DTD in Jython's expat parser (Xerces-J).
2020-09-17T07:56:51.6910204Z                 feature = "http://apache.org/xml/features/disallow-doctype-decl"
2020-09-17T07:56:51.6910761Z                 parser._reader.setFeature(feature, True)
2020-09-17T07:56:51.6911150Z             except AttributeError:
2020-09-17T07:56:51.6911467Z                 # For CPython / expat parser.
2020-09-17T07:56:51.6912006Z                 # Anything not handled ends up here and entities aren't expanded.
2020-09-17T07:56:51.6912484Z                 parser.DefaultHandler = lambda x: None
2020-09-17T07:56:51.6913134Z                 # Expects an integer return; zero means failure -> expat.ExpatError.
2020-09-17T07:56:51.6913755Z                 parser.ExternalEntityRefHandler = lambda *x: 1
2020-09-17T07:56:51.6914349Z         if hasattr(xml_input, 'read'):
2020-09-17T07:56:51.6914674Z             parser.ParseFile(xml_input)
2020-09-17T07:56:51.6914947Z         else:
2020-09-17T07:56:51.6920272Z >           parser.Parse(xml_input, True)
2020-09-17T07:56:51.6921390Z E           xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0
2020-09-17T07:56:51.6921830Z 
2020-09-17T07:56:51.6922368Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/xmltodict.py:327: ExpatError
2020-09-17T07:56:51.6922837Z =============================== warnings summary ===============================
2020-09-17T07:56:51.6923186Z tests/test_cross_validate.py: 210 warnings
2020-09-17T07:56:51.6923530Z tests/test_degap.py: 24 warnings
2020-09-17T07:56:51.6923841Z tests/test_derep.py: 45 warnings
2020-09-17T07:56:51.6924163Z tests/test_evaluate.py: 32 warnings
2020-09-17T07:56:51.6924509Z tests/test_filter_length.py: 64 warnings
2020-09-17T07:56:51.6924835Z tests/test_ncbi.py: 20 warnings
2020-09-17T07:56:51.6925146Z tests/test_orient.py: 76 warnings
2020-09-17T07:56:51.6925480Z tests/test_screenseq.py: 44 warnings
2020-09-17T07:56:51.6925837Z types/tests/test_methods.py: 64 warnings
2020-09-17T07:56:51.6926261Z types/tests/test_types_formats_transformers.py: 128 warnings
2020-09-17T07:56:51.6927297Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/sequence/_sequence.py:427: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
2020-09-17T07:56:51.6928213Z     return self._bytes.tostring()
2020-09-17T07:56:51.6928427Z 
2020-09-17T07:56:51.6928846Z tests/test_cross_validate.py::TestPipelines::test_evaluate_classifications
2020-09-17T07:56:51.6929578Z tests/test_cross_validate.py::TestPipelines::test_evaluate_classifications_expected_id_superset_valid
2020-09-17T07:56:51.6930391Z tests/test_cross_validate.py::TestPipelines::test_evaluate_classifications_observed_id_superset_invalid
2020-09-17T07:56:51.6931278Z tests/test_cross_validate.py::TestPipelines::test_evaluate_cross_validate_k3
2020-09-17T07:56:51.6931867Z tests/test_cross_validate.py::TestPipelines::test_evaluate_fit_classifier
2020-09-17T07:56:51.6932491Z tests/test_evaluate.py::TestEvaluateUtilities::test_process_labels_empty
2020-09-17T07:56:51.6934217Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/rescript/evaluate.py:76: UserWarning: The lists of input taxonomies and labels are different lengths. Additional taxonomies will be labeled numerically by their order in the inputs. Note that if these numbers match existing labels, those data will be grouped in the visualization.
2020-09-17T07:56:51.6935544Z     warnings.warn(msg, UserWarning)
2020-09-17T07:56:51.6935784Z 
2020-09-17T07:56:51.6936169Z tests/test_cross_validate.py::TestPipelines::test_evaluate_fit_classifier
2020-09-17T07:56:51.6938183Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.23.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
2020-09-17T07:56:51.6939533Z     warnings.warn(warning, UserWarning)
2020-09-17T07:56:51.6939798Z 
2020-09-17T07:56:51.6940130Z tests/test_derep.py::TestDerep::test_dereplicate_numericIDs
2020-09-17T07:56:51.6940622Z tests/test_derep.py::TestDerep::test_dereplicate_uniq
2020-09-17T07:56:51.6941254Z tests/test_derep.py::TestDerep::test_dereplicate_uniq
2020-09-17T07:56:51.6941710Z tests/test_derep.py::TestDerep::test_dereplicate_uniq_99_perc
2020-09-17T07:56:51.6942183Z tests/test_derep.py::TestDerep::test_dereplicate_uniq_99_perc
2020-09-17T07:56:51.6943111Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/rescript/dereplicate.py:116: SettingWithCopyWarning: 
2020-09-17T07:56:51.6943817Z   A value is trying to be set on a copy of a slice from a DataFrame.
2020-09-17T07:56:51.6944252Z   Try using .loc[row_indexer,col_indexer] = value instead
2020-09-17T07:56:51.6944547Z   
2020-09-17T07:56:51.6945482Z   See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
2020-09-17T07:56:51.6947562Z     uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
2020-09-17T07:56:51.6947872Z 
2020-09-17T07:56:51.6948315Z tests/test_derep.py::TestDerep::test_dereplicate_numericIDs
2020-09-17T07:56:51.6948936Z tests/test_derep.py::TestDerep::test_dereplicate_uniq
2020-09-17T07:56:51.6949821Z tests/test_derep.py::TestDerep::test_dereplicate_uniq
2020-09-17T07:56:51.6950254Z tests/test_derep.py::TestDerep::test_dereplicate_uniq_99_perc
2020-09-17T07:56:51.6950711Z tests/test_derep.py::TestDerep::test_dereplicate_uniq_99_perc
2020-09-17T07:56:51.6951574Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/rescript/dereplicate.py:117: SettingWithCopyWarning: 
2020-09-17T07:56:51.6952243Z   A value is trying to be set on a copy of a slice from a DataFrame.
2020-09-17T07:56:51.6952653Z   Try using .loc[row_indexer,col_indexer] = value instead
2020-09-17T07:56:51.6952930Z   
2020-09-17T07:56:51.6953796Z   See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
2020-09-17T07:56:51.6954767Z     uc['centroidtaxa'] = uc['centroidID'].apply(lambda x: taxa.loc[x])
2020-09-17T07:56:51.6955045Z 
2020-09-17T07:56:51.6955549Z tests/test_evaluate.py::TestEvaluateSeqs::test_evaluate_seqs_visualizer
2020-09-17T07:56:51.6957106Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/rescript/evaluate.py:76: UserWarning: The lists of input sequences and labels are different lengths. Additional sequences will be labeled numerically by their order in the inputs. Note that if these numbers match existing labels, those data will be grouped in the visualization.
2020-09-17T07:56:51.6958213Z     warnings.warn(msg, UserWarning)
2020-09-17T07:56:51.6958426Z 
2020-09-17T07:56:51.6958857Z tests/test_filter_length.py::TestFilterByTaxonomy::test_filter_seqs_length_by_taxon_no_failures
2020-09-17T07:56:51.6960446Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-q2bek65f/6a14d996-19a1-45fb-8b7a-737c73685ce6/data/dna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6961674Z     % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6961999Z 
2020-09-17T07:56:51.6962439Z tests/test_filter_length.py::TestFilterByTaxonomy::test_filter_seqs_length_by_taxon_no_seqs_pass_filter
2020-09-17T07:56:51.6964279Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-dw02fz5f/5fca65f3-674d-4a5f-b175-8d33a894a62c/data/dna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6965536Z     % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6965844Z 
2020-09-17T07:56:51.6966254Z tests/test_filter_length.py::TestFilterGlobally::test_filter_seqs_length_all_filtered_out
2020-09-17T07:56:51.6967862Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-5054xhrm/eca0544b-b263-4cce-804f-c98c10ea635b/data/dna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6969347Z     % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6969690Z 
2020-09-17T07:56:51.6970106Z tests/test_filter_length.py::TestFilterGlobally::test_filter_seqs_length_no_failures
2020-09-17T07:56:51.6971892Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-zh0holl6/4403e41a-2a07-4155-b62e-0b88be2bdaef/data/dna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6973101Z     % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6973410Z 
2020-09-17T07:56:51.6973706Z tests/test_get_data.py::TestGetSILVA::test_get_silva_data
2020-09-17T07:56:51.6974927Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/q2-RNAFASTAFormat-db5v51nj'> does not look like a fasta file
2020-09-17T07:56:51.6975979Z     % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6976304Z 
2020-09-17T07:56:51.6976596Z tests/test_get_data.py::TestGetSILVA::test_get_silva_data
2020-09-17T07:56:51.6978025Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/q2-RNASequencesDirectoryFormat-usr_jxwf/rna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6979471Z     % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6979794Z 
2020-09-17T07:56:51.6980104Z tests/test_get_data.py::TestGetSILVA::test_get_silva_data
2020-09-17T07:56:51.6981356Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/q2-RNAFASTAFormat-zj32_lqa'> does not look like a fasta file
2020-09-17T07:56:51.6982595Z     % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6982919Z 
2020-09-17T07:56:51.6983212Z tests/test_get_data.py::TestGetSILVA::test_get_silva_data
2020-09-17T07:56:51.6984704Z   /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/q2-RNASequencesDirectoryFormat-4hihakeb/rna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6986017Z     % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6986330Z 
2020-09-17T07:56:51.6986814Z -- Docs: https://docs.pytest.org/en/stable/warnings.html
2020-09-17T07:56:51.6987239Z =========================== short test summary info ============================
2020-09-17T07:56:51.6987925Z FAILED tests/test_ncbi.py::TestNCBI::test_get_ncbi_data_query_mushroom_two - ...
2020-09-17T07:56:51.6988335Z ============ 1 failed, 142 passed, 733 warnings in 85.28s (0:01:25) ============
2020-09-17T07:56:51.9957401Z ##[error]The process '/bin/bash' failed with exit code 1

additional tests failed

build conda package

see here: https://dev.qiime2.org/latest/tutorials/conda-tutorial/

Request for update to Taxon Table

Request to update the get-ncbi-data to format the taxon table as it would appear in SILVA. Currently, the scripts populates all ranks regardless of presence in TaxID lineage.

Example of actual:

Feature ID Taxon
KR045484.1 k__Bacteria; p__Bacteria; c__Bacteria; o__Bacteria; f__Bacteria; g__uncultured; s__bacterium

Example of desired:

Feature ID Taxon
KR045484.1 k__Bacteria; p__; c__; o__; f__; g__; s__

Additionally, would it be possible to allow for specification of accession number target range? E.x. 'KR045484.1:3-200'. Or, to specify only to retrieve the taxonomy? This utility would be ideal for those of us with a curated list of sequences from a HMM. It is common to retrieve sequences from genomic assemblies. It is also possible to have multiple copies with genetic variation between the genomic regions.

add length mode to dereplicate?

choose longest taxonomy string when dereplicating.

This might then also include rank_handle and new_rank_handle params, though that gets complicated...

BUG: merge_taxa reorders columns during merge

It seems that when combining DataFrames containing more columns than just 'Taxa' , the columns sometimes get reordered and the 'Taxa' column ends up not being first.

See here for more details: QIIME2 forum post

dereplicate chokes on numeric seqIDs (pandas index)

Appears to be an issue with numeric indices in pandas. I will try to fix immediately.

Traceback (most recent call last):
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/q2cli/commands
.py", line 328, in __call__
    results = action(**arguments)
  File "<decorator-gen-151>", line 2, in dereplicate
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/act
ion.py", line 245, in bound_callable
    output_types, provenance)
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/act
ion.py", line 390, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/rescript/derep
licate.py", line 61, in dereplicate
    taxa, sequences, derep_seqs, uc, mode=mode)
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/rescript/dereplicate.py", line 122, in _dereplicate_taxa
    uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/series.py", line 4045, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/rescript/dereplicate.py", line 122, in <lambda>
    uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexing.py", line 1424, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexing.py", line 1849, in _getitem_axis
    self._validate_key(key, axis)
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexing.py", line 1725, in _validate_key
    self._convert_scalar_indexer(key, axis)
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexing.py", line 274, in _convert_scalar_indexer
    return ax._convert_scalar_indexer(key, kind=self.name)
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3138, in _convert_scalar_indexer
    return self._invalid_indexer("label", key)
  File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3340, in _invalid_indexer
    form=form, klass=type(self), key=key, kind=type(key)
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1078587] of <class 'int'>

Plugin error from rescript:

  cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1078587] of <class 'int'>

add `extract-alignment-region`

In light of #30 and #32 it would be prudent to add an alternative approach to the feature-classifier extract-reads approach. That is, extract the region of interest from an alignment rather than a primer search.

Specifically, the user would be able to specify the alignment positions from which to extract the region of interest. The intent is to refactor the extract_alignment_region.py code from this pipeline.

add "FeatureData[AlignedRNASequence]" type

We'll need this, if we want to allow future editing and parsing of curated alignments from SILVA, e.g. SILVA_138_SSURef_tax_silva_full_align_trunc.fasta.gz.

Importing as FeatureData[AlignedSequence] obviously fails.

I think we can currently import this as FeatureData[RNASequence], but this might cause confusion / issues downstream when users try to manipulate t=when imported this way.

IMP: update dereplicate backfilling functionality

Update the dereplicate action to handle any number or combination of taxonomic ranks the user would like to use when implementing any of the LCA approaches. For the moment, we can only handle dealing with a standard 7-rank taxonomy, i.e. dpcofgs.

An example of this issue is on the QIIME 2 Forum.

dereplicate: add derep_prefix option

c.f. https://github.com/qiime2/q2-vsearch/blob/144a2c60330308087cb89d87bafc53568af1dad4/q2_vsearch/_cluster_sequences.py#L130

feature request for `dereplicate`

Hi all,

There was a recent question posted to the QIIME forum that got me thinking about why the results might differ between running stand-alone QIIME clustering and vsearch-standalone clustering.

I'm curious if it would be a significant challenge to add an additional component to qiime rescript dereplicate that would follow allow for the user to supply a feature-table file. This information is only relevant when users want to invoke the --p-perc-identity argument. At the moment, this triggers vsearch to use --clust_size, but because the fasta header is formatted with only feature information (and not abundance information) the centroid selection defaults to the longest sequence in the cluster, rather than the most abundant sequence. If the feature-table information could be imported with qiime rescript dereplicate, the relevant abundance information might then be available for each sequence feature (summed across all samples with that sequence feature), and inserted into the clustering command.

I'm not sure whether or not you want the default to be the length of the sequence or the abundance of the sequence feature - my guess is the sequence abundances are probably more relevant when defining a cluster than the length, but perhaps this varies a lot across marker genes, and the sequence fragments generated therein.

Thanks for your consideration!

add `degap_seq` method?

convert FeatureData[Aligned*Sequence] --> FeatureData[*Sequence]

both DNA and RNA formats

c.f., #30

add license text

I think we should include / mention the silva license link within the help text of both retrieve-silva-data and parse-silva-taxonomy. Is there a way we can add this as a "citation"?

parallelize or replace pandas apply functions

Some formatting steps in various actions, e.g., cross-validate are single-core bottlenecks in pandas.

Should either:

Parallelize. Use swifter or dask or something to apply eligible functions in parallel.
Replace formats. Some of these steps (e.g., validation in cross-validate) needn't necessarily be done in pandas, but it is convenient. Consider replacing those steps with a more efficient format... e.g., validation could be done by looping over feature IDs with skbio???

Update `get-silva-data` for downloading SILVA 138.1 LSU data

The SILVA 138.1 LSU files are available here. This was brought to our attention in the QIIME 2 forum

bokulich-lab / rescript Goto Github PK

rescript's People

Contributors

Stargazers

Watchers

Forkers

rescript's Issues

Recommend Projects

Recommend Topics

Recommend Org