bokulich-lab / rescript Goto Github PK
View Code? Open in Web Editor NEWREference Sequence annotation and CuRatIon Pipeline
License: BSD 3-Clause "New" or "Revised" License
REference Sequence annotation and CuRatIon Pipeline
License: BSD 3-Clause "New" or "Revised" License
we should add a gallery showing example QZVs and also displaying example output plots inline (e.g., to give users a preview).
This could be added to this repo, or alternatively in the q2-forum tutorials?
Provide the ability for users to download RefSeq Genome Assemblies, along with their associated taxonomy. See this forum thread for more details.
Some notes:
(txid2[orgn] OR txid2157[orgn]) AND "latest_refseq"[Properties]
Not really this package issue but it'll be nice if it will included in the docker, currently qiime rescript doesn't work in quay.io/qiime2/core:2021.4
Hello,
After installing RESCRIPt on Ubuntu 16.04 (QIIME2-2020.2) using
conda create -y -n rescript
conda activate rescript
conda install \
-c conda-forge -c bioconda -c qiime2 -c defaults \
qiime2 q2cli q2templates q2-types q2-longitudinal q2-feature-classifier "pandas>=0.25.3" xmltodict
pip install git+https://github.com/bokulich-lab/RESCRIPt.git
I get the following error below. I cannot execute any commands in QIIME2 anymore, as this error pops up. I haven't installed any other packages in my conda environment lately, such as Songbird or Tensorflow. This error only began after installing RESCRIPt. I've also tried running qiime dev refresh-cache
, with the same error appearing.
QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
Traceback (most recent call last):
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/bin/qiime", line 11, in <module>
sys.exit(qiime())
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 716, in main
with self.make_context(prog_name, args, **extra) as ctx:
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 641, in make_context
self.parse_args(ctx, args)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/click/command.py", line 43, in parse_args
return super().parse_args(ctx, args)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 1086, in parse_args
echo(ctx.get_help(), color=ctx.color)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 516, in get_help
return self.command.get_help(self)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 879, in get_help
self.format_help(ctx, formatter)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/click/core.py", line 898, in format_help
self.format_options(ctx, formatter)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/click/command.py", line 157, in format_options
for subcommand in self.list_commands(ctx):
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/commands.py", line 92, in list_commands
plugins = sorted(self._plugin_lookup)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/commands.py", line 76, in _plugin_lookup
import q2cli.core.cache
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 404, in <module>
CACHE = DeploymentCache()
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 61, in __init__
self._state = self._get_cached_state(refresh=refresh)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 107, in _get_cached_state
self._cache_current_state(current_requirements)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 200, in _cache_current_state
state = self._get_current_state()
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/core/cache.py", line 238, in _get_current_state
plugin_manager = qiime2.sdk.PluginManager()
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 54, in __new__
self._init(add_plugins=add_plugins)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 81, in _init
plugin = entry_point.load()
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2445, in load
return self.resolve()
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2451, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/rescript/plugin_setup.py", line 24, in <module>
from .get_data import get_silva_data
File "/home/dunfield/anaconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/rescript/get_data.py", line 19, in <module>
from q2_types.feature_data import RNAFASTAFormat
ImportError: cannot import name 'RNAFASTAFormat'
Does anyone know how I can stop this error and get qiime2 working again with RESCRIPt installed? Any help with this is greatly appreciated.
UPDATE: I removed RESCRIPt and then installed qiime2-2021.4. At that point, I was not getting the error after running commands. Next, I re-installed RESCRIPt using
conda activate qiime2-2021.4
conda install -c conda-forge -c bioconda -c qiime2 -c defaults xmltodict
pip install git+https://github.com/bokulich-lab/RESCRIPt.git
At this point, I am no longer getting the error. It seems I just needed to update to the latest version of QIIME2.
Hi Sirs,
I am Shakti Kumar from India, working on metagenomics data analysis. I want to use NCBI 16S (https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz) database to make sequence reference databases. Please help me to use RESCRIPt.
the parallel of get_silva_data
but download, import, and format from GTDB:
should add license information to main readme and method description, similar to what we've done with the SILVA license.
GTDB is pretty much already Q2-compatible but this pipeline would streamline things for users, allow us to easily pipe these artifacts into classifier training workflows, and bake GTDB into provenance to ensure that users cite appropriately...
Hello lovely RESCRIPt team!
Because I'm trying to be cutting edge (and I have a plugin I want to update that relies on RESCRIPt ๐), I was trying to update t qiime2-2021.2.
When I go to refresh the cache, I get the following error in setting up the plugin, I think related to a disjoint between RESCRIPt's feature classifier and the core q2-feature-classifier functionality.
Traceback (most recent call last):
File "/Users/jusdeb/miniconda3/envs/test-env/bin/qiime", line 11, in <module>
sys.exit(qiime())
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/builtin/dev.py", line 31, in refresh_cache
import q2cli.core.cache
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 406, in <module>
CACHE = DeploymentCache()
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 61, in __init__
self._state = self._get_cached_state(refresh=refresh)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 107, in _get_cached_state
self._cache_current_state(current_requirements)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 200, in _cache_current_state
state = self._get_current_state()
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/q2cli/core/cache.py", line 238, in _get_current_state
plugin_manager = qiime2.sdk.PluginManager()
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 54, in __new__
self._init(add_plugins=add_plugins)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/sdk/plugin_manager.py", line 81, in _init
plugin = entry_point.load()
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2472, in load
return self.resolve()
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2478, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/rescript/plugin_setup.py", line 144, in <module>
citations=[citations['bokulich2018optimizing']]
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/plugin/plugin.py", line 301, in register_function
deprecated, examples)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/sdk/action.py", line 526, in _init
output_descriptions)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/core/type/signature.py", line 118, in __init__
self._assert_valid_parameters(parameters)
File "/Users/jusdeb/miniconda3/envs/test-env/lib/python3.6/site-packages/qiime2/core/type/signature.py", line 270, in _assert_valid_parameters
% (param_name, spec.qiime_type))
TypeError: Default value for parameter 'reads_per_batch' is not of semantic QIIME type Int % Range(1, None) | Str % Choices('auto') or `None`.
inputs:
steps:
parse_silva_taxonomy
? any way to make auto-download optional?screen_seqs
filter_seqs_length_by_taxon
dereplicate
(question: use one more more modes for taxonomy derep?)evaluate_taxonomy
and cross_validate
(with kfold disabled)outputs:
Should we allow an option to keep the raw imported SILVA files we download?
My thinking is that if users would like to construct classifiers with and without species labels they will not have to go through the download option twice.
Or... given that this is a nice convenience method, we can simply return both taxonomy files, i.e., with and without species labels, and then there'd be no need for the option of writing out the raw files? This would will make constructing the silva dbs much easier for each qiime release.
I suppose we can do it all.. but that might be confusing.
these all use skbio to iterate over sequences; splitting into chunks should enable processing in parallel.
As a RESCRIPt user,
I want the plugin to support python 3.8 (and so be compatible with the latest QIIME 2 release)
so that I can benefit from the newest QIIME 2 features.
Hi
I have tried all possible versions of Silva 138 V3-V4 classifiers generated using three different dereplicating mode "uniq" "majority" and "super" in RESCRIPt to classify 16s sequences for mock bacterial communities from Zymoresearch https://files.zymoresearch.com/datasheets/ds1706_zymobiomics_microbial_community_standards_data_sheet.pdf
However, using these classifier are not assigning "Salmonella" at genera level which is expected to be in the community. I blasted Not_assigned ASVs to NCBI nucleotide database then I found majority of "Not_assigned" have hits with salmonella genus.
In contrast, when I classify the same sequences with Silva 132 V3-V4 classifier, it gives me "Salmonella" at genera level.
I also classify some other sequences from other studies too, however, in every case I am missing salmonella.,
The overall classification with Silva 138 classifier is better than Silva 132; but why I am missing salmonella, which is a bit weird for me.
Can you please give me a clue regarding this issue?
Thanks in advanced
Ashutosh
If we are interested in down-stream analyses that require in-depth taxonomic curation at various ranks, then we should allow the option to not propagate taxonomy. This will match the functionality currently in get-ncbi-data
.
Need to make it clear that rank propagation will propagate taxonomy ranks for all available taxonomy, using forward fill prior to subsetting the ranks requested by the user. That is :
Accession | Taxonomy-Ranks |
---|---|
All Available Ranks | d__Eukaryota sk__Holozoa k__Animalia ks__ sp__Lophotrochozoa p__Rotifera ps__ pi__ sc__ c__Monogononta cs__ ci__ so__ o__Ploimida os__ sf__ f__ fs__ g__Brachionus |
Propagated Ranks | d__Eukaryota sk__Holozoa k__Animalia ks__Animalia sp__Lophotrochozoa p__Rotifera ps__Rotifera pi__Rotifera sc__Rotifera c__Monogononta cs__Monogononta ci__Monogononta so__Monogononta o__Ploimida os__Ploimida sf__Ploimida f__Ploimida fs__Ploimida g__Brachionus |
So if a user wanted infraclass
w/o rank propagation they'd get:
ci__With Rank propagation they'd get:
ci__Monogononta`
There is probably a better example we can use...
the installation docs etc are a bit out of date, e.g, regarding minimum versions etc.
For example, see this forum issue.
dereplicate._dereplicate_taxa
uses pd.Series
as the view type for fasta files. This is convenient for small files but may pose problems with very large fasta files. DNAIterator
or DNAFASTAFormat
may be a better way to iterate across sequences.
Other functions might also use pd.Series for handling fasta files.
Need to test a bit further to see if this shaves off runtime/memory use.
I would really like the ability to reverse complement sequences in QIIME 2 without re-orienting against a reference. Is this appropriate for RESCRIPt (if so, I'm happy to issue the PR), or is it better for another more central plugin?
this would save users the trouble of navigating the massive number of SILVA files to find the correct inputs to parse_silva_taxonomy
.
inputs:
output:
Some ideas and features (from conversation with @mikerobeson ):
The workflow would be:
I got this error using rescript evaluate-fit-classifier
. I am using qiime 2022.2 and scikit-learn 0.24.1 on a Linux server. I am attempting to create a reference database of the Sylvilagus bachmani genome so I can blast sequences to it to determine if they are from S. bachmani or another organism. I can recreate this error by doing this:
wget https://data.qiime2.org/distro/core/qiime2-2022.2-py38-linux-conda.yml
conda env create -n qiime2-2022.2 --file qiime2-2022.2-py38-linux-conda.yml
conda activate qiime2-2022.2
pip install git+https://github.com/bokulich-lab/RESCRIPt.git
qiime rescript get-ncbi-data --p-query '(512907[BioProject]) AND "Sylvilagus bachmani"[porgn:__txid365149] ' \
--o-sequences bachmani-refseqs-unfiltered.qza \
--o-taxonomy bachmani-refseqs-taxonomy-unfiltered.qza \
--p-n-jobs 5
qiime rescript evaluate-fit-classifier --i-sequences bachmani-refseqs-unfiltered.qza \
--i-taxonomy bachmani-refseqs-taxonomy-unfiltered.qza \
--o-classifier bachmani-classifier.qza \
--o-observed-taxonomy bachmani-classifier-predicted-taxonomy.qza \
--verbose
Error info:
Validation: 134.04s
/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.24.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)
Training: 2663.73s
Traceback (most recent call last):
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2cli/commands.py", line 339, in __call__
results = action(**arguments)
File "<decorator-gen-457>", line 2, in evaluate_fit_classifier
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
outputs = self._callable_executor_(scope, callable_args,
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/qiime2/sdk/action.py", line 485, in _callable_executor_
outputs = self._callable(scope.ctx, **view_args)
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/rescript/cross_validate.py", line 48, in evaluate_fit_classifie
r
observed_taxonomy, = classify(reads=sequences,
File "<decorator-gen-513>", line 2, in classify_sklearn
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
outputs = self._callable_executor_(scope, callable_args,
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/qiime2/sdk/action.py", line 391, in _callable_executor_
output_views = self._callable(**view_args)
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/classifier.py", line 220, in classify_skl
earn
seq_ids, taxonomy, confidence = list(zip(*predictions))
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/_skl.py", line 46, in predict
for calculated in workers(jobs):
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 1043, in __call__
if self.dispatch_one_batch(iterator):
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
self.results = batch()
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/_skl.py", line 54, in _predict_chunk
return _predict_chunk_with_conf(pipeline, separator, confidence, chunk)
File "/home/tangled/miniconda3/envs/qiime/lib/python3.8/site-packages/q2_feature_classifier/_skl.py", line 70, in _predict_chunk_with
_conf
raise ValueError('this classifier does not support confidence values')
ValueError: this classifier does not support confidence values
Plugin error from rescript:
this classifier does not support confidence values
See above for debug info
Any idea why this error was created and what I'm doing wrong here? Is this a problem with the new 2022.2 qiime version?
HI,
i cannot clone the RESCRIPt.git via pip. Every time i try i get the following error-** ERROR: Command errored out with exit status 128: git clone -q https://github.com/bokulich-lab/RESCRIPt.git /tmp/pip-req-build-ryakchyf Check the logs for full command output **. Tried to run pip directly into qiime2 env as well as by creating rescript env.
Many of the files we parse from SILVA, to generate QIIME compatible, taxonomy classifiers, are provided as gzipped files here. It appears that RESCRIPt is unable to process these files unless they are decompressed prior to import into QIIME.
To successfully build a conda package, requests
dependency should to be added to the conda recipe requirements section: https://github.com/bokulich-lab/RESCRIPt/blob/master/ci/recipe/meta.yaml.
It looks like the latest release tag has a typo in it, the tag is 2020.11
, but in order to be PEP440/versioneer compliant compatible with the QIIME 2 train release, the tag should actually be 2020.11.0
. In theory we can just delete the broken tag and replace it with a new tag, but I am not sure what kind of URLs may or may not break because of that (and if it even matters).
Testing a large COI dataset with qiime rescript evaluate-fit-classifier
using an AWS instance and the job keeps crashing. Not sure why the job is failing on this command:
qiime rescript evaluate-fit-classifier \
--i-sequences "$HOME"/input/boldCOI.derep.seqs.qza \
--i-taxonomy "$HOME"/input/boldCOI.derep.tax.qza \
--p-reads-per-batch 5000 --p-n-jobs 4 \
--o-classifier tidybug_nBayes_classifier_seqs.qza \
--o-observed-taxonomy tidybug_nBayes_classifier_tax.qza \
--o-evaluation tidybug_nBayes_classifier_viz.qzv --verbose
The information on the screen just says:
Killed
and the resulting .log file doesn't contain information about the stopped job:
Validation: 92.70s
/home/ubuntu/miniconda3/envs/rescript/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.21.2. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)
I don't think it's a memory issue as this job is running on an AWS EC2 instance with over 250GB of RAM. It's possible though, as the job that's crashing was attempted while another job was running on the same machine. That other job was another RESCRIPt command:
qiime rescript evaluate-cross-validate \
--i-sequences "$HOME"/input/boldCOI.derep.seqs.qza \
--i-taxonomy "$HOME"/input/boldCOI.derep.tax.qza \
--p-reads-per-batch 5000 --p-n-jobs 4 \
--o-expected-taxonomy tidybug_crossval_expected_taxonomy.qza \
--o-observed-taxonomy tidybug_crossval_observed_taxonomy.qza
And that job is doing fine. No crashing (though it has been running for more than 48 hours!).
Any ideas how to track where the issues might be originating?
As a plugin user, I would like to use RESCRIPt to download and format fungal ITS sequence and taxonomy data from the UNITE database.
The proposed get_unite_data
would:
FeatureData[Sequence]
and FeatureData[Taxonomy]
The PlutoF API can be used to download data directly from UNITE: https://plutof.docs.apiary.io/#
Questions:
I'm not sure how much this is needed, but is a compliment to #123 ENH: add action for get_unite_data
GreenGenes is distributed from ftp://greengenes.microbio.me/greengenes_release/ with a predictable naming scheme
Downsteam, this would replace https://github.com/caporaso-lab/pretrained-feature-classifiers/blob/master/get_gg.sh#L14-L15
for optional OTU clustering
evaluate-cross-validate seems to have a persistent issue. I have tried to run it twice now with the same result. Here are the commands:
$ qiime rescript get-ncbi-data --p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) AND "BARCODE"[KYWD]' --o-sequences seq.qza --o-taxonomy tax.qza --verbose --p-n-jobs 5
$ qiime rescript evaluate-cross-validate --i-sequences seq.qza --i-taxonomy tax.qza --o-expected-taxonomy expected.qza --o-observed-taxonomy observed.qza --o-evaluation evaluation.qzv --p-n-jobs 12
I'm running it under qiime2-2020.8
.
I get
Plugin error from rescript:
A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
Debug info has been saved to /tmp/qiime2-q2cli-err-4z14w9x3.log
Please find the log file attached.
Given the recent updates to q2-types (qiime2/q2-types#256) to handle RNA data, and the current required functionality for RESCRIPt (e.g. DNA-to-RNA conversions), we should migrate some of these transformers and related content to q2-types
. Slate this for the next QIIME 2 (2021.6??) and RESCRIPt releases.
As a RESCRIPt user,
I want the NCBI's datasets
/dataformat
command-line tools support in RESCRIPt
so that I can easily download annotated genomes and their metadata by taxon/taxon ID.
See here: https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/
And here some usage examples for viral genomes: https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-virus/
The LCA methods will leave blank taxonomies if there is no LCA (e.g., because one is "Unclassified").
Should instead fill in a placeholder label (e.g., 'Unassigned'), as blank labels can cause some downstream methods to break. This placeholder label string should be exposed as a parameter so that users can specify placeholder labels that match their pre-existing taxonomies.
Reported on the Q2 forum.
cross-validate throws various warnings that can be ignored:
Should silence these warnings so they don't bother anybody.
@mikerobeson reported that this function is not filling empty taxonomic ranks following consensus taxonomic assignment. We should add an option to toggle rank filling on/off.
E.g., two identical sequences with the following taxonomy labels:
Bacteria;Genus;SpeciesA
Bacteria;Genus;SpeciesB
Would be collapsed to:
Bacteria;Genus
We should allow the option to collapse instead to:
Bacteria;Genus;__
Filling the ambiguous rank.
Question: How should we fill the ambiguous ranks? Should we allow input of a user-defined filler? What about defining ambiguous ranks for multiple levels? (e.g., the greengenes convention of g__, s__, etc) Or do we just default to "__"
Logs:
https://github.com/bokulich-lab/RESCRIPt/runs/1127331123
Relevant subsection:
2020-09-17T07:55:26.0796330Z [command]/bin/bash /tmp/2020817-2640-717k0p.u5tdk.sh
2020-09-17T07:55:28.7487757Z ============================= test session starts ==============================
2020-09-17T07:55:28.7489124Z platform linux -- Python 3.6.11, pytest-6.0.2, py-1.9.0, pluggy-0.13.1
2020-09-17T07:55:28.7489901Z rootdir: /home/runner/work/RESCRIPt/RESCRIPt
2020-09-17T07:55:28.7490429Z collected 143 items
2020-09-17T07:55:28.7490917Z
2020-09-17T07:55:30.7171971Z tests/test_cross_validate.py .............. [ 9%]
2020-09-17T07:55:30.7323262Z tests/test_degap.py .. [ 11%]
2020-09-17T07:55:32.7420963Z tests/test_derep.py ............. [ 20%]
2020-09-17T07:55:34.1696564Z tests/test_evaluate.py ............. [ 29%]
2020-09-17T07:55:35.1489782Z tests/test_filter_length.py ............................. [ 49%]
2020-09-17T07:56:11.6230159Z tests/test_get_data.py ... [ 51%]
2020-09-17T07:56:12.9456416Z tests/test_merge.py ............. [ 60%]
2020-09-17T07:56:49.8844414Z tests/test_ncbi.py .....F. [ 65%]
2020-09-17T07:56:50.1811833Z tests/test_orient.py ... [ 67%]
2020-09-17T07:56:51.4930550Z tests/test_parse_silva_taxonomy.py .............. [ 77%]
2020-09-17T07:56:51.5290260Z tests/test_screenseq.py ..... [ 81%]
2020-09-17T07:56:51.5817501Z types/tests/test_methods.py . [ 81%]
2020-09-17T07:56:51.6674875Z types/tests/test_types_formats_transformers.py ......................... [ 99%]
2020-09-17T07:56:51.6843012Z . [100%]
2020-09-17T07:56:51.6843455Z
2020-09-17T07:56:51.6843863Z =================================== FAILURES ===================================
2020-09-17T07:56:51.6844375Z ________________ TestNCBI.test_get_ncbi_data_query_mushroom_two ________________
2020-09-17T07:56:51.6844784Z
2020-09-17T07:56:51.6845375Z self = <rescript.tests.test_ncbi.TestNCBI testMethod=test_get_ncbi_data_query_mushroom_two>
2020-09-17T07:56:51.6845931Z
2020-09-17T07:56:51.6846357Z def test_get_ncbi_data_query_mushroom_two(self):
2020-09-17T07:56:51.6846844Z seq, tax = self.get_ncbi_data(
2020-09-17T07:56:51.6847986Z query='MT345279.1',
2020-09-17T07:56:51.6848792Z ranks=['domain', 'phylum', 'subphylum', 'superfamily', 'family',
2020-09-17T07:56:51.6852430Z > 'subfamily', 'genus', 'species', 'subspecies']
2020-09-17T07:56:51.6852704Z )
2020-09-17T07:56:51.6852817Z
2020-09-17T07:56:51.6853338Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/rescript/tests/test_ncbi.py:105:
2020-09-17T07:56:51.6853727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2020-09-17T07:56:51.6854120Z <decorator-gen-179>:2: in get_ncbi_data
2020-09-17T07:56:51.6854379Z ???
2020-09-17T07:56:51.6854926Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/qiime2/sdk/action.py:245: in bound_callable
2020-09-17T07:56:51.6855406Z output_types, provenance)
2020-09-17T07:56:51.6856050Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/qiime2/sdk/action.py:390: in _callable_executor_
2020-09-17T07:56:51.6856557Z output_views = self._callable(**view_args)
2020-09-17T07:56:51.6857168Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/rescript/ncbi.py:82: in get_ncbi_data
2020-09-17T07:56:51.6857665Z taxids, ranks, rank_propagation, entrez_delay)
2020-09-17T07:56:51.6858318Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/rescript/ncbi.py:198: in get_taxonomies
2020-09-17T07:56:51.6858795Z records = _get(params, ids, entrez_delay)
2020-09-17T07:56:51.6859373Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/rescript/ncbi.py:132: in _get
2020-09-17T07:56:51.6859775Z content = parse(r.content)
2020-09-17T07:56:51.6860013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2020-09-17T07:56:51.6860159Z
2020-09-17T07:56:51.6860617Z xml_input = b'{"error":"API rate limit exceeded","api-key":"40.79.21.212","count":"4","limit":"3"}\n'
2020-09-17T07:56:51.6860959Z encoding = None
2020-09-17T07:56:51.6861596Z expat = <module 'xml.parsers.expat' from '/home/runner/miniconda/envs/testing/lib/python3.6/xml/parsers/expat.py'>
2020-09-17T07:56:51.6862233Z process_namespaces = False, namespace_separator = None, disable_entities = True
2020-09-17T07:56:51.6862777Z kwargs = {}, handler = <xmltodict._DictSAXHandler object at 0x7efeb3173f28>
2020-09-17T07:56:51.6863276Z parser = <pyexpat.xmlparser object at 0x7efeb3148b88>
2020-09-17T07:56:51.6863950Z feature = 'http://apache.org/xml/features/disallow-doctype-decl'
2020-09-17T07:56:51.6864296Z
2020-09-17T07:56:51.6864631Z def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
2020-09-17T07:56:51.6865240Z namespace_separator=':', disable_entities=True, **kwargs):
2020-09-17T07:56:51.6865673Z """Parse the given XML input and convert it into a dictionary.
2020-09-17T07:56:51.6866133Z
2020-09-17T07:56:51.6866570Z `xml_input` can either be a `string` or a file-like object.
2020-09-17T07:56:51.6866846Z
2020-09-17T07:56:51.6867242Z If `xml_attribs` is `True`, element attributes are put in the dictionary
2020-09-17T07:56:51.6867926Z among regular child elements, using `@` as a prefix to avoid collisions. If
2020-09-17T07:56:51.6868352Z set to `False`, they are just ignored.
2020-09-17T07:56:51.6868575Z
2020-09-17T07:56:51.6868800Z Simple example::
2020-09-17T07:56:51.6869006Z
2020-09-17T07:56:51.6869218Z >>> import xmltodict
2020-09-17T07:56:51.6869509Z >>> doc = xmltodict.parse(\"\"\"
2020-09-17T07:56:51.6869771Z ... <a prop="x">
2020-09-17T07:56:51.6869963Z ... <b>1</b>
2020-09-17T07:56:51.6870144Z ... <b>2</b>
2020-09-17T07:56:51.6870327Z ... </a>
2020-09-17T07:56:51.6870500Z ... \"\"\")
2020-09-17T07:56:51.6870996Z >>> doc['a']['@prop']
2020-09-17T07:56:51.6871283Z u'x'
2020-09-17T07:56:51.6871559Z >>> doc['a']['b']
2020-09-17T07:56:51.6871818Z [u'1', u'2']
2020-09-17T07:56:51.6871988Z
2020-09-17T07:56:51.6872287Z If `item_depth` is `0`, the function returns a dictionary for the root
2020-09-17T07:56:51.6872759Z element (default behavior). Otherwise, it calls `item_callback` every time
2020-09-17T07:56:51.6873223Z an item at the specified depth is found and returns `None` in the end
2020-09-17T07:56:51.6873557Z (streaming mode).
2020-09-17T07:56:51.6873757Z
2020-09-17T07:56:51.6874091Z The callback function receives two parameters: the `path` from the document
2020-09-17T07:56:51.6874698Z root to the item (name-attribs pairs), and the `item` (dict). If the
2020-09-17T07:56:51.6875447Z callback's return value is false-ish, parsing will be stopped with the
2020-09-17T07:56:51.6875913Z :class:`ParsingInterrupted` exception.
2020-09-17T07:56:51.6876212Z
2020-09-17T07:56:51.6876427Z Streaming example::
2020-09-17T07:56:51.6876636Z
2020-09-17T07:56:51.6876842Z >>> def handle(path, item):
2020-09-17T07:56:51.6877260Z ... print('path:%s item:%s' % (path, item))
2020-09-17T07:56:51.6877512Z ... return True
2020-09-17T07:56:51.6877699Z ...
2020-09-17T07:56:51.6877926Z >>> xmltodict.parse(\"\"\"
2020-09-17T07:56:51.6878167Z ... <a prop="x">
2020-09-17T07:56:51.6878351Z ... <b>1</b>
2020-09-17T07:56:51.6878525Z ... <b>2</b>
2020-09-17T07:56:51.6878779Z ... </a>\"\"\", item_depth=2, item_callback=handle)
2020-09-17T07:56:51.6879217Z path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:1
2020-09-17T07:56:51.6879618Z path:[(u'a', {u'prop': u'x'}), (u'b', None)] item:2
2020-09-17T07:56:51.6879835Z
2020-09-17T07:56:51.6880162Z The optional argument `postprocessor` is a function that takes `path`,
2020-09-17T07:56:51.6880649Z `key` and `value` as positional arguments and returns a new `(key, value)`
2020-09-17T07:56:51.6881108Z pair where both `key` and `value` may have changed. Usage example::
2020-09-17T07:56:51.6881398Z
2020-09-17T07:56:51.6881649Z >>> def postprocessor(path, key, value):
2020-09-17T07:56:51.6881895Z ... try:
2020-09-17T07:56:51.6882250Z ... return key + ':int', int(value)
2020-09-17T07:56:51.6882558Z ... except (ValueError, TypeError):
2020-09-17T07:56:51.6882847Z ... return key, value
2020-09-17T07:56:51.6883455Z >>> xmltodict.parse('<a><b>1</b><b>2</b><b>x</b></a>',
2020-09-17T07:56:51.6883836Z ... postprocessor=postprocessor)
2020-09-17T07:56:51.6884387Z OrderedDict([(u'a', OrderedDict([(u'b:int', [1, 2]), (u'b', u'x')]))])
2020-09-17T07:56:51.6884676Z
2020-09-17T07:56:51.6885010Z You can pass an alternate version of `expat` (such as `defusedexpat`) by
2020-09-17T07:56:51.6885537Z using the `expat` parameter. E.g:
2020-09-17T07:56:51.6885779Z
2020-09-17T07:56:51.6886059Z >>> import defusedexpat
2020-09-17T07:56:51.6886676Z >>> xmltodict.parse('<a>hello</a>', expat=defusedexpat.pyexpat)
2020-09-17T07:56:51.6887238Z OrderedDict([(u'a', u'hello')])
2020-09-17T07:56:51.6887475Z
2020-09-17T07:56:51.6887788Z You can use the force_list argument to force lists to be created even
2020-09-17T07:56:51.6888238Z when there is only a single child of a given level of hierarchy. The
2020-09-17T07:56:51.6888693Z force_list argument is a tuple of keys. If the key for a given level
2020-09-17T07:56:51.6889158Z of hierarchy is in the force_list argument, that level of hierarchy
2020-09-17T07:56:51.6889769Z will have a list as a child (even if there is only one sub-element).
2020-09-17T07:56:51.6890252Z The index_keys operation takes precendence over this. This is applied
2020-09-17T07:56:51.6890888Z after any user-supplied postprocessor has already run.
2020-09-17T07:56:51.6891217Z
2020-09-17T07:56:51.6891453Z For example, given this input:
2020-09-17T07:56:51.6891712Z <servers>
2020-09-17T07:56:51.6891921Z <server>
2020-09-17T07:56:51.6892145Z <name>host1</name>
2020-09-17T07:56:51.6892375Z <os>Linux</os>
2020-09-17T07:56:51.6892607Z <interfaces>
2020-09-17T07:56:51.6892847Z <interface>
2020-09-17T07:56:51.6893079Z <name>em0</name>
2020-09-17T07:56:51.6893339Z <ip_address>10.0.0.1</ip_address>
2020-09-17T07:56:51.6893582Z </interface>
2020-09-17T07:56:51.6893822Z </interfaces>
2020-09-17T07:56:51.6894047Z </server>
2020-09-17T07:56:51.6894253Z </servers>
2020-09-17T07:56:51.6894444Z
2020-09-17T07:56:51.6894882Z If called with force_list=('interface',), it will produce
2020-09-17T07:56:51.6895222Z this dictionary:
2020-09-17T07:56:51.6895569Z {'servers':
2020-09-17T07:56:51.6895884Z {'server':
2020-09-17T07:56:51.6896210Z {'name': 'host1',
2020-09-17T07:56:51.6896530Z 'os': 'Linux'},
2020-09-17T07:56:51.6896867Z 'interfaces':
2020-09-17T07:56:51.6897215Z {'interface':
2020-09-17T07:56:51.6897599Z [ {'name': 'em0', 'ip_address': '10.0.0.1' } ] } } }
2020-09-17T07:56:51.6897825Z
2020-09-17T07:56:51.6898306Z `force_list` can also be a callable that receives `path`, `key` and
2020-09-17T07:56:51.6898774Z `value`. This is helpful in cases where the logic that decides whether
2020-09-17T07:56:51.6899195Z a list should be forced is more complex.
2020-09-17T07:56:51.6899454Z """
2020-09-17T07:56:51.6899846Z handler = _DictSAXHandler(namespace_separator=namespace_separator,
2020-09-17T07:56:51.6900270Z **kwargs)
2020-09-17T07:56:51.6900562Z if isinstance(xml_input, _unicode):
2020-09-17T07:56:51.6900862Z if not encoding:
2020-09-17T07:56:51.6901416Z encoding = 'utf-8'
2020-09-17T07:56:51.6901730Z xml_input = xml_input.encode(encoding)
2020-09-17T07:56:51.6902067Z if not process_namespaces:
2020-09-17T07:56:51.6902375Z namespace_separator = None
2020-09-17T07:56:51.6902724Z parser = expat.ParserCreate(
2020-09-17T07:56:51.6903031Z encoding,
2020-09-17T07:56:51.6903269Z namespace_separator
2020-09-17T07:56:51.6903494Z )
2020-09-17T07:56:51.6903666Z try:
2020-09-17T07:56:51.6903943Z parser.ordered_attributes = True
2020-09-17T07:56:51.6904292Z except AttributeError:
2020-09-17T07:56:51.6904811Z # Jython's expat does not support ordered_attributes
2020-09-17T07:56:51.6905201Z pass
2020-09-17T07:56:51.6905783Z parser.StartNamespaceDeclHandler = handler.startNamespaceDecl
2020-09-17T07:56:51.6906678Z parser.StartElementHandler = handler.startElement
2020-09-17T07:56:51.6907352Z parser.EndElementHandler = handler.endElement
2020-09-17T07:56:51.6908016Z parser.CharacterDataHandler = handler.characters
2020-09-17T07:56:51.6908516Z parser.buffer_text = True
2020-09-17T07:56:51.6908805Z if disable_entities:
2020-09-17T07:56:51.6909033Z try:
2020-09-17T07:56:51.6909503Z # Attempt to disable DTD in Jython's expat parser (Xerces-J).
2020-09-17T07:56:51.6910204Z feature = "http://apache.org/xml/features/disallow-doctype-decl"
2020-09-17T07:56:51.6910761Z parser._reader.setFeature(feature, True)
2020-09-17T07:56:51.6911150Z except AttributeError:
2020-09-17T07:56:51.6911467Z # For CPython / expat parser.
2020-09-17T07:56:51.6912006Z # Anything not handled ends up here and entities aren't expanded.
2020-09-17T07:56:51.6912484Z parser.DefaultHandler = lambda x: None
2020-09-17T07:56:51.6913134Z # Expects an integer return; zero means failure -> expat.ExpatError.
2020-09-17T07:56:51.6913755Z parser.ExternalEntityRefHandler = lambda *x: 1
2020-09-17T07:56:51.6914349Z if hasattr(xml_input, 'read'):
2020-09-17T07:56:51.6914674Z parser.ParseFile(xml_input)
2020-09-17T07:56:51.6914947Z else:
2020-09-17T07:56:51.6920272Z > parser.Parse(xml_input, True)
2020-09-17T07:56:51.6921390Z E xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0
2020-09-17T07:56:51.6921830Z
2020-09-17T07:56:51.6922368Z ../../../miniconda/envs/testing/lib/python3.6/site-packages/xmltodict.py:327: ExpatError
2020-09-17T07:56:51.6922837Z =============================== warnings summary ===============================
2020-09-17T07:56:51.6923186Z tests/test_cross_validate.py: 210 warnings
2020-09-17T07:56:51.6923530Z tests/test_degap.py: 24 warnings
2020-09-17T07:56:51.6923841Z tests/test_derep.py: 45 warnings
2020-09-17T07:56:51.6924163Z tests/test_evaluate.py: 32 warnings
2020-09-17T07:56:51.6924509Z tests/test_filter_length.py: 64 warnings
2020-09-17T07:56:51.6924835Z tests/test_ncbi.py: 20 warnings
2020-09-17T07:56:51.6925146Z tests/test_orient.py: 76 warnings
2020-09-17T07:56:51.6925480Z tests/test_screenseq.py: 44 warnings
2020-09-17T07:56:51.6925837Z types/tests/test_methods.py: 64 warnings
2020-09-17T07:56:51.6926261Z types/tests/test_types_formats_transformers.py: 128 warnings
2020-09-17T07:56:51.6927297Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/sequence/_sequence.py:427: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
2020-09-17T07:56:51.6928213Z return self._bytes.tostring()
2020-09-17T07:56:51.6928427Z
2020-09-17T07:56:51.6928846Z tests/test_cross_validate.py::TestPipelines::test_evaluate_classifications
2020-09-17T07:56:51.6929578Z tests/test_cross_validate.py::TestPipelines::test_evaluate_classifications_expected_id_superset_valid
2020-09-17T07:56:51.6930391Z tests/test_cross_validate.py::TestPipelines::test_evaluate_classifications_observed_id_superset_invalid
2020-09-17T07:56:51.6931278Z tests/test_cross_validate.py::TestPipelines::test_evaluate_cross_validate_k3
2020-09-17T07:56:51.6931867Z tests/test_cross_validate.py::TestPipelines::test_evaluate_fit_classifier
2020-09-17T07:56:51.6932491Z tests/test_evaluate.py::TestEvaluateUtilities::test_process_labels_empty
2020-09-17T07:56:51.6934217Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/rescript/evaluate.py:76: UserWarning: The lists of input taxonomies and labels are different lengths. Additional taxonomies will be labeled numerically by their order in the inputs. Note that if these numbers match existing labels, those data will be grouped in the visualization.
2020-09-17T07:56:51.6935544Z warnings.warn(msg, UserWarning)
2020-09-17T07:56:51.6935784Z
2020-09-17T07:56:51.6936169Z tests/test_cross_validate.py::TestPipelines::test_evaluate_fit_classifier
2020-09-17T07:56:51.6938183Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.23.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
2020-09-17T07:56:51.6939533Z warnings.warn(warning, UserWarning)
2020-09-17T07:56:51.6939798Z
2020-09-17T07:56:51.6940130Z tests/test_derep.py::TestDerep::test_dereplicate_numericIDs
2020-09-17T07:56:51.6940622Z tests/test_derep.py::TestDerep::test_dereplicate_uniq
2020-09-17T07:56:51.6941254Z tests/test_derep.py::TestDerep::test_dereplicate_uniq
2020-09-17T07:56:51.6941710Z tests/test_derep.py::TestDerep::test_dereplicate_uniq_99_perc
2020-09-17T07:56:51.6942183Z tests/test_derep.py::TestDerep::test_dereplicate_uniq_99_perc
2020-09-17T07:56:51.6943111Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/rescript/dereplicate.py:116: SettingWithCopyWarning:
2020-09-17T07:56:51.6943817Z A value is trying to be set on a copy of a slice from a DataFrame.
2020-09-17T07:56:51.6944252Z Try using .loc[row_indexer,col_indexer] = value instead
2020-09-17T07:56:51.6944547Z
2020-09-17T07:56:51.6945482Z See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
2020-09-17T07:56:51.6947562Z uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
2020-09-17T07:56:51.6947872Z
2020-09-17T07:56:51.6948315Z tests/test_derep.py::TestDerep::test_dereplicate_numericIDs
2020-09-17T07:56:51.6948936Z tests/test_derep.py::TestDerep::test_dereplicate_uniq
2020-09-17T07:56:51.6949821Z tests/test_derep.py::TestDerep::test_dereplicate_uniq
2020-09-17T07:56:51.6950254Z tests/test_derep.py::TestDerep::test_dereplicate_uniq_99_perc
2020-09-17T07:56:51.6950711Z tests/test_derep.py::TestDerep::test_dereplicate_uniq_99_perc
2020-09-17T07:56:51.6951574Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/rescript/dereplicate.py:117: SettingWithCopyWarning:
2020-09-17T07:56:51.6952243Z A value is trying to be set on a copy of a slice from a DataFrame.
2020-09-17T07:56:51.6952653Z Try using .loc[row_indexer,col_indexer] = value instead
2020-09-17T07:56:51.6952930Z
2020-09-17T07:56:51.6953796Z See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
2020-09-17T07:56:51.6954767Z uc['centroidtaxa'] = uc['centroidID'].apply(lambda x: taxa.loc[x])
2020-09-17T07:56:51.6955045Z
2020-09-17T07:56:51.6955549Z tests/test_evaluate.py::TestEvaluateSeqs::test_evaluate_seqs_visualizer
2020-09-17T07:56:51.6957106Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/rescript/evaluate.py:76: UserWarning: The lists of input sequences and labels are different lengths. Additional sequences will be labeled numerically by their order in the inputs. Note that if these numbers match existing labels, those data will be grouped in the visualization.
2020-09-17T07:56:51.6958213Z warnings.warn(msg, UserWarning)
2020-09-17T07:56:51.6958426Z
2020-09-17T07:56:51.6958857Z tests/test_filter_length.py::TestFilterByTaxonomy::test_filter_seqs_length_by_taxon_no_failures
2020-09-17T07:56:51.6960446Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-q2bek65f/6a14d996-19a1-45fb-8b7a-737c73685ce6/data/dna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6961674Z % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6961999Z
2020-09-17T07:56:51.6962439Z tests/test_filter_length.py::TestFilterByTaxonomy::test_filter_seqs_length_by_taxon_no_seqs_pass_filter
2020-09-17T07:56:51.6964279Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-dw02fz5f/5fca65f3-674d-4a5f-b175-8d33a894a62c/data/dna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6965536Z % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6965844Z
2020-09-17T07:56:51.6966254Z tests/test_filter_length.py::TestFilterGlobally::test_filter_seqs_length_all_filtered_out
2020-09-17T07:56:51.6967862Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-5054xhrm/eca0544b-b263-4cce-804f-c98c10ea635b/data/dna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6969347Z % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6969690Z
2020-09-17T07:56:51.6970106Z tests/test_filter_length.py::TestFilterGlobally::test_filter_seqs_length_no_failures
2020-09-17T07:56:51.6971892Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/qiime2-archive-zh0holl6/4403e41a-2a07-4155-b62e-0b88be2bdaef/data/dna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6973101Z % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6973410Z
2020-09-17T07:56:51.6973706Z tests/test_get_data.py::TestGetSILVA::test_get_silva_data
2020-09-17T07:56:51.6974927Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/q2-RNAFASTAFormat-db5v51nj'> does not look like a fasta file
2020-09-17T07:56:51.6975979Z % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6976304Z
2020-09-17T07:56:51.6976596Z tests/test_get_data.py::TestGetSILVA::test_get_silva_data
2020-09-17T07:56:51.6978025Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/q2-RNASequencesDirectoryFormat-usr_jxwf/rna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6979471Z % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6979794Z
2020-09-17T07:56:51.6980104Z tests/test_get_data.py::TestGetSILVA::test_get_silva_data
2020-09-17T07:56:51.6981356Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/q2-RNAFASTAFormat-zj32_lqa'> does not look like a fasta file
2020-09-17T07:56:51.6982595Z % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6982919Z
2020-09-17T07:56:51.6983212Z tests/test_get_data.py::TestGetSILVA::test_get_silva_data
2020-09-17T07:56:51.6984704Z /home/runner/miniconda/envs/testing/lib/python3.6/site-packages/skbio/io/registry.py:548: FormatIdentificationWarning: <_io.BufferedReader name='/tmp/q2-RNASequencesDirectoryFormat-4hihakeb/rna-sequences.fasta'> does not look like a fasta file
2020-09-17T07:56:51.6986017Z % (file, fmt), FormatIdentificationWarning)
2020-09-17T07:56:51.6986330Z
2020-09-17T07:56:51.6986814Z -- Docs: https://docs.pytest.org/en/stable/warnings.html
2020-09-17T07:56:51.6987239Z =========================== short test summary info ============================
2020-09-17T07:56:51.6987925Z FAILED tests/test_ncbi.py::TestNCBI::test_get_ncbi_data_query_mushroom_two - ...
2020-09-17T07:56:51.6988335Z ============ 1 failed, 142 passed, 733 warnings in 85.28s (0:01:25) ============
2020-09-17T07:56:51.9957401Z ##[error]The process '/bin/bash' failed with exit code 1
additional tests failed
Request to update the get-ncbi-data to format the taxon table as it would appear in SILVA. Currently, the scripts populates all ranks regardless of presence in TaxID lineage.
Example of actual:
Feature ID Taxon
KR045484.1 k__Bacteria; p__Bacteria; c__Bacteria; o__Bacteria; f__Bacteria; g__uncultured; s__bacterium
Example of desired:
Feature ID Taxon
KR045484.1 k__Bacteria; p__; c__; o__; f__; g__; s__
Additionally, would it be possible to allow for specification of accession number target range? E.x. 'KR045484.1:3-200'. Or, to specify only to retrieve the taxonomy? This utility would be ideal for those of us with a curated list of sequences from a HMM. It is common to retrieve sequences from genomic assemblies. It is also possible to have multiple copies with genetic variation between the genomic regions.
choose longest taxonomy string when dereplicating.
This might then also include rank_handle
and new_rank_handle
params, though that gets complicated...
It seems that when combining DataFrames containing more columns than just 'Taxa' , the columns sometimes get reordered and the 'Taxa' column ends up not being first.
See here for more details: QIIME2 forum post
Appears to be an issue with numeric indices in pandas. I will try to fix immediately.
Traceback (most recent call last):
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/q2cli/commands
.py", line 328, in __call__
results = action(**arguments)
File "<decorator-gen-151>", line 2, in dereplicate
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/act
ion.py", line 245, in bound_callable
output_types, provenance)
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/qiime2/sdk/act
ion.py", line 390, in _callable_executor_
output_views = self._callable(**view_args)
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/rescript/derep
licate.py", line 61, in dereplicate
taxa, sequences, derep_seqs, uc, mode=mode)
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/rescript/dereplicate.py", line 122, in _dereplicate_taxa
uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/series.py", line 4045, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/rescript/dereplicate.py", line 122, in <lambda>
uc['Taxon'] = uc['seqID'].apply(lambda x: taxa.loc[x])
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexing.py", line 1424, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexing.py", line 1849, in _getitem_axis
self._validate_key(key, axis)
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexing.py", line 1725, in _validate_key
self._convert_scalar_indexer(key, axis)
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexing.py", line 274, in _convert_scalar_indexer
return ax._convert_scalar_indexer(key, kind=self.name)
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3138, in _convert_scalar_indexer
return self._invalid_indexer("label", key)
File "/scratch/nab336/envs/rescript/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3340, in _invalid_indexer
form=form, klass=type(self), key=key, kind=type(key)
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1078587] of <class 'int'>
Plugin error from rescript:
cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1078587] of <class 'int'>
In light of #30 and #32 it would be prudent to add an alternative approach to the feature-classifier extract-reads approach. That is, extract the region of interest from an alignment rather than a primer search.
Specifically, the user would be able to specify the alignment positions from which to extract the region of interest. The intent is to refactor the extract_alignment_region.py code from this pipeline.
We'll need this, if we want to allow future editing and parsing of curated alignments from SILVA, e.g. SILVA_138_SSURef_tax_silva_full_align_trunc.fasta.gz
.
Importing as FeatureData[AlignedSequence]
obviously fails.
I think we can currently import this as FeatureData[RNASequence]
, but this might cause confusion / issues downstream when users try to manipulate t=when imported this way.
Update the dereplicate
action to handle any number or combination of taxonomic ranks the user would like to use when implementing any of the LCA approaches. For the moment, we can only handle dealing with a standard 7-rank taxonomy, i.e. dpcofgs.
An example of this issue is on the QIIME 2 Forum.
Hi all,
There was a recent question posted to the QIIME forum that got me thinking about why the results might differ between running stand-alone QIIME clustering and vsearch-standalone clustering.
I'm curious if it would be a significant challenge to add an additional component to qiime rescript dereplicate
that would follow allow for the user to supply a feature-table file. This information is only relevant when users want to invoke the --p-perc-identity
argument. At the moment, this triggers vsearch to use --clust_size
, but because the fasta header is formatted with only feature information (and not abundance information) the centroid selection defaults to the longest sequence in the cluster, rather than the most abundant sequence. If the feature-table information could be imported with qiime rescript dereplicate
, the relevant abundance information might then be available for each sequence feature (summed across all samples with that sequence feature), and inserted into the clustering command.
I'm not sure whether or not you want the default to be the length of the sequence or the abundance of the sequence feature - my guess is the sequence abundances are probably more relevant when defining a cluster than the length, but perhaps this varies a lot across marker genes, and the sequence fragments generated therein.
Thanks for your consideration!
I think we should include / mention the silva license link within the help text of both retrieve-silva-data
and parse-silva-taxonomy
. Is there a way we can add this as a "citation"?
Some formatting steps in various actions, e.g., cross-validate
are single-core bottlenecks in pandas.
Should either:
The SILVA 138.1 LSU files are available here. This was brought to our attention in the QIIME 2 forum
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.