bokulich-lab / q2-fondue Goto Github PK
View Code? Open in Web Editor NEWFunctions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere
License: BSD 3-Clause "New" or "Revised" License
Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere
License: BSD 3-Clause "New" or "Revised" License
As a plugin user,
I want to merge sequence artifacts of multiple get-sequences
re-fetches of the same runID or projectID (see equivalent for SRA metadata artifacts in #55). The goal is not to merge single with paired reads, but single with single reads and paired with paired reads - both belonging to the same sequencing runs.
As far as I can see, Q2 only offers an action to merge FeatureData[Sequence]
artifacts (qiime feature-table merge-seqs
) but not SampleData[SequencesWithQuality]
or SampleData[PairedEndSequencesWithQuality]
artifacts of the same sequencing run (see this post on Q2 forum).
When using an SRA sample ID (#SRS...) to fetch sample metadata, the command hangs and never returns.
Steps to reproduce:
q2 fondue get-metadata --p-sample-ids SRS2162586 --p-email <your email> --o-metadata ~/metadata.qza --verbose
.Expected behaviour:
The command returns and the metadata.qza
artifact is created.
Actual behaviour:
The command hangs, doesn't print anything and we never get an artifact.
Note: It would seem that we get to the point of entrezpy making a request (but probably never getting a response)...
Running get-metadata
(as well as get-all
) with project ID PRJEB14529
, PRJEB23239
or PRJEB10914
prints error and hangs indefinitely.
Steps to reproduce:
PRJEB14529
, PRJEB23239
, PRJEB10914
q2 fondue get-metadata --m-accession-ids-file <ids file> --p-email <your email> --o-metadata ~/metadata.qza --verbose
Expected behaviour:
Command returns and metadata.qza
artifact is created.
Actual behaviour:
Command prints below error and hangs indefinitely:
Exception in thread Thread-10:
Traceback (most recent call last):
File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
self.run_one_request(request, analyzer)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 71, in run_one_request
analyzer.parse(response, request)
File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 513, in parse
self.analyze_result(response, request)
File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 507, in analyze_result
self.result.add_metadata(response, request.uids)
File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 476, in add_metadata
parsed_results[i], desired_id=uid)
IndexError: list index out of range
As a plugin user,
I want to be able to fetch studies' metadata and sequences not only with their ProjectID but also with a StudyID, such that studies prior to May'11 (when ProjectID was introduced) can also be fetched.
As a plugin user,
I want a get-sequences
action for fetching sequences from SRA,
so that I can easily generate sequence artifacts based on SRA accession numbers.
Acceptance criteria:
As a plugin user, I want to have a DOI to identify this plugin with and make it citable. (Helpful resource)
As a plugin user,
I want to get a list of IDs for which fetching sequences failed + artifacts for the sequences/metadata that were fetched correctly,
so that I can re-run the command only for the failed IDs.
As a plugin user,
I want the space limit within get-sequences
to be adjusted
so that I can use as much of my free space as possible when downloading sequences.
Subtasks:
Note: Initial adjustment worked pretty good - after #80 there's some space being freed up at every fetch iteration so we need to re-adjust.
Current behaviour:
SRAMetadataFormat
skips all validation.
Expected behaviour:
SRAMetadataFormat
should validate at least some of the metadata properties like existence of a header and a couple of fields that are expected to be present in every study (ID, BioSample ID, Project ID, Platform, Instrument, Bases, Bytes etc.)
For some run IDs, using sra-tools' prefetch
command prior to fasterq-dump
in get-sequences
could be beneficial in that it eliminates transfer problems and tests the data validity. Additionally, we should implement a time buffer between retries.
References supporting this:
If I run get-sequences
multiple times with the same ProjectID, I get differing number of sequence files with each run.
Steps to reproduce:
test_issue.tsv
#!/bin/bash
for i in {1..7};
do
qiime fondue get-sequences \
--m-accession-ids-file test_issue.tsv \
--p-email [email protected] \
--p-retries 10 \
--output-dir "test_proj$i"
file="test_proj$i/single_reads.qza"
if [[ -f "$file" ]] ;
then
qiime tools extract \
--input-path "$file" \
--output-path "test_proj$i/single_extract"
echo `find test_proj"$i"/single_extract/*/data -type f | wc -l`
else
echo "0"
fi
done
Expected behaviour:
Every time command is run, we expect same number of sequence (plus metadata.yml
) files to be fetchable from SRA.
Actual behaviour:
Every time command is run, number of sequence (plus metadata.yml
) files is different. In my case, above script returned the following counts:
42, 47, 47, 0, 71, 89, 92
With the --verbose
flag, a call to get-all
generates the user warning "No paired-read sequences available for these accession IDs." As a user, I find the warning status confusing when nothing is wrong per se. Could we downgrade this status to INFO?
Sample command and output:
$ qiime fondue get-all --i-accession-ids toy-multi-ids.qza --p-email <> --p-n-jobs 4 --output-dir work-multi --verbose
QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
2022-02-11 11:17:26,567 [MainThread] [INFO] [entrezpy.esearch.esearcher.Esearcher]: {"query": "xtZfY5XUTDuY3d6sNk2pNw==", "status": "OK"}
2022-02-11 11:17:34,073 [MainThread] [INFO] [q2_fondue.sequences]: Downloading sequences for 4 accession IDs...
Downloading sequences for run SRR7871145 (attempt 1): 100%|████████████████████████████████████████████████████████████████| 4/4 [00:49<00:00, 12.43s/it, 0 failed]
2022-02-11 11:18:23,779 [MainThread] [INFO] [q2_fondue.sequences]: Download finished.
/Users/linkim/Documents/Work/Software/q2-fondue/q2_fondue/sequences.py:218: UserWarning: No paired-read sequences available for these accession IDs.
warn(warn_msg)
2022-02-11 11:18:25,344 [MainThread] [WARNING] [q2_fondue.sequences]: No paired-read sequences available for these accession IDs.
2022-02-11 11:18:25,349 [MainThread] [INFO] [q2_fondue.sequences]: Processing finished.
Saved SRAMetadata to: work-multi/metadata.qza
Saved SampleData[SequencesWithQuality] to: work-multi/single_reads.qza
Saved SampleData[PairedEndSequencesWithQuality] to: work-multi/paired_reads.qza
Saved SRAFailedIDs to: work-multi/failed_runs.qza
Steps to reproduce:
get-metadata
action to fetch metadata for project PRJNA13694
Expected result:
Actual result:
Note:
We are not parsing the experiment metadata - check here:
q2-fondue/q2_fondue/entrezpy_clients/_efetch.py
Lines 246 to 270 in eb9ee93
versus run metadata:
As a plugin user,
I want to have an overview diagram showing me what is happening under the hood of the get-metadata action.
Note: initial draft for this diagram is in #26
If I run get-all
(or get-sequences
) with an invalid projectID without --verbose
, q2fondue does not return whether it succeeded or failed.
Steps to reproduce:
PRJEB307x.tsv
)qiime fondue get-all \
--m-accession-ids-file PRJEB307x.tsv \
--p-email [email protected] \
--output-dir outout
Expected behaviour:
get-all
fails it should print an indication that it failed even in non-verbose mode.Actual behaviour:
As a plugin user,
I want the metadata/sequence-fetching action to accept a file with a list of accession ids as an input parameter,
so that I can obtain easily fetch a large number of datasets with ids stored in a file.
Acceptance criteria:
Note:
See here for an example on how to feed it from a file
When using a project ID with a lot of runs (>1000) to fetch metadata the action fails as it cannot find some IDs that were requested. It happens when using both, the project ID or the run IDs directly.
Steps to reproduce:
PRJEB14186
q2 fondue get-metadata --m-accession-ids-file <ids file> --p-email <your email> --o-metadata ~/metadata.qza --verbose
.Expected behaviour:
The command returns and the metadata.qza artifact is created.
Actual behaviour:
The command fails with the following error:
Exception in thread Thread-10:
Traceback (most recent call last):
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
self.run_one_request(request, analyzer)
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 71, in run_one_request
analyzer.parse(response, request)
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 551, in parse
self.analyze_result(response, request)
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 545, in analyze_result
self.result.add_metadata(response, request.uids)
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 508, in add_metadata
current_run = run_ids[uid]
KeyError: 'ERR1428963'
Note: the key in the KeyError is not always the same, it seems like the command only fetches a subset of the metadata... And the subset is not always of the same size (it gets a few hundred entries instead of 778).
As a plugin user,
I would like to instruct get-sequences
to only get the run IDs from a SRAMetadata
file that match certain search criteria (e.g. with a parameter where
).
Note: Idea comes from q2-sra's fetch_runs
action.
As a plugin user,
I would like all q2-fondue methods to store the individual accession IDs that were used as an input to the method.
Background:
Currently, the methods keep track of the name of the metadata file (e.g. metadata.tsv
) but not its content (no content hashes are generated or similar).
Suggested solution:
Create an additional type for metadata that stores the tsv file with the used accession IDs.
When fetching tens of runs, very often one needs to wait a very long time for q2-fondue to process and save the sequences (even with amplicon data, not to mention (meta)genomes). Looking at the code makes me realize that there are two main issues with the approach we are taking within the get-sequences method (and here we were, blaming it on QIIME! 🙈 ):
download
-> pre-process
(incl. renaming) -> process
(write to final files)).Two main, relatively easy solutions (at least for now) addressing those points could be:
As a plugin user,
I would like get-metadata
/get-all
methods to not fail when they encounter invalid IDs
so that I can get the metadata for at least the remaining IDs.
Background:
It turns out that some IDs that can be found in literature may be out-of-date as far as SRA is concerned, i.e.: if authors requested data removal from SRA. In such cases, those IDs appear as "suppressed" when searching using the SRA Browser.
It would be much more convenient if get-metadata
and get-all
methods returned a list of failed IDs and continued to downloading metadata for all the remaining IDs - the user can then deal with the leftover IDs.
As a plugin developer/user,
I want the logging employed by all the entrezpy objects to be exposed and configurable,
so that it is easier to debug/so that I can follow progress of the requests using a specific log level.
Note: All of the Entrezpy objects already use loggers from the logging module but it would seem that they need some basic configuration (add stdout handlers with a nice formatter, expose log level to the user). More info here
As a plugin user,
I want one action which chains sequence and metadata fetching,
so that I can obtain both in a single step.
Acceptance criteria:
This enhancement request requires some more evaluation.
In case of large datasets it takes very long for fasterq-dump
(blue line in the screenshot below) to process the sequences fetched with prefetch
(simply due to the size; red line). As these two actions (prefetch -> fasterq-dump) are currently chained within one function, nothing else can be processed or downloaded before every single sample is done.
This could be significantly improved by letting prefetch
grab everything first and while some data is already available fasterq-dump
could start processing that (similarly to how we do it later for the post-processing steps).
As a plugin user,
I want get-sequences
to ignore the retries
indicated in case it runs into a space exhaustion issue. Since in case of a space issue there is no need to retry the sequence fetching multiple times.
Running get-sequences
with the ProjectID PRJEB14529
raises a fasterq-dump error for its runID ERR139189
(below Experiment A). But running get-sequences
just for this runID ERR139189
succeeds (below Experiment B).
Experiment A:
PRJEB14529
into a "metadata-file"qiime fondue get-sequences --m-accession-ids-file <metadata-file> --p-email <your-email> --output-dir <output-loc>
Experiment B:
ERR139189
into a "metadata-file"qiime fondue get-sequences --m-accession-ids-file <metadata-file> --p-email <your-email> --output-dir <output-loc>
get-sequences
for a runID and the projectID it belongs to should both either succeed or fail.Command A fails with the following error:
Traceback (most recent call last):
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/q2cli/commands.py", line 329, in __call__
results = action(**arguments)
File "<decorator-gen-312>", line 2, in get_all
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
outputs = self._callable_executor_(scope, callable_args,
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/qiime2/sdk/action.py", line 485, in _callable_executor_
outputs = self._callable(scope.ctx, **view_args)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/get_all.py", line 24, in get_all
seq_single, seq_paired, = get_sequences(
File "<decorator-gen-549>", line 2, in get_sequences
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
outputs = self._callable_executor_(scope, callable_args,
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/qiime2/sdk/action.py", line 391, in _callable_executor_
output_views = self._callable(**view_args)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/sequences.py", line 244, in get_sequences
_run_fasterq_dump_for_all(
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/sequences.py", line 75, in _run_fasterq_dump_for_all
raise ValueError('{} could not be downloaded with the '
ValueError: ERR139189 could not be downloaded with the following fasterq-dump error returned: 2021-11-11T09:47:39 fasterq-dump.2.9.6 sys: connection not found while validating within network system module - Failed to Make Connection in KClientHttpOpen to 'www.ncbi.nlm.nih.gov:443'
2021-11-11T09:47:39 fasterq-dump.2.9.6 err: invalid accession 'ERR139189'
Command B succeeds with:
Saved SampleData[SequencesWithQuality] to: ERR139189/single_reads.qza
Saved SampleData[PairedEndSequencesWithQuality] to: ERR139189/paired_reads.qza
Could be caused by using an older version of sra-tools==2.9.6
instead of sra-tools==2.10
see comments here and here
When fetching metadata containing some duplicated keys, the processing fails.
Steps to reproduce:
qiime fondue get-metadata ...
Expected behaviour:
Actual behaviour:
Traceback (most recent call last):
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
self.run_one_request(request, analyzer)
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 71, in run_one_request
analyzer.parse(response, request)
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 295, in parse
self.analyze_result(response, request)
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 289, in analyze_result
self.result.add_metadata(response, request.uids)
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 265, in add_metadata
self.metadata[uid] = self._process_single_run(
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 71, in _process_single_run
processed_meta = self._extract_custom_attributes(
File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 107, in _extract_custom_attributes
raise DuplicateKeyError(
q2_fondue.entrezpy_clients._efetch.DuplicateKeyError: One of the metadata keys (BioSampleModel) is duplicated.
As a plugin user,
I want q2-fondue get-all
and get-sequences
to retry my command when the connection to NCBI via Entrezpy fails (as indicated below with the error) or to capture this error and let me know it was a connection problem.
What q2-fondue get-all
returns when connection to NCBI fails:
2021-11-18 10:38:58,153 [MainThread] [INFO] [q2_fondue.metadata]: 2160 missing IDs were found - we will retry fetching those (20 retries left).
Exception in thread Thread-37:
Traceback (most recent call last):
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 1354, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1256, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1302, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1251, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1011, in _send_output
self.send(msg)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 951, in send
self.connect()
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1418, in connect
super().connect()
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 922, in connect
self.sock = self._create_connection(
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/socket.py", line 787, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/socket.py", line 918, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/requester.py", line 90, in request
response = urllib.request.urlopen(urllib.request.Request(req.url,data=data),
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 1397, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 1357, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
self.run_one_request(request, analyzer)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 66, in run_one_request
response = self.requester.request(request)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/requester.py", line 104, in request
self.logger.error(json.dumps({'URL-error':url_err.reason, 'action':'retry'}))
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type gaierror is not JSON serializable
Exception in thread Thread-36:
Traceback (most recent call last):
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/monitor.py", line 83, in run
i.report_status(self.processed_requests, self.expected_requests)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/base/request.py", line 129, in report_status
self.logger.debug(json.dumps({'status': self.dump_internals()}))
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type gaierror is not JSON serializable
When running make dev
with an existing sra-tools installation, the current install-sra-tools.sh
script asks for every package in the installation whether it should overwrite it.
Steps to reproduce:
Install sra-tools as described in the ReadMe and run make dev
afterwards in the same environment.
Expected behaviour:
The script should ask once whether it should overwrite all required packages [y/n] or only run the command pip install -e
w/o reinstalling sra-tools from scratch.
Actual behaviour:
The script asks for every package whether it should overwrite it [y/n].
As a plugin user,
I want the get-sequences
action to be able to fetch by project ID,
so that I can use it to retrieve only sequences from a specific project.
It appears that when neither paired- nor single-read sequences were saved as artifacts, the action (get-sequences
or get-all
) still returns successfully (as format validation passes even on empty fastq files). It should, however, fail when there are no files in either of the two.
Note: I'm not sure what caused this behaviour - still investigating. Will add steps to reproduce when found. Regardless of that, I think this behaviour should not be expected.
As a plugin user, I would like to also fetch publication metadata when I use get-metadata
, so that I know about any publications linked to the studies and samples. This would significantly improve traceability. i.e., that citation information is also fetched and preserved alongside other (meta)data.
I would like to grab any Pubmed ID(s) linked to the BioProject ID. If possible, this could be linked additionally to a DOI (as a separate metadata column).
I imagine that we could either (a) embed the citation directly in provenance or (b) just retrieve the pubmed ID/doi and place it in the metadata file for easy parsing later.
PubMed ID appears to be one of the optional metadata categories, and even searching BioProjects by PubMed ID is possible in SRA.
This could also be part of a separate action if PubMed ID needs to be fetched via a separate query.
Running get-metadata
with project ID PRJEB5482
prints KeyError and hangs indefinitely.
Steps to reproduce:
q2 fondue get-metadata --m-accession-ids-file <ids file> --p-email <your email> --o-metadata ~/metadata.qza --verbose
Expected behaviour:
Command returns and metadata.qza artifact is created.
Actual behaviour:
Command prints below error and hangs indefinitely:
Exception in thread Thread-10:
Traceback (most recent call last):
File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
self.run_one_request(request, analyzer)
File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 71, in run_one_request
analyzer.parse(response, request)
File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 513, in parse
self.analyze_result(response, request)
File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 507, in analyze_result
self.result.add_metadata(response, request.uids)
File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 475, in add_metadata
self.metadata[i] = self._process_single_id(
File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 375, in _process_single_id
sample_ids = self._create_samples(attributes, study_id)
File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 187, in _create_samples
pool_meta = attributes['Pool'].get('Member')
KeyError: 'Pool'
All of the functions take the sample_id
parameter as input. What we effectively used so far was run IDs rather than sample IDs (see #24), meaning that we are actually fetching runs and not samples.
We should rename that param to make it less confusing. I'd propose to name it something non-specific, e.g. accession_id
- in another PR (#26) I'll be introducing a step that should recognize which kind of ID was given.
For some time now, when installing sra-tools through conda (as described in the readme), none of the CLI tools actually work - all throw some mysterious certificate error.
Steps to reproduce:
prefetch -v ERR1428207
and observe the errorExpected behaviour:
Data for the provided ID is downloaded.
Actual behaviour:
An error is thrown: The certificate is not correctly signed by the trusted CA
.
Notes:
As a q2-fondue developer,
I want all the standard plugin skeleton elements,
so that I can start developing the actual actions/functionality without worrying about the boilerplate code.
Acceptance criteria:
As a plugin user,
I want the metadata/data fetching action to accept ProjectIDs (PRJEB / PRJNA) as an input parameter,
so I can obtain the sequencing (meta)data of the entire SRA/ENA BioProject.
Goal is to update the GHA integration with a stable action-library-packaging version.
Acceptance criteria:
As a plugin user,
I would like q2fondue to inform me how much space is roughly needed to download the sequences with get-sequences
for the accession IDs in my --m-accession-ids-file
such that I can compare it with my available Q2 TMPDIR
space.
Suggested implementation approach:
estimate-space-req
that makes use of vdb-dump --info
from sra-tools and sums over space requirements for all runIDs multiplying it with a factor of 8 or 10 (since according to sra-tools' wiki “As a rule of thumb you should have about 8x … 10x the size of the accession available on your filesystem.”)estimate-space-req
could be integrated into pipeline get-all
before get-sequences
is run.TMPDIR
location to a location with more space if its space is exceeded currently.Description with more details will follow shortly.
As a plugin user,
I want to have an action that allows me to scrape a collection of scientific papers for accession IDs which can be passed to the other q2fondue actions.
As a plugin user,
I would like get-sequences
and get-all
to exit with the previously fetched sequences shortly before it fails with a space issue and fails to return anything (OSError: [Errno 28] No space left on device
). This could be implemented by closing the q2fondue execution when nearly all available storage (e.g. 95%) is used up.
Note: This can be tested on projectID PRJEB3079
that requires 230-270 GiB to be downloaded.
We should clean up the repo to make it ready for the first release (mostly: remove meeting notes and clean up the issues).
Just opening this issue to have this documented in one place as it's not necessarily something we can "fix".
As @adamovanja pointed out elsewhere, sometimes fetching sequences fails with an invalid accession ...
error from fasterq-dump
and it seems to be happening only with the latest version of q2-fondue. After some investigation, my impression is that this is caused by the configuration of sra-tools that is performed using the vdb-config
tool. So there are two scenarios:
prefetch
's download location is set to "current directory" (this is the default option): prefetch
manages to download everything as expected but fasterq-dump
fails (who knows why, I couldn't really find that out)prefetch
's download location is set to "user-repository" and the repository value is set (tab "Cache"): sequences are fetched correctly (note that if the repository is not set, the download will likely fail)This behaviour is observed on sra-tools version 2.11.0 (currently available via conda). When using the latest version of the toolkit (2.13.0) I have not observed the same issue: the downloads seemed to succeed regardless of the repository settings.
To set the repo location one can use the vdb-config in the interactive mode or just execute those two commands:
vdb-config -s "/repository/user/main/public/root=<your cache location>"
vdb-config --prefetch-to-user-repo
Proposed solution:
If someone else can reproduce this, I would say we should just add a section at the beginning of the README/tutorial saying that the users should run the configuration tool after installing q2-fondue and what they need to set where. Whenever the newest toolkit version becomes available we can just upgrade and that should solve the issue.
As a plugin user,
I want a get-metadata
action for fetching SRA metadata,
so that I can easily obtain a table with metadata for all my sequences.
Acceptance criteria:
entrezpy
to interact with EntrezWhen downloading sequences for some samples (e.g., metagenomes) the download appears very slow, as compared to smaller sized datasets.
Steps to reproduce:
Try to fetch sequences for ID ERR1700893
and observe the time it takes.
Expected behaviour:
The size of this dataset is approx. 28 GB - it should be a matter of half an hour to an hour to fetch (depends on connection speed).
Actual behaviour:
It takes hours (don't know exactly, didn't wait for it to finish).
The problem is that in case of large datasets prefetch
silently fails as the default allowed max. size is 20GB. fasterq-dump
then takes over but is just much slower. This can easily be fixed by adjusting the max-size
param of prefetch
to unlimited to allow downloads of any data. See here for some more info.
As a plugin developer,
I want the retmax
param on efetch
queries to be set to a fixed value
so that one can reliably fetch metadata without failing in case too many were requested.
Note: that should probably be set somewhere around here:
q2-fondue/q2_fondue/metadata.py
Lines 42 to 49 in e39db49
and this maybe needs adjusting to loop over id batches:
q2-fondue/q2_fondue/metadata.py
Lines 57 to 64 in e39db49
This is inspired by the approach used here: https://github.com/ebolyen/q2-sra/blob/eea38c3750aa051ed1cecb1f2b7220d42ef1f2d5/q2_sra/lib/efetch.py#L57
Current python version in github action is 3.6 (see q2-fondue/.github/workflows/ci.yml
), but should be changed to 3.8 (as that is the version used by Q2 2021.8dev)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.