bokulich-lab / q2-fondue Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 6.0 416 KB

Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.10% Python 98.88% TeX 0.45% Shell 0.48% Dockerfile 0.10%

q2-fondue's People

Contributors

Stargazers

Watchers

Forkers

misialq nbokulich adamovanja gajenderaleti lina-kim

q2-fondue's Issues

Action to merge multiple sequences of same sequencing runs

As a plugin user,

I want to merge sequence artifacts of multiple get-sequences re-fetches of the same runID or projectID (see equivalent for SRA metadata artifacts in #55). The goal is not to merge single with paired reads, but single with single reads and paired with paired reads - both belonging to the same sequencing runs.

As far as I can see, Q2 only offers an action to merge FeatureData[Sequence] artifacts (qiime feature-table merge-seqs) but not SampleData[SequencesWithQuality] or SampleData[PairedEndSequencesWithQuality] artifacts of the same sequencing run (see this post on Q2 forum).

Fetching metadata by sample ID never returns

When using an SRA sample ID (#SRS...) to fetch sample metadata, the command hangs and never returns.

Steps to reproduce:

Run the following command: q2 fondue get-metadata --p-sample-ids SRS2162586 --p-email <your email> --o-metadata ~/metadata.qza --verbose.
Observe the behaviour.

Expected behaviour:
The command returns and the metadata.qza artifact is created.

Actual behaviour:
The command hangs, doesn't print anything and we never get an artifact.

Note: It would seem that we get to the point of entrezpy making a request (but probably never getting a response)...

get-metadata IndexError & hangs

Running get-metadata (as well as get-all) with project ID PRJEB14529, PRJEB23239 or PRJEB10914 prints error and hangs indefinitely.

Steps to reproduce:

Create TSV file containing either of these project IDs: PRJEB14529, PRJEB23239, PRJEB10914
Run the following command: q2 fondue get-metadata --m-accession-ids-file <ids file> --p-email <your email> --o-metadata ~/metadata.qza --verbose
Observe the behaviour.

Expected behaviour:
Command returns and metadata.qza artifact is created.

Actual behaviour:
Command prints below error and hangs indefinitely:

Exception in thread Thread-10:
Traceback (most recent call last):
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
    self.run_one_request(request, analyzer)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 71, in run_one_request
    analyzer.parse(response, request)
  File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 513, in parse
    self.analyze_result(response, request)
  File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 507, in analyze_result
    self.result.add_metadata(response, request.uids)
  File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 476, in add_metadata
    parsed_results[i], desired_id=uid)
IndexError: list index out of range

Fetch data by StudyID

As a plugin user,
I want to be able to fetch studies' metadata and sequences not only with their ProjectID but also with a StudyID, such that studies prior to May'11 (when ProjectID was introduced) can also be fetched.

Implement sequence fetching action

As a plugin user,
I want a get-sequences action for fetching sequences from SRA,
so that I can easily generate sequence artifacts based on SRA accession numbers.

Acceptance criteria:

given a list of SRA accession numbers, fetches all required sequences and outputs a corresponding number of q2 artifacts
handles single- and paired-end data correctly

Register plugin on Zenodo

As a plugin user, I want to have a DOI to identify this plugin with and make it citable. (Helpful resource)

Output a list of failed IDs from `get-sequences`

As a plugin user,
I want to get a list of IDs for which fetching sequences failed + artifacts for the sequences/metadata that were fetched correctly,
so that I can re-run the command only for the failed IDs.

Adjust get-sequence's space limit

As a plugin user,
I want the space limit within get-sequences to be adjusted
so that I can use as much of my free space as possible when downloading sequences.

Subtasks:

Gather a list of run/project IDs to run the tests with
Perform and document space requirement tests
Adjust the space limit in the code

Note: Initial adjustment worked pretty good - after #80 there's some space being freed up at every fetch iteration so we need to re-adjust.

Add proper metadata format validation

Current behaviour:
SRAMetadataFormat skips all validation.

Expected behaviour:
SRAMetadataFormat should validate at least some of the metadata properties like existence of a header and a couple of fields that are expected to be present in every study (ID, BioSample ID, Project ID, Platform, Instrument, Bases, Bytes etc.)

Prefetch prior to fasterq-dump in get-sequences

For some run IDs, using sra-tools' prefetch command prior to fasterq-dump in get-sequences could be beneficial in that it eliminates transfer problems and tests the data validity. Additionally, we should implement a time buffer between retries.

References supporting this:

`get-sequences` produces variable output

If I run get-sequences multiple times with the same ProjectID, I get differing number of sequence files with each run.

Steps to reproduce:

Create TSV file containing the ProjectID: PRJEB30327 in test_issue.tsv
Run the following bash script (for 7 iterations takes approx. 20 min):

#!/bin/bash

for i in {1..7};
do
  qiime fondue get-sequences \
        --m-accession-ids-file test_issue.tsv \
        --p-email [email protected] \
        --p-retries 10 \
        --output-dir "test_proj$i"

  file="test_proj$i/single_reads.qza"
  if [[ -f "$file" ]] ; 
  then
    qiime tools extract \
    --input-path "$file" \
    --output-path "test_proj$i/single_extract"
    
    echo `find test_proj"$i"/single_extract/*/data -type f | wc -l`
  else
    echo "0"
  fi
done

Observe the output.

Expected behaviour:
Every time command is run, we expect same number of sequence (plus metadata.yml) files to be fetchable from SRA.

Actual behaviour:
Every time command is run, number of sequence (plus metadata.yml) files is different. In my case, above script returned the following counts:
42, 47, 47, 0, 71, 89, 92

Downgrade missing sequence types warning to info

With the --verbose flag, a call to get-all generates the user warning "No paired-read sequences available for these accession IDs." As a user, I find the warning status confusing when nothing is wrong per se. Could we downgrade this status to INFO?

Sample command and output:

$ qiime fondue get-all --i-accession-ids toy-multi-ids.qza --p-email <> --p-n-jobs 4 --output-dir work-multi --verbose
QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
2022-02-11 11:17:26,567 [MainThread] [INFO] [entrezpy.esearch.esearcher.Esearcher]: {"query": "xtZfY5XUTDuY3d6sNk2pNw==", "status": "OK"}
2022-02-11 11:17:34,073 [MainThread] [INFO] [q2_fondue.sequences]: Downloading sequences for 4 accession IDs...
Downloading sequences for run SRR7871145 (attempt 1): 100%|████████████████████████████████████████████████████████████████| 4/4 [00:49<00:00, 12.43s/it, 0 failed]
2022-02-11 11:18:23,779 [MainThread] [INFO] [q2_fondue.sequences]: Download finished.
/Users/linkim/Documents/Work/Software/q2-fondue/q2_fondue/sequences.py:218: UserWarning: No paired-read sequences available for these accession IDs.
  warn(warn_msg)
2022-02-11 11:18:25,344 [MainThread] [WARNING] [q2_fondue.sequences]: No paired-read sequences available for these accession IDs.
2022-02-11 11:18:25,349 [MainThread] [INFO] [q2_fondue.sequences]: Processing finished.
Saved SRAMetadata to: work-multi/metadata.qza
Saved SampleData[SequencesWithQuality] to: work-multi/single_reads.qza
Saved SampleData[PairedEndSequencesWithQuality] to: work-multi/paired_reads.qza
Saved SRAFailedIDs to: work-multi/failed_runs.qza

Experiment metadata is missing from the final result

Steps to reproduce:

use get-metadata action to fetch metadata for project PRJNA13694
look at the result: compare to metadata available here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRR066138 (one run from that project)

Expected result:

all metadata fields available in the Run Browser should be present in the table

Actual result:

some metadata are missing (e.g.: Sample Location or Temperature)

Note:
We are not parsing the experiment metadata - check here:

q2-fondue/q2_fondue/entrezpy_clients/_efetch.py

Lines 246 to 270 in eb9ee93

    
               def _create_experiment(self, attributes: dict, sample_id: str) -> str: 
        
                   """Creates an SRAExperiment object. 
        
                   Information like Experiment ID, platform, instrument and library 
        
                   metadata as well as other custom metadata are added here. 
        
                   Args: 
        
                       attributes (dict): Dictionary with all the metadata from 
        
                           the XML response. 
        
                       sample_id (str): ID of the sample which the experiment belongs to. 
        
                   Returns: 
        
                       exp_id (str): ID of the processed study. 
        
                   """ 
        
                   exp_meta = attributes['EXPERIMENT'] 
        
                   exp_id = exp_meta['IDENTIFIERS'].get('PRIMARY_ID') 
        
                   if exp_id not in self.experiments.keys(): 
        
                       platform = list(exp_meta['PLATFORM'].keys())[0] 
        
                       instrument = exp_meta['PLATFORM'][platform].get('INSTRUMENT_MODEL') 
        
                       self.experiments[exp_id] = SRAExperiment( 
        
                           id=exp_id, 
        
                           instrument=instrument, 
        
                           platform=platform, 
        
                           sample_id=sample_id, 
        
                           library=self._extract_library_info(attributes), 
        
                           custom_meta=None

versus run metadata:

q2-fondue/q2_fondue/entrezpy_clients/_efetch.py

Line 303 in eb9ee93

custom_meta = self._extract_custom_attributes(run, 'run')

Write a tutorial on plugin usage

Update diagram on fetching metadata

As a plugin user,
I want to have an overview diagram showing me what is happening under the hood of the get-metadata action.

Note: initial draft for this diagram is in #26

Silent exits in non-verbose mode

If I run get-all (or get-sequences) with an invalid projectID without --verbose, q2fondue does not return whether it succeeded or failed.

Steps to reproduce:

Create TSV file with an incorrect ProjectID: PRJEB307x (below PRJEB307x.tsv)
Run the following command:

qiime fondue get-all \
        --m-accession-ids-file PRJEB307x.tsv \
        --p-email [email protected] \
        --output-dir outout

Observe the output.

Expected behaviour:

If get-all fails it should print an indication that it failed even in non-verbose mode.

Actual behaviour:

Command returns nothing.

Add support for feeding a list of accession ids from a file

As a plugin user,
I want the metadata/sequence-fetching action to accept a file with a list of accession ids as an input parameter,
so that I can obtain easily fetch a large number of datasets with ids stored in a file.

Acceptance criteria:

the action uses either a list of ids or a file containing a list of ids

Note:
See here for an example on how to feed it from a file

Fetching large amounts of runs fails

When using a project ID with a lot of runs (>1000) to fetch metadata the action fails as it cannot find some IDs that were requested. It happens when using both, the project ID or the run IDs directly.

Steps to reproduce:

Create a TSV file containing one project ID: PRJEB14186
Run the following command: q2 fondue get-metadata --m-accession-ids-file <ids file> --p-email <your email> --o-metadata ~/metadata.qza --verbose.
Observe the behaviour.

Expected behaviour:
The command returns and the metadata.qza artifact is created.

Actual behaviour:
The command fails with the following error:

Exception in thread Thread-10:
Traceback (most recent call last):
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
    self.run_one_request(request, analyzer)
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 71, in run_one_request
    analyzer.parse(response, request)
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 551, in parse
    self.analyze_result(response, request)
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 545, in analyze_result
    self.result.add_metadata(response, request.uids)
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 508, in add_metadata
    current_run = run_ids[uid]
KeyError: 'ERR1428963'

Note: the key in the KeyError is not always the same, it seems like the command only fetches a subset of the metadata... And the subset is not always of the same size (it gets a few hundred entries instead of 778).

Filter get-sequences by criteria

As a plugin user,
I would like to instruct get-sequences to only get the run IDs from a SRAMetadata file that match certain search criteria (e.g. with a parameter where).

Note: Idea comes from q2-sra's fetch_runs action.

Add provenance for used accession IDs

As a plugin user,
I would like all q2-fondue methods to store the individual accession IDs that were used as an input to the method.

Background:
Currently, the methods keep track of the name of the metadata file (e.g. metadata.tsv) but not its content (no content hashes are generated or similar).

Suggested solution:
Create an additional type for metadata that stores the tsv file with the used accession IDs.

Fetching even medium-large datasets takes ages

When fetching tens of runs, very often one needs to wait a very long time for q2-fondue to process and save the sequences (even with amplicon data, not to mention (meta)genomes). Looking at the code makes me realize that there are two main issues with the approach we are taking within the get-sequences method (and here we were, blaming it on QIIME! 🙈 ):

All the steps are executed sequentially (download -> pre-process (incl. renaming) -> process (write to final files)).
Within the (pre)-process steps, files are processed one-by-one.

Two main, relatively easy solutions (at least for now) addressing those points could be:

Pre-processing and writing can be executed as soon as the download of a given ID is finished. Since download is not CPU intensive, we can make use of the idling CPUs to start processing the data.
Independent runs do not need to be processed one-by-one - that step can easily be parallelized using a pool of workers.

Do not fail on invalid IDs

As a plugin user,
I would like get-metadata/get-all methods to not fail when they encounter invalid IDs
so that I can get the metadata for at least the remaining IDs.

Background:
It turns out that some IDs that can be found in literature may be out-of-date as far as SRA is concerned, i.e.: if authors requested data removal from SRA. In such cases, those IDs appear as "suppressed" when searching using the SRA Browser.

It would be much more convenient if get-metadata and get-all methods returned a list of failed IDs and continued to downloading metadata for all the remaining IDs - the user can then deal with the leftover IDs.

Make Entrez logging configurable

As a plugin developer/user,
I want the logging employed by all the entrezpy objects to be exposed and configurable,
so that it is easier to debug/so that I can follow progress of the requests using a specific log level.

Note: All of the Entrezpy objects already use loggers from the logging module but it would seem that they need some basic configuration (add stdout handlers with a nice formatter, expose log level to the user). More info here

Chain sequence and metadata fetching into a pipeline

As a plugin user,
I want one action which chains sequence and metadata fetching,
so that I can obtain both in a single step.

Acceptance criteria:

given a list of accession ids, fetches sequences and corresponding metadata using a pipeline
outputs sequence artifacts and metadata table

Decouple fasterq-dump from prefetch to make processing of large datasets faster

This enhancement request requires some more evaluation.

In case of large datasets it takes very long for fasterq-dump (blue line in the screenshot below) to process the sequences fetched with prefetch (simply due to the size; red line). As these two actions (prefetch -> fasterq-dump) are currently chained within one function, nothing else can be processed or downloaded before every single sample is done.

This could be significantly improved by letting prefetch grab everything first and while some data is already available fasterq-dump could start processing that (similarly to how we do it later for the post-processing steps).

No retries in case of space issue

As a plugin user,
I want get-sequences to ignore the retries indicated in case it runs into a space exhaustion issue. Since in case of a space issue there is no need to retry the sequence fetching multiple times.

`fasterq-dump` differing output for ProjectID member and runID

Running get-sequences with the ProjectID PRJEB14529 raises a fasterq-dump error for its runID ERR139189 (below Experiment A). But running get-sequences just for this runID ERR139189 succeeds (below Experiment B).

Steps to reproduce:

Experiment A:

Create TSV file containing the ProjectID: PRJEB14529 into a "metadata-file"
Run the command: qiime fondue get-sequences --m-accession-ids-file <metadata-file> --p-email <your-email> --output-dir <output-loc>
Observe the behaviour.

Experiment B:

Create TSV file containing the runID: ERR139189 into a "metadata-file"
Run the command: qiime fondue get-sequences --m-accession-ids-file <metadata-file> --p-email <your-email> --output-dir <output-loc>
Observe the behaviour.

Expected behaviour:

Running get-sequences for a runID and the projectID it belongs to should both either succeed or fail.

Actual behaviour:

Command A fails with the following error:

Traceback (most recent call last):
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/q2cli/commands.py", line 329, in __call__
    results = action(**arguments)
  File "<decorator-gen-312>", line 2, in get_all
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/qiime2/sdk/action.py", line 485, in _callable_executor_
    outputs = self._callable(scope.ctx, **view_args)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/get_all.py", line 24, in get_all
    seq_single, seq_paired, = get_sequences(
  File "<decorator-gen-549>", line 2, in get_sequences
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/qiime2/sdk/action.py", line 391, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/sequences.py", line 244, in get_sequences
    _run_fasterq_dump_for_all(
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/sequences.py", line 75, in _run_fasterq_dump_for_all
    raise ValueError('{} could not be downloaded with the '
ValueError: ERR139189 could not be downloaded with the following fasterq-dump error returned: 2021-11-11T09:47:39 fasterq-dump.2.9.6 sys: connection not found while validating within network system module - Failed to Make Connection in KClientHttpOpen to 'www.ncbi.nlm.nih.gov:443'
2021-11-11T09:47:39 fasterq-dump.2.9.6 err: invalid accession 'ERR139189'

Command B succeeds with:

Saved SampleData[SequencesWithQuality] to: ERR139189/single_reads.qza
Saved SampleData[PairedEndSequencesWithQuality] to: ERR139189/paired_reads.qza

Note

Could be caused by using an older version of sra-tools==2.9.6instead of sra-tools==2.10 see comments here and here

Duplicated metadata keys are not handled correctly

When fetching metadata containing some duplicated keys, the processing fails.

Steps to reproduce:

fetch metadata for sample SRR5498984 with qiime fondue get-metadata ...

Expected behaviour:

metadata gets fetched without an issue
the non-required metadata fields should rather all be included with a suffix in case of duplicates

Actual behaviour:

fetching fails with the following error ("BioSampleModel" is duplicated):

Traceback (most recent call last):
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
    self.run_one_request(request, analyzer)
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 71, in run_one_request
    analyzer.parse(response, request)
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 295, in parse
    self.analyze_result(response, request)
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 289, in analyze_result
    self.result.add_metadata(response, request.uids)
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 265, in add_metadata
    self.metadata[uid] = self._process_single_run(
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 71, in _process_single_run
    processed_meta = self._extract_custom_attributes(
  File "/Users/mziemski/miniconda3/envs/fondue/lib/python3.8/site-packages/q2_fondue/entrezpy_clients/_efetch.py", line 107, in _extract_custom_attributes
    raise DuplicateKeyError(
q2_fondue.entrezpy_clients._efetch.DuplicateKeyError: One of the metadata keys (BioSampleModel) is duplicated.

Capturing gaierror of entrezpy

As a plugin user,
I want q2-fondue get-all and get-sequences to retry my command when the connection to NCBI via Entrezpy fails (as indicated below with the error) or to capture this error and let me know it was a connection problem.

What q2-fondue get-all returns when connection to NCBI fails:

2021-11-18 10:38:58,153 [MainThread] [INFO] [q2_fondue.metadata]: 2160 missing IDs were found - we will retry fetching those (20 retries left).
Exception in thread Thread-37:
Traceback (most recent call last):
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 1418, in connect
    super().connect()
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/socket.py", line 787, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/requester.py", line 90, in request
    response = urllib.request.urlopen(urllib.request.Request(req.url,data=data),
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 1397, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/urllib/request.py", line 1357, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
    self.run_one_request(request, analyzer)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 66, in run_one_request
    response = self.requester.request(request)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/requester.py", line 104, in request
    self.logger.error(json.dumps({'URL-error':url_err.reason, 'action':'retry'}))
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type gaierror is not JSON serializable
Exception in thread Thread-36:
Traceback (most recent call last):
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/requester/monitor.py", line 83, in run
    i.report_status(self.processed_requests, self.expected_requests)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/site-packages/entrezpy/base/request.py", line 129, in report_status
    self.logger.debug(json.dumps({'status': self.dump_internals()}))
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type gaierror is not JSON serializable

FIX: `make dev` requires manual approval to reinstall all packages

When running make devwith an existing sra-tools installation, the current install-sra-tools.sh script asks for every package in the installation whether it should overwrite it.

Steps to reproduce:
Install sra-tools as described in the ReadMe and run make dev afterwards in the same environment.

Expected behaviour:
The script should ask once whether it should overwrite all required packages [y/n] or only run the command pip install -e w/o reinstalling sra-tools from scratch.

Actual behaviour:
The script asks for every package whether it should overwrite it [y/n].

get-sequences should also be able to fetch by project ID

As a plugin user,
I want the get-sequences action to be able to fetch by project ID,
so that I can use it to retrieve only sequences from a specific project.

Raise error when no sequences present

It appears that when neither paired- nor single-read sequences were saved as artifacts, the action (get-sequences or get-all) still returns successfully (as format validation passes even on empty fastq files). It should, however, fail when there are no files in either of the two.

Note: I'm not sure what caused this behaviour - still investigating. Will add steps to reproduce when found. Regardless of that, I think this behaviour should not be expected.

ENH: fetch publication metadata with `get-metadata`

As a plugin user, I would like to also fetch publication metadata when I use get-metadata, so that I know about any publications linked to the studies and samples. This would significantly improve traceability. i.e., that citation information is also fetched and preserved alongside other (meta)data.

I would like to grab any Pubmed ID(s) linked to the BioProject ID. If possible, this could be linked additionally to a DOI (as a separate metadata column).

I imagine that we could either (a) embed the citation directly in provenance or (b) just retrieve the pubmed ID/doi and place it in the metadata file for easy parsing later.

PubMed ID appears to be one of the optional metadata categories, and even searching BioProjects by PubMed ID is possible in SRA.

This could also be part of a separate action if PubMed ID needs to be fetched via a separate query.

get-metadata KeyError & hangs

Running get-metadata with project ID PRJEB5482 prints KeyError and hangs indefinitely.

Steps to reproduce:

Create TSV file containing the project ID: PRJEB5482
Run the following command: q2 fondue get-metadata --m-accession-ids-file <ids file> --p-email <your email> --o-metadata ~/metadata.qza --verbose
Observe the behaviour.

Expected behaviour:
Command returns and metadata.qza artifact is created.

Actual behaviour:
Command prints below error and hangs indefinitely:

Exception in thread Thread-10:
Traceback (most recent call last):
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 48, in run
    self.run_one_request(request, analyzer)
  File "/Users/anjaadamov/opt/anaconda3/envs/fondue-new/lib/python3.8/site-packages/entrezpy/requester/threadedrequest.py", line 71, in run_one_request
    analyzer.parse(response, request)
  File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 513, in parse
    self.analyze_result(response, request)
  File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 507, in analyze_result
    self.result.add_metadata(response, request.uids)
  File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 475, in add_metadata
    self.metadata[i] = self._process_single_id(
  File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 375, in _process_single_id
    sample_ids = self._create_samples(attributes, study_id)
  File "/Users/anjaadamov/Documents/projects/05_q2fondue/q2-fondue/q2_fondue/entrezpy_clients/_efetch.py", line 187, in _create_samples
    pool_meta = attributes['Pool'].get('Member')
KeyError: 'Pool'

sample_id param name may be confusing

All of the functions take the sample_id parameter as input. What we effectively used so far was run IDs rather than sample IDs (see #24), meaning that we are actually fetching runs and not samples.

We should rename that param to make it less confusing. I'd propose to name it something non-specific, e.g. accession_id - in another PR (#26) I'll be introducing a step that should recognize which kind of ID was given.

Sra-tools do not work when installed via conda

For some time now, when installing sra-tools through conda (as described in the readme), none of the CLI tools actually work - all throw some mysterious certificate error.

Steps to reproduce:

Follow q2-fondue’s installation instructions to create a fresh env
After activation, just execute prefetch -v ERR1428207 and observe the error

Expected behaviour:
Data for the provided ID is downloaded.

Actual behaviour:
An error is thrown: The certificate is not correctly signed by the trusted CA.

Notes:

this does not happen on Ubuntu or CentOS (only Mac: Catalina, Big Sur and Monterey)
when using the binaries provided directly by NCBI everything seems ok
SRA Tools' support says that it's best to use their latest versions (not available via conda) and that new versions are being only "sometimes" updated on conda
all that means we need to (at least for now) drop the version from conda and install it manually, otherwise MacOS users run into issues

Prepare plugin skeleton

As a q2-fondue developer,
I want all the standard plugin skeleton elements,
so that I can start developing the actual actions/functionality without worrying about the boilerplate code.

Acceptance criteria:

plugin is installable within existing QIIME 2 installation
CI is set up

ProjectID: fetch metadata and sequencing data of an entire project

As a plugin user,
I want the metadata/data fetching action to accept ProjectIDs (PRJEB / PRJNA) as an input parameter,
so I can obtain the sequencing (meta)data of the entire SRA/ENA BioProject.

Update CI

Goal is to update the GHA integration with a stable action-library-packaging version.

Acceptance criteria:

CI passes

Action to estimate `get-sequences` space requirement

As a plugin user,
I would like q2fondue to inform me how much space is roughly needed to download the sequences with get-sequences for the accession IDs in my --m-accession-ids-file such that I can compare it with my available Q2 TMPDIR space.

Suggested implementation approach:

Can be a separate action estimate-space-req that makes use of vdb-dump --info from sra-tools and sums over space requirements for all runIDs multiplying it with a factor of 8 or 10 (since according to sra-tools' wiki “As a rule of thumb you should have about 8x … 10x the size of the accession available on your filesystem.”)
estimate-space-req could be integrated into pipeline get-all before get-sequences is run.
Readme and/or tutorial could include a note suggesting to user to change TMPDIR location to a location with more space if its space is exceeded currently.

Logs are printed multiple times when fetching metadata

Description with more details will follow shortly.

Add citations to all methods

As a plugin user,
I want all the methods to have their relevant citations included,
so that I can give credit to whomever it should be given.

Note: This should be added here and properly referenced here.

Action to scrape papers for IDs

As a plugin user,
I want to have an action that allows me to scrape a collection of scientific papers for accession IDs which can be passed to the other q2fondue actions.

Save fetched sequences before space issue

As a plugin user,
I would like get-sequences and get-all to exit with the previously fetched sequences shortly before it fails with a space issue and fails to return anything (OSError: [Errno 28] No space left on device). This could be implemented by closing the q2fondue execution when nearly all available storage (e.g. 95%) is used up.

Note: This can be tested on projectID PRJEB3079 that requires 230-270 GiB to be downloaded.

Prepare the repo for release

We should clean up the repo to make it ready for the first release (mostly: remove meeting notes and clean up the issues).

Fetching sequences fails sometimes when sra-tools are not configured properly

Just opening this issue to have this documented in one place as it's not necessarily something we can "fix".

As @adamovanja pointed out elsewhere, sometimes fetching sequences fails with an invalid accession ... error from fasterq-dump and it seems to be happening only with the latest version of q2-fondue. After some investigation, my impression is that this is caused by the configuration of sra-tools that is performed using the vdb-config tool. So there are two scenarios:

prefetch's download location is set to "current directory" (this is the default option): prefetch manages to download everything as expected but fasterq-dump fails (who knows why, I couldn't really find that out)
prefetch's download location is set to "user-repository" and the repository value is set (tab "Cache"): sequences are fetched correctly (note that if the repository is not set, the download will likely fail)

This behaviour is observed on sra-tools version 2.11.0 (currently available via conda). When using the latest version of the toolkit (2.13.0) I have not observed the same issue: the downloads seemed to succeed regardless of the repository settings.

To set the repo location one can use the vdb-config in the interactive mode or just execute those two commands:

vdb-config -s "/repository/user/main/public/root=<your cache location>"
vdb-config --prefetch-to-user-repo

Proposed solution:
If someone else can reproduce this, I would say we should just add a section at the beginning of the README/tutorial saying that the users should run the configuration tool after installing q2-fondue and what they need to set where. Whenever the newest toolkit version becomes available we can just upgrade and that should solve the issue.

Implement metadata fetching action

As a plugin user,
I want a get-metadata action for fetching SRA metadata,
so that I can easily obtain a table with metadata for all my sequences.

Acceptance criteria:

given a list of SRA accession numbers, fetches all metadata
outputs a single table with metadata from all sequences concatenated together
uses entrezpy to interact with Entrez

Download of large samples is very slow

When downloading sequences for some samples (e.g., metagenomes) the download appears very slow, as compared to smaller sized datasets.

Steps to reproduce:
Try to fetch sequences for ID ERR1700893 and observe the time it takes.

Expected behaviour:
The size of this dataset is approx. 28 GB - it should be a matter of half an hour to an hour to fetch (depends on connection speed).

Actual behaviour:
It takes hours (don't know exactly, didn't wait for it to finish).

The problem is that in case of large datasets prefetch silently fails as the default allowed max. size is 20GB. fasterq-dump then takes over but is just much slower. This can easily be fixed by adjusting the max-size param of prefetch to unlimited to allow downloads of any data. See here for some more info.

Use retmax param to better control fetching data

As a plugin developer,
I want the retmax param on efetch queries to be set to a fixed value
so that one can reliably fetch metadata without failing in case too many were requested.

Note: that should probably be set somewhere around here:

q2-fondue/q2_fondue/metadata.py

Lines 42 to 49 in e39db49

    
           metadata_response = efetcher.inquire( 
        
               { 
        
                   'db': 'sra', 
        
                   'id': run_ids, 
        
                   'rettype': 'xml', 
        
                   'retmode': 'text' 
        
               }, analyzer=EFetchAnalyzer(log_level) 
        
           )

and this maybe needs adjusting to loop over id batches:

q2-fondue/q2_fondue/metadata.py

Lines 57 to 64 in e39db49

    
           def _execute_efetcher(email, n_jobs, run_ids, log_level): 
        
               efetcher = ef.Efetcher( 
        
                   'efetcher', email, apikey=None, 
        
                   apikey_var=None, threads=n_jobs, qid=None 
        
               ) 
        
               set_up_entrezpy_logging(efetcher, log_level) 
        
               meta_df, missing_ids = _efetcher_inquire(efetcher, run_ids, log_level) 
        
               return meta_df, missing_ids

This is inspired by the approach used here: https://github.com/ebolyen/q2-sra/blob/eea38c3750aa051ed1cecb1f2b7220d42ef1f2d5/q2_sra/lib/efetch.py#L57

Edit GHA Python Version

Current python version in github action is 3.6 (see q2-fondue/.github/workflows/ci.yml), but should be changed to 3.8 (as that is the version used by Q2 2021.8dev)

	def _create_experiment(self, attributes: dict, sample_id: str) -> str:
	"""Creates an SRAExperiment object.

	Information like Experiment ID, platform, instrument and library
	metadata as well as other custom metadata are added here.

	Args:
	attributes (dict): Dictionary with all the metadata from
	the XML response.
	sample_id (str): ID of the sample which the experiment belongs to.
	Returns:
	exp_id (str): ID of the processed study.
	"""
	exp_meta = attributes['EXPERIMENT']
	exp_id = exp_meta['IDENTIFIERS'].get('PRIMARY_ID')
	if exp_id not in self.experiments.keys():
	platform = list(exp_meta['PLATFORM'].keys())[0]
	instrument = exp_meta['PLATFORM'][platform].get('INSTRUMENT_MODEL')
	self.experiments[exp_id] = SRAExperiment(
	id=exp_id,
	instrument=instrument,
	platform=platform,
	sample_id=sample_id,
	library=self._extract_library_info(attributes),
	custom_meta=None

	metadata_response = efetcher.inquire(
	{
	'db': 'sra',
	'id': run_ids,
	'rettype': 'xml',
	'retmode': 'text'
	}, analyzer=EFetchAnalyzer(log_level)
	)

	def _execute_efetcher(email, n_jobs, run_ids, log_level):
	efetcher = ef.Efetcher(
	'efetcher', email, apikey=None,
	apikey_var=None, threads=n_jobs, qid=None
	)
	set_up_entrezpy_logging(efetcher, log_level)
	meta_df, missing_ids = _efetcher_inquire(efetcher, run_ids, log_level)
	return meta_df, missing_ids

bokulich-lab / q2-fondue Goto Github PK

q2-fondue's People

Contributors

Stargazers

Watchers

Forkers

q2-fondue's Issues

Steps to reproduce:

Expected behaviour:

Actual behaviour:

Note

Recommend Projects

Recommend Topics

Recommend Org