ebi-ait / geo_to_hca Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 45.75 MB

Tools to assist in the automatic conversion of GEO metadata into HCA metadata standard

Python 99.64% Makefile 0.36%

geo_to_hca's People

Contributors

Watchers

geo_to_hca's Issues

Bug when showing paper titles

Description of the bug

When trying to use the tool to generate a spreadsheet, if the paper is not found immediately, it searchs in EuropePMC and asks for manual input, but instead of showing the paper title, it just shows the short name of the journal, which is not helpful to determine

Steps to replicate the bug

Go to the repo in your local branch
Run the script with the following command: python3 apps/geo_to_hca.py --accession GSE158702
You'll get the following output:

.
.
.
Getting project metadata
No publication for project PRJNA666217 was found: searching project title in EuropePMC
project title is: Spatiotemporal Analysis of Human Intestinal Development at Single Cell Resolution - scRNA-Seq
A publication title has been found: Nucleic Acids Res.
Is this the publication title associated with the GEO accession? [y/n]: n
An alternative publication title has been found: Brief Funct Genomics.
Is this the publication title associated with the GEO accession? [y/n]: n
An alternative publication title has been found: Front Mol Biosci.
Is this the publication title associated with the GEO accession? [y/n]: n
no publication results for project title in ENA
project name is Spatiotemporal Analysis of Human Intestinal Development at Single Cell Resolution - scRNA-Seq:
A publication title has been found: Nucleic Acids Res.
Is this the publication title associated with the GEO accession? [y/n]: n
An alternative publication title has been found: Brief Funct Genomics.
Is this the publication title associated with the GEO accession? [y/n]: n
An alternative publication title has been found: Front Mol Biosci.
Is this the publication title associated with the GEO accession? [y/n]: n
no publication results for project name in ENA
.
.
.

Expected behaviour
The tool should output the title of the paper (Along with the shortname of the journal if needed), but just the journal name is not enough

[FEATURE] Add INSDC project and study accessions to Project tab

As a data wrangler I'd like to see the project and study accessions added to the project tab automatically. I believe these should be fairly easy to retrieve programmatically.

Have to bear in mind that the project and study are switched currently in our metadata until the latest dev branch is merged to master.

Add 'GSM" sample ids to 'specimen from organism' and 'cell suspension' tabs

The GSM sample ids (GEO accessions) are required for the conversion of HCA metadata to SCEA format. Therefore adding the GSM sample ids to the HCA spreadsheet would consider the automatic generation of SCEA metadata.

[FEATURE] Add functionality to create one combined spreadsheet rather than separate ones

Currently when you specify a list of accessions you get given a spreadsheet per accession rather than one combined spreadsheet, I think it would be nice if it gave you one to avoid copy/pasting information into one.

issue installing sra requirements with pip and pip3

Hi :-)
I'm having some issues with the following command:

pip3 install -r requirements.txt --upgrade
I'm getting this error:

ERROR: Could not find a version that satisfies the requirement sra-tools (from -r requirements.txt (line 6)) (from versions: none)
ERROR: No matching distribution found for sra-tools (from -r requirements.txt (line 6))

I get this error message when using bothpip and pip3.
Any idea what's causing this? I've already cloned the git repository to my home computer– are there any other dependencies I should be aware of?

Add option to retrieve metadata in the absence of an SRA study accession

Currently the geo_to_hca script does not work when a GEO accession is provided but an SRA study accession does not exist for the project. This is also true if there are no experiment accessions, which is the case when the sequencing data is not openly available. This is a ticket for the addition of an option to handle projects with no SRA study accession or experiment accessions i.e. to pre-fill the project and sample metadata using a project accession and biosamples accessions only.

Address missing fastq file names in apps/geo_to_hca.py script

Description of the issue

Currently, fastq file names can not be found for certain SRA study IDs. This is either because (1) the files are available only in bam format or more frequently (2) the files are stored as an SRA object which would require downloading and converting to fastq format using an NCBI SRAdb script. Although this does not cause an error, it means the dataset is not (immediately) suitable. To solve this issue, I would like to:

Meet with the developers (and other wranglers) to discuss getting fastq file names from an SRA object using an SRA study id. This is so we can integrate the approach with how they would prefer to fetch the raw data files for upload which at some point they will be required to do.
Incorporate a function to get the fast file names using the SRAdb fastq-dump script and/or feedback from the developers.
Investigate whether fastq file names can instead be found in the ENA in cases where file names are missing.

Add parser to apps/geo_to_hca.py script

Description of the issue:

Currently, the script apps/geo_to_hca.py takes predefined parameters as inputs and outputs. I would like to add a parser with default options to give the user the ability to:

Choose a template [Default:docs/hca_template.xlsx]
- (Optional): Choose a header row (e.g. in HCA metadata 4 is the header row with the programmatic names) [Default: 4]
- (Optional): Choose an initial metadata input row (e.g. in HCA metadata, 6 is the row number where you begin inputting the metadata on each sheet) [Default: 6]
Choose Output directory: Choose the output folder for the spreadsheets [Default: spreadsheets/]
Choose file with accessions [No Default]
- (Optional): Have another option to add space-separated accessions manually and merge them with the accessions obtained from

Add functionality for pre-filling specific fields according to Drop-seq and Smart-seq2 technologies

Add functionality for pre-filling specific fields according to Drop-seq and Smart-seq2 technologies. Currently, this is done for 10X datasets only.

Sort spreadsheet tabs by the same accession ids

Currently the tabs which are generated in the hca spreadsheet are sorted differently which can result in confusion and errors during manual curation. The script should be updated so that all the dataframes/tabs are sorted by the same list of accession ids.

Fix issue with spreadsheet names not showing correctly

Description of the issue:
When dealing with Superseries/more than 1 accession in a line, there is an unexpected behaviour due to returning the Superseries ID without the GSE prefix that results in the spreadsheets being called only by the ID without the preffix.

Impact/Effort estimation
Low/Low

Link protocol ids to cell suspension ids in geo_to_hca.py script

Description of issue

Currently, the library_preparation_protocol_id and sequencing_protocol_id associated with each cell_suspension_id are not automatically added to the Sequencing file tab.

I would like to write a function which links the protocol ids to the cell_suspension ids and inputs the relevant values into the Sequencing file tab.

Dissociation protocol tab is missing from output spreadsheet

Currently when a GEO accession is given as input and a metadata spreadsheet is generated, the dissociation protocol tab is missing from the spreadsheet. This might be because it is missing from the default input template spreadsheet or it may be a bug in the code. It is not a high priority problem as this tab can be added manually, and automating conversion of dissociation protocol info. is not really possible at least currently, as it almost always requires manual extraction from the publication text.

Update repository documentation

Description of the issue:

I would like to update the documentation in this repository:

Create a Readme.md under the Apps folder
Modify existing Readme.md at root:
- Move known issues to tickets of their own
- More detailed description of how to run the application (Prerequisites, installation, running)

[FEATURE] format and input authors for the publication tab

As a data wrangler I would like to have the converter put the authors into the publication and contributors tab as it is fairly time consuming and error prone to do it manually.

The author information may be harvested programmatically from an API if not totally within GEO, happy to help figure out how.

Add function to detect correct accession to use in a Superseries

Currently, some GEO accessions that are given as input to the geo_to_hca script are not single accessions but are part of a superseries or a superseries accession itself.

We need a function which will assess if an accession is part of a superseries. This would need to be interactive for the user. This is because some accessions within a superseries are not strictly single cell RNA sequencing experiments, whereas others are. In some cases all sub-series are, but they are not necesserily all aggregated together. The user needs to indicate which accessions should be used.

Currently there is a function to deal with comma-separated sub-series within a superseries, but this needs to be checked for accuracy and it needs to be extended.

Ensure all modules are in the template spreadsheet

When converting a recent dataset I realised that 'plate-based sequencing' module is not in the template, this ticket is to systematically check which modules are missing to ensure we don't miss any metadata.

[BUG] Write the correct lane_index and bundling

Description of the issue:
The use of the HCA for lane_index is none other than being able to distinguish files within a bundle that should be analysed together (e.g. R1 and R2 of a run). Up until now, the lane index has been filled based on filename, but this might not be a valid value in some scenarios (e.g. GSE135893).

While in the process of investigating this, another issue arose: For the same dataset, the converter created 228 different bundles, but the experiment consists of only 34 experiment accessions.

Expected behaviour:

Creates a single bundle per experiment accession
Lane indexes are assigned incrementally based on number of runs in a experiment.

Current behaviour:

Not really sure how the current script creates bundles
Lane indexes are filled using the filename

Effort/Value:
Low/Medium - This issue is not a blocker but takes time from the wrangling process. Correcting it should only take a couple lines of code.

Lint repository code to abide to PEP 8

Description of the issue:

Due to the time and different people working on the code, the code under the apps folder doesn't follow the PEP 8 style guide. I would like to take some time to lint the code and make it consistent.

Library preparation incorrect when converting GSE130708

I recently converted the accession GSE130708. In the 'Library preparation protocol' tab it created a library prep with method, 'Drop-seq' and pre-filled barcode lengths even though it is a 10X 3' v2 library prep.

I am guessing this is some kind of default but I think if the library preparation method can't be determined from the GEO metadata then these fields should be left blank as it is very easy to accidentally keep a value that is not correct for the actual library prep method that was used.

ebi-ait / geo_to_hca Goto Github PK

geo_to_hca's People

Contributors

Watchers

geo_to_hca's Issues

Recommend Projects

Recommend Topics

Recommend Org