ebi-ait / geo_to_hca Goto Github PK
View Code? Open in Web Editor NEWTools to assist in the automatic conversion of GEO metadata into HCA metadata standard
Tools to assist in the automatic conversion of GEO metadata into HCA metadata standard
Description of the bug
When trying to use the tool to generate a spreadsheet, if the paper is not found immediately, it searchs in EuropePMC and asks for manual input, but instead of showing the paper title, it just shows the short name of the journal, which is not helpful to determine
Steps to replicate the bug
python3 apps/geo_to_hca.py --accession GSE158702
.
.
.
Getting project metadata
No publication for project PRJNA666217 was found: searching project title in EuropePMC
project title is: Spatiotemporal Analysis of Human Intestinal Development at Single Cell Resolution - scRNA-Seq
A publication title has been found: Nucleic Acids Res.
Is this the publication title associated with the GEO accession? [y/n]: n
An alternative publication title has been found: Brief Funct Genomics.
Is this the publication title associated with the GEO accession? [y/n]: n
An alternative publication title has been found: Front Mol Biosci.
Is this the publication title associated with the GEO accession? [y/n]: n
no publication results for project title in ENA
project name is Spatiotemporal Analysis of Human Intestinal Development at Single Cell Resolution - scRNA-Seq:
A publication title has been found: Nucleic Acids Res.
Is this the publication title associated with the GEO accession? [y/n]: n
An alternative publication title has been found: Brief Funct Genomics.
Is this the publication title associated with the GEO accession? [y/n]: n
An alternative publication title has been found: Front Mol Biosci.
Is this the publication title associated with the GEO accession? [y/n]: n
no publication results for project name in ENA
.
.
.
Expected behaviour
The tool should output the title of the paper (Along with the shortname of the journal if needed), but just the journal name is not enough
As a data wrangler I'd like to see the project and study accessions added to the project tab automatically. I believe these should be fairly easy to retrieve programmatically.
Have to bear in mind that the project and study are switched currently in our metadata until the latest dev branch is merged to master.
The GSM sample ids (GEO accessions) are required for the conversion of HCA metadata to SCEA format. Therefore adding the GSM sample ids to the HCA spreadsheet would consider the automatic generation of SCEA metadata.
Currently when you specify a list of accessions you get given a spreadsheet per accession rather than one combined spreadsheet, I think it would be nice if it gave you one to avoid copy/pasting information into one.
Hi :-)
I'm having some issues with the following command:
pip3 install -r requirements.txt --upgrade
I'm getting this error:
ERROR: Could not find a version that satisfies the requirement sra-tools (from -r requirements.txt (line 6)) (from versions: none)
ERROR: No matching distribution found for sra-tools (from -r requirements.txt (line 6))
I get this error message when using bothpip
and pip3
.
Any idea what's causing this? I've already cloned the git repository to my home computer– are there any other dependencies I should be aware of?
Currently the geo_to_hca script does not work when a GEO accession is provided but an SRA study accession does not exist for the project. This is also true if there are no experiment accessions, which is the case when the sequencing data is not openly available. This is a ticket for the addition of an option to handle projects with no SRA study accession or experiment accessions i.e. to pre-fill the project and sample metadata using a project accession and biosamples accessions only.
Description of the issue
Currently, fastq file names can not be found for certain SRA study IDs. This is either because (1) the files are available only in bam format or more frequently (2) the files are stored as an SRA object which would require downloading and converting to fastq format using an NCBI SRAdb script. Although this does not cause an error, it means the dataset is not (immediately) suitable. To solve this issue, I would like to:
Description of the issue:
Currently, the script apps/geo_to_hca.py
takes predefined parameters as inputs and outputs. I would like to add a parser with default options to give the user the ability to:
docs/hca_template.xlsx
]
header
row (e.g. in HCA metadata 4 is the header row with the programmatic names) [Default: 4
]initial metadata input row
(e.g. in HCA metadata, 6 is the row number where you begin inputting the metadata on each sheet) [Default: 6
]spreadsheets/
]Add functionality for pre-filling specific fields according to Drop-seq and Smart-seq2 technologies. Currently, this is done for 10X datasets only.
Currently the tabs which are generated in the hca spreadsheet are sorted differently which can result in confusion and errors during manual curation. The script should be updated so that all the dataframes/tabs are sorted by the same list of accession ids.
Description of the issue:
When dealing with Superseries/more than 1 accession in a line, there is an unexpected behaviour due to returning the Superseries ID without the GSE
prefix that results in the spreadsheets being called only by the ID without the preffix.
Impact/Effort estimation
Low/Low
Description of issue
Currently, the library_preparation_protocol_id and sequencing_protocol_id associated with each cell_suspension_id are not automatically added to the Sequencing file tab.
I would like to write a function which links the protocol ids to the cell_suspension ids and inputs the relevant values into the Sequencing file tab.
Currently when a GEO accession is given as input and a metadata spreadsheet is generated, the dissociation protocol tab is missing from the spreadsheet. This might be because it is missing from the default input template spreadsheet or it may be a bug in the code. It is not a high priority problem as this tab can be added manually, and automating conversion of dissociation protocol info. is not really possible at least currently, as it almost always requires manual extraction from the publication text.
Description of the issue:
I would like to update the documentation in this repository:
Readme.md
under the Apps
folderReadme.md
at root:
As a data wrangler I would like to have the converter put the authors into the publication and contributors tab as it is fairly time consuming and error prone to do it manually.
The author information may be harvested programmatically from an API if not totally within GEO, happy to help figure out how.
Currently, some GEO accessions that are given as input to the geo_to_hca script are not single accessions but are part of a superseries or a superseries accession itself.
We need a function which will assess if an accession is part of a superseries. This would need to be interactive for the user. This is because some accessions within a superseries are not strictly single cell RNA sequencing experiments, whereas others are. In some cases all sub-series are, but they are not necesserily all aggregated together. The user needs to indicate which accessions should be used.
Currently there is a function to deal with comma-separated sub-series within a superseries, but this needs to be checked for accuracy and it needs to be extended.
When converting a recent dataset I realised that 'plate-based sequencing' module is not in the template, this ticket is to systematically check which modules are missing to ensure we don't miss any metadata.
Description of the issue:
The use of the HCA for lane_index is none other than being able to distinguish files within a bundle that should be analysed together (e.g. R1 and R2 of a run). Up until now, the lane index has been filled based on filename, but this might not be a valid value in some scenarios (e.g. GSE135893).
While in the process of investigating this, another issue arose: For the same dataset, the converter created 228 different bundles, but the experiment consists of only 34 experiment accessions.
Expected behaviour:
Current behaviour:
Effort/Value:
Low/Medium - This issue is not a blocker but takes time from the wrangling process. Correcting it should only take a couple lines of code.
Description of the issue:
Due to the time and different people working on the code, the code under the apps
folder doesn't follow the PEP 8 style guide. I would like to take some time to lint the code and make it consistent.
I recently converted the accession GSE130708. In the 'Library preparation protocol' tab it created a library prep with method, 'Drop-seq' and pre-filled barcode lengths even though it is a 10X 3' v2 library prep.
I am guessing this is some kind of default but I think if the library preparation method can't be determined from the GEO metadata then these fields should be left blank as it is very easy to accidentally keep a value that is not correct for the actual library prep method that was used.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.