rnacentral / rnacentral-import-pipeline Goto Github PK

RNAcentral data import pipeline

License: Apache License 2.0

Shell 0.76% Python 54.16% Dockerfile 0.14% Nextflow 3.98% Groovy 0.21% Makefile 0.04% Batchfile 0.60% PostScript 31.37% HTML 0.34% PLpgSQL 0.52% Rust 6.45% Visual Basic 6.0 1.43%

nextflow postgresql python

rnacentral-import-pipeline's People

Contributors

Stargazers

Watchers

rnacentral-import-pipeline's Issues

Figure out why GENCODE is separate from Ensembl/GENCODE

In sequences like: https://rnacentral.org/rna/URS0000301B08/9606 there should only be an Ensembl/GENCODE entry, not a GENCODE alone entry.

Gene types

Hello,

I came across the gene type 'sRNA' (e.g. in gene URS0001BF7BA4_9606) and was wondering what biotype it corresponds to. At first I assumed it to be short ncRNA, however, the genes are generally longer than 200nt.

I couldn't find any more information on this. Is there something I've been missing?

Best,
Michaela

P.S. I wasn't quite sure where to post this, so please let me know if I should move this conversation elsewhere.

Gene definition

It seems there is no real gene definition in the GTF (at least for human). The attributes column contains a Name that encompasses multiple exons and transcripts. However in the ID mapping file, there are multiple cases where a single external gene ID maps to multiple URS. Examples: MEF2C-AS1, LINC00929, SNHG7

So do URS map to transcripts? If so, why is there the distinction between URS (Name in attributes) and transcript IDs (ID in attributes)?

Import EVLncRNAs database

Described here: https://pubmed.ncbi.nlm.nih.gov/33221906/

https://www.sdklab-biophysics-dzu.net/EVLncRNAs2/

Mapping of MOD Identifiers to RNA Central IDs

Hi,
A couple of years ago when RNA Central was just a gleam in Alex's eye I spoke with him about this issue (on behalf of the GO because we will use the RNA Central IDs for annotation rather than the MOD ID). I requested all of the MODs to send Alex/RNAC their current RNA IDs (+ identifying information) and subsequently RNAC would return the corresponding IDs (and the conversation would be on-going to maintain the ID correspondence into the future). In other words, analogous to the mechanism we have in place for reference proteomes, but for transcriptome ID mapping.

Question: Did all of the MODs supply their IDs (or made them readily available)?
Question: Has RNAC provided a mapping to the MODs of the RNAC ids
Question: Anything we can do to help make this happen?

Suzi (for the GOC)

Setup required paths

Talk to its to setup all the requried paths to start working on rnacentral pipeline

Meaning of gene sources 'alignment' and 'expert-database'

The human GTF contains an attribute called 'source', which for each gene has the value 'alignment' or 'expert-database'. I haven't found any clear explanation what each of these values mean, i.e. are 'alignment' genes less curated than 'expert-database'?

For example, the piRNA class has 142620 transcripts, but none of them are from an 'expert-database' although they all come from PirBase. Is PirBase not considered an expert database in this case? If so, what determines a database/transrcript to be 'expert' or not?

Ensure we handle removing GO terms from quickgo

Sometimes GO terms are removed from QuickGO, I'm not sure if that case is properly handled right now so we should look into it and ensure it is.

Migrate analyze.nf

This runs some analysis on sequnces. It might be useful to split up the main image to have smaller images for each part.

Update ZFIN file location

After May 1, 2021 the ZFIN file should be retrieved from a new location:
https://zfin.org/downloads/rnaCentral.json.

Reconginize more assembly names

Sometimes we are given assembly names as UCSC or ensembl, right now we only work with ensembl ones but we can probably convert the ucsc ones to ensembl. There should be a mapping of names somewhere we can use.

Migrate export.nf

This will need to modified to deal with the datamovers and the new internal structure.

Fix off by one error in miRNA data

Test building singularity images

Need to make sure all the required images can be built and run on codon.

Produce an Rfam annotation file

We should produce an Rfam annotation file as part of a release procedure. This will be useful for people like flybase to associate a gene with a Rfam families.

Export high quality set of Rfam annotations

As part of our export process we produce a series of Rfam matches to RNAcentral sequences. This file can be used by other resources to connect their sequences to Rfam and thus GO terms, but there is a problem with the file. We include all Rfam matches, not only the ones that have are complete and have a high e-value. We should modify the pipeline to add another file which contains only the 'good' matches which other resources, like flybase can use.

Import linkage to fantom5/6

Import data to link to:

Fantom 6: localisation (https://fantom.gsc.riken.jp/zenbu/reports/#FANTOM6_target_subcellular_localization)
- We should be able to know if the gene was annotated and ideally what the localisation is.
Fantom 5: promoters (https://fantom.gsc.riken.jp/zenbu/gLyphs/#config=338Jp369HlMrduGQJmN3-B;loc=hg38::chr10:102400829..102466332+)
- We should be able to tell if the gene is annotated by fantom

Update Rfam GO terms each release

There is code to import new Rfam GO (and other) terms as part of release process. This should be run every release but isn't now. Just add the call an import steps.

Fix bed coordinate issues

Seems not all exon/introns are correct. No idea why or where they come from though.

Rewrite RefSeq import

We need to write a luigi task for regularly importing RefSeq data and replace an existing semi-automatic java code.

More info about the existing pipeline in RNAC_RefSeq.docx in https://www.ebi.ac.uk/seqdb/confluence/display/EMBL/RNAcentral+datasets

Gene type 'ncRNA'

There are quite a few transcripts annotated as 'ncRNA'. However according to the ID mapping, it seems they could be annotated further as tRNA, rRNA etc.

$ zcat homo_sapiens.GRCh38.gff3.gz | grep -P '\ttranscript\t' | grep 'type=ncRNA' | wc -l

5067

>>> id_map = pd.read_table(
    id_map_file,
    names=['RNAcentral', 'database', 'transcript_id', 'NCBI_taxid', 'type', 'gene_name'],
    dtype=str
)
>>> id_map = id_map.query('NCBI_taxid == "9606"')
...

>>> id_map.query('type == "ncRNA"')[['database']].value_counts()

ENA          54239
GENECARDS     4945
MALACARDS     3505
GTRNADB        989
RFAM           568
HGNC           516
LNCBOOK        163
PDB             81
REFSEQ          29
ENSEMBL          5
GENCODE          5
MODOMICS         5
SILVA            3
INTACT           1

>>> id_map.query('type == "ncRNA"')[['transcript_id', 'database']].value_counts()

transcript_id                                   database
RF00005                                         RFAM        348
RF02543                                         RFAM         57
RF00619                                         RFAM         45
RF02541                                         RFAM         35
RF01853                                         RFAM         29
                                                           ... 
JQ705238.1:1672..3229:rRNA                      ENA           1
JQ705239.1:1673..3230:rRNA                      ENA           1
JQ705240.1:1672..3229:rRNA                      ENA           1
JQ705241.1:1670..3227:rRNA                      ENA           1
tRX-Val-NNN-5-1:CM000663.2:145161108-145161178  GTRNADB       1

For instance, I'd expect all transcripts from GTRNADB to be marked as tRNA rather than ncRNA.
Is this bug or feature?

Import ZWD

Importing ZWD will allow to improve the corresponding Rfam families.

Import linkage to expression atlas

From preliminary work there seem to be several thousand genes we have in common with them. We can import this connection for building links and the like. It would be good to have search terms we can use as well as enough to create links to expression atlas.

Add link from accession to rnc_sequence_regions

It should be possible to go from an accession to the region the accession is from. Currently it isn't directly possible but we can easily enough add a table to create the link.

Analyse sequences to see if they are protein coding

This was suggested by the SAB as useful feature, particularly for lncRNAs. I think we can analyse all sequences and add a QA check, possibly_protein_coding as well as show the scores from the program used on the site. There are a few possible programs suggested by the SAB:

and some others that may be useful:

I've checked them a bit and it seems, the website CPC is down, and PhyloCSF requires an alignment. The rest look pretty straightforward though. PhyloCSF is used by lncipedia so we can talk to them about how they do it as well.

Import GO terms

We should import GO annotations from QuickGO into RNAcentral. This will require several new tables as well as import procedures for fetching and storing the annotations. They are producing a file at /ebi/ftp/pub/contrib/goa/goa_rna_all.gpa.gz which should contain the GO annotations on RNAcentral UPI's.

Get UniProt descriptions

We could get UniProt descriptions to show in the PSICQUIC table

Figure out why sequence export and search export disagree

Right now the search and sequence export have different numbers of entries. This is probably caused by a mix of things. First some of search export entries are duplicated. This is probably due to overlapping id ranges and some bad/missing data in precompute/xref. I think there are 'fake' entries in precompute. These are likely urs_taxid pairs that do not have any xref entries. There may also be xref entries that have no corresponding entry In precompute. This issue is to track finding and a fixing these problems so the search exports can be correct.

One possible fix is to just truncate precompute and redo everything.

Problems with search export

Boost ordering is wrong
Rfam warning not shown next to the red warning logo:
http://test.rnacentral.org/search?q=4V4Q

Export high quality set of GO annotations

Right now our GO annotations are based on any our Rfam matches, which may not be ideal (see #107). We should instead create annotations based only on the high quality matches we will export in #107.

Use results of CPAT analysis

From #77 we will do some analysis with CPAT, and so we want to use the results. We should:

Import the best ORF's for display in the website.
Import the scores from CPAT
Add a protein_coding QA step
Add protein coding facet to text search

Difference between 'databases' and 'providing_databases'

What is the difference between 'databases' and 'providing_databases'? I assume databases lists all databases where the transcript is documented whereas providing_databases only lists the the databases that provide the genomic coordinates. Is that correct?

Import linkage to lncAtlas

lncAtlas has localisation information for various lncRNAs in various cell types. We should be able to create links to lncAtlas as well as show what cell types and where sequences are found. It may be nice to be able to do searches like:

expressed_in:"cell_type"
localised_in:"cytoplasm"
localization_source:"lncatlas"
expression_source:"lncatlas"

As well as have enough information to show some useful links in the website.

Fix spelling mistake in some sequences

Some snoRNAs are called Homo sapians should be Homo sapiens. This is an issue from the providing database so we need to fix it when importing. We can also fix it in the database if it would be too much to reimport the database in question.

Parse ENA Non-coding product in Python

The ENA data is provided to RNAcentral as EMBL-formatted files. They are processed into csv files that are subsequently loaded into a database. The code is written in Perl and relies on BioPerl and Ensembl Hive.

To streamline the production process I suggest moving from Perl to Python.

parse EMBL files into csv files using BioPython
create luigi pipeline
review how TPA entries are stored (for example, miRBase entries are identified by DR lines in TPAs)
combine EMBL files before processing to avoid launching tens of thousands of LSF jobs
once done, delete old Perl code from the repository

Check if external software exists

Some packages were provided and I never containerized. We need at least:

psql
msyql
pgloader
etc..

Do not run genome mapping for sequences with entries in `rnc_coordinates`

Test sequence: URS00001DC04F/9606 (miRBase, RefSeq, ENA). It should not be mapped because the coordinates are provided by miRBase.

Import newer MODOMICS data

We have out-of-date data from them and should follow up to get the latest data.

Import psicquic data

We can get the data from http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml should import all entries from EBI-GOA-miRNA as another set of interactions for sequences.

Update mirgenedb data

Better lncRNA terms

Seems that many lncRNA's are considered ncRNA for unclear reasons. As an example:

https://rnacentral.org/rna/URS0001BF7BA4/9606

Add paper to NKILA

Annotate URS00008120E1_9606 with doi:10.1007/s11033-021-06482-y.

Add a health check for empty taxids

Extract database version when parsing files

We need to keep track of what version of each database we are importing. Right now this is done manually and isn't idea. So instead what we should do is write it out when parsing the files from a database. We could then place it in the db, or produce a report which will make updating the website and the blog post easier.

tRNAs from LncBook

There are some very long transcripts from LncBook that are annotated as tRNAs. However, the RNA structure image is of a mature tRNA that I cannot find any link or ID to. Additionally, when I go to LncBook, I don't see any information the transcripts being tRNAs. I was wondering where the gene type information is coming from and if we can be sure that they are really are tRNAs.

For example, URS0001BE6631 or HSALNT0163847 could be the same region as RUFY2-207

Query:

so_rna_type_name:"TRNA" AND TAXONOMY:"9606" AND expert_db:"LncBook"

Migrate import-data.nf

This fetches and parses data. We need to get it running on codon. This will need to deal with datamovers and such as it is a pretty complex pipeline that fetches from a lot of places.

Update LncBook data

Explore linking to complex portal

The use RNAcentral ids so it should be possible to link to them. For example: https://www.ebi.ac.uk/complexportal/complex/CPX-5183.

Consider importing CLC2 lncRNA data

This is a resource about lncRNA expression in cancers and is part of the SAB's general request for richer data into RNAcentral. The paper can be found here: https://academic.oup.com/narcancer/article/3/2/zcab013/6225859. This may be a change from our general procedures as we do not generally import from papers and only from databases.

Link to ribovision

We should be able to link from a sequence page to places like: http://apollo.chemistry.gatech.edu/RiboVision2/#STAAU_LSU

rnacentral / rnacentral-import-pipeline Goto Github PK

rnacentral-import-pipeline's People

Contributors

Stargazers

Watchers

rnacentral-import-pipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org