Giter Site home page Giter Site logo

rnacentral / rnacentral-import-pipeline Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 11.22 MB

RNAcentral data import pipeline

License: Apache License 2.0

Shell 0.76% Python 54.16% Dockerfile 0.14% Nextflow 3.98% Groovy 0.21% Makefile 0.04% Batchfile 0.60% PostScript 31.37% HTML 0.34% PLpgSQL 0.52% Rust 6.45% Visual Basic 6.0 1.43%
nextflow postgresql python

rnacentral-import-pipeline's People

Contributors

afg1 avatar antonpetrov avatar blakesweeney avatar carlosribas avatar danstaines avatar dependabot[bot] avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

rnacentral-import-pipeline's Issues

Gene types

Hello,

I came across the gene type 'sRNA' (e.g. in gene URS0001BF7BA4_9606) and was wondering what biotype it corresponds to. At first I assumed it to be short ncRNA, however, the genes are generally longer than 200nt.

I couldn't find any more information on this. Is there something I've been missing?

Best,
Michaela

P.S. I wasn't quite sure where to post this, so please let me know if I should move this conversation elsewhere.

Gene definition

It seems there is no real gene definition in the GTF (at least for human). The attributes column contains a Name that encompasses multiple exons and transcripts. However in the ID mapping file, there are multiple cases where a single external gene ID maps to multiple URS. Examples: MEF2C-AS1, LINC00929, SNHG7

So do URS map to transcripts? If so, why is there the distinction between URS (Name in attributes) and transcript IDs (ID in attributes)?

Mapping of MOD Identifiers to RNA Central IDs

Hi,
A couple of years ago when RNA Central was just a gleam in Alex's eye I spoke with him about this issue (on behalf of the GO because we will use the RNA Central IDs for annotation rather than the MOD ID). I requested all of the MODs to send Alex/RNAC their current RNA IDs (+ identifying information) and subsequently RNAC would return the corresponding IDs (and the conversation would be on-going to maintain the ID correspondence into the future). In other words, analogous to the mechanism we have in place for reference proteomes, but for transcriptome ID mapping.

Question: Did all of the MODs supply their IDs (or made them readily available)?
Question: Has RNAC provided a mapping to the MODs of the RNAC ids
Question: Anything we can do to help make this happen?

Suzi (for the GOC)

Setup required paths

Talk to its to setup all the requried paths to start working on rnacentral pipeline

Meaning of gene sources 'alignment' and 'expert-database'

The human GTF contains an attribute called 'source', which for each gene has the value 'alignment' or 'expert-database'. I haven't found any clear explanation what each of these values mean, i.e. are 'alignment' genes less curated than 'expert-database'?

For example, the piRNA class has 142620 transcripts, but none of them are from an 'expert-database' although they all come from PirBase. Is PirBase not considered an expert database in this case? If so, what determines a database/transrcript to be 'expert' or not?

Migrate analyze.nf

This runs some analysis on sequnces. It might be useful to split up the main image to have smaller images for each part.

Reconginize more assembly names

Sometimes we are given assembly names as UCSC or ensembl, right now we only work with ensembl ones but we can probably convert the ucsc ones to ensembl. There should be a mapping of names somewhere we can use.

Migrate export.nf

This will need to modified to deal with the datamovers and the new internal structure.

Produce an Rfam annotation file

We should produce an Rfam annotation file as part of a release procedure. This will be useful for people like flybase to associate a gene with a Rfam families.

Export high quality set of Rfam annotations

As part of our export process we produce a series of Rfam matches to RNAcentral sequences. This file can be used by other resources to connect their sequences to Rfam and thus GO terms, but there is a problem with the file. We include all Rfam matches, not only the ones that have are complete and have a high e-value. We should modify the pipeline to add another file which contains only the 'good' matches which other resources, like flybase can use.

Update Rfam GO terms each release

There is code to import new Rfam GO (and other) terms as part of release process. This should be run every release but isn't now. Just add the call an import steps.

Gene type 'ncRNA'

There are quite a few transcripts annotated as 'ncRNA'. However according to the ID mapping, it seems they could be annotated further as tRNA, rRNA etc.

$ zcat homo_sapiens.GRCh38.gff3.gz | grep -P '\ttranscript\t' | grep 'type=ncRNA' | wc -l

5067
>>> id_map = pd.read_table(
    id_map_file,
    names=['RNAcentral', 'database', 'transcript_id', 'NCBI_taxid', 'type', 'gene_name'],
    dtype=str
)
>>> id_map = id_map.query('NCBI_taxid == "9606"')
...

>>> id_map.query('type == "ncRNA"')[['database']].value_counts()

ENA          54239
GENECARDS     4945
MALACARDS     3505
GTRNADB        989
RFAM           568
HGNC           516
LNCBOOK        163
PDB             81
REFSEQ          29
ENSEMBL          5
GENCODE          5
MODOMICS         5
SILVA            3
INTACT           1

>>> id_map.query('type == "ncRNA"')[['transcript_id', 'database']].value_counts()

transcript_id                                   database
RF00005                                         RFAM        348
RF02543                                         RFAM         57
RF00619                                         RFAM         45
RF02541                                         RFAM         35
RF01853                                         RFAM         29
                                                           ... 
JQ705238.1:1672..3229:rRNA                      ENA           1
JQ705239.1:1673..3230:rRNA                      ENA           1
JQ705240.1:1672..3229:rRNA                      ENA           1
JQ705241.1:1670..3227:rRNA                      ENA           1
tRX-Val-NNN-5-1:CM000663.2:145161108-145161178  GTRNADB       1

For instance, I'd expect all transcripts from GTRNADB to be marked as tRNA rather than ncRNA.
Is this bug or feature?

Import ZWD

Importing ZWD will allow to improve the corresponding Rfam families.

Import linkage to expression atlas

From preliminary work there seem to be several thousand genes we have in common with them. We can import this connection for building links and the like. It would be good to have search terms we can use as well as enough to create links to expression atlas.

Analyse sequences to see if they are protein coding

This was suggested by the SAB as useful feature, particularly for lncRNAs. I think we can analyse all sequences and add a QA check, possibly_protein_coding as well as show the scores from the program used on the site. There are a few possible programs suggested by the SAB:

and some others that may be useful:

I've checked them a bit and it seems, the website CPC is down, and PhyloCSF requires an alignment. The rest look pretty straightforward though. PhyloCSF is used by lncipedia so we can talk to them about how they do it as well.

Import GO terms

We should import GO annotations from QuickGO into RNAcentral. This will require several new tables as well as import procedures for fetching and storing the annotations. They are producing a file at /ebi/ftp/pub/contrib/goa/goa_rna_all.gpa.gz which should contain the GO annotations on RNAcentral UPI's.

Figure out why sequence export and search export disagree

Right now the search and sequence export have different numbers of entries. This is probably caused by a mix of things. First some of search export entries are duplicated. This is probably due to overlapping id ranges and some bad/missing data in precompute/xref. I think there are 'fake' entries in precompute. These are likely urs_taxid pairs that do not have any xref entries. There may also be xref entries that have no corresponding entry In precompute. This issue is to track finding and a fixing these problems so the search exports can be correct.

One possible fix is to just truncate precompute and redo everything.

Use results of CPAT analysis

From #77 we will do some analysis with CPAT, and so we want to use the results. We should:

  • Import the best ORF's for display in the website.
  • Import the scores from CPAT
  • Add a protein_coding QA step
  • Add protein coding facet to text search

Difference between 'databases' and 'providing_databases'

What is the difference between 'databases' and 'providing_databases'? I assume databases lists all databases where the transcript is documented whereas providing_databases only lists the the databases that provide the genomic coordinates. Is that correct?

Import linkage to lncAtlas

lncAtlas has localisation information for various lncRNAs in various cell types. We should be able to create links to lncAtlas as well as show what cell types and where sequences are found. It may be nice to be able to do searches like:

  • expressed_in:"cell_type"
  • localised_in:"cytoplasm"
  • localization_source:"lncatlas"
  • expression_source:"lncatlas"

As well as have enough information to show some useful links in the website.

Fix spelling mistake in some sequences

Some snoRNAs are called Homo sapians should be Homo sapiens. This is an issue from the providing database so we need to fix it when importing. We can also fix it in the database if it would be too much to reimport the database in question.

Parse ENA Non-coding product in Python

The ENA data is provided to RNAcentral as EMBL-formatted files. They are processed into csv files that are subsequently loaded into a database. The code is written in Perl and relies on BioPerl and Ensembl Hive.

To streamline the production process I suggest moving from Perl to Python.

  • parse EMBL files into csv files using BioPython
  • create luigi pipeline
  • review how TPA entries are stored (for example, miRBase entries are identified by DR lines in TPAs)
  • combine EMBL files before processing to avoid launching tens of thousands of LSF jobs
  • once done, delete old Perl code from the repository

Extract database version when parsing files

We need to keep track of what version of each database we are importing. Right now this is done manually and isn't idea. So instead what we should do is write it out when parsing the files from a database. We could then place it in the db, or produce a report which will make updating the website and the blog post easier.

tRNAs from LncBook

There are some very long transcripts from LncBook that are annotated as tRNAs. However, the RNA structure image is of a mature tRNA that I cannot find any link or ID to. Additionally, when I go to LncBook, I don't see any information the transcripts being tRNAs. I was wondering where the gene type information is coming from and if we can be sure that they are really are tRNAs.

For example, URS0001BE6631 or HSALNT0163847 could be the same region as RUFY2-207

Query:

so_rna_type_name:"TRNA" AND TAXONOMY:"9606" AND expert_db:"LncBook"

Migrate import-data.nf

This fetches and parses data. We need to get it running on codon. This will need to deal with datamovers and such as it is a pretty complex pipeline that fetches from a lot of places.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.