Giter Site home page Giter Site logo

contig-alias's People

Contributors

andresfsilva avatar apriltuesday avatar jmmut avatar nitin-ebi avatar singaltanmay avatar sundarvenkata-ebi avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

contig-alias's Issues

Add HateOAS to paginated endpoints

the problem with this pagination approach is that when users do a query like "/assemblies/taxid/9606" and receive a list with 10 results, they might not know about the pagination thing and think that's the complete result.

We have to provide something in the result to signal that it's only a page. I'm not a huge fan of HATEOAS but it seems to be the closest to a standard that we currently have. take a look at one of our endpoints:

The good thing about it is that it tells you how many pages are there, and also gives you the links to get the next/previous/first/last pages (when it makes sense).

Originally posted by @jmmut in #29

Perform queries involving partial input data.

Queries that provide a part (sub-string) of a field. Specifically:-

  • Given an accession "298.5" return all assemlies whose GCA or GCF contains "298.5".
  • Given an accession "298.5" return all chromosomes whose GCA or GCF contains "298.5".
  • Given a chromosome/contig "Chr1" (with or without prefix), return the set of aliases for every species that has a "Chr1".

all constant variable put in a seprate class

I have been reviewing our codebase and noticed that the constant variables are written in individual classes. After conducting further analysis, I believe that consolidating these constant variables in a separate class could make our code more efficient, easier to read, and maintain.

Rename genbank to INSDC

We are indeed ingesting genbank accessions, but genbank is just one of the INSDC sources, along ENA and DDBJ. These accessions are in the same namespace, so they are compatible. Right now, someone searching for an ENA accession would need to put it under "genbank", and ENA is not genbank, but both are INSDC.

There's no need to rename all the internal variable names, only the ones that show in the swagger and in the json reponses. Also, while documenting INSDC parameters, make a quick mention to GenBank, ENA and DDBJ.

split admin and user swagger pages

it's been ok until now to have both apis togther, but if we plan to start asking users to test the contig alias, it would be clearer for them if we hide the admin endpoints.

fix swagger serialization exceptions

This error shows up in the logs when running the service.

this probably comes from having a null int instead of an Integer for a default value in some endpoint.

2020-08-11 10:34:37.371  WARN 21295 --- [nio-8080-exec-2] i.s.m.p.AbstractSerializableParameter    : Illegal DefaultValue null for parameter type integer

java.lang.NumberFormatException: For input string: ""
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[na:1.8.0_211]
        ...

Paginated endpoints

Add pagination support to all endpoints for

  • dealing with large number of items in response
  • uniformity in operation of endpoints

IOException in FTP Client when adding assemblies in bulk

Adding assemblies ["GCA_000001405.10","GCA_000001405.11","GCA_000001405.12"] using /v1/admin/assemblies results in IOException

Error:-

2020-08-16 21:48:53.439 ERROR 5055 --- [pool-1-thread-1] u.a.e.e.c.dus.PassiveAnonymousFTPClient  : Could not connect to FTP server 'ftp.ncbi.nlm.nih.gov'. FTP status was: null. Reply code: 221. Reply string: 221 Goodbye.
.
2020-08-16 21:48:53.439 ERROR 5055 --- [pool-1-thread-1] u.a.e.e.c.service.AssemblyService        : IOException while fetching and inserting GCA_000001405.10

org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
	at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:324) ~[commons-net-3.6.jar:3.6]
	at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:300) ~[commons-net-3.6.jar:3.6]
	at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:523) ~[commons-net-3.6.jar:3.6]
	at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:648) ~[commons-net-3.6.jar:3.6]
	at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:622) ~[commons-net-3.6.jar:3.6]
	at org.apache.commons.net.ftp.FTP.quit(FTP.java:904) ~[commons-net-3.6.jar:3.6]
	at org.apache.commons.net.ftp.FTPClient.logout(FTPClient.java:1148) ~[commons-net-3.6.jar:3.6]
	at uk.ac.ebi.eva.contigalias.dus.PassiveAnonymousFTPClient.disconnect(PassiveAnonymousFTPClient.java:80) ~[classes/:na]
	at uk.ac.ebi.eva.contigalias.dus.PassiveAnonymousFTPClient.connect(PassiveAnonymousFTPClient.java:63) ~[classes/:na]
	at uk.ac.ebi.eva.contigalias.dus.PassiveAnonymousFTPClient.connect(PassiveAnonymousFTPClient.java:33) ~[classes/:na]
	at uk.ac.ebi.eva.contigalias.dus.NCBIBrowser.connect(NCBIBrowser.java:49) ~[classes/:na]
	at uk.ac.ebi.eva.contigalias.datasource.NCBIAssemblyDataSource.getAssemblyByAccession(NCBIAssemblyDataSource.java:43) ~[classes/:na]
	at uk.ac.ebi.eva.contigalias.datasource.NCBIAssemblyDataSource$$FastClassBySpringCGLIB$$534e408e.invoke(<generated>) ~[classes/:na]
	at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.2.6.RELEASE.jar:5.2.6.RELEASE]
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
	at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:139) ~[spring-tx-5.2.6.RELEASE.jar:5.2.6.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691) ~[spring-aop-5.2.6.RELEASE.jar:5.2.6.RELEASE]
	at uk.ac.ebi.eva.contigalias.datasource.NCBIAssemblyDataSource$$EnhancerBySpringCGLIB$$4235151a.getAssemblyByAccession(<generated>) ~[classes/:na]
	at uk.ac.ebi.eva.contigalias.service.AssemblyService.fetchAndInsertAssembly(AssemblyService.java:97) ~[classes/:na]
	at uk.ac.ebi.eva.contigalias.service.AssemblyService.lambda$fetchAndInsertAssembly$2(AssemblyService.java:169) ~[classes/:na]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:832) ~[na:na]

Perform all alias resolutuion queries (single table only)

Perform all alias resolution queries that only require access to a single table. This includes:

  • Given an assembly’s genbank or refseq accession, return its refseq accession or genbank alias along with the assembly name, organism name, taxonomic id.
  • Given a chromosome’s genbank or refseq accession, return its refseq accession or genbank alias.

Add persistence to cache results

When a search is performed using the API, the service should first check if the result is present in database. If not, then the "dus" is used to fetch and parse the assembly report from NCBI's server. If a result is found online it is returned to the user and also added to the database as a caching mechanism.

Allow using a proxy for FTP connections

In case connecting directly to FTPs is not allowed, but there's an available proxy, we could change our FTP browser to use the proxy.

from https://commons.apache.org/proper/commons-net/apidocs/org/apache/commons/net/SocketClient.html :

setSocketFactory method, which allows you to control the type of Socket the SocketClient creates for initiating network connections. This is especially useful for adding SSL or proxy support

Note that the class we use from apache commons FTPClient inherits from SocketClient.

Tasks:

  • add a property (in application.properties) to set the proxy host+port and use that for our FTP connections.

Service to fetch Assemblies by accession

When given an accession there currently exists functionality in the project to:

  • Open an FTP connection (using FTPClient)
  • Construc the path of required directory (using NCBIBrowser)
  • Navigate to the path of the *assembly_report.txt of given accession (using NCBIBrowser)
  • Download the report into an InputStream (using NCBIBrowser)
  • Read the InputStream using BufferedReader (using AssemblyReportReader)
  • Parse the report into AssemblyEntity and ChromosomeEntity Java objects (using AssemblyReportReader)

This functionality is presently being harnessed in various test cases. The goal is to expose this functionality to the user using an API. Doing this will require building these components:

  • AssemblyDao interface to define all functionality to be implemented.
  • FTPClientAssemblyDaoImplement that implements AssemblyDao.
  • AssemblyService is a Spring service that makes use of AssemblyDao.
  • ClientAPI is a Controller that lets user access AssemblyService through HTTP requests.

Use a real DB

Using H2 has allowed us to quickstart the project but we are starting to face issues by not having a real DB.

I think the easiest way is to put in the application properties some fields with maven properties that will be replaced by the selected maven profile. In other projects we define it like:

spring.datasource.url=@contig-alias-dbUrl@
spring.datasource.username=@contig-alias-dbUsername@
spring.datasource.password=@contig-alias-dbPassword@
spring.jpa.hibernate.ddl-auto=@contig-alias-ddlBehaviour@
spring.jpa.database-platform=org.hibernate.dialect.PostgreSQLDialect
spring.jpa.generate-ddl=true

where the url is a jdbc url like jdbc:postgresql://localhost:5432/postgres and ddl-auto is usually create or validate (https://docs.spring.io/autorepo/docs/spring-boot/1.1.0.M1/reference/html/howto-database-initialization.html, https://stackoverflow.com/questions/438146/what-are-the-possible-values-of-the-hibernate-hbm2ddl-auto-configuration-and-wha)

Having those properties and a running postgres should be the only changes needed.

Review classification of chromosome/scaffold

We allow enabling or disabling scaffold ingestion at compile time, but the classification of a sequence as "chromosome" or "scaffold" may not be completely right:

            if (!line.startsWith("#")) {
                parseChromosomeLine(line);
                String[] columns = line.split("\t", -1);
                if (columns.length >= 6) {
                    if (columns[3].equals("Chromosome")) {
                        parseChromosomeLine(columns);
                    } else if (isScaffoldsEnabled && columns[1].equals("unplaced-scaffold")) {
                        parseScaffoldLine(columns);
                    }
                }

for instance, the MT (in ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.14_GRCh37.p13/) won't be considered as either chromosome nor scaffold because its column[3] is Mitochondrion, not Chromosome, and its column[1] is assembled-molecule, not unplaced-scaffold.

Likewise, there are other contigs that are different to unplaced-scaffold. GCA_000001405.14 has some of them, like alt-scaffold, fix-patch and others, and we should decide whether to include them and in which category if so. (Including as Chromosome is meant for a small set of very used sequences, scaffolds for anything else).

support ga4gh identifiers

from the refget paper:

Refget defines three supported identifier algorithms; MD5,TRUNC512 and GA4GH Identifier. All three algorithms normalise sequence input by stripping all whitespace characters and restricting to characters in the range A-Z. We chose this character range as a compromise between the methods and requirements employed by CRAM, ENA and the Variation Representation Specification (VRS).10MD5 is the default checksum algorithm used by the CRAM format’s M5 tag and hence the CRR. It is provided for backwards compatibility with existing CRAM files. However,there are limitations to md5’s algorithm the occurrence of a checksum collision between non-identical sequences would be catastrophic. To mitigate this concern, we co-developed two schemes with the Genomic Knowledge Standards’ Variation Representation Specification (VRS) based on the SHA-512 checksum algorithm called TRUNC512 andGA4GH identifier. Both schemes use the first 24 bytes of aSHA-512 digest. TRUNC512 chooses to represent this as ahex encoded string. GA4GH identifier converts these bytes into a base64 URL encoded string formatted as “ga4gh:SQ.XXXX”. Both algorithms are interchangeable since both represent the same underlying SHA-512 digest,however the GA4GH identifier is preferred to maintain VRS compatibility.

I tought that refget only used trunc512 and md5 but it seems we should support the GA4GH identifiers. Luckily, I think we can store just trunc512 and md5 as we are doing at the moment and allow searches by ga4gh id by transforming it on the fly to the trunc512 id.

Sequence collection data models without using contig alias

What data/java model would be appropriate for storing sequence collections
We need to be able to represent all 3 levels of sequence collections:

The top level digest S3LCyI788LE6vq89Tc_LojEcsMZRixzP
The compact level

{
    "sequences": "EiYgJtUfGyad7wf5atL5OG4Fkzohp2qe",
    "lengths": "5K4odB173rjao1Cnbk5BnvLt9V7aPAa2",
    "names": "g04lKdxiYtG3dOGeUC5AdKEifw65G0Wp"
}

The canonical level

{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "76f9f3315fa4b831e93c36cd88196480",
    "d5171e863a3d8f832f0559235987b1e5",
    "b9b1baaa7abf206f6b70cf31654172db"
  ]
}

Additional question: Can we add another property to a sequence collection in this datamodel

Create handler classes to convert List/Optional from Service into PagedModel for Controller

wait, I missed this. it looks that we have different understanding on what our paging strategy should be. My point is to provide a uniform interface, so that you can use every endpoint in the same way. However, if you query by an assembly accession and we know it can only possibly return 1 element, we can just ignore the pagination parameters, (or remove them actually, if spring doesn't complain if they are provided). I don't think we should change the service interface from Optional<> to Page<> either. Only return createAppropriateResponseEntity(optionalAssembly, assemblyAssembler); should stay from the current form, and you can add a method overload that takes an optional and creates a page so that the assembler can build the response.

Originally posted by @jmmut in #46

Sequence collection data models using contig alias

The current java model in contig alias has two main entities:

  • Chromosome: representing a single sequence provided by Genbank and ENA
  • Assembly: representing a group of sequence provided by Genbank and ENA

This issue will investigate how this model can be modified to support storing and providing sequence collections for the assemblies already represented.
We need to be able to represent all 3 levels of sequence collections:

  • The top level digest S3LCyI788LE6vq89Tc_LojEcsMZRixzP
  • The compact level
{
    "sequences": "EiYgJtUfGyad7wf5atL5OG4Fkzohp2qe",
    "lengths": "5K4odB173rjao1Cnbk5BnvLt9V7aPAa2",
    "names": "g04lKdxiYtG3dOGeUC5AdKEifw65G0Wp"
}
  • The canonical level
{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "76f9f3315fa4b831e93c36cd88196480",
    "d5171e863a3d8f832f0559235987b1e5",
    "b9b1baaa7abf206f6b70cf31654172db"
  ]
}

Additional question: Can we add another property to a sequence collection in this datamodel

fix redirection to swagger page

if this is the root path, https://wwwdev.ebi.ac.uk/eva/webservices/contig-alias/ it should redirect to the swagger page, without having to manually type https://wwwdev.ebi.ac.uk/eva/webservices/contig-alias/swagger-ui.html.

Instead it shows a json that allows making a full paginated scan of the DB (http://wwwdev.ebi.ac.uk/eva/webservices/contig-alias/assemblyEntities). I didn't know about this feature but I guess it can be useful, so ideally this would still be available in another path. If both things are not possible, having the swagger in the root path is more important.

API docs

Write API docs for all endpoints using Swagger or any other similar tool.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.