Giter Site home page Giter Site logo

cedar-submission-server's Introduction

CEDAR Submission Server

The server will listen on port 9010.

To validation an example CEDAR BioSample instance:

curl -X POST \
  -H "Accept: application/json" \
  -H "Content-type: application/json" \
  -H "Authorization: apiKey <CedarUserApiKey>" \
  -d @${CEDAR_HOME}/cedar-docs/repositories/BioSample/AMIA2016DemoBioSampleInstance-Example.json \
  "http://localhost:9010/command/validate-biosample"

To validate an example CEDAR AIRR instance:

curl -X POST \
  -H "Accept: application/json" \
  -H "Content-type: application/json" \
  -H "Authorization: apiKey <CedarUserApiKey>" \
  -d @${CEDAR_HOME}/cedar-docs/repositories/AIRR/EAB2017DemoAIRRSampleInstance-Example.json \
  "http://localhost:9010/command/validate-airr"

To submit an example CEDAR AIRR instance (with no user-supplied files):

curl -X POST \
  -H "Accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: apiKey <CedarUserApiKey>" \
  -F "instance=@${CEDAR_HOME}/cedar-docs/repositories/AIRR/EAB2017DemoAIRRSampleInstance-Example.json" \
  "http://localhost:9010/command/submit-airr"

Here is a success response from the server:

{ 
  "isValid": true,
  "messages": []
}

Here is an error response from the server:

{ 
  "isValid": false,
  "messages": [ 
                "Empty Sample Identifier.", 
                "Empty attribute value for attribute 'biomaterial provider'." 
              ]
}

To validate that an example XML submission works against the BioSample REST service use the included example as follows:

curl -X POST \
  -d @${CEDAR_HOME}/cedar-docs/repositories/BioSample/Human.1.0-Example.xml \
  "https://www.ncbi.nlm.nih.gov/projects/biosample/validate/"

cedar-submission-server's People

Contributors

bukharilab avatar egyedia avatar johardi avatar marcosmro avatar martinjoconnor avatar willrett avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

bukharilab

cedar-submission-server's Issues

Update ImmPort credentials

ImmPort submission token password seems to have changed. Email sent on September 19th.

curl -X POST https://auth.dev.immport.org/auth/token -d username=cedaruser -d password="<password>" 2>&1

{
"error" : "Bad credentials",
"status" : 401
}%

Fails to compile with Java 16.0.1

Works with Java version 15. Following error with Java 16.0.1:

[ERROR] Failed to execute goal org.jvnet.jaxb2.maven2:maven-jaxb2-plugin:0.14.0:generate (default) on project cedar-submission-server-core: Execution default of goal org.jvnet.jaxb2.maven2:maven-jaxb2-plugin:0.14.0:generate failed: Cannot invoke "java.lang.reflect.Method.invoke(Object, Object[])" because "com.sun.xml.bind.v2.runtime.reflect.opt.Injector.defineClass" is null -> [Help 1]

Develop generic Human.1.0 BioSample submission mechanism

Develop standalone BioSample Human.1.0 package element and a pluggable code module (see #41) that can be loaded by the submission server to transform it into NCBI-compliant XML.

(Ultimate goal would be a mechanism to generically cover all BioSample packages but first step is to simply handle specific packages.)

Requires #41

Verify AIRR file names in instance against submission data files

We currently do not verify that the files listed by a user in their AIRR submission actually correspond to the files the upload. Need to check this before submission and give appropriate validation error.

See if we can also verify the file type against the data files - though might be hard.

Submission/AIRR Enhancements

In addition to checking file name/data file correspondence (#36) and reference integrity (#37), investigate if there are other pre-submission integrity check we can do on AIRR instances.

Parent of #36
Parent of #19
Parent of #37
Parent of #46
Parent of metadatacenter/cedar-template-editor#779
Parent of metadatacenter/cedar-project#633
Parent of metadatacenter/cedar-template-editor#780
Parent of metadatacenter/cedar-project#894
Parent of metadatacenter/cedar-project#918
Parent of metadatacenter/cedar-project#917
Parent of metadatacenter/cedar-project#923
Parent of metadatacenter/cedar-project#936
Parent of metadatacenter/cedar-project#948
Parent of metadatacenter/cedar-project#949
Parent of metadatacenter/cedar-project#950
Parent of #39
Parent of #40
Parent of #41
Parent of #42
Parent of #43
Parent of #44

AIRR filename/type checking

Currently users can enter a filename in the AIRR template and leave the file type blank, which cause NCBI validation errors because the partial data makes it into the submission XML.

Make sure partially completed file name/type pairs do not generate an XML entry.

AIRR/SRA file name validation/detection

Validate data files submitted via AIRR or SRA template. Ensure that files entered in template match files selected in dialog.

Another option, is that users upload data files automatically fill in template fields.

ImmPort data upload

Once the upload registration ticket is created (#6, #7), the following steps are performed in the submission server:

a. CEDAR submission server exports form content as ImmPort json templates.
b. CEDAR submission server creates a zip file or a directory containing the exported ImmPort json templates and possibly other files with the possibly modified package name returned by the previous API call.
c. CEDAR submission server executes the ImmPort script that transfers a file or directory to the Aspera server and provides the ImmPort credentials, and the upload registration ticket and the possibly modified package name returned by the previous API call.
d. ImmPort script validates the input parameters by calling the ImmPort Private Data API and if valid, starts the Aspera transfer and returns the appropriate status when complete.
e. If the transfer is successful, CEDAR submission server calls the ImmPort Private Data API to change the upload registration ticket status to 'Pending' and provides the ImmPort credentials and upload registration ticket.
f. ImmPort Private Data API validates the upload registration ticket, moves the transferred zip file or directory from the Aspera server to the ImmPort Data Upload Server's web upload drop zone, and changes the status of the upload registration ticket to 'Pending'.
g. ImmPort Data Upload Server processes the pending upload registration ticket and sends an e-mail to the CEDAR / ImmPort user with the data upload result.

Described instep 10 here:

https://docs.google.com/document/d/1J0j3scmOK8yZQDMH1Te1EiU_epmGCts01dsEHjj--e4/edit

FTP submission queue

Currently all examples of submission on the submission server have a synchronous pass through and the initial call waits until submission is complete.

Need a queue-based system to handle large submissions (e.g., AIRR, where there may be several multi-GB files).

Related to NCBI submission task #5.

Create ImmPort upload registration ticket

Using previous authentication credentials and workspace selection (#6) the submission server call the ImmPort Private Data API to create an upload registration ticket and provides the ImmPort credentials, the workspace ID, and the package name (either a zip file name or a directory name).

ImmPort Private Data API performs authentication and authorization based on ImmPort credentials and workspace ID and if valid, possibly updates the package name (replaces spaces with underscores, etc), creates an upload registration ticket with a status of 'Created', and returns the HTTP status, the upload registration ticket and its status, and the (possibly modified) package name. If authentication or authorization fails, or an error occurs during this process, the appropriate error is returned.

Required for steps 6-8 in ImmPort Submission Steps here:

https://docs.google.com/document/d/1J0j3scmOK8yZQDMH1Te1EiU_epmGCts01dsEHjj--e4/edit

NCBI submission credentials

Need to validate with a real submission but it appears that we do not need to supply NCBI credentials on submission to SRA/BioSample/BioProject.

The credentials model NCBI adopts is that the user-supplied email is used to contact the submitter after a successful submission. For failed submissions CEDAR is responsible for informing the user. Our current approach of monitoring the submission FTP site and reading the generated report files already fulfills that goal. (FYI, NCBI refers to this process as ‘brokered submission’, with CEDAR being the broker.) After a successfully submission, CEDAR is effectively out of the loop. The email sent by NCBI to the user invites them to go the the NCBI web site where that can create an account and monitor all submissions they have made.

Mapping of CAIRR BioProject fields to NCBI BioProject XML

Determine mappings from CAIRR BioProject fields to NCBI BioProject XML.

The following are the BioProject fields in the CAIRR template:

screen shot 2018-02-14 at 4 32 49 pm

These need to be mapped to NCBI BioProject XML, e.g.,

<Project>

  <ProjectID/>                          

  <Descriptor>

    <Title>Candida albicans A123</Title>
    <Description> <p> Genome Sequencing of C. albicans </p> </Description>
    <ExternalLink category="Related Resources" label="Genomics Institute">
      <URL>www.organization.org</URL>
    </ExternalLink>

    <Relevance> <Medical>yes</Medical> </Relevance>

  </Descriptor>

  <ProjectType>
    <!-- sample-scope = eMonoisolate eMultiisolate eMultispecies eEnvironment eSynthetic eSingleCell eOther -->

    <ProjectTypeSubmission sample_scope="eMonoisolate">

      <Organism>
        <OrganismName>Candida albicans A123</OrganismName>
        <Strain>A123</Strain>
      </Organism>

      <BioSampleSet>
        <BioSample>
          <PrimaryId db="BioSample">SAMN000123</PrimaryId>
        </BioSample>
      </BioSampleSet>
      
      <IntendedDataTypeSet>
        <DataType>
          genome sequencing
          <!--
              genome sequencing raw sequence reads genome sequencing and assembly
              metagenome metagenomic assembly assembly transcriptome proteomic map
              clone ends targeted loci targeted loci cultured targeted loci
              environmental random survey exome variation epigenomics phenotype or genotype other
          -->
        </DataType>
      </IntendedDataTypeSet>
      
    </ProjectTypeSubmission>
  </ProjectType>

</Project>

Some fields already go to the Description element in a submission:

    <Description>
        <Comment>AIRR (myasthenia gravis) data to the NCBI using the CAIRR</Comment>
        <Submitter user_name="[email protected]"/>
        <Organization type="lab" role="owner">
            <Name>Yale University</Name>
            <Contact email="[email protected]">
                <Name>
                    <First>Kevin</First>
                    <Last>O'Connor</Last>
                </Name>
            </Contact>
        </Organization>
    </Description>

AIRR sampleID to filename association

Need a general approach for associating BioSample IDs with uploaded files.

The current SRA template element does not have an explicit way to do this. Instead, there is an implicit assumption that the file name is the same as the sample ID.

Use individual fields for SRA file names

Currently we have file names listed as attribute value fields. This means that they are not available in spreadsheet mode and users may miss them.

However, more than one file may be associate with an SRA entry so we cannot just have one file name. Perhaps have 5ish? Need to determine sensible limit.

Update release date handling for AIRR template

Problem described in following email to Yuriy at NCBI (with suggested solution confirmed as appropriate by him):

Our understanding is that the pubic release of, say, an SRA submission entry will force the release of referenced BioSample submission entries, which will in turn force the public release of the reference BioProject. So, even if, for example, a BioProject submission has a future release date, the public release of a BioSample that references it will effectively make the BioProject public irrespective of its release date.

Is this understanding correct?

The reason this is an issue for us is that we are generating a BioProject/BioSample/SRA submission from the AIRR specification and it includes only a BioSample release date - and has no overall submission release date or SRA entry release date. Since the SRA parts of the submission have no release date we are assuming that they are released immediately - which forces release of the referenced BioSamples and in turn the BioProject, effectively making the entire submission public immediately.

We are assuming that we should include release dates for each BioProject, BioSample, and SRA entry to fully control the pubic release dates? 

Verify AIRR instance internal identifier consistency

We do not currently validate internal references in AIRR instances (e.g, that a sample ID in an SRA entry has a corresponding BioSample entry).

See if this is possible. Have to be careful because samples may have been previously submitted - so this may not be possible statically.

Monitor NCBI submissions

NCBI sends an email ~5-30 minutes after a submission. Does not make for great demos.

However, it does more frequently generate incremental report files in the submission FTP directory that can be downloaded and used to inform users of processing. We can use the continuous monitoring code (#8) to periodically poll the FTP directory.

Incorrect source for Cell Subset field in AIRR template

The MiAIRR template restricts the value to the Cell Line (CLO) ontology.

This ontology does not have the 'naive B cell' class for example. The Cell Ontology (CL) has this value.

Which ontology should the field use?
Answer: CL

Need to update template and possibly instances.

Switching to NCBI production server

Determine timing of switch from using NCBI test server to production server.

Need to determine if we need a per-user login for the use of the production server.

AIRR NCBI FTP submission

Add NCBI user email to XML for submitting AIRR instance to SRA.

Previously were going provide ability to supply user name and password. No longer needed because we now use CEDAR account submit to NCBI FTP server and embed user email in submission XML.

Current FTP submission code here:

https://github.com/metadatacenter/cedar-submission-server/blob/develop/cedar-submission-server-application/src/main/java/org/metadatacenter/cedar/submission/resources/AIRRSubmissionServerResource.java

The AIRR template is here:

https://cedar.staging.metadatacenter.net/templates/edit/https://repo.staging.metadatacenter.net/templates/d7c9d050-a4aa-4448-b7ed-cf58b980baf2?folderId=https:%2F%2Frepo.staging.metadatacenter.net%2Ffolders%2Ff77e5f5e-0bce-4e52-8415-a684614b9461

IMPORTANT: the JSON-to-XML translation code in the submission server expects the template to be exactly as specified here. Any changes to the template must be accompanied by changes to the translation code.

Dynamic detection of duplicate NCBI sample

A frequent cause of failure on BioSample or SRA submission is the reuse of an user-defined accession number for a sample.

(Note these user-supplied sample numbers are mapped to BioSample accession numbers on submission.)

The accession numbers for samples must be unique so reuse for another submission causes NCBI submission errors quite late in the submission process.

Not clear if there is a way to ask NCBI if a number has been previously used by a particular user. The per user aspect is a major complicating factor since we do not know a user's NCBI credentials at any point in the submission process.

Builds on basic metadata validation (metadatacenter/cedar-template-editor#779).

AIRR validation call

Add validation call to test for some high level errors in AIRR template, e.g., BioSample IDs not matching, file names not matching

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.