metadatacenter / cedar-submission-server Goto Github PK

View Code? Open in Web Editor NEW

0.0 22.0 1.0 1.07 MB

CEDAR server to handle submissions to metadata repositories

License: Other

Java 99.45% XSLT 0.55%

cedar-submission-server's Introduction

CEDAR Submission Server

The server will listen on port 9010.

To validation an example CEDAR BioSample instance:

curl -X POST \
  -H "Accept: application/json" \
  -H "Content-type: application/json" \
  -H "Authorization: apiKey <CedarUserApiKey>" \
  -d @${CEDAR_HOME}/cedar-docs/repositories/BioSample/AMIA2016DemoBioSampleInstance-Example.json \
  "http://localhost:9010/command/validate-biosample"

To validate an example CEDAR AIRR instance:

curl -X POST \
  -H "Accept: application/json" \
  -H "Content-type: application/json" \
  -H "Authorization: apiKey <CedarUserApiKey>" \
  -d @${CEDAR_HOME}/cedar-docs/repositories/AIRR/EAB2017DemoAIRRSampleInstance-Example.json \
  "http://localhost:9010/command/validate-airr"

To submit an example CEDAR AIRR instance (with no user-supplied files):

curl -X POST \
  -H "Accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: apiKey <CedarUserApiKey>" \
  -F "instance=@${CEDAR_HOME}/cedar-docs/repositories/AIRR/EAB2017DemoAIRRSampleInstance-Example.json" \
  "http://localhost:9010/command/submit-airr"

Here is a success response from the server:

{ 
  "isValid": true,
  "messages": []
}

Here is an error response from the server:

{ 
  "isValid": false,
  "messages": [ 
                "Empty Sample Identifier.", 
                "Empty attribute value for attribute 'biomaterial provider'." 
              ]
}

To validate that an example XML submission works against the BioSample REST service use the included example as follows:

curl -X POST \
  -d @${CEDAR_HOME}/cedar-docs/repositories/BioSample/Human.1.0-Example.xml \
  "https://www.ncbi.nlm.nih.gov/projects/biosample/validate/"

cedar-submission-server's People

Contributors

Watchers

Forkers

bukharilab

cedar-submission-server's Issues

Update ImmPort credentials

ImmPort submission token password seems to have changed. Email sent on September 19th.

curl -X POST https://auth.dev.immport.org/auth/token -d username=cedaruser -d password="<password>" 2>&1

{
"error" : "Bad credentials",
"status" : 401
}%

Fails to compile with Java 16.0.1

Works with Java version 15. Following error with Java 16.0.1:

[ERROR] Failed to execute goal org.jvnet.jaxb2.maven2:maven-jaxb2-plugin:0.14.0:generate (default) on project cedar-submission-server-core: Execution default of goal org.jvnet.jaxb2.maven2:maven-jaxb2-plugin:0.14.0:generate failed: Cannot invoke "java.lang.reflect.Method.invoke(Object, Object[])" because "com.sun.xml.bind.v2.runtime.reflect.opt.Injector.defineClass" is null -> [Help 1]

Regenerate XML for missing file names

Complete CAIRR XML translation

Finish XML translation based on initial wiring here: #16

Follow-on task: #17

Generic composable BioProject/BioSample/SRA submission mechanism

Based on individual generic submission mechanisms for BioProject, BioSample, and SRA (#40, #42, #43) develop a generic way to compose an overall submission covering all three.

Requires #40
Requires #42
Requires #43

Update AIRR convertor

New AIRR template will require XML generation code update.

Develop generic Human.1.0 BioSample submission mechanism

Develop standalone BioSample Human.1.0 package element and a pluggable code module (see #41) that can be loaded by the submission server to transform it into NCBI-compliant XML.

(Ultimate goal would be a mechanism to generically cover all BioSample packages but first step is to simply handle specific packages.)

Requires #41

Template-to-repository module plugin system for submission server

Need means to specify and load code modules to translate specific template instances into a repository-compliant form.

Requires #39

Verify AIRR file names in instance against submission data files

We currently do not verify that the files listed by a user in their AIRR submission actually correspond to the files the upload. Need to check this before submission and give appropriate validation error.

See if we can also verify the file type against the data files - though might be hard.

Get workspaces from ImmPort API

Submission server calls the ImmPort Private Data API with the user's ImmPort credentials to obtain the list of workspaces and displays the list to the user.

Required for steps 3 and 4 in ImmPort Submission Steps here:

https://docs.google.com/document/d/1J0j3scmOK8yZQDMH1Te1EiU_epmGCts01dsEHjj--e4/edit

Corresponding front end task: metadatacenter/cedar-template-editor#610

Submission wiring for new CAIRR template

Following on #16 and #18, wire up submission service to SRA for new AIRR template.

Develop generic BioProject submission mechanism

Develop standalone BioProject element a pluggable code module (see #41) that can be loaded by the submission server to transform it into NCBI-compliant XML.

Requires #41

Submission/AIRR Enhancements

In addition to checking file name/data file correspondence (#36) and reference integrity (#37), investigate if there are other pre-submission integrity check we can do on AIRR instances.

AIRR filename/type checking

Currently users can enter a filename in the AIRR template and leave the file type blank, which cause NCBI validation errors because the partial data makes it into the submission XML.

Make sure partially completed file name/type pairs do not generate an XML entry.

AIRR/SRA file name validation/detection

Validate data files submitted via AIRR or SRA template. Ensure that files entered in template match files selected in dialog.

Another option, is that users upload data files automatically fill in template fields.

Patch invalid artifacts on production

As of August 30th we have the following:

(#invalid)/(#total resources)
Template: 72/925
Element: 28/1,443

ImmPort data upload

Once the upload registration ticket is created (#6, #7), the following steps are performed in the submission server:

a. CEDAR submission server exports form content as ImmPort json templates.
b. CEDAR submission server creates a zip file or a directory containing the exported ImmPort json templates and possibly other files with the possibly modified package name returned by the previous API call.
c. CEDAR submission server executes the ImmPort script that transfers a file or directory to the Aspera server and provides the ImmPort credentials, and the upload registration ticket and the possibly modified package name returned by the previous API call.
d. ImmPort script validates the input parameters by calling the ImmPort Private Data API and if valid, starts the Aspera transfer and returns the appropriate status when complete.
e. If the transfer is successful, CEDAR submission server calls the ImmPort Private Data API to change the upload registration ticket status to 'Pending' and provides the ImmPort credentials and upload registration ticket.
f. ImmPort Private Data API validates the upload registration ticket, moves the transferred zip file or directory from the Aspera server to the ImmPort Data Upload Server's web upload drop zone, and changes the status of the upload registration ticket to 'Pending'.
g. ImmPort Data Upload Server processes the pending upload registration ticket and sends an e-mail to the CEDAR / ImmPort user with the data upload result.

Described instep 10 here:

https://docs.google.com/document/d/1J0j3scmOK8yZQDMH1Te1EiU_epmGCts01dsEHjj--e4/edit

FTP submission queue

Currently all examples of submission on the submission server have a synchronous pass through and the initial call waits until submission is complete.

Need a queue-based system to handle large submissions (e.g., AIRR, where there may be several multi-GB files).

Related to NCBI submission task #5.

Create ImmPort upload registration ticket

Using previous authentication credentials and workspace selection (#6) the submission server call the ImmPort Private Data API to create an upload registration ticket and provides the ImmPort credentials, the workspace ID, and the package name (either a zip file name or a directory name).

ImmPort Private Data API performs authentication and authorization based on ImmPort credentials and workspace ID and if valid, possibly updates the package name (replaces spaces with underscores, etc), creates an upload registration ticket with a status of 'Created', and returns the HTTP status, the upload registration ticket and its status, and the (possibly modified) package name. If authentication or authorization fails, or an error occurs during this process, the appropriate error is returned.

Required for steps 6-8 in ImmPort Submission Steps here:

https://docs.google.com/document/d/1J0j3scmOK8yZQDMH1Te1EiU_epmGCts01dsEHjj--e4/edit

Extending Submission server to manage submissions

https://docs.google.com/document/d/1Q4quatLQqV9lANzB6Mex8ZCwna9JeM6Q24OSU0JPe4Y/edit

Requires metadatacenter/cedar-resource-server#61

NCBI submission credentials

Need to validate with a real submission but it appears that we do not need to supply NCBI credentials on submission to SRA/BioSample/BioProject.

The credentials model NCBI adopts is that the user-supplied email is used to contact the submitter after a successful submission. For failed submissions CEDAR is responsible for informing the user. Our current approach of monitoring the submission FTP site and reading the generated report files already fulfills that goal. (FYI, NCBI refers to this process as ‘brokered submission’, with CEDAR being the broker.) After a successfully submission, CEDAR is effectively out of the loop. The email sent by NCBI to the user invites them to go the the NCBI web site where that can create an account and monitor all submissions they have made.

Add CEDAR attribute to BioSample submission

Add optional attribute to BioSample submission to indicate CEDAR was submission tool used.

Environment variables for external repos

Need exhaustive variables for NCBI and ImmPort submission (base URL, user name, password).

Mapping of CAIRR BioProject fields to NCBI BioProject XML

Determine mappings from CAIRR BioProject fields to NCBI BioProject XML.

The following are the BioProject fields in the CAIRR template:

These need to be mapped to NCBI BioProject XML, e.g.,

<Project>

  <ProjectID/>                          

  <Descriptor>

    <Title>Candida albicans A123</Title>
    <Description> <p> Genome Sequencing of C. albicans </p> </Description>
    <ExternalLink category="Related Resources" label="Genomics Institute">
      <URL>www.organization.org</URL>
    </ExternalLink>

    <Relevance> <Medical>yes</Medical> </Relevance>

  </Descriptor>

  <ProjectType>
    <!-- sample-scope = eMonoisolate eMultiisolate eMultispecies eEnvironment eSynthetic eSingleCell eOther -->

    <ProjectTypeSubmission sample_scope="eMonoisolate">

      <Organism>
        <OrganismName>Candida albicans A123</OrganismName>
        <Strain>A123</Strain>
      </Organism>

      <BioSampleSet>
        <BioSample>
          <PrimaryId db="BioSample">SAMN000123</PrimaryId>
        </BioSample>
      </BioSampleSet>
      
      <IntendedDataTypeSet>
        <DataType>
          genome sequencing
          <!--
              genome sequencing raw sequence reads genome sequencing and assembly
              metagenome metagenomic assembly assembly transcriptome proteomic map
              clone ends targeted loci targeted loci cultured targeted loci
              environmental random survey exome variation epigenomics phenotype or genotype other
          -->
        </DataType>
      </IntendedDataTypeSet>
      
    </ProjectTypeSubmission>
  </ProjectType>

</Project>

Some fields already go to the Description element in a submission:

    <Description>
        <Comment>AIRR (myasthenia gravis) data to the NCBI using the CAIRR</Comment>
        <Submitter user_name="[email protected]"/>
        <Organization type="lab" role="owner">
            <Name>Yale University</Name>
            <Contact email="[email protected]">
                <Name>
                    <First>Kevin</First>
                    <Last>O'Connor</Last>
                </Name>
            </Contact>
        </Organization>
    </Description>

User studies with AIRR volunteers

Incrementally evaluate the AIRR submission process with the 17 or so volunteers from the AIRR meeting.

Add AIRR SRA FTP upload functionality

Add REST endpoint to accept CEDAR AIRR instance plus a raw data file and submit it to NCBI's FTP server.

Associated front end task is metadatacenter/cedar-template-editor#558

Update AIRR template on submission server

AIRR template on submission server needs to be updated to reflect new model updates

AIRR sampleID to filename association

Need a general approach for associating BioSample IDs with uploaded files.

The current SRA template element does not have an explicit way to do this. Instead, there is an implicit assumption that the file name is the same as the sample ID.

Use individual fields for SRA file names

Currently we have file names listed as attribute value fields. This means that they are not available in spreadsheet mode and users may miss them.

However, more than one file may be associate with an SRA entry so we cannot just have one file name. Perhaps have 5ish? Need to determine sensible limit.

Initial Yale AIRR submission testing

Yale will recruit some local submitters, submit some AIRR metadata and data, and report feedback.

Update release date handling for AIRR template

Problem described in following email to Yuriy at NCBI (with suggested solution confirmed as appropriate by him):

Our understanding is that the pubic release of, say, an SRA submission entry will force the release of referenced BioSample submission entries, which will in turn force the public release of the reference BioProject. So, even if, for example, a BioProject submission has a future release date, the public release of a BioSample that references it will effectively make the BioProject public irrespective of its release date.

Is this understanding correct?

The reason this is an issue for us is that we are generating a BioProject/BioSample/SRA submission from the AIRR specification and it includes only a BioSample release date - and has no overall submission release date or SRA entry release date. Since the SRA parts of the submission have no release date we are assuming that they are released immediately - which forces release of the referenced BioSamples and in turn the BioProject, effectively making the entire submission public immediately.

We are assuming that we should include release dates for each BioProject, BioSample, and SRA entry to fully control the pubic release dates?

Define REST tests for submission server

Need to test validation (and perhaps submission) endpoints.

Verify AIRR instance internal identifier consistency

We do not currently validate internal references in AIRR instances (e.g, that a sample ID in an SRA entry has a corresponding BioSample entry).

See if this is possible. Have to be careful because samples may have been previously submitted - so this may not be possible statically.

Monitor NCBI submissions

NCBI sends an email ~5-30 minutes after a submission. Does not make for great demos.

However, it does more frequently generate incremental report files in the submission FTP directory that can be downloaded and used to inform users of processing. We can use the continuous monitoring code (#8) to periodically poll the FTP directory.

Incorrect source for Cell Subset field in AIRR template

The MiAIRR template restricts the value to the Cell Line (CLO) ontology.

This ontology does not have the 'naive B cell' class for example. The Cell Ontology (CL) has this value.

Which ontology should the field use?
Answer: CL

Need to update template and possibly instances.

Switching to NCBI production server

Determine timing of switch from using NCBI test server to production server.

Need to determine if we need a per-user login for the use of the production server.

AIRR NCBI FTP submission

Add NCBI user email to XML for submitting AIRR instance to SRA.

Previously were going provide ability to supply user name and password. No longer needed because we now use CEDAR account submit to NCBI FTP server and embed user email in submission XML.

Current FTP submission code here:

https://github.com/metadatacenter/cedar-submission-server/blob/develop/cedar-submission-server-application/src/main/java/org/metadatacenter/cedar/submission/resources/AIRRSubmissionServerResource.java

The AIRR template is here:

https://cedar.staging.metadatacenter.net/templates/edit/https://repo.staging.metadatacenter.net/templates/d7c9d050-a4aa-4448-b7ed-cf58b980baf2?folderId=https:%2F%2Frepo.staging.metadatacenter.net%2Ffolders%2Ff77e5f5e-0bce-4e52-8415-a684614b9461

IMPORTANT: the JSON-to-XML translation code in the submission server expects the template to be exactly as specified here. Any changes to the template must be accompanied by changes to the translation code.

Dynamic detection of duplicate NCBI sample

A frequent cause of failure on BioSample or SRA submission is the reuse of an user-defined accession number for a sample.

(Note these user-supplied sample numbers are mapped to BioSample accession numbers on submission.)

The accession numbers for samples must be unique so reuse for another submission causes NCBI submission errors quite late in the submission process.

Not clear if there is a way to ask NCBI if a number has been previously used by a particular user. The per user aspect is a major complicating factor since we do not know a user's NCBI credentials at any point in the submission process.

Builds on basic metadata validation (metadatacenter/cedar-template-editor#779).

Create standard BioSample templates

Create templates for the 9 standard BioSample submission packages. Conversion code in submission service must handle these.

Microbe; version 1.0
Model organism or animal; version 1.0
Metagenome or environmental; version 1.0
Invertebrate; version 1.0
Human; version 1.0
Plant; version 1.0
Virus; version 1.0
Beta-lactamase; version 1.0

https://www.ncbi.nlm.nih.gov/biosample/docs/packages/

Requirements analysis for generic NCBI BioSample submission

Figure out what work is required to generic BioSample submission from CEDAR.

BioSample package submission is the key.

https://www.ncbi.nlm.nih.gov/biosample/docs/packages/

Investigate GenBank submission for AIRR

Investigate how GenBank submission can be made from AIRR template (in addition to current BioProject/BioSample/SRA submission).

Add authentication to submission server

Currently does not have authentication

Continuous monitoring of ImmPort submission

Optionally, submission server continuously calls the ImmPort Private Data API to get the status of the upload registration ticket, which will be updated by the ImmPort Data Upload Server after the job is processed. It may take several minutes for the data upload to complete.

Described by step 9 here:

https://docs.google.com/document/d/1J0j3scmOK8yZQDMH1Te1EiU_epmGCts01dsEHjj--e4/edit

metadatacenter / cedar-submission-server Goto Github PK

cedar-submission-server's Introduction

CEDAR Submission Server

cedar-submission-server's People

Contributors

Watchers

Forkers

cedar-submission-server's Issues

Recommend Projects

Recommend Topics

Recommend Org