nih-sparc / sparc-api Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 15.0 951 KB

SPARC Portal API

License: Apache License 2.0

Python 99.99% Procfile 0.01% Shell 0.01%

sparc-api's People

Contributors

Stargazers

Watchers

Forkers

abi-software ignapas pcrespov ryanquey agarny egauzens jsaq007 bisgaard-itis ddjnw1yu

sparc-api's Issues

Update instructions on "Running the app"

I followed the instructions (in README.md) to run the app, but being on macOS and using Xcode 12, it didn't work. However, what works is the following:

python3 -m venv ./venv
. ./venv/bin/activate
pip install --upgrade pip
pip install wheel                       <-- Added to prevent several warnings when installing the requirements
export ARCHFLAGS="-arch x86_64"         <-- Needed on macOS when using Xcode 12
pip install -r requirements.txt
pip install -r requirements-dev.txt
gunicorn main:app

oSPARC: need to account for the release of the new Web API

Following the release of the new Web API, our /check_simulation endpoint is not always able to determine the end of a simulation. Thus, for datasets 135, 157, and 308, everything is fine, but for datasets 318 and 320, we get an error message that reads that the simulation failed even though it is actually successful.

Download file button in dataset details page not working

Download file button in dataset details page is not working. Seems to throw a CORS error when calling https://api.pennsieve.io/zipit/discover

Try this by hitting download file here:
https://sparc.science/datasets/137?type=dataset

Which gives error:

If you edit the request in firefox and resend it, it will work:

My proposed solution would be to add https:/sparc.science and https://staging.sparc.science to Access-Control-Allow-Origin on api.pennsieve

Presigned URL is to resources that are not cross origin

Presigned URL is to resources on Pennsieve are not cross origin.

As far as I understand, this means that /download/ endpoint will not be able to be used?

sparc-api/app/main.py

Line 139 in 59239d0

def create_presigned_url(expiration=3600):

sparc.science needs to be whitelisted by the resources at:
"pennsieve-prod-discover-publish-use1"

This can be confirmed by creating a presigned url with the following steps:

run sparc-api
go to (assuming port is 5000):
http://localhost:5000/download?key=54%2F1%2Ffiles%2Fderivative%2Fsam-P21+BAT%2Fses-B3tub%2FP21-5+b3tub+-+Movie.mp4
copy pre-signed url
open sparc.science
in chrome console: fetch('<presigned_url>')

Using cors anywhere to confirm resource can be reached:

open and start cors-anywhere session at https://cors-anywhere.herokuapp.com/corsdemo
in chrome console run fetch('https://cors-anywhere.herokuapp.com/' + '<presigned_url>')

`exists` API method return value

The exists API method returns a structure like {'exists': 'true'} it should be {'exists': True}.

def url_exists(path):
    try:
        head_response = s3.head_object(
            Bucket=Config.S3_BUCKET_NAME,
            Key=path,
            RequestPayer="requester"
        )
    except ClientError:
        return {'exists': 'false'}

    content_length = head_response.get('ContentLength', 0)

    if content_length > 0:
        return {'exists': 'true'}

    return {'exists': 'false'}

Requirements.txt contains packages that no longer exist on PyPI

To reproduce:

git clone https://github.com/nih-sparc/sparc-api.git
pip install -r requirements.txt

Which stops giving the error:

ERROR: Could not find a version that satisfies the requirement post==2019.4.13g

After a little investigation I found this comment from a PyPI admin saying that a the following packages have been removed from PyPI:

get==2019.4.13 
post==2019.4.13 
request==2019.4.13

Seems to run fine without them. I've put the following pull request in to fix this:
#24

Remove unused facet maps

To keep the map page running during a transition from scicrunch to algolia facets, I have included both of them in the facet prop mapping.

This ticket is a reminder to remove the scicrunch facets once the transition is completed

#59 (comment)

Documentation on Data Staging Environment: Getting test data into the curation dev staging

Processing and displaying unpublished dasets sparc-app. A full guide

This issue is where the documentation for the data staging environment will be stored until the data staging environment has it's own repo.

An example of this running can be found at:
https://context-cards-demo-stage.herokuapp.com/maps
(note that only the maps page is implemented currently)

Because of this, we will first focus on the changes needed to display an unpublished or updated dataset on the /maps page. Much of this can be applied to the /data page, but a decent amount of work needs to be done to modify the data pulled from scicrunch. This is needed because of a few limitations on the amount of data we get from the unpublished vs published datasets.

Overview of steps

Part A: Running the current implementation
A1. Setting up environment variable
A2. Dataset processing from scicrunch

Part B: Understanding and modifying the current implemenation
B1. Modifying sparc-api requests to use pennsieveId as opposed to DOI
B2. Downloading files from the pennsieve python client as opposed to from s3
B3. Downloading dataset thumbnails from pennsieve (as opposed to discover)
B4. Front end changes

Part A: Staging datasets and running the staging site

Step 1: Setting up environment variables

The following will be needed to stage and retrieve the datasets:

There are two categories, ones that can be kept the same as normal development and those that need to be changed

Same as normal:

ALGOLIA_API_KEY=XXXXXXX
ALGOLIA_APP_ID=XXXX
AWS_USER_POOL_WEB_CLIENT_ID=XXXXX
KNOWLEDGEBASE_KEY=XXXXXXXXXXXXX

The pennsieve keys must have access to the desired datasets for staging to see them:

PENNSIEVE_API_TOKEN=XXXXXXX
PENNSIEVE_API_SECRET=XXXXXXX

And these are set to the curation index:

SCICRUNCH_HOST=https://scicrunch.org/api/1/elastic/SPARC_PortalDatasets_stage
ALGOLIA_INDEX=k-core_curation

**Note that ALGOLIA_INDEX is front end. It is set in sparc-app

Feel free to slack or email me if you are working on this and need any of these keys

Step 2: Dataset Processing

Datasets can be put through the scicrunch elastic search processing via a url.

2a: Check if staging is ready to run

https://sparc.scicrunch.io/sparc/stage?api_key=<KNOWLEDGEBASE_KEY>
where <KNOWLEDGEBASE_KEY> is your KNOWLEDGEBASE_KEY.

There is no queue for processing and datasets can only be processed one at a time. The status is used to check if the server is available and ready.

2b: Submit dataset for staging

Use this url:
https://sparc.scicrunch.io/sparc/stage?api_key=<KNOWLEDGEBASE_KEY>&datasetID=<pennsieve-id>

where is the pennsieve identifier. I.E. 5c0a31f6-4926-4091-8876-3b11af7846ed

Step 3: Running the site

3a: Retrieve the staging repos

Use the staging branch of sparc-api:
#157

And this branch of sparc-app:
https://github.com/Tehsurfer/sparc-app/tree/new-staging

Set the sparc-api endpoint on sparc-app to where you are running it. (Often http://localhost:5000/)

That should be everything needed to view staged datasets on the /maps page

Part B: Understanding and modifying the current implementation

Next we will go into how it works and how to develop it further.

I currently don't know of any tickets to develop this further, but I imagine there will be a ticket soon to be able to stage datasets across all of sparc.science with a logged in user's pennsieve keys.

B1. Modifying sparc-api requests to use pennsieveId as opposed to DOI

This is as simple as modifying the elastic search query to use pennsieveId as opposed to DOIS

def create_pennsieve_id_query(pennseiveId):
    query = {
        "size": 50,
        "from": 0,
        "query": {
            "term": {
                "item.identifier.aggregate": {
                    "value": f"N:dataset:{pennseiveId}"
                }
            }
        }
    }

    print(query)
    return query

The results from here can be processed with app/scicrunch_process_results.py. The function used is _prepare_results(results):

B2. Downloading files from the pennsieve python client as opposed to from s3

We first check which method of downloading we will use by the length of the id. (PennsieveIds are longer than discoverIds.) This is necessary as we don't get told which type of id is returned from scicrunch for unpublished datasets.

# This version of s3-resouces is used for accessing files on staging that have never been published
@app.route("/s3-resource/<path:path>")
def direct_download_url2(path):
    print(path)
    filePath = path.split('files/')[-1]
    pennsieveId = path.split('/')[0]

    # If length is small, we have a pennsieve discover id. We will process this one with the normal s3-resource route
    if len(pennsieveId) <= 4:
        return direct_download_url(path)

    if 'N:package:' not in pennsieveId:
        pennsieveId = 'N:dataset:' + pennsieveId

    url = bfWorker.getURLFromDatasetIdAndFilePath(pennsieveId, filePath)
    if url != None:
        resp2 = requests.get(url)
        return resp2.content
    return jsonify({'error': 'error with the provided ID '}, status=502)

Note that url = bfWorker.getURLFromDatasetIdAndFilePath(pennsieveId, filePath) retrieves a temporary download url from the pennsieve python client for a pennsieve id and file path.

If the dataset does have a discover id, we need to retrieve the pennsieve id to use on the pennsieve python client.

You could attempt to avoid making the call to scicrunch to translate the pennsieve id to discover id, but I did it this way to keep the downloads consistent at one point I believe.

B3. Downloading dataset thumbnails from pennsieve (as opposed to discover)

The banners unfortunately cannot be returned from the pennsieve python client or discover, so we must use the pennsieve REST api. The endpoint used for this is /getbanner

Note that in order to use the pennsieve REST api you must log in via the s3 auth system

@app.route("/get_banner/<datasetId>")
def get_banner_pen(datasetId):
    p_temp_key = pennsieve_login()
    ban = get_banner(p_temp_key, datasetId)
    return ban

B4. Front end changes

Since unpublished datasets return less content, checks likely need to be added to keep the site from accessing properties of undefined and crashing the site.

I did this by just adding more logic to check fields exist, but it may be a bit more complicated to implement on the /datasets page where a lot of processing is done in one big async data block.

The code to get this running on the /maps page is available here:
https://github.com/Tehsurfer/map-sidebar-curation

If you have any questions about the implementation or have ideas on how to do this better feel free to chat here or dm me.

Failed BF authentication crashes all other services

No other endpoints are accessible if the server api is not authenticated with the Blackfynn host.

sparc-api/app/main.py

Line 80 in 7fa8d1b

def connect_to_blackfynn():

A lot of other endpoints could run if Blackfynn is down/unauthorized as they run on BFdiscover or scicrunch

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.