nih-sparc / sparc-api Goto Github PK
View Code? Open in Web Editor NEWSPARC Portal API
License: Apache License 2.0
SPARC Portal API
License: Apache License 2.0
I followed the instructions (in README.md
) to run the app, but being on macOS and using Xcode 12, it didn't work. However, what works is the following:
python3 -m venv ./venv
. ./venv/bin/activate
pip install --upgrade pip
pip install wheel <-- Added to prevent several warnings when installing the requirements
export ARCHFLAGS="-arch x86_64" <-- Needed on macOS when using Xcode 12
pip install -r requirements.txt
pip install -r requirements-dev.txt
gunicorn main:app
Following the release of the new Web API, our /check_simulation
endpoint is not always able to determine the end of a simulation. Thus, for datasets 135, 157, and 308, everything is fine, but for datasets 318 and 320, we get an error message that reads that the simulation failed
even though it is actually successful.
Download file button in dataset details page is not working. Seems to throw a CORS error when calling https://api.pennsieve.io/zipit/discover
Try this by hitting download file here:
https://sparc.science/datasets/137?type=dataset
If you edit the request in firefox and resend it, it will work:
My proposed solution would be to add https:/sparc.science and https://staging.sparc.science to Access-Control-Allow-Origin
on api.pennsieve
Presigned URL is to resources on Pennsieve are not cross origin.
As far as I understand, this means that /download/
endpoint will not be able to be used?
Line 139 in 59239d0
sparc.science needs to be whitelisted by the resources at:
"pennsieve-prod-discover-publish-use1"
This can be confirmed by creating a presigned url with the following steps:
fetch('<presigned_url>')
Using cors anywhere to confirm resource can be reached:
fetch('https://cors-anywhere.herokuapp.com/' + '<presigned_url>')
The exists
API method returns a structure like {'exists': 'true'}
it should be {'exists': True}
.
def url_exists(path):
try:
head_response = s3.head_object(
Bucket=Config.S3_BUCKET_NAME,
Key=path,
RequestPayer="requester"
)
except ClientError:
return {'exists': 'false'}
content_length = head_response.get('ContentLength', 0)
if content_length > 0:
return {'exists': 'true'}
return {'exists': 'false'}
To reproduce:
git clone https://github.com/nih-sparc/sparc-api.git
pip install -r requirements.txt
Which stops giving the error:
ERROR: Could not find a version that satisfies the requirement post==2019.4.13g
After a little investigation I found this comment from a PyPI admin saying that a the following packages have been removed from PyPI:
get==2019.4.13
post==2019.4.13
request==2019.4.13
Seems to run fine without them. I've put the following pull request in to fix this:
#24
To keep the map page running during a transition from scicrunch to algolia facets, I have included both of them in the facet prop mapping.
This ticket is a reminder to remove the scicrunch facets once the transition is completed
This issue is where the documentation for the data staging environment will be stored until the data staging environment has it's own repo.
An example of this running can be found at:
https://context-cards-demo-stage.herokuapp.com/maps
(note that only the maps page is implemented currently)
Because of this, we will first focus on the changes needed to display an unpublished or updated dataset on the /maps page. Much of this can be applied to the /data page, but a decent amount of work needs to be done to modify the data pulled from scicrunch. This is needed because of a few limitations on the amount of data we get from the unpublished vs published datasets.
Part A: Running the current implementation
A1. Setting up environment variable
A2. Dataset processing from scicrunch
Part B: Understanding and modifying the current implemenation
B1. Modifying sparc-api requests to use pennsieveId as opposed to DOI
B2. Downloading files from the pennsieve python client as opposed to from s3
B3. Downloading dataset thumbnails from pennsieve (as opposed to discover)
B4. Front end changes
The following will be needed to stage and retrieve the datasets:
There are two categories, ones that can be kept the same as normal development and those that need to be changed
Same as normal:
ALGOLIA_API_KEY=XXXXXXX
ALGOLIA_APP_ID=XXXX
AWS_USER_POOL_WEB_CLIENT_ID=XXXXX
KNOWLEDGEBASE_KEY=XXXXXXXXXXXXX
The pennsieve keys must have access to the desired datasets for staging to see them:
PENNSIEVE_API_TOKEN=XXXXXXX
PENNSIEVE_API_SECRET=XXXXXXX
And these are set to the curation index:
SCICRUNCH_HOST=https://scicrunch.org/api/1/elastic/SPARC_PortalDatasets_stage
ALGOLIA_INDEX=k-core_curation
**Note that ALGOLIA_INDEX is front end. It is set in sparc-app
Feel free to slack or email me if you are working on this and need any of these keys
Datasets can be put through the scicrunch elastic search processing via a url.
https://sparc.scicrunch.io/sparc/stage?api_key=<KNOWLEDGEBASE_KEY>
where <KNOWLEDGEBASE_KEY> is your KNOWLEDGEBASE_KEY.
There is no queue for processing and datasets can only be processed one at a time. The status is used to check if the server is available and ready.
Use this url:
https://sparc.scicrunch.io/sparc/stage?api_key=<KNOWLEDGEBASE_KEY>&datasetID=<pennsieve-id>
where is the pennsieve identifier. I.E. 5c0a31f6-4926-4091-8876-3b11af7846ed
Use the staging branch of sparc-api:
#157
And this branch of sparc-app:
https://github.com/Tehsurfer/sparc-app/tree/new-staging
Set the sparc-api endpoint on sparc-app to where you are running it. (Often http://localhost:5000/)
That should be everything needed to view staged datasets on the /maps page
Next we will go into how it works and how to develop it further.
I currently don't know of any tickets to develop this further, but I imagine there will be a ticket soon to be able to stage datasets across all of sparc.science with a logged in user's pennsieve keys.
This is as simple as modifying the elastic search query to use pennsieveId as opposed to DOIS
def create_pennsieve_id_query(pennseiveId):
query = {
"size": 50,
"from": 0,
"query": {
"term": {
"item.identifier.aggregate": {
"value": f"N:dataset:{pennseiveId}"
}
}
}
}
print(query)
return query
The results from here can be processed with app/scicrunch_process_results.py. The function used is _prepare_results(results):
We first check which method of downloading we will use by the length of the id. (PennsieveIds are longer than discoverIds.) This is necessary as we don't get told which type of id is returned from scicrunch for unpublished datasets.
# This version of s3-resouces is used for accessing files on staging that have never been published
@app.route("/s3-resource/<path:path>")
def direct_download_url2(path):
print(path)
filePath = path.split('files/')[-1]
pennsieveId = path.split('/')[0]
# If length is small, we have a pennsieve discover id. We will process this one with the normal s3-resource route
if len(pennsieveId) <= 4:
return direct_download_url(path)
if 'N:package:' not in pennsieveId:
pennsieveId = 'N:dataset:' + pennsieveId
url = bfWorker.getURLFromDatasetIdAndFilePath(pennsieveId, filePath)
if url != None:
resp2 = requests.get(url)
return resp2.content
return jsonify({'error': 'error with the provided ID '}, status=502)
Note that url = bfWorker.getURLFromDatasetIdAndFilePath(pennsieveId, filePath)
retrieves a temporary download url from the pennsieve python client for a pennsieve id and file path.
If the dataset does have a discover id, we need to retrieve the pennsieve id to use on the pennsieve python client.
You could attempt to avoid making the call to scicrunch to translate the pennsieve id to discover id, but I did it this way to keep the downloads consistent at one point I believe.
The banners unfortunately cannot be returned from the pennsieve python client or discover, so we must use the pennsieve REST api. The endpoint used for this is /getbanner
Note that in order to use the pennsieve REST api you must log in via the s3 auth system
@app.route("/get_banner/<datasetId>")
def get_banner_pen(datasetId):
p_temp_key = pennsieve_login()
ban = get_banner(p_temp_key, datasetId)
return ban
Since unpublished datasets return less content, checks likely need to be added to keep the site from accessing properties of undefined and crashing the site.
I did this by just adding more logic to check fields exist, but it may be a bit more complicated to implement on the /datasets page where a lot of processing is done in one big async data block.
The code to get this running on the /maps page is available here:
https://github.com/Tehsurfer/map-sidebar-curation
If you have any questions about the implementation or have ideas on how to do this better feel free to chat here or dm me.
No other endpoints are accessible if the server api is not authenticated with the Blackfynn host.
Line 80 in 7fa8d1b
A lot of other endpoints could run if Blackfynn is down/unauthorized as they run on BFdiscover or scicrunch
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.