uc-cdis / indexd Goto Github PK

View Code? Open in Web Editor NEW

20.0 25.0 20.0 2.47 MB

Index service server

License: Apache License 2.0

Python 99.38% Dockerfile 0.39% Shell 0.12% Mako 0.11%

gen3

indexd's Introduction

Indexd

Indexd is a hash-based data indexing and tracking service providing 128-bit globally unique identifiers. It is designed to be accessed via a REST-like API or via a client, such as the reference client implementation. It supports distributed resolution with a central resolver talking to other Indexd servers.

For more about GUIDs and an example of a central resolver, see dataguids.org.

Indexd is a vital microservice used in the Gen3 Open Source Software Platform. Gen3 is used for developing data commons that colocate compute and storage. Data commons accelerate and democratize the process of scientific discovery, especially over large or complex datasets.

The Problem That Indexd Solves

Data inevitably moves and changes, which leads to unreproducible research. It's not uncommon for physical data to be moved from one storage location to another, for domain names to change, and/or for data to exist in multiple locations.

If you run an analysis over a set of data and later it gets moved, your analysis is no longer repeatable. The same data still exists, it just isn't where you thought.

This presents a huge problem for repeatable research. There needs to be a unique identifier for a given piece of data that can be used in analyses without "hard-coding" the physical location of the data.

The Solution: Indexd's Globally Unique Identifiers (GUIDs)

Indexd serves as an abstraction over the physical data locations, providing a Globally Unique Identifier (GUID) per datum. These identifiers will always be resolvable with Indexd and will always provide the locations of the physical data, even if the data moves.

Data GUIDs were created by the Data Biosphere - a vibrant ecosystem for biomedical research containing community-driven standard-based open-source modular and interoperable components that can be assembled into diverse data environments.

GUIDs provide a domain-neutral, persistent and scalable way to track data across platforms. Indexd is a proven solution to provide GUIDs for data.

Technical Details

Quick Links:

View API Documentation
Installation Instructions
Use Cases for Indexing Data

Indexd is a two-layer system. On the bottom layer, each data object has a GUID and hashes that map to known physical locations of the data.

The second layer is aliases. Aliases are user-defined, human-readable identifiers that map to GUIDs. This adds the flexibility of supporting human-readable identifiers and allow referencing existing identifiers (such as DOIs and ARKs) that are created in other systems.

GUIDs are primarily used to track the current location of data as it is moved or copied from one location to another. The GUID itself is at minimum 128-bits, as a UUID is used as the base. Additionally, a prefix can be prepended to this UUID, which lengthens the identifier even further (this is used primarily to assist in distributed resolution). If you want a shorter identifier, you can use the aliases defined above to create a different, unique mapping to GUIDs.

Data GUIDs with a prefix are structured as follows:

dg.[resourceId]/[128-bit UUID]

All data GUIDs with optional prefixes begin with the characters: dg
The second component in a data GUID is a unique string that identifies a resource that can resolve the data GUID. Prefixes are assigned by the Open Commons Consortium. There is no charge for being assigned a data GUID prefix, but the organization that is assigned the prefix must maintain a service that dereferences data GUIDs associated with that prefix.
The third component in a data GUID is 128 UUID following IETF RFC 4122

GUIDs can be assigned to entities in object storage, as well as XML and JSON documents. The current location(s) of a particular datum is reflected in the URL list contained within Indexd.

As the same datum may exist in multiple locations, there may be more than one URL associated with each GUID. The ability to actually access the URL provided by Indexd is done on the client site.

Clients must provide capabilities to access URLs specified in Indexd. Gen3 Auth (specifically the Fence service) is capable of creating signed URLs for accessing data.

The client has to be able to interpret the protocol encoded in the URL. This is similar to a browser accessing HTTP and FTP transparently by having it encoded in the URL. If a client comes across a URL that it doesn’t know how to access, it can report an error and the user may have to use a different client to access that URL.

All the information about a specific datum mentioned above (the GUID, URLs, hashes, file size, access control, etc.) are bundled together and referred to internally as an Indexd record.

Indexd Records

Records are collections of information necessary to as-uniquely-as-possible identify a piece of information. This is done through the use of hashes and metadata. Records are assigned a UUIDv4 at the time of creation and additionally may include a prefix to aide in resolution (these combined become the GUID). This allows records to be uniquely referenced amongst multiple records.

Hashes used by the index are deployment-specific but are intended to be the results of widely known and commonly available hashing algorithms, such as MD5 or SHA1. This is similar to the way that torrents are tracked and provides a mechanism by which data can be safely retrieved from potentially untrusted sources in a secure manner.

Additional metadata that is stored in index records includes the size of the data as well as the type.

Example record with relevant fields:

{
    did: "dg.4242/000003f6-a029-421a-bc84-a9777b7c34a5",
    urls: [
        "gs://some-google-storage-bucket/some-file.txt",
        "s3://some-s3-bucket/some-file.txt"
    ],
    hashes: {
        md5: "f7b38502322197f60a5af8e530fa376e"
    },
    size: 42,
    acl: [
        "example",
        "test"
    ],
    authz: [
        "M7cIajvg"
    ],
    rev: "2013243e",
    baseid: "84c0843e-c685-4dg8-953a-8b8f354fc198",
    file_name: null,
    uploader: null,
}

did: The GUID, AKA Digital Identifier (did)
urls: storage locations for the actual data
hashes: a dictionary of hash algorithms to hashes
size: file size in bytes
acl: access control list with strings identifying required authorizations
authz: preferred over acl, anything here will have priority
- a list of strings representing resources (or resource tags) in Gen3's Authorization Service, Arborist that user must have access to in order to access the data
rev: the current revision (for avoiding conflicts)
- See next section for more details
baseid: the base identifier linking logically similar GUIDs
- See Data Version Control for more details
file_name: an optional name for the file that will be searchable through Indexd's API
uploader: who uploaded the file (when using the flow described later on about blank records)

Avoiding Conflicts on Updates

In order to avoid update conflicts for frequently updated GUIDs, Indexd uses a revisioning system similar to that utilized in distributed version control systems. Within a particular GUID, this mechanism is referred to as the revision or rev.

For an update to take place, both the GUID and the revision must match that of the current Indexd record. When any update succeeds, a new revision is generated for the Indexd record. This prevents multiple, conflicting updates from occurring. The revision is an opaque string and is not used for anything other than avoiding update conflicts.

Data Version Control

It is possible that specific data needs to be updated, but should still be logically related to previous versions of that data. It may also be the case that there were errors in previous data that are corrected in future versions.

It is still true, however, that GUIDs should be persistent and the data they point to should be immutable. Meaning that a GUID will always refer to the same data. The idea of a new version requires a new GUID for that data (if the hash and file size have changed).

The question is: how do you maintain a logical linking between different versions or updates for the same data?

To handle this versioning in Indexd, the concept of a baseid is introduced. The baseid is a UUID that all versions of the data (in other words, all GUIDs) point to. The baseid logically groups the "same" data.

It is then possible (via the API) to retrieve all versions for a given GUID. In addition, it is possible to ask for the latest version of a GUID. See the API documentation for more details.

To reiterate, a given GUID will always point to the same data, even if there are later versions. The later versions will have different GUIDs, though they will be connected through a common baseid. The Indexd API makes it possible to programmatically determine if newer versions of a given datum exist.

Access Control

Indexd records (identified by GUIDs) are intended to be publicly readable documents, and therefore contain no information other than resource locators. However, in order to prevent unauthorized creation/updating/deleting of records, each record keeps a list of authorization rules (in an authz property).

The authz property contains a list of abstract "resources" a user must have access to in order to have permission to update/delete the associated GUID. For backward compatibility, the ACL list that was used for access control is still available (the acl field).

If Indexd is used with other Gen3 software, specifically the services related to Gen3 Auth (Fence and Arborist), it enables a more useful and robust access control system that exposes various data access methods securely by utilizing the authz field in Indexd.

The additional usage of the Gen3 Auth services will enable data access through signed URLs, with authorization checks based on the authz field in Indexd.

Distributed Resolution: Utilizing Prefixes in GUIDs

If you know the URL of a particular Data GUID resolution service (like Indexd), which is associated with a particular prefix, you can directly access that service to get the relevant record.

Otherwise, you can access a centralized resolver like dataguids.org, which will resolve you to the data GUID service associated with the GUID's prefix.

Indexd's distributed resolution logic for a given GUID/alias is roughly as follows:

Attempt to get a local record with given input (as GUID)
Attempt to get a local record with given input (as alias)
Attempt distributed resolution using connected services configured in Indexd's DIST config

It is possible to resolve to a service that is not another Indexd, provided that a sufficient client is written to convert from the existing format to the format Indexd expects
- Currently we have a DOI Client and GA4GH's DOS Client
  - More info on DOIs
  - More info DOS
    - NOTE: Was renamed to DRS
  - Resolving to servers with other identifiers, like ARK IDs could be supported if a client was created (otherwise, you can use the aliases in Indexd to simply map from an existing identifier to a GUID)
  - We have a GA4GH DRS Implementation which includes bundles.
The distributed resolution can be "smart", in that you can configure hints that tell a central resolver Indexd that a given input should be resolved with a specific distributed service
- The hints are a list of regexes that will attempt to match against given input
- For example: hints: ["10\..*"] for DOIs since they'll begin with 10.

An example configuration (see configuration section for more info) for an external service to resolve to:

CONFIG["DIST"] = [
    {"name": "DX DOI", "host": "https://doi.org/", "hints": ["10\..*"], "type": "doi"},
]

The type tells Indexd which client to use for that external service. In this case, doi maps to the DOI Client.

Indexd itself can be configured to append a prefix to the typical UUID in order to aide in the distributed resolution capabilities mentioned above. Specifically, we can add a prefix such as dg.4GH5/ which may represent one instance of Indexd. For distributed resolution purposes, we can then create hints that let the central resolver know where to go when it receives a GUID with a prefix of dg.4GH5/.

The prefix that a given Indexd instance uses is specified in the DEFAULT_PREFIX configuration in the settings file. In order to ensure that this gets used, set PREPEND_PREFIX to True. Note that the prefix will only be prepended to GUIDs generated for new records that are indexed without providing a GUID.

The ADD_PREFIX_ALIAS configuration represents a different way of using the prefix: if set to True, instead of prepending the prefix to the GUID, indexd will create an alias of the form <prefix><GUID> for this record. Note that you should NOT set both ADD_PREFIX_ALIAS and PREPEND_PREFIX to True, or aliases will be created as <prefix><prefix><GUID>.

If a DEFAULT_PREFIX is configured, certain endpoints may take extra steps to resolve a local GUID based on this. The GET /{GUID}, /index/{GUID}, and DRS endpoints will all accept either the prefixed or unprefixed version of the GUID, regardless of whether the PREPEND_PREFIX or ADD_PREFIX_ALIAS condiguration is being used. However, any other endpoint that takes a GUID will only accept the exact did as stored in the database, so it is best to use that field from the record for subsequent requests.

Use Cases For Indexing Data

Data may be loaded into Indexd through a few different means:

I want to upload data to storage location(s) and index at the same time

Using the gen3-client you can upload objects to storage locations and mint GUIDs at the same time.

Blank Record Creation in Indexd

Indexd supports void or blank records that allow users to pre-register data files through Fence before actually registering them. This enables the Data Upload flow that allows users to use a client to create Indexd records before the physical file exists in storage buckets. The complete flow contains three main steps:

pre-register
hash/size/URL populating
data node registration

General flow:

Fence requests a blank object from Indexd. Indexd creates an object with no hash, size or URLs, only the uploader and optionally file_name fields.
The Indexd listener (indexs3client jobs dispatched by the ssjdispatcher) monitors bucket updates and updates Indexd with the URL, hash and size of the objects.
The client application (windmill or gen3-data-client) lists records for data files which the user needs to submit to the graph. The user fills all empty fields and submits the request to Indexd to update the authz or acl.

NOTE: Step 2 above fulfills the use case of dynamically indexding data added to storage buckets discussed later on.

I want to associate Indexd data to structured data in a Gen3 Data Commons

NOTE: This assumes that the data already exists in storage location(s)

Indexd Record Creation Through Gen3's Data Submission Service: Sheepdog

When data files are submitted to a Gen3 Data Commons using Sheepdog, the files are automatically indexed into Indexd. Submissions to Sheepdog can include object_id's that map to existing Indexd GUIDs. Or, if there are no existing records, Sheepdog can create them on the fly.

To create Indexd records on the fly, Sheepdog will check if the file being submitted has a hash & file size matching anything currently in Indexd and if so uses the returned document GUID as the object ID reference. If no match is found in Indexd then a new record is created and stored in Indexd.

I want to index data that is dynamically added to storage location(s)

Automatically Creating Indexd Records when Objects are Added to Object Storage

Using AWS SNS or Google PubSub it is possible to have streaming notifications when files are created, modified or deleted in the respective cloud object storage services (S3, GCS). It is then possible to use an AWS Lambda or GCP Cloud Function to automatically index the new object into Indexd.

NOTE: This may require using the batch processing services if the file is large (to compute the necessary minimal set of hashes to support indexing). There are known limitations with AWS Lambda and GCP Cloud Functions related to how long a process can run before AWS/Google cuts it off. Some hash calculations may exceed that time limit.

This feature can be set up on a per Data Commons basis for any buckets of interest. The buckets do not have to be owned by the commons, but permissions to read the bucket objects and permissions for SNS or PubSub are necessary.

For existing data in buckets, the SNS or PubSub notifications may be simulated such that the indexing functions are started for each object in the bucket. This is useful because only a single code path is necessary for indexing the contents of an object.

We have a solution for AWS discussed in the "Blank Record Creation in Indexd" Section.

Indexd REST API for Record Creation

It is also possible to interact directly with the Indexd API in order to create index records. There are two options for authorization for these sorts of updates.

Use Basic Auth (username/password) to provide administrative control over Indexd

You can use the /bin/indexd_admin.py to add a new username and password combination to Indexd.

and/or

Use the Gen3 Auth services (Fence and Arborist) to control access based on access tokens provided in requests

Similar to other Gen3 services, users must pass along their Access Token in the form of a JWT in the Authorization header of their request to the Indexd API. Indexd will check that the user is authorized for the items in the authz field by passing along your token and the action you're trying to do to the Arborist service.

Standards and Governance

CTDS (maintainers of Indexd) are working with the not-for-profit Open Commons Consortium to assign Data GUID Prefixes to organizations that would like to run a Data GUID service.

In addition, one of our goals is to work with GA4GH to ensure Data GUIDs and Indexd comply with the GA4GH standard. We are also working in parallel to establish Data GUIDs as an Open Commons Consortium (OCC) standard.

Installation

Please see how you can set up a local development environment, which includes setting up a [virtual environment](docs/local_dev_environment.md#Set-up-a-Virtual Environment) and setting up a local postgresql db for testing.

Configuration

As part of setting up your local development environment, you will also need to configure settings too.

Testing

Follow [installation](local development environment
Check the testing notes and run:

python3 -m pytest -vv --cov=indexd --cov-report xml --junitxml="test-results.xml" tests

Quickstart with Helm

You can now deploy individual services via Helm! Please refer to the Helm quickstart guide HERE (https://github.com/uc-cdis/indexd/blob/master/docs/quickstart_helm.md)

indexd's People

Contributors

Stargazers

Watchers

Forkers

ohsu-comp-bio dankolbman lookcrabs fantix nci-gdc quiltomics applesline s-ankita briandehlinger plooploops yradsmikham mattgarvin1 luanenhui phuongph rpatil524 gaybro8777 amosbunde rishabh3456

indexd's Issues

add a 'metadata' property to IndexRecordURL

the object's state has always been tracked in the metadata service. But it doesn't actually make sense to track whether an object is uploaded/validated in the metadata service because each location might have a different state.
Since indexd is the abstraction of underlying locations, it should be the only one that store location specific information.
states can be registered -> validated -> redacted

PXD-1089 ⁃ PUT new url appends to list

I spoke with Phillis about this a while back, so I may be misremembering. But at the moment whenever you make a PUT request to update a record's urls attribute, the url is appended to a list. It seemed to me that we wanted to just update the previous url instead.

@philloooo please correct me if I am wrong/further elaborate if necessary.

PXD-602 ⁃ support making document update on all versions with one call

use case:
people want to update the acl on all versions of the same file and they do that a lot and they don't want to handle partial failure

PXD-819 ⁃ redirect response with incorrect domain in the message

flask handles redirect from /index -> /index/, when deployed in production behind reverse proxy, the Location header has the correct url under external domain, but the response message uses the internal domain

curl https://data.kidsfirstdrc.org/index/index -v
*   Trying 52.44.40.52...
* TCP_NODELAY set
* Connected to data.kidsfirstdrc.org (52.44.40.52) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.kidsfirstdrc.org
* Server certificate: Amazon
* Server certificate: Amazon Root CA 1
* Server certificate: Starfield Services Root Certificate Authority - G2
> GET /index/index HTTP/1.1
> Host: data.kidsfirstdrc.org
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 301 MOVED PERMANENTLY
< Access-Control-Allow-Origin: *
< Content-Type: text/html; charset=utf-8
< Date: Sun, 17 Jun 2018 18:41:01 GMT
< Location: http://data.kidsfirstdrc.org/index/index/
< Server: nginx
< Set-Cookie: csrftoken=d93b12807e844387feda2700bb17eec12000.0032018-06-17T18:41:01+00:00;Path=/
< Content-Length: 263
< Connection: keep-alive
< 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
* Connection #0 to host data.kidsfirstdrc.org left intact
<p>You should be redirected automatically to target URL: <a href="http://indexd-service/index/">http://indexd-service/index/</a>.  If not click the link.```

Allow ETag and CRC hashes

S3 uses an md5 of individual part md5s followed by the number of parts as an ETag for a multipart upload. This should be supported as it is easily verified against S3.

Validate input for updates gracefully

Updating a document via POST /index/{UUID} requires form, size, urls, and hashes at the minimum. However, this is not documented and attempting to post without one of them in the body will throw a key not found resulting in an unhelpful 500 response.
See here: https://github.com/uc-cdis/indexd/blob/master/indexd/index/blueprint.py#L271

The request body should be validated and return with a error: field {property} is required or something that informs the user of the issue.

Indexd crash bug

Here is what happen:

Indexclient uses the wrong user/password to access Indexd server.
The client can still query the data.
However, if the client updates or creates new entries, and does it in a loop (let say 100 times, or 1000 times) with fast speed (a couple of milliseconds per request).
Indexd returns request errors for sometimes.
And crashes

request for non-existent alias does not return 404

This is a bug that was introduced by the alias --> urls shortcut that was introduced last night. When the calls within the alias-index cross-reference blueprint raise NoRecordFound (and other such errors), they need to be caught within that blueprint and converted to the standard error json that is returned to the user.

fix index creation with provided uuid

https://github.com/uc-cdis/indexd/pull/40/files
this pr is somehow partially reverted by later commits.
Need to fix it and add unit test

PXD-553 ⁃ support using arborist to evaluate protected requests

if it's provided that there is an arborist server, use that to check auth for create/update/delete operations
if not, use current basic auth

missing requirement

Hello,
I'm trying to build INDEXD according to the readme but I get an error when installing the dependencies. I'm on a fresh virtualenv.

Looks like it would be possible to specify cdislogging as a dependency.

| => pip install .
Processing /Users/jlindsay/Documents/Code/cidc/gen3/indexd
Collecting flask==0.10.1 (from indexd==0.1)
Collecting jsonschema==2.5.1 (from indexd==0.1)
  Using cached jsonschema-2.5.1-py2.py3-none-any.whl
Collecting sqlalchemy==1.0.8 (from indexd==0.1)
  Downloading SQLAlchemy-1.0.8.tar.gz (4.6MB)
    100% |████████████████████████████████| 4.6MB 220kB/s
Collecting sqlalchemy-utils>=0.32.21 (from indexd==0.1)
  Downloading SQLAlchemy-Utils-0.33.2.tar.gz (124kB)
    100% |████████████████████████████████| 133kB 2.3MB/s
Collecting psycopg2>=2.7 (from indexd==0.1)
  Downloading psycopg2-2.7.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (1.7MB)
    100% |████████████████████████████████| 1.7MB 644kB/s
Collecting cdislogging (from indexd==0.1)
  Could not find a version that satisfies the requirement cdislogging (from indexd==0.1) (from versions: )
No matching distribution found for cdislogging (from indexd==0.1)

indexd index_record 'size' field need to be BIGINT

If we want to continue storing file size in bytes, then we need to update our schema:

class IndexRecord(Base):
'''
Base index record representation.
'''
tablename = 'index_record'

did = Column(String, primary_key=True)
rev = Column(String)
form = Column(String)
size = Column(Integer)

https://www.postgresql.org/docs/9.1/static/datatype-numeric.html

add a `type` property to IndexRecordURL and another model to store controlled enum in db

we need a separate field to describe the purpose of the data that's from a list of types that make sense to the system.
This field is needed to make services able to infer the purpose of that storage location. eg:

on_premises_cold_storage
on_premises_primary_storage
amazon_backup
amazon_primary_storage

PXD-554 ⁃ Add bulk insertion into indexd

support creating index with existing uuid

need to support creating index with existing uuid instead of always creating a new one
need to also update the indexclient for it

500 Errors

[Thu Mar 15 16:07:37.785181 2018] [wsgi:error] [pid 18:tid 140064201840384] Traceback (most recent call last):
[Thu Mar 15 16:07:37.785184 2018] [wsgi:error] [pid 18:tid 140064201840384] File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1817, in wsgi_app
[Thu Mar 15 16:07:37.785185 2018] [wsgi:error] [pid 18:tid 140064201840384] response = self.full_dispatch_request()
[Thu Mar 15 16:07:37.785187 2018] [wsgi:error] [pid 18:tid 140064201840384] File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1477, in full_dispatch_request
[Thu Mar 15 16:07:37.785189 2018] [wsgi:error] [pid 18:tid 140064201840384] rv = self.handle_user_exception(e)
[Thu Mar 15 16:07:37.785191 2018] [wsgi:error] [pid 18:tid 140064201840384] File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1381, in handle_user_exception
[Thu Mar 15 16:07:37.785193 2018] [wsgi:error] [pid 18:tid 140064201840384] reraise(exc_type, exc_value, tb)
[Thu Mar 15 16:07:37.785194 2018] [wsgi:error] [pid 18:tid 140064201840384] File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1475, in full_dispatch_request
[Thu Mar 15 16:07:37.785195 2018] [wsgi:error] [pid 18:tid 140064201840384] rv = self.dispatch_request()
[Thu Mar 15 16:07:37.785197 2018] [wsgi:error] [pid 18:tid 140064201840384] File "/usr/local/lib/python2.7/dist-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1461, in dispatch_request
[Thu Mar 15 16:07:37.785198 2018] [wsgi:error] [pid 18:tid 140064201840384] return self.view_functionsrule.endpoint
[Thu Mar 15 16:07:37.785200 2018] [wsgi:error] [pid 18:tid 140064201840384] File "/usr/local/lib/python2.7/dist-packages/indexd-0.1-py2.7.egg/indexd/auth/init.py", line 24, in check_auth
[Thu Mar 15 16:07:37.785201 2018] [wsgi:error] [pid 18:tid 140064201840384] return f(*args, **kwargs)
[Thu Mar 15 16:07:37.785202 2018] [wsgi:error] [pid 18:tid 140064201840384] File "/usr/local/lib/python2.7/dist-packages/indexd-0.1-py2.7.egg/indexd/index/blueprint.py", line 264, in add_index_record_version
[Thu Mar 15 16:07:37.785204 2018] [wsgi:error] [pid 18:tid 140064201840384] form = flask.request.json['form']
[Thu Mar 15 16:07:37.785205 2018] [wsgi:error] [pid 18:tid 140064201840384] KeyError: 'form'

Include baseid in /latest endpoint

The /{UUID} and /{UUID}/latest endpoints both return indexd documents, however, the /latest endpoint does not include the baseid of the document. This field should be added to the response to stay consistent with the document format.

Document undocumented urls

It looks like there are a handful of urls that exist but are undocumented that may be useful to know about. The ones I see are:

/_stats
/_version
/_status
/urls

document current indexd in swagger

support versioning

something like http://help.zenodo.org/#versioning seems to make sense.
Each index document will have a version number(or just a created timestamp) and a foreign key to a permanent/concept id document. The only thing that I think can be different from zenodo is that querying concept id should be under another endpoint, egconcept/<id>?target=<latest/all>
use cases:

all my data doesn't need versioning, the concept_id field will just be null
I want to upload a new version of the data object, I give indexd the id or hashes of the old version, indexd create a concept id, add it to the old version index document and also new version index document.
When I create an object, I tell indexd it will be versioned, so this first index document is created with a concept it.

Index Current ISB Production Buckets

Add basic auth.

All operations are currently full public. Need to restrict operations to those with basic authorization. For now, restrict all operations, then follow up with white-listing certain public operations.

PXD-1138 ⁃ Make endpoint to truncate tables & reset schema version correctly

PXD-822 ⁃ Add possibility to query indexd with a negated param

When getting indexes, introduce a negate operation. With indexclient, add a new parameter in list_with_parms() called negate_params, all the values passed in negate_params will be negated (not equal or not exist)

Below is an example:

docs = self.index_client.list_with_params(
            params=
               {'acl': ['open']
                'urls_metadata': {
                    's3://amazonaws.com/': {
                    'state': ''validated"
                },
            negate_params=
               {'acl': ['example']
                'urls_metadata': {
                   's3://example.com/': None
                }
               }})

This will return all records that:

have acl open
urls_metadata have [<key like '%s3://amazonaws.com%'>]['state'] == validated,
don't have acl example
urls_metadata don't have key like '%s3://example.com%'

Different level of negation
For negate params, urls_metadata and metadata can have different level of negation. If passed (key, value) has (NOT value is True), then filter by NOT exist key.

Below will exclude all records that have urls_metadata with url_key LIKE '%s3://example.com%'

{
  'urls_metadata': {
    's3://example.com/': None
  }
}

Below will exclude all records that have key "state" in urls_metadata[<key like'%s3://example.com%'>]

{
  'urls_metadata': {
    's3://example.com/': {'state': None}
  }
}

Below will exclude all records with state "validated" in urls_metadata[<key like'%s3://example.com%'>]['state']

{
  'urls_metadata': {
    's3://example.com/': {'state': 'validated'}
  }
}

PXD-2216 ⁃ repurpose EDC lambda for generic bucket listener

fill in hash & size for created records

add an optional version string

allow providing a version string for IndexRecord
allow updating version string

move configuration to external yaml

Currently, configuration is hard-coded. It's desirable to move this into external configuration files.

using multiple hashes does not pass schema validation

PXD-1168 ⁃ Arborist Integration with IndexD

Scope Include;

define consistent mapping from Arborist method to service actual action
Allow user to pass a token to indexD and then evaluate the users's role to decide if the user can update the data
It should be backward compatible
- if IndexD is not configured to connect to arborist
Create Unit test
Update README to describe that - Now IndexD supports Role based access control if its is deployed together with arborist

Add Configuration to Prepend Prefix for ID

Deploy to DCP/Data Stage with prefix that is TBD.

Also need an option when adding or creating endpoint, to find out if there is already a prefix.

Allow lookups by baseid

It seems that it is currently not possible to try and resolve a document given only a baseid. These seems like it should be possible.

A suggested implementation may either be to:

Use the existing /index/{UUID} endpoints to also return BaseVersion documents
Implement a new resource to resolve base uuids like /base/{UUID}

PXD-1134 ⁃ Implement updated DOS list request

DOS moved from POST to GET on its list endpoint. This issue is meant to track the necessary changes to update indexd to support the new version.

Server changes here https://github.com/uc-cdis/indexd/blob/master/indexd/dos/blueprint.py

Add client tests here https://github.com/uc-cdis/indexd/blob/master/tests/test_client.py#L624

Pull request at DOS with schema change https://github.com/ga4gh/data-object-service-schemas/pull/87

PXD-854 ⁃ update() does not behave correctly

The problem is when we update current urls, and urls_metadata.

client = IndexClient()
docs = client.get("40a395b0-51d1-426e-9fbd-13a91ac7547e")
new_url = "s3://host_name/bucket/key_name"
docs.urls.append(new_url)
docs.urls_metadata[new_url] = {"state": "validated"}

docs.patch()

The error I get is:
requests.exceptions.HTTPError: 500 Server Error: INTERNAL SERVER ERROR

The inded-error.log are:

[Thu Jun 21 20:25:00.331227 2018] [:error] [pid 8685:tid 140451114325760] [2018-06-21 20:25:00,327][    indexd][  ERROR] Exception on /index/40a395b0-51d1-426e-9fbd-13a91ac7547e [PUT]
[Thu Jun 21 20:25:00.331269 2018] [:error] [pid 8685:tid 140451114325760] Traceback (most recent call last):
[Thu Jun 21 20:25:00.331277 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1817, in wsgi_app
[Thu Jun 21 20:25:00.331285 2018] [:error] [pid 8685:tid 140451114325760]     response = self.full_dispatch_request()
[Thu Jun 21 20:25:00.331293 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1477, in full_dispatch_request
[Thu Jun 21 20:25:00.331301 2018] [:error] [pid 8685:tid 140451114325760]     rv = self.handle_user_exception(e)
[Thu Jun 21 20:25:00.331308 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1381, in handle_user_exception
[Thu Jun 21 20:25:00.331317 2018] [:error] [pid 8685:tid 140451114325760]     reraise(exc_type, exc_value, tb)
[Thu Jun 21 20:25:00.331324 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1475, in full_dispatch_request
[Thu Jun 21 20:25:00.331333 2018] [:error] [pid 8685:tid 140451114325760]     rv = self.dispatch_request()
[Thu Jun 21 20:25:00.331340 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/Flask-0.10.1-py2.7.egg/flask/app.py", line 1461, in dispatch_request
[Thu Jun 21 20:25:00.331348 2018] [:error] [pid 8685:tid 140451114325760]     return self.view_functions[rule.endpoint](**req.view_args)
[Thu Jun 21 20:25:00.331356 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/indexd-0.1-py2.7.egg/indexd/auth/__init__.py", line 24, in check_auth
[Thu Jun 21 20:25:00.331364 2018] [:error] [pid 8685:tid 140451114325760]     return f(*args, **kwargs)
[Thu Jun 21 20:25:00.331371 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/indexd-0.1-py2.7.egg/indexd/index/blueprint.py", line 280, in put_index_record
[Thu Jun 21 20:25:00.331380 2018] [:error] [pid 8685:tid 140451114325760]     urls_metadata=urls_metadata,
[Thu Jun 21 20:25:00.331387 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/indexd-0.1-py2.7.egg/indexd/index/drivers/alchemy.py", line 599, in update
[Thu Jun 21 20:25:00.331395 2018] [:error] [pid 8685:tid 140451114325760]     session.add(record)
[Thu Jun 21 20:25:00.331402 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/SQLAlchemy-1.0.8-py2.7-linux-x86_64.egg/sqlalchemy/orm/session.py", line 1577, in add
[Thu Jun 21 20:25:00.331421 2018] [:error] [pid 8685:tid 140451114325760]     self._save_or_update_state(state)
[Thu Jun 21 20:25:00.331429 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/SQLAlchemy-1.0.8-py2.7-linux-x86_64.egg/sqlalchemy/orm/session.py", line 1596, in _save_or_update_state
[Thu Jun 21 20:25:00.331437 2018] [:error] [pid 8685:tid 140451114325760]     self._save_or_update_impl(st_)
[Thu Jun 21 20:25:00.331443 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/SQLAlchemy-1.0.8-py2.7-linux-x86_64.egg/sqlalchemy/orm/session.py", line 1846, in _save_or_update_impl
[Thu Jun 21 20:25:00.331451 2018] [:error] [pid 8685:tid 140451114325760]     self._update_impl(state)
[Thu Jun 21 20:25:00.331458 2018] [:error] [pid 8685:tid 140451114325760]   File "/var/tungsten/services/indexd/deploy/current/venv/lib/python2.7/site-packages/SQLAlchemy-1.0.8-py2.7-linux-x86_64.egg/sqlalchemy/orm/session.py", line 1832, in _update_impl
[Thu Jun 21 20:25:00.331466 2018] [:error] [pid 8685:tid 140451114325760]     state_str(state)
[Thu Jun 21 20:25:00.331472 2018] [:error] [pid 8685:tid 140451114325760] InvalidRequestError: Instance '<IndexRecordUrl at 0x7fbd432f53d0>' has been deleted.  Use the make_transient() function to send this object back to the transient state.

move acl out of metadata table

acls should be a supported feature that's not kept in garbage prop

PXD-988 ⁃ New endpoint for bulk queries

Currently the GET /index/ endpoint accepts a few URL parameters, one of them being ids [1]. The more UUIDs the longer the indexd query URL. Most browsers enforce a limit of ~2000 characters per URL, and there is a larger limit on most server software. Sometimes extremely large queries will result in a URL that breaks the maximum character limit causing the request to fail. My suggestion is to add a new endpoint to accept POST data in the body of an HTTP request.

(NOTE: I am not attached to the endpoint name I describe below.)

Current implementation

GET /index?start=1&limit=100&ids=id1,id2...idn

Proposal

POST /bulk-index/list

body = {
    "start": 1,
    "limit": 100,
    "ids": ["id1", "id2", ..."idn"]
}

[1] https://github.com/uc-cdis/indexd/blob/master/indexd/index/blueprint.py#L48-L61

support listing index documents given a list of ids

user wants to be able to get multiple documents with one call providing a list of ids

PXD-552 ⁃ DCF -> ISB -> Index TCGA staging bucket

We need to index the following 2 TCGA buckets on DCF staging:
https://nci-crdc-staging.datacommons.io

gdc-tcga-phs000178-open-staging
Created under Google Project: dcf-staging-202214
allAuthenticatedUsers have Storage Object Viewer role
gdc-tcga-phs000178-controlled-staging
Created under Google Project: dcf-staging-202214
[email protected] has Storage Object Viewer role
[email protected] has Storage Object Viewer role

PXD-1727 ⁃ List with params returns half the limit when using multiple params

Reported by GDC.

timestamps should be ISO-8601 format

should be return as isoformat
should be called created_time and updated_time because they are not date

support getting latest version document that has version populated

PXD-555 ⁃ support bulk creation from bdbag format

https://github.com/ini-bdds/bdbag

write wsgi files for use in apache / nginx

The index is intended to be run behind a proper web server, such as apache or nginx.

PXD-2213 ⁃ support empty records

new endpoint for blank records containing just the uploader field
update documentation to explain the purpose for this feature: https://paper.dropbox.com/doc/Data-upload-design--AQK_Pt1oxinOWtTI9Cm9lDLaAg-61kmmvSMey7zkedXngt03
allow creating empty records
query by empty acl
support filling (still not modifying) hash and size in empty records

New feature request

I think it will be very nice if indexd supports bulk input. Most of the regular use cases are good, each insertion takes roughly 25 milliseconds. I think if we insert 1000 entries at a time, it will take much less than 25ms x 1000 = 25s.

Two use cases will benefit significantly:

backup-indexd-populator takes 24-26 hours now to populate 1.8-1.9 million entries. It will take significant less time(hopefully 30-60 minutes)
signpost-2-indexd migration takes about 60-70 hours (depending on environment) to migrate about 3.5 million entries. It will take much less time (probably 60-90 minutes)

Records should include an uploader field which indicates the user who uploaded/is uploading this file
support update (delete), query for uploader field in records (using existing endpoints)
also support deleting the field, since sheepdog will need to set the field to empty after data upload is finalized
update documentation to explain the purpose and usage