Giter Site home page Giter Site logo

mountetna / metis Goto Github PK

View Code? Open in Web Editor NEW
0.0 5.0 4.0 6.23 MB

Metis is a file service for Etna applications.

License: GNU General Public License v2.0

Ruby 81.85% JavaScript 10.36% HTML 0.11% Shell 2.73% Makefile 0.03% Dockerfile 0.05% SCSS 2.61% TypeScript 2.25%

metis's Introduction

Run tests

Metis

Metis is a file service for Etna applications. It provides the ability to store binary files in folder hierarchies and access them via HTTP API. The underlying object storage uses an ordinary filesystem (i.e., files in Metis are stored on disk as files)

Organization

Projects

As with all Etna applications, the basic organizational unit of Metis is the project - the user only has rights to data from a project according to their project role.

Buckets

Buckets are the root-level containers for each project. A bucket is access restricted, either by role or by access list. Bucket names are restricted to symbol_names, i.e. [A-Za-z0-9_]+.

Folders

Folders are collections of files and folders. Folders may be "protected", preventing the creation of new files or folders.

Files

Files may have any content. Metis will compute MD5 sums for each file and (if available) back them up using cloud storage. Files may also be protected from modification by admins.

Names

File (and folder) names must match [^<>:;,?"*\|\/\x00-\x1f]+, i.e. excluding common wildcard, separator and control characters.

Client

Metis provides a browser client that allows file viewing, download, and upload.

API

Listing

Listing a folder path will return a JSON list of files and folders at that path, including an HMAC-signed download_url:

GET /:project_name/list/:bucket_name/*folder_path

Buckets

You may get a list of visible buckets for your project:

GET /:project_name/list/

Admins may create, update or delete a bucket:

POST /:project_name/bucket/create/:bucket_name { owner, access, description }
POST /:project_name/bucket/update/:bucket_name { access, description, new_bucket_name }
DELETE /:project_name/bucket/remove/:bucket_name

Bucket access is either a role administrator, editor, viewer or a comma-separated list of Etna user ids (emails).

Folders

Editors may create, remove and rename folders.

POST /:project_name/folder/create/:bucket_name/*folder_path
DELETE /:project_name/folder/remove/:bucket_name/*folder_path
POST /:project_name/folder/rename/:bucket_name/*folder_path { new_folder_path }

Admins may protect or unprotect folders:

POST /:project_name/folder/protect/:bucket_name/*folder_path
POST '/:project_name/folder/unprotect/:bucket_name/*folder_path

Files

Editors may remove or rename a file:

DELETE /:project_name/file/remove/:bucket_name/*file_path
POST /:project_name/file/rename/:bucket_name/*file_path { new_file_path }

Admins may protect or unprotect a file:

POST /:project_name/file/protect/:bucket_name/*file_path
POST /:project_name/file/unprotect/:bucket_name/*file_path

Uploads

The basic upload cycle first requires the editor to authorize a new file upload into a project, bucket and path with their Etna auth token:

POST /authorize/upload { project_name, bucket_name, file_path }

If the intended file path is invalid, or the destination is locked, the upload authorization will fail.

This returns an HMAC-signed URL to the upload endpoint, which may be used to perform the upload:

POST /:project_name/upload/:bucket_name/*file_path { action, ... }

This endpoint can perform three actions:

We initiate the upload by communicating what we intend to send:

{ action: 'start', file_size, next_blob_size, next_blob_hash }

We repeatedly send blobs of binary data (using multipart post):

{ action: 'blob', blob_data, next_blob_size, next_blob_hash }

If the data does not hash correctly upon receipt, Metis will reject the blob.

When we have sent the final blob, the upload completes and Metis returns the newly-minted file JSON.

If we are dissatisfied with our progress we may cancel the upload:

{ action: 'cancel' }

Documentation

For the most current documentation about Metis, please refer to the Mount Etna documentation blog.

metis's People

Contributors

coleshaw avatar graft avatar corps avatar jasondcater avatar dtm2451 avatar gvaihir avatar

Watchers

 avatar James Cloos avatar Lindsay avatar  avatar Bushra Samad avatar

metis's Issues

UI, User logging, Invalidate User Cookie.

When a user cookie is invalid we need to remove it. Right now when the cookie is expired the UI will say that the user cannot be logged in and just sits there. We need to clear that cookie and restart the log in cycle.

BAM streaming

You submit a request in the form of genomic regions as JSON.

In return you get either a count of reads in those regions or the reads themselves, either in JSON or BAM/SAM format.

Configure the base system of Metis

We need to install the base configuration of Metis which include things such as:
Apache Config
Package Install
Firewall Config

See the Chef configuration or Wiki page for more detail.

Hash-based object store

Currently each file on Metis is stored on disk in an actual directory structure. This is cumbersome to manage and requires several filesystem operations in order to move files to a different path. Ostensibly the reason for this was some sort of inspectibility of the file store on-disk. In practice this isn't really the case (the hex-encoded file paths are hard to read), and they cause some serious issues (there is a linux file system size limit that leaks into Metis, as it takes two hex characters to encode each file name character).

A better object store could use MD5s to organize files. Each file content is stored in a directory structure according to its hash; the object store maintains a table mapping a file key (i.e., a full path including :project_name/:bucket_name) to an md5. Newly-uploaded files are stored at a temporary location until the object store can hash them, after which it is moved to its md5-location.

This has several advantages: duplicate files don't take up extra space, and moving files from one path to another merely involves changing a database entry in the object store. This also abstracts the "object store" away from Metis' folder/file structure, paving the way for future use of other object stores (e.g., a cloud-based store or a Ceph store).

bulk upload of files to a metis 'bucket' via the web

The HTTP bucket-upload API should allow you to put a file (or series of files) to a particular metis project bucket.

The cycle:

  1. Get a Janus token.

  2. Authorize a bucket upload with Metis to generate an HMAC URL for your PUT.

  3. PUT the file onto Metis in your project bucket with, say, wget.

  4. Repeat (2,3) with all files.

The same could be done with a JS client, which might favor the more complex chunked upload capabilities.

Search files

Users should be able to type a search in a search box and see a list of file results.

Add quotas to buckets

Metis has 'buckets' for each project. We should add a size quota to each bucket. This means:

  1. The total usage is updated on the bucket when a file is uploaded and when a file is removed.
  2. Uploads into the bucket fail if the upload size is greater than available free space.
  3. The current quota usage for all buckets is shown somewhere in the UI.

Refactor Metis

We need to refactor the initial code of Metis to include models. On Feb 9th Saurabh did a code review and made suggestions about the Metis code. Jason will adopt those suggestions and integrate them into Metis

Create buckets with access control

The 'bucket' is a unit of organization for files/folders on Metis. Rather than allowing individual folders in a single file hierarchy to become the method for organizing project life, we might think of a bucket as a blob of data, useful for a particular task.

The default bucket is called 'files' and is readable by all; this is the method that the Metis browser currently defaults to.

We should allow users to create new buckets that they can add files to using the existing file and folder methods. The bucket facility should work as advertised already; those without access to a bucket are already forbidden from accessing a file's contents from within that bucket (see Metis::Controller#require_bucket). All that need be added is:

  1. A 'create bucket' method, which creates a new named bucket. Bucket names are in snake_case and are the root object in the project.

  2. A 'bucket access' method, which sets users on an access control list. If there is no access list (the default), the bucket is public to the project (like 'files'). If a list of project users is specified, only the listed users may have access to the bucket.

Potential uses:

  1. One-off tasks that require temporary designated storage (e.g. bulk-process a bunch of files).

  2. Receive large deliveries of files, e.g. fastqs off an SFTP server. These can be copied by a remote process into the designated bucket (using the normal Metis file methods).

Metis basics

After exploring Riak CS, Open Swift, etc., they seem rather too large and complex. Let us instead attempt to make our own, very simple fileserver. It should basically support two operations:

get - retrieve a file resource given a bucket_name and a token_name

put - upload a file resource given a bucket_name and a token_name

These requests should both authenticate using HMAC, which lets us sign a request using a secret key known to both ends.

So, for example, a get URL might look like this:

https://metis.ucsf.edu/get?bucket=magma-dev-ucsf-immunoprofiler&token=patient-flojo_file-IPICRC006.wsp&HMAC=hmacmessagestring

Support file name validation on folders

Each folder on metis should have a basic filename validation regexp. Any regexp can be set here - except if the regexp does not match existing files in the folder, Metis complains. Once a regexp is in place, if a user attempts to authorize a file in this folder, Metis will complain if it doesn't match the appropriate regexp. Otherwise files are created as normal.

UI, metis-upload-controller.jsx, destroy initialization web worker after usage.

We use two web workers for our uploads. On web worker (the upload worker) should be a singleton and once instantiated, be left alone. The other is an initialization web worker (it handles the initial handshakes with the server by sending initial file data). This 'initialization web worker' needs to be torn down after use (using 'terminate') since it just clogs up memory.

Files uploading to non-project directories

Found when trying to fix #61 . Files are uploaded to non-project directories (which have to exist first??). For example, I'm seeing:

data/
  uploads/
  data_blocks/
     <my file>

Whereas I think it should be:

data/
  ipi/
    uploads/
    data_blocks/
      <my file>

upload#authorize returns a 500 if it does not receive metis_uid

metis_uid is expected to be set in a cookie (which, as an aside, it ought to accept headers also, or exclusively), but if the cookie is not set a 500 results when the upload is created because there is a not_null constraint on the metis_uid column (rightly so).

Set better return codes from the server.

We need to set better return codes from the server. Presently if we try an login through Metis and we are authenticating against the wrong server (janus-dev vs janus-stage or whatever) the system can get stuck in a loop. We need to return better status codes from the server so the UI can respond appropriately

batch ssh/rsync uploads onto metis

Sometimes a user might want to upload a large amount of sequencing data into metis and link it to records in magma. How can they accomplish this?

  1. Allow users to add an SSH public key into Janus.

  2. When a user wants to dump a bunch of files in place, they request a project bucket.

  3. Metis exposes an isolated (chrooted, etc.) bucket that they can connect to via SSH, e.g. ssh://metis.ucsf.edu/12345 with whatever filesize quotas we like.

  4. Any files can be moved into this bucket using your favorite SCP/SFTP client (e.g. rsync --partial --progress if you like), authenticated with your SSH key. Metis assimilates them into its file database; they are available in that bucket (unorganized) for that project.

Copy files

There is no way to copy a file from one location to another on Metis. Following #53 this is now cheap and easy, and only involves creating a new file link. We also need this to provide mountetna/magma#51.

The basic requirement is simple, and would work like

  • POST /copy_file/:bucket_name/*file_path { new_folder_path }.
  • Metis checks that you have the ability to read the original file path and write to the folder path.
  • If the new_folder_path is in a bucket with an owner other than metis, Metis checks that there is an hmac signed by the owning application along with the POST. (Can the HMAC include new_folder_path if it is in the POST?)

There is maybe some desire to allow several files to be copied in a single operation, to prevent lots of hmac-checking and the like to process perhaps hundreds of copy operations, although this seems unnecessary to start.

File upload missing current_byte_position parameter

Running Metis on latest master (986c76e), I get an error trying to upload a file via the browser:

POST to https://metis-dev.etna-development.org/ipi/upload/TestBucket/<file name + headers>

Leads to a 422:

{"error":"Missing param current_byte_position"}

Sent params were:

-----------------------------58833755134078827354081713951
Content-Disposition: form-data; name="action"

blob
-----------------------------58833755134078827354081713951
Content-Disposition: form-data; name="blob_data"; filename="blob"
Content-Type: application/octet-stream

�PNG
�
���
IHDR������
�����Z%¸â��HiCCPICC Profile��H��W�XSÉ��[RIh��H	½�"H��B�T©�����J	AÄàÚE�lè��¢«+ kE]ë"Ø]ËÃ�ÊʺX°¡ò&�ÖÕï½÷½ó}sï�3çü§dî½3�èÔð¤Ò\T��<I�,><�5)5�Ez��`����ðør);..
@�ºÿSÞ\�ÖP®¸(¹¾�ÿ¯¢'�Êù� q�g�äü<����/áKe���½¡Þzf�T�§@l �	B,Uâ,5.Qâ5®TÙ$Æs Þ
��Æãɲ�Ðn�zV!?�òhß�ØU"�K�Ð!C�À�ñ��G@<*/o��C;à�ñ�OÖ?83�9y¼¬a¬®E%ä�±\�Ë�õ�¶ã�K^®b(���4�,"^Y3ìÛÍ���JL�¸W����±>ÄïÄ��=Ä(U¤�HRÛ£¦|9�ö0!v�ðB"!6�8L���¥Ñgd�ø�Ã�����¸��ß%Byh��³F6#>v�gÊ8l�o#O¦�«´?¥ÈIbkøo��Ü!þ×Å¢Ä�uÎ�µP���±6ÄLyNB¤Ú�³)�qb�ld�xeþ6�û
%áÁj~lZ¦,,^c/Ë��Õ�-��¹1�\U J�ÐðìæóTù�AÜ"�°��x�òIQCµ��!¡êÚ±�¡$IS/Ö%-��×ø¾�æÆiìqª07\©·�ØT^� ñÅ�
à�Tóã1Ò�¸Du�xF6oB�:�¼�D����, �#�Ì�Ù@ÜÞÛÜ��©gÂ��È@�����fÈ#E5#�×�Pþ�H�äÃ~ÁªY!(�úOÃZõÕ�dªf�U�9à1Äy �äÂß
��d8Z2x�5âo¢óa®¹p(ç¾Õ±¡&J£Qñ²t�,�¡Ä�b�1�è��à�¸���¯Ap¸áÞ¸ÏP¶�Û���:	��×�]�[ÓÅ�d_ÕÃ�Ñ �F�ÓÔ�ñe͸�dõÀ�q�È�¹q&n�\ðq0����±= �£É\Yý×Üÿ¨á�®kì(®��2��DqøÚSÛIÛc�EÙÓ/;¤Î5c¸¯�á�¯ãs¾è´�Þ#¿¶Ä�`�°3Ø	ì�v�k�,ì�Ö�]Ä�(ñð*z¤ZECÑâUùä@�ñ7ñx��ÊNÊ]�\{\?ªç
�EÊ÷#àÌ�Î��³D�,6|ó�Y\	�ô(��«�+�Êï�ú5õ�©ú> Ìó�ëò��àS��Y�ëxÖ��z�ãÍß:ë�ðñX	À��¾BV¨ÖáÊ��P��|¢��9°��°
-----------------------------58833755134078827354081713951
Content-Disposition: form-data; name="next_blob_size"

1024
-----------------------------58833755134078827354081713951
Content-Disposition: form-data; name="next_blob_hash"

171a5b3ced40ce3d59050ff2bd70be04
-----------------------------58833755134078827354081713951--

Wonder if this is an issue on just the first chunk? Will investigate. Upload worked fine for me on the code included with the VM, before pulling down the latest code. Am running on migration version 18.

Backups to Glacier

When a file is uploaded, it is automatically backed up on Glacier, and the reference is saved to the Metis database.

Compute md5 sums for files

When a file is uploaded, Metis should automatically schedule a job to compute an md5 sum. A Metis worker process (probably using Polyphemus's "brother" interface) computes the md5 and sets the appropriate entry in the database.

Support file types

One of the main reasons for building our own file service is to be able to extend it with type-specific functions (e.g. bam slicing). The first step must be that Metis is aware of the file's type.

The simplest implementation would rely on the 'file' command, which guesses file types based on examining the first few bytes. Probably Metis should run this on upload completion (along with md5 hashing) and save the result to a column.

Basic file types can be represented with an icon in the browser. Subsequently we can build endpoints that operate on certain file types to yield transformations and custom viewers for those types.

FCS streaming

You should be able to submit a list of gates as polygons via a JSON request.

In return you may get a matrix of rows (cells) from the FCS file, either in JSON or FCS (binary) format.

Upload a file

Metis should accept a file for upload. There are already controllers for this written long ago by Jason; however, since the authentication logic has changed these need to be updated to work with the Etna::Auth and Etna::Hmac layers.

I am also reconnecting the Javascript client to the server, also written long ago by Jason. My chief aim here is some refactoring of the reducers, and replacing the Janus login cycle pieces with new JWT-based user info.

Organize files on disk

Files should be sorted by project. Currently Metis defines a 'project_path' for each project in config.yml.

File data is stored in this path - how is it organized?

Each Metis::File expects to have data at a certain location. Currently this is stored by file_name, which creates some issues:

  • The file_name cannot be an arbitrary string. In general it seems dangerous to create files based on arbitrary user-strings. Restricting the file-name format to something safe (no spaces, apostrophes, etc.) will probably irritate users.

  • A file_name in the Metis::File record and a file_name on disk may get out of sync. If the client is also identifying files based on their file name, it might also get out of sync.

Alternatives:

  1. Use an md5sum to store the file. This is bad because if the data changes, the location changes. Also the server does not know the location until the upload is complete.

  2. Use a unique id for each file. ???

Support remote mount of buckets

A feature devoutly to be wished is remote mounting a Metis bucket as a local folder. Why download a BAM when a slice will do? Some criteria:

  1. Should be encrypted/authenticated. This eliminates NFS and makes SMB/CIFS difficult.

  2. Should mount in user space. A user on e.g. a cluster might not have root; a global mount would risk exposing protected data.

The main candidates are CIFS, sshfs, and webdav. CIFS might be hard to encrypt and authenticate. Sshfs might be slow and have trouble reconciling auth methodologies (must auth against etna credentials).

Webdav seems the most attractive; authentication can happen via http headers as usual. Encryption comes from TLS/HTTPS. Files can be served through the same Etna service via a dav endpoint and maybe rack_dav.

Connect QB3 cluster to Metis

We need to move data from QB3 to Metis.
Presently we cannot reach Metis from QB3. Jason Jed and Jason Cater, believe it to be a firewall issue.
Jason Cater is working with Dave Oh in CIP to diagnose the issue.

Production launch requirements

Here are some issues we might require for actually using Metis in production:

  1. A working web browser.
  • Users can see and browse a directory listing.
  • Files can be uploaded into a directory
  • Files can be downloaded via HMAC link
  • Directories (folders) can be created
  • File properties like size, HMAC, update time, author can be read
  1. Secure file storage
  • Files are organized on disk under safe filenames
  • Folders correspond to a folder on disk.
  • File md5 sums are recorded in the file record
  • Files can be removed from the disk
  • Files can be renamed
  • Read-only files cannot be removed by any user
  • A read-only flag can only be added and removed by admins
  • Files and directories are marked as read-only on disk and in the database
  • Files are backed up.
  1. Bulk access
  • The /list/ api allows Etna token-users to generate a JSON listing of a folder, complete with download URLs, creation/update times, etc., suitable for a bash or wget client
  • The /list/ api accepts globs
  • Sufficiently small folders can be downloaded as an archive

Most of the outstanding work seems to be on:

  1. Read-only file operations
  2. Backups

UI, Entry layout, Get smarter about the layout display.

Right now the layout displays "fileFails", "fileUploads", and "fileList" in that order. When an entry move from one to the next (as in completes or fails) the entry can seem to "jump" around the display. This is annoying and confusing. We need to develop a better display scheme so this does not happen AND makes the file groupings clear.

UI, Upload Worker, Upload Speed Window. Need to reset value on file upload switch.

When we upload files we have a calculation of upload speed on the client. This calculation helps us determine the size of the next blob to upload (so we are not choking out slow connections with massive blob uploads). When we swap to another upload OR resuming an upload we need to reset the upload speed. There is an array of upload speed that we average out. This array should be reset on file upload resets/resumes.

UI, Lock out interface commands between upload cycles.

On a very slow connection it may take a minute for commands to APPEAR like they execute. We usually wait for the server to respond before showing verification. On a slow connection this makes the UI look/feel sluggish. We should lock out the controls so the user doesn't run the command multiple times. On a fast connection the user would not even notice the lock out.

UI, web workers, split duties.

Right now there is one web worker that has two duties, initialization of a file upload, and the file upload it self. Since initialization happens separately from upload we should split the web workers. That way we can have smaller code for the initializer and save us some memory.

rsync out of metis

Right now metis is accessible through a web interface. However, people seem unhappy with the idea of bulk downloading via such an interface, and the usual method - making a zip archive - is impractical for large files.

Instead we should support the canonical solution, secure FTP.

A simple implementation might be that each user can upload some SSH keys, which allow them to ssh into a chrooted endpoint that gives them read-only access to the files from each of the projects they belong to - they can then download these via rsync.

UI, Disappearing List Item on Upload Complete.

When there is a single item that has been uploaded and displayed AND we are uploading a second item AND that second item completes, the second item does not get added to the File List object on the UI.

The data is on the server but the UI misses the display of it. When we refresh the page we can see the items.
If we add break points to this process it works fine.

Steps to reproduce.

  1. Upload a small 1MB file.
  2. Upload a larger 100 MB file.

Versioning

When a new verison of an existing file is uploaded, instead of deleting the old file it should be kept as a previous version.

  1. The old binary is moved to .
  2. The new binary is created in place at
  3. A 'Version' record is created for the file, tracking the old binary's md5sum and the backup_id for the old version, if any.
  4. The file's json record reports a list of previous versions, if any, as a list of md5sum.
  5. The file download API responds to a version parameter which downloads the given md5sum revision of the indicated file.
  6. The UI allows the user to examine the list of versions for a file and download one of them.

Download a file

  1. You should be able to download a file from Metis.

  2. The download URL should be HMAC-signed so you don't need a token to get the file.

support thumbnailing

Many files on metis will be images; metis should support thumbnailing of these images in some way, probably via a worker process. Thumbnails might work like this:

  1. When an upload is made/a file exists, a "thumbnail" may be requested for that file at a given location (the name of the thumbnail is up to the requester).

  2. Optionally Metis puts a placeholder in place for the thumbnail.

  3. Metis builds a thumbnail and puts it in place at the requested location.

Questions:

Where is the work queue maintained? How does Metis check for new work?

Assimilate files from the filesystem into Metis

Files in metis need to be tracked: named appropriately, placed in the correct directory, and given an entry in the File table in the Metis database, which then generates an md5_hash and a aws_glacier_backup_id via remote workers.

Currently there are a number of files on the disk that are NOT tracked in this way (there is no database entry). These need to be assimilated into Metis; there should be a metis command to do this. It should:

  1. Copy the file to the metis storage directory for that project (e.g. /data1/metis/files/my_project/)
  2. Create a database entry for the file, including name, project_name, size
  3. Schedule jobs to compute md5 and backup to glacier.

File immutability

Some files (e.g. fastq files) should be write-once and thereafter should not be over-writeable or removable (without effort) to prevent data loss. Perhaps the effort involves administrator override?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.