mlmi2-cssi / foundry Goto Github PK

View Code? Open in Web Editor NEW

75.0 75.0 16.0 45.89 MB

Simplifying the discovery and usage of machine-learning ready datasets in materials science and chemistry

License: MIT License

Python 100.00%

chemistry data-science datasets machine-learning materials-science

foundry's People

Contributors

Stargazers

Watchers

Forkers

ethantruelove flawnson ianfoster batmanabcdefg braedencu wardlt cyschneck sgbaird mibzzy rjacobs914 wdwzyyg muchanem marshallmcdonnell sailfish009 kurtmckee michaelcshn

foundry's Issues

Collect Small Molecule Datasets

Consider how to shape and prepare datasets like GDB9, and the GDB-G4MP2 datasets for usage in Foundry

Add ability to publish models through Foundry

Via DLHub, enable model publication as a pass-through from this client

2 Issues On Examples Page of Foundry Docs

In the docs, the following code is provided:

res = f.load_data()
imgs = res['train']['input']['imgs']
coords = res['train']['input']['coords']


# Show some images with coordinate overlays
import matplotlib.pyplot as plt

n_images = 3
offset = 150
key_list = list(res['train']['input']['imgs'].keys())[0+offset:n_images+offset]

fig, axs = plt.subplots(1, n_images, figsize=(20,20))
for i in range(n_images):
    axs[i].imshow(imgs[key_list[i]])
    axs[i].scatter(coords[key_list[i]][:,0], 
                   coords[key_list[i]][:,1], s = 20, c = 'r', alpha=0.5)

Issue 1:
On line 3, coords is defined as coords = res['train']['input']['coords']. However, it should be defined as coords = res['train']['target']['coords']. The former defenition throws errors, while the latter works just fine.

Suggested Change
From: coords = res['train']['input']['coords']
To: coords = res['train']['target']['coords']

Issue 2:
On line 2, imgs is defined. However, that defenition is not used when defining key_list

Suggested Change
From: key_list = list(res['train']['input']['imgs'].keys())[0+offset:n_images+offset]
To: key_list = list(imgs.keys())[0+offset:n_images+offset]

Add Testing

Set up Sphinx Autodocs

Set up Sphinx with our GitBook docs. Publish from the docs GitBook branch.

assess Foundry dataset documentation

Read through the Foundry Dataset documentation for understanding in GitBook. Add comments for areas that could use clarification or improvement.

Add Stan Segmentation Example

Dataset: _test_foundry_stan_dendrite_segmentation_v1.1

@ethantruelove

Add Google Colab to Docs

Now that we support cloud server usage, we should call this out in the README

Dataset description helpers

Look into merging functionality from dlhub_sdk for model description into Foundry
https://github.com/DLHub-Argonne/dlhub_sdk/blob/master/dlhub_sdk/models/datasets.py
Remove in dlhub_sdk https://github.com/DLHub-Argonne/dlhub_sdk

Documentation for running a model

Update 6/13/22: Blocked by #23

Fix cloud CI tests

Currently the cloud CI fails with authentication issues. As of 9-25-2020, everything works locally after login, but the Github Actions builds fail even after the secrets have been added https://github.com/MLMI2-CSSI/foundry/settings/secrets.

Add setup-python dependency caching in GHA

Sometimes tests can take a while to run due to package installs; we can cut down on run time by caching the dependencies in GitHub Actions (GHA)

https://github.blog/changelog/2021-11-23-github-actions-setup-python-now-supports-dependency-caching/

NOTE: should probably do this for dlhub-sdk as well

Set up a monitor to check CI status of core services

For services like DLHub, funcx, Xtract, etc we want to make sure we have testing (ideally daily) to know when something is down. Otherwise we go on a goose hunt for Foundry.

From Ben G.: This seems like a powerful component to tie everything together: https://reportportal.io/features

Benchmark Planning

Planning thread for defining benchmark challenges.

i.e.
With datasets a, b, c with inputs X predict outputs Y with a given metric M

Need

Defined datasets, inputs, outputs, splits, validation sets, and metrics
Infrastructure to automate running tests

publish the QOVae dataset

Publish the QOVae dataset to Foundry

A list of all potential data targets, and more info on this dataset, can be found here:
https://docs.google.com/spreadsheets/d/1SUrYEBoO1-L-ShIuMkCqd3v1Plhy-kMPT04XBtOmn2E/edit?pli=1#gid=0

This task will entail locating the data, understanding what the metadata values should be (e.g. authors, units, keys, etc), and shaping it into a Foundry dataset for publication.

Info on how to do this can be found in the Foundry documentation and in the dataset publishing notebook in the foundry repo

add checks or appropriate handling for dataset publishing

Currently Ryan J and Lane are getting the following error (at bottom) when publishing datasets. After some initial investigation with Ben, it appears that the metadata are not being set correctly, however Ryan's publishing code does indeed define metadata, like so:

metadata = {}
metadata['inputs'] = inputs
metadata['package_type'] = 'tabular'
metadata['dataset_doi'] = 'https://doi.org/10.1016/j.commatsci.2019.06.010'
metadata['publisher'] = publisher
metadata['affiliations'] = affiliations
metadata['authors'] = authors
metadata['description'] = 'Dataset containing DFT-calculated dilute alloy impurity diffusion barriers for 408 host-impurity pairs'

f = Foundry(index='mdf-test')
res = f.publish(metadata, data_source, title, authors, short_name=short_name, update=True) #update=True if modifying an existing dataset
print(res)

This is consistent with our model publishing notebook guide, but varies slightly from what's currently in our documentation.

We probably just need to add some slight handling to f.publish() to make sure the metadata blob is wrapped in the proper keys.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3_python37/lib/python3.7/site-packages/foundry/foundry.py in load(self, name, download, globus, verbose, metadata, authorizers, **kwargs)
    189         try:
--> 190             res["dataset"] = res["projects"][self.config.metadata_key]
    191         except KeyError as e:

KeyError: 'projects'

The above exception was the direct cause of the following exception:

Exception                                 Traceback (most recent call last)
<ipython-input-6-eb1fba88b824> in <module>
      1 from foundry import Foundry
      2 f = Foundry(no_browser=True, no_local_server=True, index='mdf-test')
----> 3 df = f.load('_test_diffusion_v1.2')

~/anaconda3_python37/lib/python3.7/site-packages/foundry/foundry.py in load(self, name, download, globus, verbose, metadata, authorizers, **kwargs)
    190             res["dataset"] = res["projects"][self.config.metadata_key]
    191         except KeyError as e:
--> 192             raise Exception(f"load: not able to index with metadata key {self.config.metadata_key}") from e
    193 
    194         del res["projects"][self.config.metadata_key]

Exception: load: not able to index with metadata key foundry

Iris Dataset: Models on Foundry

It is important to demonstrate that Foundry is capable of connecting data to ML models and vice versa. To this effect, we would like to eventually have the Iris example (examples/iris) running models in Foundry. This will require three steps:

Build models locally that take Foundry-pulled Iris data and analyze. This could be SVMs, KRR, or other supervised or unsupervised techniques. For eg: https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html
These trained models need to be described and uploaded to DLHub through Foundry. (@meschw04 and @blaiszik make sure that models can be uploaded to DLHub through Foundry).
Code in the notebook needs to be rewritten such that models are run through Foundry/DLHub instead of trained/run locally.

The final product should demonstrate the process of training a model locally on Foundry-pulled data, uploading this model through Foundry, then running this model through Foundry. Foundry end-to-end. A scientist should be able to see how to upload their own novel models and compare to existing models.

Automate publishing matminer and deepchem datasets

write a script you can run on your local machine (like a .py script you can run in PyCharm) that automates publication of datasets.

The following are some general things to think about, but I suggest breaking them out into your own tasks in this story

Some steps to investigate:

whether or not you can pull the download links to the datasets programmatically, by webscraping with BeautifulSoup or a similar Python package
how to write the metadata to .json files (use the json package)
how to read in the metadata files and data files from your local machine programmatically (I suggest keeping them all in a folder that you walk through using os or something similar)
how to map the metadata files to the data files
how to pass arguments to the script so you can run it easily from the commandline (I suggest argparse) -- an example of an argument you might want to pass in would be the path to the directory containing the data
The ultimate goal is to be able to just run the script and publish these datasets with as little human labor as possible. So if there's something you can code to make less human work in the future, do that something! :)

if there does not appear to be a way to read in the title and related information from the metadata, then create a new issue to add that to foundry and amend publish(). But first, see if you can include it in the metadata

Addressable Data Environments

Add capabilities to address specific data environments into client

Could specify the environment in the Foundry instantiation
Explore concepts around global data stores rather than just local

Cosmetic changes

Add simple logo to Readme
Add better overview description of project to readme

Documentation for publishing a model

Once Foundry has model publishing capabilities from issue #23 , we'll need documentation on how to use it. This should be written in the docs on gitbook. This is closely related to issue #179 about creating/updating the model publishing guide notebook.

Split Pydantic definitions from Foundry class file

Add a separate file models.py to hold the Pydantic class definitions that are currently in foundry.py

Be sure to re-add imports from models.py in foundry.py

Publish more datasets on Foundry

We'll add a list of datasets here that should be published to Foundry. You can use the dataset publishing guide and the documentation to do this.

update tests for Foundry `publish()` to auto-clean

When testing the Foundry publish() method, test-packages get published to the destination endpoint (currently on UIUC). Right now we are manually clearing them, but we should programmatically clean them up instead, at the end of our test suite.

Add the necessary clean up methods (that require using the Globus SDK) to the end of the test suite

checkin with Ben about the Globus SDK methods to use

Also want to clean up model publication detritus, if possible (check in with Logan)

(NOTE: also don't have a method for this in MDF CC, should probably add one)

Input types for metadata

For data upload, data types can be defined as either an input or a target. However, there are occasions where one may switch an input to be a target. For example, I may try to predict glass forming ability with glass transition, liquidus, and crystallization temperatures (horrible model). Later, I decide to predict the liquidus temperature as a function of other composition properties. The liquidus temperature is an input for one task but the target for another.

Add documentation to support community involvement

Add a code of conduct
Add a contributor guide

enable dataset publishers to publish using HTTPS

Publishing a dataset using HTTPS is now possible that we're using Eagle, which has Globus Connect Server 5. We should expose this functionality to our users such that they can publish a dataset using HTTPS and don't need to upload the data to a Globus Endpoint, share the source_url, etc.

Update `publish()` to take links

Need to take "links" to pass onto the MDF connect_client. To do this one, need to add helper function to MDF connect client first (something like add_links()

add helper functions to Connect Client
add logic in Foundry publish() to pass on links metadata

before this story also included passing in all datacite data. preserving that text here in case it's needed in the future:

need to take all necessary keys to add to the datacite block (called dc) (e.g. description, resource_type, etc)
For "description", in data cite it's just Description, and then you can give it a DescriptionType:“Abstract”. Link to dc metadata schema: https://github.com/materials-data-facility/data-schemas/blob/12bb3c7da2a5d8667e5cdb8fe90f18b77094af4d/schemas/dc.json#L411

So it'd be something like:

{
    "dc": {
        "descriptions": [{
            "description": "Blah blah blah",
            "descriptionType": "Abstract"
        }]
    }
}

Things I need in the dc block:

def create_dc_block(self, title, authors,
                        affiliations=None, publisher=None, publication_year=None,
                        resource_type=None, description=None, dataset_doi=None,
                        related_dois=None, subjects=None,
                        **kwargs):

Create (or update existing) model publishing guide notebook

We have a model publishing guide that needs to either be updated or redone once the foundry model publishing capability is complete (#23 )

Document how to add a dataset

Add documentation on how to create and publish a Foundry dataset

add tests for model running

We need to add tests to test_foundry_gha.py for running DLHub models -- especially those with different funcx endpoints IDs (meaning the containers run on a server other than the default, which has been UC River) and older models published with varying versions of DLHub SDK and Globus SDK (this is particularly important given the code-breaking changes from Globus SDK 2 to 3).

Here are the two stubbed out tests in question (with models provided to test servables published with different versions of DLHub and Globus SDK 3, to ensure that they all still work when running using current DLHub, Globus SDK, and funcX versions). test_model_run_endpoint() will test running a model when a different funcx endpoint is specified.

Please add the assertions needed to make these tests function properly. Additionally, we should specify exactly which DLHub SDK versions were used for each model (which can be found at dlhub.org by searching by model name).

def test_model_run():
    f = Foundry()
    # test old model published with old DLHub, Globus SDK 2
    res = f.run('zhuozhao_uchicago/Noop', [1, 2, 3, 2, 2, 3])

    # test model published with newer DLHub, Globus SDK 2
    res = f.run("aristana_uchicago/noop_v14", {"data": True}, debug=True)

    # test model published with newest DLHub, Globus SDK 3
    res = f.run("aristana_uchicago/noop_dlhub_v4", {"data": True}, debug=True)

    res = f.run("aristana_uchicago/noop_dlhub_v4", True, debug=True)

    res = f.run("aristana_uchicago/noop_dlhub_v4", {"data": True})


def test_model_run_endpoint():
    f = Foundry()

    # test data
    X_test = np.zeros((11, 11))
    x_cen, y_cen = 6.0, 5.0   
    sig_x, sig_y = 0.6, 1.5
    for x in range(11):
        for y in range(11):
            X_test[y][x] = 1000*(np.exp(-(x-x_cen)*(x-x_cen)/2*sig_x -(y-y_cen)*(y-y_cen)/2*sig_y))
        
    # the input needs to be normalize to 0-1, e.g., rescale using min-max norm
    X_test = (X_test - X_test.min()) / (X_test.max() - X_test.min())

   # inputs
    inputs = X_test[np.newaxis, np.newaxis].astype('float32')

    # run the BraggNN model trained using plain PyTorch 
    res = f.run("kj.schmidt913_gmail/BraggNN_PT", inputs, funcx_endpoint='7d7d5826-6167-4d6e-a591-b57628d60588')

HTTPS load dataset contents

Querying the dataset by source_id should give you the required information to create the file list (walk the Globus directory), then create the HTTPS requests for each file. We might need to also add headers for Auth

fix Apple M1 bug with keras

Currently blocking any TF/Keras related development or use on M1 machines.

Info here: https://stackoverflow.com/questions/65383338/zsh-illegal-hardware-instruction-python-when-installing-tensorflow-on-macbook

https://stackoverflow.com/questions/65242614/why-does-loading-tensorflow-on-mac-lead-to-process-finished-with-exit-code-132

Seems TF 2.6 is now supported via its PluggableDevice interface?
https://makeoptim.com/en/deep-learning/tensorflow-metal

alternatively, here was the custom macOS solution (now archived/obsolete): https://github.com/apple/tensorflow_macos

https://developer.apple.com/metal/tensorflow-plugin/

Some thoughts on using Foundry in geosciences (xarray, netcdf, zarr, etc.)

This issue is a follow up to the great discussion we had at the Pangeo + Globus Labs meeting earlier this week.

Foundry seems like a great project that would fill and important niche in the geosciences: allowing scientists to publish simulation data from their globus-connected HPC systems to share with the broader world. As I understand it, Foundry is currently focused exclusively on supervised machine-learning datasets that fit into the intput / target paradigm. This certainly captures some of the needs in geosciences, but not all of them. Sometimes people just want to publish a dataset for general consumption, not specifically for ML. So my first question is whether there is scope in the project for more generic dataset publishing via globus? If not, can you point me towards any other projects in that space?

Even if the answer is no, I think we will still want to use Foundry to publish ML-focused data in the geosciences.

Leaving that question aside for now, here are some random thoughts on what might make Foundry useful / appealing for geoscience / ocean / weather / climate / etc. users.

NetCDF is our data model

Foundry currently supports two data types: tablular data and hierarchical data. In geosciences, we tend to use a similar schema to distinguish between data types. However, we tend to say tabular data vs. array data. And in the geosciences, 99% of array data is encoded in the NetCDF data model. And probably 80% uses CF Conventions.

When data follow these conventions, the dataset metadata already contain fields like standard_name, description, units, etc. etc. within the data files themselves (i.e. self-describing). Consequentially, some of the metadata that Foundry requires for describing datasets may be redundant. So a major design question for incorporating NetCDF-type data into Foundry would be how to harmonize Foundry's metadata requirements with the metadata standards commonly used in geosciences via NetCDF / CF conventions. Perhaps some of the Foundry metadata could be automatically discovered by examining the data.

Xarray is our python API

Just like Pandas provides data structures and a computational API that matches the tabular data model, xarray provides data structures and a computational API that matches the NetCDF data model. So just like Foundry maps tabular data to be opened in Pandas, we would want to map NetCDF-style data to be opened in Xarray.

Xarray also serves as swiss-army knife of file formats. Similarly to Pandas, It can read dozens of different data formats and load them all into the same data model. This is a huge cognitive boost for scientists, who can then easily write analysis code that interoperates with any of these formats.

Our files are NetCDF, [Cloud Optimized] GeoTIFF or Zarr

For the most part, I imagine geoscientists wanting to share array data via Foundry will be using one of three formats.

NetCDF, often with hundreds / thousands of files in folder, arranged sequentially along some dimension such as time. (Note that Xarray can open these collections as single Dataset object, leveraging Dask. I see this feature has come up elsewhere in Foundry: #52. Under the hood, NetCDF4 files are HDF5, but as a rule we almost never open them directly with h5py, preferring to use the NetCDF model instead.
Cloud Optimized GeoTIFF (or COG) is the predominant format for geospatial imagery data. Each file holds a single multiscale image. The files are often catalogued using Spatio-temporal asset catalog. The Radiant Earth MLHub holds lots of remote sensing ML training datasets and shares similar aims to Foundry, so that might be a worthwhile project to investigate.
Zarr - Zarr is a hierarchical format similar to HDF5. However, rather than storing everything in a single file, Zarr explodes the data into many individual files / objects, including separate json metadata. This has advantages and disadvantages depending on your context. (HPC filesystems tend not to love lots of small files; cloud object stores work great with it). One clear advantage is that it is trivial to append to these datasets along any dimensions. So where we might have 1000 individual netCDF files, we would only have one single Zarr store. When sharing Zarr data via Foundry, we would probably just share a single Zarr group (potentially comprised of thousands of individual files, which have no meaning outside the context of the Zarr group / array). Like with NetCDF, there is a question of redundancy between the metadata stored in Zarr itself vs. input to Foundry.

If you're interested in moving this forward, we would be happy to serve as guinea pigs 🐹 in exploring Foundry for geosciences. We have many datasets sitting on globus-connected systems that we would like to publish.

cc @mgrover1, @scollis, @cisaacstern, @jbusecke,

Use Dask for Return Values

Dask could solve a lot of the problems in Foundry around unified function return value types and working with datasets that are larger than can fit into memory

https://dask.org

Dask natively has APIs that work with pandas, and clever ways to work with JSON files and directories of input files (e.g., directories of images or csvs)

Paralellize HTTP file GETs

Add parallelization to the https get functionality. Below is an example, but there might be better ways to do it.

https://github.com/MLMI2-CSSI/foundry-cli/blob/master/foundry_cli/main.py

Close old branches

Check if there is anything important in the other branches and close them if not.

MAST-ML Support

Add MAST-ML Support byline to the README

Update CLI to pin datasets in JSON specification to a version

Functionality should allow the key in the specification to be tied to the source_name instead of source_id and the value to be the version rather than unused.

Pytest and Travis deploy

The current state of tests/test_foundry.py is just to pass without flagging. There are several areas that need to be fixed here.
First, we need to make sure that tests actually pass when running f=Foundry(), mostly making sure that this part passes Globus Auth (recall problems around .gloubs-native-apps.cfg and travis.tar.enc). Use the tests from MDF forge as a template for this.
Second, place a test.json (or similar) on the endpoint used for foundry examples. Use asserts to make sure foundry can get to the endpoint. Once again, refer to MDF forge tests.
Third, make sure that Foundry can run a simple model (say, for eg., Ryan Chard's StringLength function) and return a correct result for some set input.

Add Deployment to PyPI

Add JSON Lines Support

Add support to read datasets shaped as JSON-line delimited files.

Requirements

Read entire file
Read a slice of a file based on line number

Use cases

HydroNet
Bandgap data

Add analytics

From proposal, we should track:

Users (authors of datasets. Can we track data user number by the globus accounts used to download the data?)
Citations to hosted datasets

from Ben: Track number of installs of foundry_ml
from Ari: looks like tracking installs may be unreliable? tbd https://pypistats.org/packages/foundry-ml
Ben and I discussed tracking users via Globus credentials, since we need to abstract away Globus anyways

Create HDF5 dataset publication notebook

Demonstrate to users how to publish an HDF5 dataset.

Create a new notebook called "hierarchical_dataset_publishing.ipynb" and add it to examples/publication_guides . Look at the existing data publication notebook for an idea of how to get started -- copy the overall format, but feel free to leave out the information about setting up Globus accounts (just add some markdown text pointing to the primary publishing notebook). That notebook is for publishing tabular data, but is a good place to start. Also look at the Foundry documentation for more info.