materialscloud-org / optimade-maker Goto Github PK

3.0 8.0 0.0 327 KB

Tools for making OPTIMADE APIs from various formats of structural data (e.g. an archive of CIF files).

License: MIT License

Python 99.15% Dockerfile 0.36% Shell 0.49%

optimade-maker's Issues

Enhancement on optimade-launch

          Okay, i got this working, but it would be good to have two changes:

opportunity to specify the provider details, probably in the config.yml file would make sense;
possibility to skip the "Do you want to edit it now" after creating a profile.

Originally posted by @eimrek in #12 (comment)

Preparing public release

Stuff to-do before public release:

Fix naming #14
Package layout -- move the mcloud master scripts out to a separate repository and potentially generalize the archive module to work with other providers (Zenodo etc.)
Address performance considerations for very large databases and enforce corresponding limits on uploads
Update config file/ID format #15 & #11

Currently, when generating the OPTIMADE jsonl file, custom/provider properties automatically get the mcloudarchive prefix by default, or a different prefix if changed via the MC_OPTIMADE_PROVIDER_PREFIX env variable.

I am still thinking whether it might make sense to keep jsonl file provider-agnostic (e.g. using a placeholder prefix like _custom or _provider). Possible reasons:

you don't need to know the provider when generating the jsonl file.
cleaner to serve the same jsonl from different providers.
easier to distinguish between provider vs the future namespace prefixes (e.g. is _stability a provider?)

Although I understand that there is value in having the jsonl file exactly mirror the underlying data (mongoDB in our case) and the API responses. So actually, maybe it's good to keep it as it is now.

thoughts, @ml-evs?

support for "direct" jsonl files

I am wondering if we really need to support "direct" jsonl files in the optimade.yaml format.

Conceptually to me it seems that the current purpose of optimake is to generate a jsonl file from other structural data formats and optimade.yaml is something that help to achieve this.

If we already have a jsonl file, then the only purpose I see is validation, and the optimade.yaml file does not seem strictly necessary.

But perhaps generating jsonl files and validating them is different enough to separate them?

E.g. we could have a different optimake subcommand for validation (optimade validate <jsonl-file>?)

Regarding the Materials Cloud Archive service, this change would affect it, as then we should add support for a "direct" jsonl file without any optimade.yaml file.

Custom properties should be type cast to the configured type

Currently, we do not make use of the type information provided in the config. This means that a user could have a csv file with a column of [1, 2, 3] that should be converted to strings, so we should try to make a type map and apply it when attaching custom properties to entries.

Default OPTIMADE properties are missing from info endpoints

Currently we only populate the info endpoint with the custom properties provided by the user, we should also fill in the normal OPTIMADE properties.

How to handle multiple versions of an archive entry?

Currently, If a user makes a new version of their archive entry, it is considered fully separate and independent of the other versions. This means that each version will have an OPTIMADE API running for them, even if the data is exactly the same.

To demonstrate, i made two versions of my test entry:

https://staging-archive.materialscloud.org/record/2023.7 (version 2)
https://staging-archive.materialscloud.org/record/2023.8 (version 3)

where the difference in version 3 is that I just added an extra file that is irrelevant to the optimade data. This means these two APIs are exactly the same. This is a scenario that we probably want to avoid.

The two entries are currently online on https://dev-optimade.materialscloud.org/

Possible solutions for this:

calculate a hash for each optimade dataset. if the new version's hash matches with the old version, just add a redirect/alias from the new to the old data. I don't think there's any metadata in the optimade api that should change with the version change. The index metadb links to the archive page, but this can have a duplicate entry.
only keep the newest version and redirect the old url to the new url. Probably not ideal, as the new version might change a lot (e.g. new id scheme), and this might break the applications relying on the old version.
Just keep the duplicate data, as is done now. Simplest, but not efficient in terms of compute resources.

Thoughts on this?

Possibly affected by pymatgen CVE

Details: GHSA-vgv8-5cpj-qj2f

I will release and backport the update to optimade-python-tools, at which point this package should also be upgraded.

Support for pymatgen `Structure` ingestion

Many archives provide data as pymatgen Structure or ComputedStructureEnergy entries which are loadable directly with monty.json.MontyDecoder. We should probably support these.

Support for files endpoint

We could consider adding support for the OPTIMADE v1.2 files endpoint, which would allow each structure to be linked to additional files that may not follow the OPTIMADE format. This hinges on whether the eventually release v1.2 format will be flexible enough to be able to reference files within archives.

`links:next` broken

Initially i thought it broke when data_returned was skipped, but now i realized that it was also broken before.

e.g. currently the alexandria-test is running on optimade-python-tools v0.25.1, and the problem is here:

https://dev-optimade.materialscloud.org/archive/alexandria-test/v1/structures

@ml-evs any ideas how to debug?

Direct support for JSONL files in config

I am preparing a big JSONL-only upload but realised we don't have a good way of specifying that in the config. I'll try to throw something together.

MongoDB slow for large databases

I spent some time investigating the slowness of the li-ion-conductors optimade database.

Here's the info endpoint: https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/info

Accessing the /structures endpoint takes over 2 minutes.

I turned on performance profiling (db.setProfilingLevel(1)) and here's the part of the log for accessing the /structures endpoint (only the slow commands should show up here, meaning everything else was fast, i think):

/structures profiling (click to show)

> db.system.profile.find().pretty()
{
	"op" : "command",
	"ns" : "li-ion-conductors.structures",
	"command" : {
		"aggregate" : "structures",
		"pipeline" : [
			{
				"$match" : {
					
				}
			},
			{
				"$group" : {
					"_id" : 1,
					"n" : {
						"$sum" : 1
					}
				}
			}
		],
		"cursor" : {
			
		},
		"lsid" : {
			"id" : UUID("f822cd93-f176-4405-8c3a-062a5c7e79d8")
		},
		"$db" : "li-ion-conductors"
	},
	"keysExamined" : 0,
	"docsExamined" : 4396695,
	"cursorExhausted" : true,
	"numYield" : 7910,
	"nreturned" : 1,
	"locks" : {
		"FeatureCompatibilityVersion" : {
			"acquireCount" : {
				"r" : NumberLong(7913)
			}
		},
		"ReplicationStateTransition" : {
			"acquireCount" : {
				"w" : NumberLong(7913)
			}
		},
		"Global" : {
			"acquireCount" : {
				"r" : NumberLong(7913)
			}
		},
		"Database" : {
			"acquireCount" : {
				"r" : NumberLong(7912)
			}
		},
		"Collection" : {
			"acquireCount" : {
				"r" : NumberLong(7912)
			}
		},
		"Mutex" : {
			"acquireCount" : {
				"r" : NumberLong(2)
			}
		}
	},
	"flowControl" : {
		
	},
	"storage" : {
		"data" : {
			"bytesRead" : NumberLong("17760670501"),
			"timeReadingMicros" : NumberLong(135074290)
		}
	},
	"responseLength" : 141,
	"protocol" : "op_msg",
	"millis" : 139328,
	"planSummary" : "COLLSCAN",
	"ts" : ISODate("2023-08-22T14:51:08.005Z"),
	"client" : "172.18.0.1",
	"allUsers" : [ ],
	"user" : ""
}

Some key points:

Basically, what seems to be the slow part, is just the counting of the total number of structures. ($match everything in the collection; group by $_id and then just sum the number).
it takes 139328 milliseconds to run
almost all of the time (135074 milliseconds) is spent on reading 17 GB of data from disk.
This command has to do a full scan of the documents (COLLSCAN). I don't think this command, in it's current state, could be sped up by using indexes.

I am wondering if this functionality could be implemented in a more efficient way. For example, db.structures.count() runs instantly.

Just for additional information, initially also accessing a single structure was as slow (2+ min), e.g. via

https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures/5b5b4b01-5b7e-48ad-8e17-8077f9b0b5d2

But after I added the id index with db.structures.createIndex({ id: 1 }), it's fast now.

Pinging @ml-evs @unkcpz @superstar54 for comments/ideas regarding the "counting" speedup. i suspect this is probably something that should be adapted in https://github.com/Materials-Consortia/optimade-python-tools?

Update optimade-python-tools to v1 (and thus pydantic to v2)

As discussed, @eimrek thinks it will be "fun" so I will leave it to him... https://github.com/pydantic/bump-pydantic might be helpful

Suggestion: `optimake serve` option

This should make the JSONL file then launch the server locally (using mongomock or otherwise), optionally running the optimade-validator on it.

Possibility of "unlisted" mode

One use case of this project is the archival of "old" OPTIMADE APIs that can no longer be supported by their original authors. In this case, it might be beneficial for the old provider to simply link to the MCloud version in its index meta-database, and retain its provider prefix. Thinking specifically here of the alexandria database, which already has a registered provider. I imagine this could be added in an "unlisted" mode, where the database is not publicly listed in MCloud's index meta-database (but would still appear on the website etc as before). This would avoid two providers from pointing to the same database.

The other option would be to simply add an aggregate=false value for that database, which will indicate the same thing at the client level.

Add ability to ingest alexandria database

from https://archive.materialscloud.org/record/2023.71

Add example archive entries and sketch out optimade.yaml config

As discussed today, I will make some example entries and map them to OPTIMADE using some yet-to-be-defined optimade.yaml config.

Things to settle for optimade-launch

Just realized there are some tasks we discussed in last meeting I didn't manage to finish, will try it this week.

fix optimade-launch tests
see how prefix can be override and add provider prefix fields

Update the archive api for new new invenioRDM version.

From Valeria,

In new invenioRDM, files are outside metadata in the new api.
See the difference here:
https://staging-archive.materialscloud.org/api/records/1414 current version
https://dev-inveniordm.materialscloud.org/api/records/tqea8-ag515 new version with invenioRDM
also, the record_id will be different.

So, we need to update the script before we move to the new version of the Archive.

optimade-maker/src/optimake/archive/archive_record.py

Line 113 in 94073ac

def get_files(self):

Will come to this later.

Naming the public package layer

We should probably choose a better name than mc_optimade for the user-facing layer that could in the future operate on non-MC archives.

Suggestions:

necroptimade (the original, but can understand not choosing it if this will still live under the materialscloud github org)
optimancer (flip side of necroptimade...)
...

Add JSON property parser

Currently we only support properties being provided in CSV files. We should extend this to include JSON files, both the formats [{"id": "1", "property": "xyz"}] and {"1": {"property": "xyz"}} should probably be allowed.

Add validator for user-provided JSONL files

This validator could also extract useful things like custom property names.

Add electrides dataset as a test/example

Choosing the ID format & resulting config changes

As discussed in the final chat, there are outstanding issues to resolve around IDs.

We discussed several options for how to generate an ID for each entry. For example, if my structures are named "relaxed_1-100" in cif files in a structures.zip archive, subfolder structures/relaxed structures:

Use the full file path, including archive files, e.g., id: 'structures.zip/structures/relaxed_structures/relaxed-1.cif.
Extract only the name from the file, e.g., id: 'relaxed-1'
same as 2 but allow a pattern to be provided, e.g., id_pattern: relaxed-{id}.cif -> id: '1'.
Provide an additional mapping file id.json that maps filenames to IDs, pushing the burden on the user instead.

I think I am in favour of 3, provided we can add another field that is used to precisely map to the source file. I think OPTIMADE's immutable_id would be useful here (as we are not using it elsewhere, and it does not need to be URL safe like id does).

I began implementing some of the related changes around specifying which archive a file comes from inside PR #11 and will continue to work on it next week.

Suggestions to improve created APIs

I've just been trying to make use of my own dataset through the API here, and have collected a few comments:

improve default rule for generating id field. At the moment we use precisely the relative filepath that the structure came from. I wonder if a better default would be attempting filepath.split("/")[-1].split(".")[0] (i.e., take just the filename). This is not guaranteed to be unique; rather than having some awkward config for specifying this, we could attempt to generate the unique set of IDs from first the filenames themselves, then the first folder up, then second folder up etc. and only fallback to the full file path when required. The immutable_id field (which we dont use atm) can be set by the relative filepath still. I will try experimenting with this...
we should set last_modified to be the datetime of extraction
the database metadata should include the optimade-maker version used

materialscloud-org / optimade-maker Goto Github PK

optimade-maker's Issues

Recommend Projects

Recommend Topics

Recommend Org