materialscloud-org / optimade-maker Goto Github PK
View Code? Open in Web Editor NEWTools for making OPTIMADE APIs from various formats of structural data (e.g. an archive of CIF files).
License: MIT License
Tools for making OPTIMADE APIs from various formats of structural data (e.g. an archive of CIF files).
License: MIT License
Okay, i got this working, but it would be good to have two changes:
Originally posted by @eimrek in #12 (comment)
Stuff to-do before public release:
Currently, when generating the OPTIMADE jsonl file, custom/provider properties automatically get the mcloudarchive
prefix by default, or a different prefix if changed via the MC_OPTIMADE_PROVIDER_PREFIX
env variable.
I am still thinking whether it might make sense to keep jsonl file provider-agnostic (e.g. using a placeholder prefix like _custom
or _provider
). Possible reasons:
_stability
a provider?)Although I understand that there is value in having the jsonl file exactly mirror the underlying data (mongoDB in our case) and the API responses. So actually, maybe it's good to keep it as it is now.
thoughts, @ml-evs?
I am wondering if we really need to support "direct" jsonl files in the optimade.yaml
format.
Conceptually to me it seems that the current purpose of optimake
is to generate a jsonl file from other structural data formats and optimade.yaml
is something that help to achieve this.
If we already have a jsonl file, then the only purpose I see is validation, and the optimade.yaml
file does not seem strictly necessary.
But perhaps generating jsonl files and validating them is different enough to separate them?
E.g. we could have a different optimake
subcommand for validation (optimade validate <jsonl-file>
?)
Regarding the Materials Cloud Archive service, this change would affect it, as then we should add support for a "direct" jsonl file without any optimade.yaml
file.
Currently, we do not make use of the type information provided in the config. This means that a user could have a csv file with a column of [1, 2, 3] that should be converted to strings, so we should try to make a type map and apply it when attaching custom properties to entries.
Currently we only populate the info endpoint with the custom properties provided by the user, we should also fill in the normal OPTIMADE properties.
Currently, If a user makes a new version of their archive entry, it is considered fully separate and independent of the other versions. This means that each version will have an OPTIMADE API running for them, even if the data is exactly the same.
To demonstrate, i made two versions of my test entry:
where the difference in version 3 is that I just added an extra file that is irrelevant to the optimade data. This means these two APIs are exactly the same. This is a scenario that we probably want to avoid.
The two entries are currently online on https://dev-optimade.materialscloud.org/
Possible solutions for this:
calculate a hash for each optimade dataset. if the new version's hash matches with the old version, just add a redirect/alias from the new to the old data. I don't think there's any metadata in the optimade api that should change with the version change. The index metadb links to the archive page, but this can have a duplicate entry.
only keep the newest version and redirect the old url to the new url. Probably not ideal, as the new version might change a lot (e.g. new id scheme), and this might break the applications relying on the old version.
Just keep the duplicate data, as is done now. Simplest, but not efficient in terms of compute resources.
Thoughts on this?
Details: GHSA-vgv8-5cpj-qj2f
I will release and backport the update to optimade-python-tools, at which point this package should also be upgraded.
Many archives provide data as pymatgen
Structure
or ComputedStructureEnergy
entries which are loadable directly with monty.json.MontyDecoder
. We should probably support these.
We could consider adding support for the OPTIMADE v1.2 files endpoint, which would allow each structure to be linked to additional files that may not follow the OPTIMADE format. This hinges on whether the eventually release v1.2 format will be flexible enough to be able to reference files within archives.
Initially i thought it broke when data_returned
was skipped, but now i realized that it was also broken before.
e.g. currently the alexandria-test is running on optimade-python-tools v0.25.1
, and the problem is here:
https://dev-optimade.materialscloud.org/archive/alexandria-test/v1/structures
@ml-evs any ideas how to debug?
I am preparing a big JSONL-only upload but realised we don't have a good way of specifying that in the config. I'll try to throw something together.
I spent some time investigating the slowness of the li-ion-conductors
optimade database.
Here's the info endpoint: https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/info
Accessing the /structures
endpoint takes over 2 minutes.
I turned on performance profiling (db.setProfilingLevel(1)
) and here's the part of the log for accessing the /structures
endpoint (only the slow commands should show up here, meaning everything else was fast, i think):
> db.system.profile.find().pretty()
{
"op" : "command",
"ns" : "li-ion-conductors.structures",
"command" : {
"aggregate" : "structures",
"pipeline" : [
{
"$match" : {
}
},
{
"$group" : {
"_id" : 1,
"n" : {
"$sum" : 1
}
}
}
],
"cursor" : {
},
"lsid" : {
"id" : UUID("f822cd93-f176-4405-8c3a-062a5c7e79d8")
},
"$db" : "li-ion-conductors"
},
"keysExamined" : 0,
"docsExamined" : 4396695,
"cursorExhausted" : true,
"numYield" : 7910,
"nreturned" : 1,
"locks" : {
"FeatureCompatibilityVersion" : {
"acquireCount" : {
"r" : NumberLong(7913)
}
},
"ReplicationStateTransition" : {
"acquireCount" : {
"w" : NumberLong(7913)
}
},
"Global" : {
"acquireCount" : {
"r" : NumberLong(7913)
}
},
"Database" : {
"acquireCount" : {
"r" : NumberLong(7912)
}
},
"Collection" : {
"acquireCount" : {
"r" : NumberLong(7912)
}
},
"Mutex" : {
"acquireCount" : {
"r" : NumberLong(2)
}
}
},
"flowControl" : {
},
"storage" : {
"data" : {
"bytesRead" : NumberLong("17760670501"),
"timeReadingMicros" : NumberLong(135074290)
}
},
"responseLength" : 141,
"protocol" : "op_msg",
"millis" : 139328,
"planSummary" : "COLLSCAN",
"ts" : ISODate("2023-08-22T14:51:08.005Z"),
"client" : "172.18.0.1",
"allUsers" : [ ],
"user" : ""
}
Some key points:
$match
everything in the collection; group by $_id
and then just sum the number).COLLSCAN
). I don't think this command, in it's current state, could be sped up by using indexes.I am wondering if this functionality could be implemented in a more efficient way. For example, db.structures.count()
runs instantly.
Just for additional information, initially also accessing a single structure was as slow (2+ min), e.g. via
But after I added the id
index with db.structures.createIndex({ id: 1 })
, it's fast now.
Pinging @ml-evs @unkcpz @superstar54 for comments/ideas regarding the "counting" speedup. i suspect this is probably something that should be adapted in https://github.com/Materials-Consortia/optimade-python-tools?
As discussed, @eimrek thinks it will be "fun" so I will leave it to him... https://github.com/pydantic/bump-pydantic might be helpful
This should make the JSONL file then launch the server locally (using mongomock or otherwise), optionally running the optimade-validator
on it.
One use case of this project is the archival of "old" OPTIMADE APIs that can no longer be supported by their original authors. In this case, it might be beneficial for the old provider to simply link to the MCloud version in its index meta-database, and retain its provider prefix. Thinking specifically here of the alexandria database, which already has a registered provider. I imagine this could be added in an "unlisted" mode, where the database is not publicly listed in MCloud's index meta-database (but would still appear on the website etc as before). This would avoid two providers from pointing to the same database.
The other option would be to simply add an aggregate=false
value for that database, which will indicate the same thing at the client level.
As discussed today, I will make some example entries and map them to OPTIMADE using some yet-to-be-defined optimade.yaml
config.
Just realized there are some tasks we discussed in last meeting I didn't manage to finish, will try it this week.
From Valeria,
In new invenioRDM, files are outside metadata in the new api.
See the difference here:
https://staging-archive.materialscloud.org/api/records/1414 current version
https://dev-inveniordm.materialscloud.org/api/records/tqea8-ag515 new version with invenioRDM
also, the record_id will be different.
So, we need to update the script before we move to the new version of the Archive.
Will come to this later.
We should probably choose a better name than mc_optimade
for the user-facing layer that could in the future operate on non-MC archives.
Suggestions:
Currently we only support properties being provided in CSV files. We should extend this to include JSON files, both the formats [{"id": "1", "property": "xyz"}]
and {"1": {"property": "xyz"}}
should probably be allowed.
This validator could also extract useful things like custom property names.
As discussed in the final chat, there are outstanding issues to resolve around IDs.
We discussed several options for how to generate an ID for each entry. For example, if my structures are named "relaxed_1-100" in cif files in a structures.zip archive, subfolder structures/relaxed structures:
id: 'structures.zip/structures/relaxed_structures/relaxed-1.cif
.id: 'relaxed-1'
id_pattern: relaxed-{id}.cif
-> id: '1'
.id.json
that maps filenames to IDs, pushing the burden on the user instead.I think I am in favour of 3, provided we can add another field that is used to precisely map to the source file. I think OPTIMADE's immutable_id
would be useful here (as we are not using it elsewhere, and it does not need to be URL safe like id
does).
I began implementing some of the related changes around specifying which archive a file comes from inside PR #11 and will continue to work on it next week.
I've just been trying to make use of my own dataset through the API here, and have collected a few comments:
id
field. At the moment we use precisely the relative filepath that the structure came from. I wonder if a better default would be attempting filepath.split("/")[-1].split(".")[0]
(i.e., take just the filename). This is not guaranteed to be unique; rather than having some awkward config for specifying this, we could attempt to generate the unique set of IDs from first the filenames themselves, then the first folder up, then second folder up etc. and only fallback to the full file path when required. The immutable_id
field (which we dont use atm) can be set by the relative filepath still. I will try experimenting with this...last_modified
to be the datetime of extractionA declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.