bluesky / databroker-pack Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 8.0 160 KB

Pack and unpack Bluesky Runs to/from a portable storage format.

Home Page: http://blueskyproject.io/databroker-pack

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

hacktoberfest hacktoberfest2021

databroker-pack's Introduction

Bluesky — An Experiment Specification & Orchestration Engine

Source	https://github.com/bluesky/bluesky
PyPI	`pip install bluesky`
Documentation	https://bluesky.github.io/bluesky
Releases	https://github.com/bluesky/bluesky/releases

Bluesky is a library for experiment control and collection of scientific data and metadata. It emphasizes the following virtues:

Live, Streaming Data: Available for inline visualization and processing.
Rich Metadata: Captured and organized to facilitate reproducibility and searchability.
Experiment Generality: Seamlessly reuse a procedure on completely different hardware.
Interruption Recovery: Experiments are "rewindable," recovering cleanly from interruptions.
Automated Suspend/Resume: Experiments can be run unattended, automatically suspending and resuming if needed.
Pluggable I/O: Export data (live) into any desired format or database.
Customizability: Integrate custom experimental procedures and commands, and get the I/O and interruption features for free.
Integration with Scientific Python: Interface naturally with numpy and Python scientific stack.

Bluesky Documentation.

The Bluesky Project enables experimental science at the lab-bench or facility scale. It is a collection of Python libraries that are co-developed but independently useful and may be adopted a la carte.

Bluesky Project Documentation.

See https://bluesky.github.io/bluesky for more detailed documentation.

databroker-pack's People

Contributors

Stargazers

Watchers

Forkers

danielballan tacaswell mrakitin mikehart85 dylanmcreynolds st3107 gwbischof kadykov

databroker-pack's Issues

Support a 'databroker' CLI with subcommands.

It would be nice to support databroker pack as well as databroker-pack. Jupyter has an elegant way of achieving this. (It's harder than I expected!) We might want to imitate it: https://github.com/jupyter/jupyter_core/blob/master/jupyter_core/command.py

line-painter bug in writing document manifest

I have not checked if this reproduces on current main, but documenting so I don't forget.

Looking at a pretty big export (16k runs) I noticed the document_manifest.txt file was ~5.8G which seemed wrong. The first couple lines of the file is:

147b431a-a326-4892-a4ae-a3127bc08f6c.msgpack
147b431a-a326-4892-a4ae-a3127bc08f6c.msgpack
922ed815-a5f8-4fe1-a5ac-a4c3893bf553.msgpack
147b431a-a326-4892-a4ae-a3127bc08f6c.msgpack
922ed815-a5f8-4fe1-a5ac-a4c3893bf553.msgpack
e0a8934b-c0cd-4b77-ad4b-5058cb21211b.msgpack
147b431a-a326-4892-a4ae-a3127bc08f6c.msgpack
922ed815-a5f8-4fe1-a5ac-a4c3893bf553.msgpack
e0a8934b-c0cd-4b77-ad4b-5058cb21211b.msgpack
9b4b9a84-151b-4c7c-854d-ebe33124a1de.msgpack

which shows we are re-writing the whole set of files everytime we add one which in turns what should be a 16498 line file into a 136100251 line file ;)

Add support for --filter via Python lambda

It would be useful to be able to filter Runs at the Python layer. The --query argument already allows them to be filter at the MongoDB layer, but Python gives us access to more things and is of course more expressive (though more expensive). I have in mind things like

databroker-pack ... --filter "lambda run: run.metadata['stop'] == 'success'"
databroker-pack ... --filter "lambda run: run.primary.read()['motor'].max() < 5"

Multiple --filter parameters should be allowed, using argparse's action='append', and they should be logically AND-ed just as --query is. (Why not OR? Because OR will will be less common, and if needed it can be implemented inside one lambda.)

Avoid writing inside $HOME during tests.

Something that bothers me about our tests presently is that they place files in ~/.local/share/intake and thus are dependent on and modify the contents of the current user's $HOME. I would like to be able to temporarily set that to be a location in /tmp. I think this can be done without changes to intake or databroker, by just setting $XDG_DATA_DIR or something like that, since intake uses appdirs, but after spending an hour or so on it (months ago) I gave up and moved on. Worth revisiting!

Make CLI usable via Python API

Let databroker_pack.commandline.pack.main() accept option args and pass them through to https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser.parse_args

Migrate from Travis-CI to GitHub Actions

See bluesky/bluesky-enhancement-proposals#21 for details.

Import error when using the databroker 2.0.0b

The databroker-pack raised import error when used in databroker 2.0.0b. This is because there is no core module in the databroker 2.0.0b. There is only _core.

Traceback (most recent call last):
  File "/Users/sst/anaconda3/envs/latest_databroker/bin/databroker-pack", line 6, in <module>
    from databroker_pack.commandline.pack import main
  File "/Users/sst/anaconda3/envs/latest_databroker/lib/python3.10/site-packages/databroker_pack/__init__.py", line 6, in <module>
    from ._pack import *  # noqa
  File "/Users/sst/anaconda3/envs/latest_databroker/lib/python3.10/site-packages/databroker_pack/_pack.py", line 10, in <module>
    import databroker.core
ModuleNotFoundError: No module named 'databroker.core'

I wonder if this package will be updated to be compatible with databroker 2.0.0.

Add support for chunking

For large batches, it would be convenient to be able to compress and transfer the packed data in chunks. To facilitate this, we need to encode sets of Documents files (msgpack or jsonl files) that go with sets of external files, so that if I have up to chunk N of Documents and up to chunk N of the external files they reference, I have exactly the relevant external files that I need---no more or less.

This has not been implemented, but it has been designed in detail in a conversation between myself and @tacaswell, documented here.

We can keep the directory structure as it is now: just one directory, plus a sub-directory structure under external_files if the --copy-external flag is set. Instead of writing one documents_manifest.txt and one external_files_manifest_<root_hash>.txt per root, we can write out N manifests for each, a documents_manfiest_i.txt per chunk and external_file_manifest_<root_hash>_i.txt files per chunk and root. A given external file should only be listed once globally in the whole set of external_file_manifest_<root_hash>_i.txt files. If a file is referenced by a Run in chunk x and a Run in chunk y > x, it should only be listed in the manifest for chunk x.

The user can specify the chunk size in terms of number of Runs (given as an integer) or max byte size (given as a string like 10000000B or 10MB).

The chunking and compression can be done separately, downstream. Only the first chunk should contain the catalog.yml. The chunks can be "un-tarred" into the same directly, as they will have no conflicting files. We can also incorporate optional tarring and compression into databroker-pack itself, but it needs to be possible to do it outside the tool for use cases where the large file transfer is being handled separately.

create compressed archive file

It would be truly helpful if databroker-pack were to create a compressed archive file (zip and/or tar+gzip) and to ingest same on unpack.

Extend databroker-unpack to load documents into Mongo

For many use cases, running intake top of a directory of mspgack or jsonl files is not suitable. The databroker-unpack utility should provide an option to copy Documents from the files into MongoDB and auto-generate a suitable catalog file.

I didn't do this in the first pass because it was not obvious to me how to spell, or how we might want to extend it. Initial thought:

databroker-unpack inplace DIRECTORY NEW_CATALOG_NAME  # current functionality

and

databroker-unpack mongo_normalized DIRECTORY NEW_CATALOG_NAME --metadatastore=MONGO_URI --filestore=MONGO_URI

Now, databroker-pack is intentionally "opinionated" and only supports a hard-coded subset of suitcases (msgpack, jsonl). Should we do the same for -unpack? I think so, because it enables us to auto-generate a working catalog file.