The machinable from machinable-org

Allow to express settings in settings.toml

Once Python 3.11 becomes available, use the stdlib toml parser to make settings more readable.

Re-run relationships

machinable should keep track of the relationships between Experiments, e.g. re-runs of the same experiment etc.

Improve Storage indexing

Various improvements to indexing

improve performance by pruning the tree
remove ComponentStorage if not found to enable reindexing after moving files

Documentation improvements

Add a 'How-to' section with in-depth tutorials like 'how to continue execution', 'how to reproduce results' etc.
Resource inheritance documenation
Slurm engine documentation
Schema-validation

Add from_code option for Execution from code backups

It is often useful to run from a code backup rather the current code state to exactly reproduce an execution. The Execution or Project interface should have an interface to run directly from a code backup.

The Slurm engine may use this by default to ensure that the code does not change between queue and execution time.

Registration API

Projects should be able to expose custom elements like Drivers, host_info etc from a registration class to make it easy to share them

Rework signature of Task.component()

The signature is sometimes cumbersome to use as a lot of tuples have to be wrapped in each other affecting readablity.

Handle non-jsonable types in records

Currently, we just convert anything to a string which is not ideal as the user might not be aware of it. To address

Record should allow for registration of a custom json-serialiser to convert complex types to strings
if no serialiser is registered, an Exeception should be raised if type is not JSONable
the exception should explain alternatives like storage.save_data()

Replace callable component methods by Component classes

The current functional API has a number of limitation and inconsistencies with the recommended Component API. This leads to poor user experience and unnecessary code duplication. To resolve this, the functional API should be using standard Component classes.

Enable doc string linter

Example setup

Skip code backup if project too large

With options to disable this behaviour in settings.yaml and storage argument.
There should also be a progress bar if the processing takes longer

Support output redirection in multiprocessing execution

This feature was previously removed due to conflicts with the pytest output capturing mechanism.

Commandline interface

The task API could be easily exposed to the command line, for example:

machinable --component test --version learning_rate=1

There could also be an interactive execution feature

$ machinable execute
What component do you want to execute? test
Do you want to specify a version [N/y]? N
...

There could also be a way to inspect the config, get dry runs and execution plans etc.

I don't think this should evolve too far because most of these use cases will be covered by the app but it could be a useful tool in server environments.

Use setproctitle to set process information

It can be hard to relate the running processes to machinable Experiments. We should use setproctitle to name every process after it's UID/experiment_id

Use of record writer fails before create events

Since on_execute_start is written in execute after create events are triggered.

Extended GraphQL server filesystem APIs

The GraphQL server currently only provides a basic GET endpoint to read plain text files. To serve advanced use cases, the API should provide

A subscription based streaming interface that allows to read files partially
File watcher subscription based on watchgod that allows to react to updates (e.g. changes in the machinable.yaml)

Automatic code backup is not disabled in no-project environments

machinable should detect if the project is not a repository like in an jupyter environment and disable the code backup.

Revisit events API

The events API could be improved and documented. Right now it does not have a clear use cases and it may not work well in distributed settings

ConfigMap should track the full key path

This would enable better KeyError messages as well as smart 'reflective' transformations, for example

test.example.toDict(patch=true) => {"test": {"example": test.example.toDict()}}

Drop GitPython dependency in favour of sh

GitPython is currently only used to infer commit information which could be done with sh

Allow dependency specifications between executions

This would be useful to define component execution that need others to finish first etc. The implementation would be left to the respective engines (Slurm already supports dependency pipelines via --dependency).

API proposal (WIP)

Execution('example').dependency([Execution('next'),], condition="any_finished")
                                 .dependency(Execution('failure'), condition="failure")
                                 .submit()
def dependency(executions: List[Union[Execution, Experiment]], condition=None):
      self._dependencies.append(locals())
# self-dependency to 're-run' if failed?

Engines:

engine on top propagates to all non-executed dependencies; exception if non-executed dependent declare engines
if already executed should check what type of engine was used and if same time go ahead to allow for dependencys on already running jobs
engines should expose 'supports_dependencies' flag similar to 'supports_resources'

At execution,

check if engine supports_dependencies, if not, raise error if execution has dependency
in future, if an engine does not support dependencies, we can fall back on synchronous logic to handle it even if the engine itself does not support it

Useful example: Slurm dependency reference

Simplify checkpoint functionality

Currently users have to manage the filenames manually, there should be an interface to select checkpoints more easily directly from the Excution interface, e.g. 'from_last_checkpoint' etc.

Provide options to compute length of execution etc.

total training time
number of resumes
show total training time at the end

Handle structured version updates correctly

Currently, version updates fail if they don't match the structure, for instance

components:
 - demo:
     test: 1

Experiment().component('demo', {'test': 2 }) # fine
Experiment().component('demo', {'test': {'nested': 'structure'}) # fails

I'm not sure whether that's a bug or a feature. In any case, the exception should be handled more gracefully.

Native engine should allow multiprocessing with 1 process

Currently, we do not use multiprocessing if processes is set to 1. Since there are good use cases for using a process isolation (e.g. catching SEGV stack traces etc) we should allow use of 1 process and use it to capture process level failures gracefully.

Replace Store() interface with StorageFileSystem model

The Store() interface duplicates a lot of the APIs of the more general StorageFileSystemModel and should thus be replaced.

Support nested component scopes

components:nested.module:
  - example

currently fails in validation

Automatic dependency managment

Currently, the user is reponsible for downloading dependency repos etc. It might be nice if machinable would take care of this automatically. This could also include some conflict management. However, I guess we could just rely on some other library that already solved this type of problem rather than reinventing some custom solution.

Relative `since` specificatoin in Index.find_latest()

Currently, Index.find_latest(since) requires a DateTime to be passed that has to be constructed by the user. It would be more convenient if users could also specify relative time as a string argument, for example since="1d" etc.

machinable init command

To scaffold new projects from a starter-template. Probably as simple as cloning a Git repo

Provide options for 'unit' testing

The structure of a machinable project allows for automated execution of all available components in a project similar. It would thus be possible to introduce an execution mode in which every component is being executed like a unit test. The component class could expose basic 'assert' APIs that only apply when executed in this new 'test mode'.

Ideally, this could be implemented by using an existing testing framework like PyTest.

disable output capturing to not interfere with Pytest's output capturing

Improve Ray Tune integration

Ray 0.8.7 improved the Trainable interface that is used in the machinable tune integration. This will allow us to improve the integration and get rid of some of the current limitations like missing checkpoint support.

Ignore schema-validation for mixins

Using Experiment.component('example', ('_test_')) will fail unless _test_ is specified in the example. To enable dynamic mixins, the schema-validation should be relaxed when it comes to mixin versions.

Local mode execution should report Exception right away

Currently, Exceptions are being caught and reported at the very end of the jobs as parallel execution is assumed. However, that's not useful behaviour in local mode because you will only learn about an exception after all iterative executions have been finished.

Add Task() reflection/serialization

It should be possible to recover a Task object using some representation of the Task specification. That would enable easy reconstruction from the observations etc.

Protect GraphQL server with Token

Should be managed in a similar way as in jupyter notebooks.

Mixins should be able to expose config methods

Currently, users have to manually 'forward' config methods into mixins, e.g.

def config_the_method(self):
      self._the_mixin_.config_the_method()

It would be useful to invoke such forward calls automatically.

Improve mixin overrides

It is currently not possible to override the _mixin_ list of parent components in a fine-grained way (e.g. remove a particular mixin etc) because they are expanded in the parent before the inheritance occurs. To allow for more flexible use, the _mixin_ elements should not be resolved until after inheritance is applied.

Infrastructure backlog

Make engine and index imports lazy

The available engines and indexes are either imported eagerly if the overhead is small or not imported at all if they require heavy dependencies. Ideally, the imports in these models should be handled lazily using the new Module getattr option (available in Python >=3.7)

Allow engines to write meta data

Engine often have useful meta data, for instance, the Slurm engine could infer the slurm submission ID and save it to the storage. Engine() should provide an interface to easily store meta data.

Smart reloading of observations

It should be possible to implement some smart automatic reload of observations, for example:

destroy observations cache every 5 minutes unless the job is finished/died
option to enable automatic re-index current storages in some interval

Cache unchanged parsed configuration for quick reload

Maintaining a hash of the machinable.yaml files for caching would probably provide some performance benefits.

get_component improvements

introduce get_component(or_fail=False) option to raise an exception if the component cannot be found
rename index argument to avoid confusion with Indexes
the method should act as identity when being passed a Component, i.e. use StorageComponent.create() rather than constructor

CLI autocomplete

Support auto-completion by statically analyzing the project

Make store operations resumable

Currently, the Store() interface makes the assumption that execution is continuous from start to finish. However, in many cases workers are actually interrupted and resumed from checkpoints and using self.store may lead to inconsistent results. machinable should automatically 'resume' from existing store/ data to enable seamless spot execution.

Introduce sync for non-pyfilesystem locations

Observer should have a sync(directory, target, frequency) method that enables automatic syncing of local file directory to the non-local pyfilesystem under 'storage/$target' of the observer storage. This is useful if you want to sync custom things like tf checkpoints etc to the storage. When the method is called multiple times multiple syncs are setup. If the filesystem is already local, nothing should be done. We can automatically sync in regular frequency using the heartbeat event, self.events.on('heartbeat', syncer.sync_if_needed)

Ray has sync features that we could potentially reuse.

Replace sh with commandlib

Follow up to #18. We currently rely on sh in an inconsistent way throughout the code base. Moving to commandlib should simplify the setup and may also allow to drop dependencies like GitPython etc.

Allow to register global config methods

Some config methods are fairly generic and not bound to the particular component (e.g parsing dtypes etc). It should thus be possible to register them globally with support for imports etc so it would be easy to provide sensible defaults

machinable-org / machinable Goto Github PK

machinable's People

Contributors

Stargazers

Watchers

Forkers

machinable's Issues

Recommend Projects

Recommend Topics

Recommend Org