aai-institute / nnbench Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 3.0 5.61 MB

A small framework for benchmarking machine learning models.

Home Page: https://aai-institute.github.io/nnbench/

License: Apache License 2.0

Python 100.00%

benchmarking machine-learning mlops neural-networks python

nnbench's Issues

Add user guide: Defining and parametrizing benchmarks by applying decorators

With the quickstart (introductory example based on scikit-learn) merged (see #25), we can follow it up with more thorough guides on different important cornerstones of the package:

Defining and parametrizing benchmarks by applying decorators (@benchmark,@parametrize,@Product) - motivation, behavior, and one usage example for each.

Enhance API Reference Documentation for nnbench

Similar to lakefs-spec (aai-institute/lakefs-spec#273 (review)), we would like to make improvements to the API reference documentation generation for the nnbench library. The proposed changes include:

Extracting module docstrings for root-level modules:
- Update the gen_api_ref_pages.py script to extract the module docstrings for root-level modules in the package.
- Add the extracted docstrings to the index page of the API reference.
- Utilize the docstring_parser library to parse the docstrings.
Adding links to root-level modules on the index page:
- Include links to the root-level modules on the index page of the API reference.
- Generate these links automatically based on the navigation structure.
Updating dependencies:
- Add the docstring-parser library to the documentation dependencies in pyproject.toml.

These enhancements aim to provide a more comprehensive and user-friendly API reference documentation for the nnbench library. The inclusion of module docstrings and links to root-level modules will make it easier for users to navigate and understand the library's structure and functionality.

Provide an importlib entry point for registering custom reporters

The AbstractBenchmarkRunner.run() interface can also take strings, which prompt a lookup of the corresponding reporter class in the reporter registry.

Similarly to fsspec, we should use an importlib.metadata entry point so that users of the library can register benchmark reporters via pyproject.toml.

Source: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/__init__.py

(This can be simplified for Python>=3.9.)

Add user guide: Reporters usage, streaming, creating custom ones, registering

Add user guide to the documentation:

Reporters and how to use them, streaming to different sources, creating custom reporters, registering them via nnbench.reporters entry point

Docs: Versioning considerations

A few observations on the use of docs versions:

latest (a pre-release version) is set as the default docs version. This causes the notification banner to be shown - however, the link in the banner links to the default version (i.e., by clicking on it, you are still looking at the pre-release docs).
There is only one stable version. Going forward, this might confuse users, who cannot access the docs for previous stable releases.

I propose adopting the same versioning scheme as for lakeFS-spec, with latest as an alias for the latest stable version and set as the default. Pre-releases are labeled as development in that scheme.

User guide: benchmarking on saved artifacts

Documentation has been expanded to demonstrate the options for artifact loading available: fsspec default vs. custom and how to use them.
Add documentation how to build custom artifact type.

Add `Artifact` class (and specialisations for models, datasets etc.)

The current decision is that setUp and tearDown tasks/callbacks will not be allowed to modify their arguments in any way to preserve type safety and integrity. This is achieved via passing the unpacked params dictionaries to the callbacks (requiring they take the parameter values in their interface), and making the tasks return None (or at least ignoring their return values in the benchmark runner).

However, we can immediately see that this abstraction is too inflexible for real cases, the most prominent of which are models and datasets (or, more directly speaking, the parameters relevant to >99% of mlbench use cases).

Therefore, we have to come up with an abstraction that allows to specify models as parametrized artifacts (by name, checksum, etc.), retrieving their contents from storage/loading them just in time for a setUp task. In fact, this is probably a routine so important that it deserves a one-liner API so that users who override setUp and tearDown tasks can incorporate artifact sourcing by themselves.

Concretely, I am thinking of something like this:

import os
from typing import Generic, TypeVar

import jax.numpy as jnp

T = TypeVar("T")

class MyModel:
    pass

class Artifact(Generic[T]):
    value: Any | None = None

    def __init__(self, path: str | os.PathLike[str]) -> None:
        # Save the path for later just-in-time materialization.
        pass

    def materialize(self) -> "Artifact"
        """Load the artifact from storage.""" 
        pass


def setUp(**kwargs) -> None:
    for k, v in kwargs.items():
        if isinstance(v, Artifact):
            v.materialize()

@benchmark
def accuracy(model: Artifact[MyModel], test_data: Artifact[jnp.Array]) -> float:
    pymodel = model.value() # the model is embedded in the artifact as a value
    xdata, ydata = test_data.value
    return jnp.sum(jnp.abs(ydata - pymodel(xdata)))

Bonus points for being able to use artifacts without having to access their value first (e.g. with wrapt?).

Add test coverage for nnbench core abstractions

We decided to forego test coverage in the past, but we should add it back to establish confidence into our first shipped release.

Concrete steps:

Core decorator tests: Check names, interfaces, fn attributes (docstrings, defaults etc.)
Collection tests: Number of benchmarks, tag matching, multifile/directory collection, typecheck fails.
Context: Collection, failure cases (non-unique keys)

Add a tabular ostream reporter class

This is currently the missing piece to the full end-to-end experience.

What is left to decide is

a) Do we use raw functions, or a reporter interface?
b) How do we connect these interfaces to the base benchmark runner's report() API? (-> fsspec-like registry?)
c) If we do a registry for b), can we use pyproject hooks to register user extensions in pyproject.toml?

Module path construction causes typecheck fails

Raised internally. Repro:

import nnbench
from nnbench import types, context
import pandas as pd
from pathlib import Path

class MyDataSet(types.Artifact):
	def deserialize(self) -> None:
		path = Path(self.path).resolve()
		self._value = pd.read_csv(path, sep="\t")
		

@nnbench.benchmark()
def my_benchmark(dataset: MyDataSet) -> float:
	return 1.0

if __name__ == "__main__":
	dataset = MyDataSet(types.LocalArtifactLoader(path="ames_housing.csv"))
	
	runner = nnbench.BenchmarkRunner()
	reporter = nnbench.BenchmarkReporter()

	results = runner.run("example.py", params={
	                     	"dataset": dataset,
	                     })
	reporter.display(results)

Error:

(.venv) janwillem@macbook-janwillem [nnbench-trial2] % python example.py
Traceback (most recent call last):
  File "/Users/janwillem/Developer/hack/nnbench-trial2/example.py", line 27, in <module>
    results = runner.run("example.py", params={
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/janwillem/Developer/hack/nnbench-trial2/.venv/lib/python3.11/site-packages/nnbench/runner.py", line 239, in run
    self._check(dparams)
  File "/Users/janwillem/Developer/hack/nnbench-trial2/.venv/lib/python3.11/site-packages/nnbench/runner.py", line 130, in _check
    raise TypeError(
TypeError: expected type <class 'example.py.MyDataSet'> for parameter 'dataset', got <class '__main__.MyDataSet'>

This means that the module loading will screw up some of the types. (If I understand correctly, it expects a MyDataSet from the target file, but gets one from the currently executed module, i.e. __main__).

The module import name is obviously a bug, but the general behavior might be expected, although surprising. Possible actions here include making a warning/note on using __main__ as path_or_module qualifier when running in-file benchmarks.

Add `nnbench.State` object holding current benchmark information, inject into setup/teardown tasks

We want to experiment with a global memoization cache, which we want to explicitly clear in a teardown task after a family of benchmarks for a model (say NER on distilbert) has run.

This means that the setup and teardown tasks need to know which benchmark they are currently applied in.

For a single benchmark, say

@nnbench.benchmark
def echo(s: str) -> str:
    print(s)

the corresponding State should look something like this:

def tearDown(state, **params): # <- or maybe a mappingproxy of the params?
    print(state)

# name: "echo"
# family: "echo" <- this is for parametrized benchmarks, where the family name equals the function name.
# family_size: 1 <- how many members in the current benchmark family?
# family_index: 0 <- which number of the family is it?

More metadata suggestions welcome. After this, we can try evicting a memo from the cache on the condition family_index == family_size - 1.

Save benchmark function interface, validate parameters

Currently we are overwriting the parameters for the benchmark passed in BenchmarkRunner.run() with the params slot saved on the benchmark. This is not thought through entirely, and needs changing.

Points to consider:

Removeparams on the Benchmark class as a separate slot. All default values can be given either as variable defaults in the benchmark signature, or via params in BenchmarkRunner.run(). (#15)
We need to warn (logging) on param<->default value collision to prevent confusion. (#19)
We should think of validating the interface typing against the passed parameters, and warn/error on mismatches. (#12)
In this sense, we can also warn if the benchmark function interface is partially/entirely untyped. (Debug log prints in _check if a member of any benchmark interface is untyped, i.e. param.annotation == inspect.Parameter.empty.) (#18)
Save benchmark function interface as a member of the Benchmark class. This should be a nnbench class called Interface etc., which is immutable/frozen, and provides easy access to function interface properties such as varnames (->inspect.signature.parameters.keys()), vartypes (->inspect.signature.parameters.values()), tuples of the two previous (->inspect.signature.parameters.items()), and default values (->p.default for p in inspect.signature.parameters). (#20)

Add `namegen` argument to `@nnbench.{parametrize,product}`

Similarly to defining a custom name for a single benchmark, we should give an option to give a custom name to each benchmark of a family defined by the @nnbench.parametrize and @nnbench.product decorators.

To prevent collisions, this should not be a static string, but a callable taking the input parameters as a dictionary and returning a string that will subsequently be set as the name of the benchmark for the respective parameter instance.

Add some default context values and benchmark metadata

The contextual struct is so far empty - everything is opt-in. That's cool, but some things are just so common that they might be useful to include by default.

Ideas:

Datetime for the start of the benchmark run.
Elapsed wall time (measured e.g. by time.perf_counter_ns()).
git SHA / tag of the current project if in a repository.
Version info of installed popular SciComp packages (TF, PyTorch, NumPy, JAX, ...).
Maybe even sysinfo? (GBM does this)

And as for the benchmark JSON:

Standardize error_occurred and error_message to also be included on success (but empty) to stabilize the schema (though we should take care to not print them to the console when everything went fine).
Include description key containing Benchmark.fn.__doc__ (might even be worth saving as a toplevel member like Benchmark.description()?)
Include fn key with the fully qualified function name (i.e. f"{fn.__qualname__}.{fn.__name__}").

Add global memo cache and eviction API

As per this TODO:

nnbench/src/nnbench/types/types.py

Lines 104 to 119 in 65fc45b

    
           class Memo(Generic[T]): 
        
               @functools.cache 
        
               # TODO: Swap this out for a local type-wide memo cache. 
        
               #  Could also be a decorator, should look a bit like this: 
        
               # global memo_cache, memo_cache_lock 
        
               # _tid = id(self) 
        
               # val: T 
        
               # with memocache_lock: 
        
               #     if _tid in memo_cache: 
        
               #         val = memo_cache[_tid] 
        
               #         return val 
        
               #     val = self.compute() 
        
               #     memo_cache[_tid] = val 
        
               #     return val 
        
               def __call__(self) -> T: 
        
                   raise NotImplementedError

Steps are

1 - bring back the global cache + lock,
2 - insert values based on the id of the inserting memo,
3 - add a clear_memo_cache API, maybe factoring out everything related into a nnbench.types.memo submodule.
4 - (optional) - check if evicting the memoized value in Memo.del() does anything (debug logs!)
5 - drop functools.cache decorators everywhere (examples!)

NB: Check that the Memo.__call__() type annotation is not affected by the solution, since that breaks typechecking in benchmarks. (With a decorator, this should not be a problem.)

What would be super nice: Showcase a parametrized benchmark with a collection of large numpy array memos, checking if evicting an array after the last run of the family and triggering GC (gc.collect()) does anything. memray is (hopefully) your friend: https://bloomberg.github.io/memray/index.html
-> This needs #125.

Add a usage example explaining nnbench to the new comers

NOTE: this ticket is a subtask of Complete documentation task.
We need a simple and illustrative explanation of the product purpose and usage for the people visiting the product repository for the first time.
We will achieve this by adding a usage example (code snippets / notebook / equivalent) to the product documentation.
This example should focus on illustrating the nnbench mechanism, but at the same time, it should be embedded in a real ML use case.
Ideally this usage example should enhance its adoption to a user's code/user's ML case (after required adjustments of a user's code).

Merged here #25

Add user guide: Runners

Add user guide to the documentation:

Runners and when/how to use them (custom runner usage examples)

Add Streamlit integration Example for `nnbench`

Add integration example as noted here #28

Re-evaluate `Benchmarkable` vs. `Benchmark` split for polymorphism

For the time being, we can roll just fine with Benchmark being the only admissible benchmark type.

However, if users want to roll their own runners, they might also want to run their own benchmarks. In that case, all of the decorator facilities in nnbench.core will not do the correct thing anymore, since they unconditionally yield Benchmarks, which would not be collected with the base BenchmarkRunner.collect() method.

A solution is to define a base Benchmarkable class (name TBD, could also be BenchmarkBase or similar), and create conversions to the respective benchmark type by a builder pattern / Python classmethod.

Case in point:

from nnbench import AbstractBenchmarkRunner, Benchmark

class MyBenchmark(Benchmark):
    @classmethod
    def create(b: Benchmarkable):
        ...

class MyBenchmarkRunner(AbstractBenchmarkRunner):
    benchmark_type = MyBenchmark

And then have nnbench.benchmark and nnbench.parametrize decorators yield Benchmarkables, respectively.

Add cartesian product decorator

@nnbench.parametrize calculates a multi-way "dot product" of all arguments. However, a cartesian product (i.e. a benchmark grid over the parameters) is also useful.

Demo:

@nnbench.product(
    a=[1, 2, 3, 4, 5],
    b=[True, False],
)
def double_if_b(a: int, b: bool) -> int:
    return a * 2 if b else a

This decorator should produce a list of len(a) * len(b) = 10 benchmarks, and obviously saves space over a 10-way parametrization.

Add proper `Context` object, manipulation methods

We've so far rolled pretty good with a bare-dict context, but it might be time to change that.

Newly, we encountered the problem of merging two benchmark records. An easy first solution is inlining the context into the accompanying benchmark list, but that has the problem of not being idempotent.

To fix, we could think of merging two context dictionaries by taking the union of the keys and padding missing values with a unique placeholder.

Together with the previously introduced filtering, flattening, and nested getitem methods, this points towards a custom class having value over a bare dictionary.

Supplying `nnbench.Transform`s directly to record IO

TL,DR: I would like to replace this pattern:

record = runner.run(__name__, params={...})
transform = MyTransform()
trecord = transform.apply(record)
FileIO.write(trecord)

with this:

record = runner.run(__name__, params={...})
FileIO.write(record, *, transform=MyTransform())

Meaning that a single transform can be given and applied to a record in a write() / read(), similarly in other IO types we have.

Update Artifact Tutorial

Update Artifact Tutorial so that it correspond with the up-to-date product code and can be reproduced by a user.

Streaming results to a database: verify SQL model, decide on what API to go forward

This two subtasks separated from #66 :

Check/verify SQL model.
Decide on what API we go forward with.

Add example: Streaming results to a database

Add example or tutorial to the documentation on streaming results to a database with a custom data sink.
Please note: https://duckdb.org/2023/03/03/json.html for getting started with DuckDB.

Prepare BQ example
Check/verify SQL model
Decide on what API we go forward with

Define abstract `BenchmarkReporter` interface

Part of #24 (comment).

Implementation needs to decide whether or not to give a variadic __init__(*args, **kwargs) or leave it empty; if no default state needs to be present on all reporters, the __init__() can be omitted altogether.

Supersedes BenchmarkRunner.report(...) in favor of reporter = MyReporter(); reporter.report(...) .

Add common ArtifactLoaders to make nnbench provide them out of the box

This is a follow up issue to #92

A few ArtifactLoaders come to mind.

AWS
Azure
GCP
fsspec

Add example: Implement benchmarking on saved artifacts

Performance diffing by materializing older versions of a model and then comparing accuracy on a different evaluation set (needs design of artifact sourcing, download, materialization).

Improve test directory organization

We have two different test directories with benchmarks in them, because we are testing two different aspects of benchmark collection.

The number of tests for collection will probably increase, so we should come up with a way to improve the test directory setup. An obvious way would be to consolidate all benchmark files into one directory, tag these benchmarks with different test workloads, and then filter by these tags in their respective unit tests.

This has the obvious advantage that we get used to the concept of filtering benchmarks by tags, and have a more intuitive understanding of benchmark collection by our package.

Support compression for file reading and writing

During ideation for reporting data in duckDB, I read one of their blog posts (https://duckdb.org/2023/03/03/json.html) that gives an example of loading a large (>10GB) compressed JSON archive into memory.

It would be really nice if we could support this with our file IO. In theory, the following steps would need to happen:

Expose a compression: str | None = None argument on (read|write)that gives the option of using a compression when writing a record (also prompting a lookup in a dictionary of compression algos, like https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L138).
This argument should also support the "infer" string as a special value (indicating that the compression should be inferred from the input filename).
Give the option of giving a directory name to (read|write)_batched that takes all records, reads/writes them to that directory in the given driver mode, and then compresses said directory.

Redesign artifact concept

Currently, in our official example, we promote artifacts as a type-safe way to specify not-yet-downloaded remote models and data. However, the DX is not where it needs to be, and I'd argue what we currently have is not worth it over just instantiating the models by hand (which is not preferred, since the user would then need to ensure reproducibility!)

Example: If the library needs me to

hand-roll a custom artifact, e.g.

class MyModelArtifact(Artifact[MyModel]):  # <- this needs the generic specialization for typing, no?
    ...

load and supply these with a FSSpecLoader, e.g.

m = MyModelArtifact(FSSpecLoader("s3://my-bucket/my-model.npz"), checksum = "12345")  # my model is in compressed numpy in some S3 bucket.

r = nnbench.BenchmarkRunner()
r.run(__file__, params={"model": m.value}, ...)

it's actually more or at least equal work than to just instantiate it outside.

For the Artifact to have any use over raw models, it should be able to be given to Params in place of a model, as a kind of future, and then a Loader can be given to the BenchmarkRunner.run() method (or its class state, if we do not wish to pollute the run() signature even further).

In this model, we can also let the loader handle caching. Pseudocode:

class FSSpecLoader:
    @functools.lru_cache  # make sure to alias Artifact.__hash__ to its checksum to get easy cache hits.
    def load(a: Artifact[T], verify: bool = True) -> T:
        ...

m = MyModelArtifact("s3://my-bucket/my-model.npz", checksum = "12345")

r = nnbench.BenchmarkRunner()
# if raw models are given instead, that will also work and a loader is not required.
record = r.run(__file__, params={"model": m}, *, artifact_loader = FSSpecLoader)  # <- or an instance.

That makes a lot more sense usage-wise to me. It would mean that Artifacts are privileged types by default in nnbench, and that they bypass certain typechecks on the params by default (to be implemented).

Opinions welcome.

cc @maxmynter @AdrianoKF @leonpawelzik

Defining and parametrizing benchmarks by applying decorators (`@benchmark,@parametrize,@product)` - motivation, behavior, and one usage example for each.

Add flexible base reporters for file IO, web, and DBs

Each of these should inherit directly from BenchmarkReporter and define a nice interface for the respective IO to be reusable or even useable out of the box for most users.

Pseudocode example:

class FileIOReporter:
    def write_record(self, r):
        ...

    def write_record_batched(self, rb):
        ...

    def read_record(self):
        ...
    
    def open(self, fp):
        ...

    def close(self, fp):
        ...


class JSONFileReporter(FileIOReporter):

    def open(self, fp):
        with open(fp, "r") as f:
            return json.load(f)

    def close(self, fp):
        fp.close()


class YAMLFileReporter(FileIOReporter):
    ... # same thing as for JSON, but use YAML read/write APIs.

Objective: Implement the base BenchmarkReporter's interface for a few specific families of reporters, not every reporter by itself.

Bonus: Maybe the IO (open, close) can even become a mixin that makes specific file or database reporters trivial by just acquiring and releasing a handle to write to.

Finalize names of user-facing abstract classes

Names should be clear and unique, communicating what they do and where/how to use them.

Candidates for refactoring:

nnbench.Params -> nnbench.Parameters.
nnbench.Benchmark is a suitable name, but should drop out of the top-level module namespace.
nnbench.BaseReporter -> nnbench.BenchmarkReporter
nnbench.types.BenchmarkResult -> nnbench.types.BenchmarkRecord? nnbench.types.Record?

Make `nnbench.parametrize` work with `functools.partial`

You may encounter a situation where you want to parametrize many benchmarks in the same way.

An example could be label specific metrics like in the artifact-benchmarking example #92.

@nnbench.parametrize(
    (
        {"class_label": "O"},
        {"class_label": "B-PER"},
        {"class_label": "I-PER"},
        {"class_label": "B-ORG"},
        {"class_label": "I-ORG"},
        {"class_label": "B-LOC"},
        {"class_label": "I-LOC"},
        {"class_label": "B-MISC"},
        {"class_label": "I-MISC"},
    ),
    tags=("metric", "per-class"),
)
def precision_per_class(
    class_label: str,
    model: Module,
    test_dataloader: DataLoader,
    index_label_mapping: dict[int, str],
    padding_id: int = -100,
) -> float:
...

To remove code redundancy, when this exact pattern applies to multiple benchmarks, it would be desirable to use e.g. functools.partial on the nnbench.parametrize decorator, like so:

per_class_parametrize = functools.partial(nnbench.parametrize, (
        {"class_label": "O"},
        {"class_label": "B-PER"},
        {"class_label": "I-PER"},
        {"class_label": "B-ORG"},
        {"class_label": "I-ORG"},
        {"class_label": "B-LOC"},
        {"class_label": "I-LOC"},
        {"class_label": "B-MISC"},
        {"class_label": "I-MISC"},
    ),
    tags=("metric", "per-class"))

@per_class_parametrize
def precision_per_class(
    class_label: str,
    model: Module,
    test_dataloader: DataLoader,
    index_label_mapping: dict[int, str],
    padding_id: int = -100,
) -> float:
...

While the above approach does not throw an error it also does not work. The benchmarks that are parametrized with functools.partial are not executed.

We should investigate that to allow the reuse of code without repetition.

Benchmark results do not contain parameters

Reported by @samuelburbulla (thanks!).

Simply put: In the lightning demo,

import nnbench


@nnbench.benchmark
def product(a: int, b: int) -> int:
    return a * b


@nnbench.benchmark
def power(a: int, b: int) -> int:
    return a ** b


runner = nnbench.BenchmarkRunner()
# run the above benchmarks with the parameters `a=2, b=10`...
record = runner.run("__main__", params={"a": 2, "b": 10})
rep = nnbench.BenchmarkReporter()
rep.display(record)  # ...and print the results to the terminal.

# results in a table like the following:
# name     function    date                   value    time_ns
# -------  ----------  -------------------  -------  ---------
# product  product     2024-03-07T10:14:21       20       1000
# power    power       2024-03-07T10:14:21     1024        750

we would expect to find the actual parameter values (a=2, b=10) in the table, since they are important to contextualize results, and provide filtering opportunities (for example, "show me each product result where a=2").

As before, we should think about how to deal with these parameters. Inserting them raw into the result record is the simplest solution, but that again balloons up the schema of multiple concurring experiments taking different sets of parameters.

In contrast, inserting the parameters as a struct/dictionary makes access a little more difficult, but keeps the schema easier (and some modern SQL engines support dotted access for struct columns, see e.g. the duckDB example).

Checklist:

Provide parameter struct as a key in the benchmark record, inserted in runner.run() (take the union of params and the defaults from the benchmark interface).
Show an application (maybe as a screenshot of application code or a unit test) of filtering a record by parameter values.

Improve ensure unique keys in `Contexts` upon merge.

Follow up from #92

Define and add a behaviour for non-unique keys in Context.add and Context.update
Remove the checks for duplicate keys from the runner.run method.

Add example: Integration with workflow engines

Integration with workflow engines, structuring into a benchmarking task. (E.g. flyte, prefect, dagster, Beam, ...).

Add user guide: Best practices on modularizing benchmark code

Add user guide to the documentation:

Best practices on modularizing benchmark code (split into different files, benchmark organization based on semantic groups, reusing setup/teardown code, ...)

Set up best practice for easy record serialization + parametrization + reproducibility

Currently, we have no serious stress test for our record IO in place - the examples all write structs of standard data types, which are really simple to deal with.

This changes potentially with the merge of #103, which adds a parameter struct to the records. That is absolutely necessary for keeping track of what actually went into the benchmarks, but it implicitly sets us up for a serious problem: How do we deal with results for benchmarks that take non-standard data types like models, functions, algorithms etc.?

Consider the following example:

import nnbench

class MyModel:
    ...

@nnbench.benchmark
def accuracy(m: MyModel, data: Any) -> float:
    ...

Serializing a record coming out of a benchmark run including accuracy can potentially be really challenging, since it is unclear how to represent MyModel in a written record.

There are multiple ways around this: First there is the option of requiring the user to make their records conform, but this has the downside of extra work for them, and can break reproducibility if the chosen representation lacks reproducibility. Adding serializer hooks for custom data types is an option, but it's convoluted and a lot of work.

Then there is the option of making the benchmarks take a unique identifier for the model instead, which is a standard data type (e.g. a hash, a remote URI etc.), and loading the artifact just in time for the user to access. This should mean easier reading/writing of records, but requires more code for the benchmark setup. It also requires us to change our story for the model artifact benchmarks, where we would need to come up with a way to efficiently instantiate models based on some information.

We should be able to use a setup task for the benchmarks to accomplish this. A cache lookup for the proper artifact, which is loaded before the start of all benchmarks seems like a good example.

Implement fsspec artifact loader for nnbench

Functionality for loading models from any fsspec supported url has been implemented.
Set fsspec loader as a default.

Add user guide: Defining and parametrizing benchmarks by applying decorators

Add user guide to the documentation:

Defining and parametrizing benchmarks by applying decorators (@benchmark,@parametrize,@product) - motivation, behavior, and one usage example for each.

Improve base `Transform` interface

Follow-up of #113.

Make Transform.apply() an abstract method, forcing it to be overridden.
Change Transform.iapply() to raise NotImplementedError if Transform.invertible is false.

Add user guide: Examples of customizing benchmarks and user-defined `nnbench.Parameters` for type safety

Add user guide to the documentation:

Examples of customizing benchmarks (setup/teardown, context info, ...) and user-defined nnbench.Parameters for type safety

Benchmark discovery fails if "./" is not prepended to a relative path

Currently, when having the following example directory structure:

src/benchmarks.py
   /main.py

executing runner.run("benchmarks.py", ...) from inside the src directory results in an importlib error:

Apparently, the ismodule check falls through to the very last import attempt, which fails since the found module does not have a __path__ set.

This is an indication that the discovery branch is backwards - to fix, we can try calling path.is_{file,dir}() on the input first, and only then try to understand it as a module name.

This should be safe since the Python file has a suffix .py, which eliminates collisions between e.g. __main__ (the module name) and __main__.py.

Also, once it's fixed, docs should be updated.

[umbrella ticket] Add more examples to showcase different applications of the project

Currently open: MNIST example for basic application in an ML pipeline.

Afterwards: We should add non-trivial examples on things we can accomplish using our tools.

Subtasks are addressed in separate tickets - ordered below by priority:

#66
#63
#67
Cool shiny JS frontends (Streamlit, dash, ...) (Streamlit adressed in #45)

Rethink reporter concept

Right now, our final step (reporting obtained results) is a bit unfinished. We have a AbstractBenchmarkRunner.report() method, which takes all reporters by name, and does not even use the runner's state, so it is (essentially) static.

At the same time, our Reporter class has no state or worthwhile interface - it too gets a report method, which is an awkward copy of the runner's report method.

Maybe we should untangle that and come up with a better abstraction. How about this:

The benchmark runner is also a reporter (and, for that matter, the only reporter) - a reporter is exactly something that has a report() method.
In place of the current BaseReporter, i.e. the argument given in AbstractBenchmarkRunner.report(to=...), we introduce a new class (Sink? DataSource?) that maybe also reads benchmark results from, but at minimum writes to a given sink.
Any supported sink then gives single output of benchmark results in the desired form - database records, plots, records, text summaries, you name it.
We drop support for sequences of str | BaseReporters, and require the user to loop over the sinks they want to use.

Comments / suggestions welcome.

Add user guides on various topics to documentation

With the quickstart (introductory example based on scikit-learn) merged (see #25), we can follow it up with more thorough guides on different important cornerstones of the package.

Parameter representations instead of parameters in benchmark records

TL,DR: Retaining references to parameters in benchmark records prevents garbage collection and wastes memory - how can we do better?

#103 introduced saving the parameters to the records. This is fine for standard Python types, but wasteful for models and datasets which have a large memory footprint. In the worst (and unfortunately common) case, garbage collection is inhibited since the reference counts of models and data that are not needed anymore never drop to zero.

There are a few ideas here:

Rolling without the parameters. This was the case until #103, but means that we have no straightforward way to find out what parameters were used in a run.
Saving a unique representation of the parameters instead of the parameters. This can a no-op for standard types, and turn large models and data into a small struct (e.g. by saving file path(s), hash(es)). This requires knowing how to translate custom types into representations, and probably filtering out parameters that are too large / not trivially representable. This could also take a parameter schema.
Saving some kind of stamp that uniquely represents the parameters (potentially not even bidirectional, e.g. a hash function).

I'm leaning towards 2), but if the serialization is too difficult, I prefer dropping the parameters again.

Remove `ArtifactCollection` Wrapper

As we added in-place deserialization in the Artifact class, the ArtifactCollection is a wrapper around a list with no added functionality.
We should therefore remove it from

types.py
guide
example

	class Memo(Generic[T]):
	@functools.cache
	# TODO: Swap this out for a local type-wide memo cache.
	# Could also be a decorator, should look a bit like this:
	# global memo_cache, memo_cache_lock
	# _tid = id(self)
	# val: T
	# with memocache_lock:
	# if _tid in memo_cache:
	# val = memo_cache[_tid]
	# return val
	# val = self.compute()
	# memo_cache[_tid] = val
	# return val
	def __call__(self) -> T:
	raise NotImplementedError

aai-institute / nnbench Goto Github PK

nnbench's Issues

Enhance API Reference Documentation for nnbench

Recommend Projects

Recommend Topics

Recommend Org