aai-institute / nnbench Goto Github PK
View Code? Open in Web Editor NEWA small framework for benchmarking machine learning models.
Home Page: https://aai-institute.github.io/nnbench/
License: Apache License 2.0
A small framework for benchmarking machine learning models.
Home Page: https://aai-institute.github.io/nnbench/
License: Apache License 2.0
With the quickstart (introductory example based on scikit-learn) merged (see #25), we can follow it up with more thorough guides on different important cornerstones of the package:
Similar to lakefs-spec (aai-institute/lakefs-spec#273 (review)), we would like to make improvements to the API reference documentation generation for the nnbench library. The proposed changes include:
gen_api_ref_pages.py
script to extract the module docstrings for root-level modules in the package.docstring_parser
library to parse the docstrings.docstring-parser
library to the documentation dependencies in pyproject.toml.These enhancements aim to provide a more comprehensive and user-friendly API reference documentation for the nnbench library. The inclusion of module docstrings and links to root-level modules will make it easier for users to navigate and understand the library's structure and functionality.
The AbstractBenchmarkRunner.run()
interface can also take strings, which prompt a lookup of the corresponding reporter class in the reporter registry.
Similarly to fsspec, we should use an importlib.metadata
entry point so that users of the library can register benchmark reporters via pyproject.toml
.
Source: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/__init__.py
(This can be simplified for Python>=3.9.)
Add user guide to the documentation:
A few observations on the use of docs versions:
latest
(a pre-release version) is set as the default docs version. This causes the notification banner to be shown - however, the link in the banner links to the default version (i.e., by clicking on it, you are still looking at the pre-release docs).stable
version. Going forward, this might confuse users, who cannot access the docs for previous stable releases.I propose adopting the same versioning scheme as for lakeFS-spec, with latest
as an alias for the latest stable version and set as the default. Pre-releases are labeled as development
in that scheme.
artifact type
.The current decision is that setUp and tearDown tasks/callbacks will not be allowed to modify their arguments in any way to preserve type safety and integrity. This is achieved via passing the unpacked params
dictionaries to the callbacks (requiring they take the parameter values in their interface), and making the tasks return None
(or at least ignoring their return values in the benchmark runner).
However, we can immediately see that this abstraction is too inflexible for real cases, the most prominent of which are models and datasets (or, more directly speaking, the parameters relevant to >99% of mlbench use cases).
Therefore, we have to come up with an abstraction that allows to specify models as parametrized artifacts (by name, checksum, etc.), retrieving their contents from storage/loading them just in time for a setUp task. In fact, this is probably a routine so important that it deserves a one-liner API so that users who override setUp and tearDown tasks can incorporate artifact sourcing by themselves.
Concretely, I am thinking of something like this:
import os
from typing import Generic, TypeVar
import jax.numpy as jnp
T = TypeVar("T")
class MyModel:
pass
class Artifact(Generic[T]):
value: Any | None = None
def __init__(self, path: str | os.PathLike[str]) -> None:
# Save the path for later just-in-time materialization.
pass
def materialize(self) -> "Artifact"
"""Load the artifact from storage."""
pass
def setUp(**kwargs) -> None:
for k, v in kwargs.items():
if isinstance(v, Artifact):
v.materialize()
@benchmark
def accuracy(model: Artifact[MyModel], test_data: Artifact[jnp.Array]) -> float:
pymodel = model.value() # the model is embedded in the artifact as a value
xdata, ydata = test_data.value
return jnp.sum(jnp.abs(ydata - pymodel(xdata)))
Bonus points for being able to use artifacts without having to access their value
first (e.g. with wrapt
?).
We decided to forego test coverage in the past, but we should add it back to establish confidence into our first shipped release.
Concrete steps:
This is currently the missing piece to the full end-to-end experience.
What is left to decide is
a) Do we use raw functions, or a reporter interface?
b) How do we connect these interfaces to the base benchmark runner's report()
API? (-> fsspec-like registry?)
c) If we do a registry for b), can we use pyproject hooks to register user extensions in pyproject.toml
?
Raised internally. Repro:
import nnbench
from nnbench import types, context
import pandas as pd
from pathlib import Path
class MyDataSet(types.Artifact):
def deserialize(self) -> None:
path = Path(self.path).resolve()
self._value = pd.read_csv(path, sep="\t")
@nnbench.benchmark()
def my_benchmark(dataset: MyDataSet) -> float:
return 1.0
if __name__ == "__main__":
dataset = MyDataSet(types.LocalArtifactLoader(path="ames_housing.csv"))
runner = nnbench.BenchmarkRunner()
reporter = nnbench.BenchmarkReporter()
results = runner.run("example.py", params={
"dataset": dataset,
})
reporter.display(results)
Error:
(.venv) janwillem@macbook-janwillem [nnbench-trial2] % python example.py
Traceback (most recent call last):
File "/Users/janwillem/Developer/hack/nnbench-trial2/example.py", line 27, in <module>
results = runner.run("example.py", params={
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/janwillem/Developer/hack/nnbench-trial2/.venv/lib/python3.11/site-packages/nnbench/runner.py", line 239, in run
self._check(dparams)
File "/Users/janwillem/Developer/hack/nnbench-trial2/.venv/lib/python3.11/site-packages/nnbench/runner.py", line 130, in _check
raise TypeError(
TypeError: expected type <class 'example.py.MyDataSet'> for parameter 'dataset', got <class '__main__.MyDataSet'>
This means that the module loading will screw up some of the types. (If I understand correctly, it expects a MyDataSet
from the target file, but gets one from the currently executed module, i.e. __main__
).
The module import name is obviously a bug, but the general behavior might be expected, although surprising. Possible actions here include making a warning/note on using __main__
as path_or_module
qualifier when running in-file benchmarks.
We want to experiment with a global memoization cache, which we want to explicitly clear in a teardown task after a family of benchmarks for a model (say NER on distilbert) has run.
This means that the setup and teardown tasks need to know which benchmark they are currently applied in.
For a single benchmark, say
@nnbench.benchmark
def echo(s: str) -> str:
print(s)
the corresponding State
should look something like this:
def tearDown(state, **params): # <- or maybe a mappingproxy of the params?
print(state)
# name: "echo"
# family: "echo" <- this is for parametrized benchmarks, where the family name equals the function name.
# family_size: 1 <- how many members in the current benchmark family?
# family_index: 0 <- which number of the family is it?
More metadata suggestions welcome. After this, we can try evicting a memo from the cache on the condition family_index == family_size - 1
.
Currently we are overwriting the parameters for the benchmark passed in BenchmarkRunner.run()
with the params
slot saved on the benchmark. This is not thought through entirely, and needs changing.
Points to consider:
params
on the Benchmark class as a separate slot. All default values can be given either as variable defaults in the benchmark signature, or via params
in BenchmarkRunner.run()
. (#15)_check
if a member of any benchmark interface is untyped, i.e. param.annotation == inspect.Parameter.empty
.) (#18)Benchmark
class. This should be a nnbench class called Interface
etc., which is immutable/frozen, and provides easy access to function interface properties such as varnames (->inspect.signature.parameters.keys()
), vartypes (->inspect.signature.parameters.values()
), tuples of the two previous (->inspect.signature.parameters.items()
), and default values (->p.default for p in inspect.signature.parameters
). (#20)Similarly to defining a custom name for a single benchmark, we should give an option to give a custom name to each benchmark of a family defined by the @nnbench.parametrize
and @nnbench.product
decorators.
To prevent collisions, this should not be a static string, but a callable taking the input parameters as a dictionary and returning a string that will subsequently be set as the name of the benchmark for the respective parameter instance.
The contextual struct is so far empty - everything is opt-in. That's cool, but some things are just so common that they might be useful to include by default.
Ideas:
time.perf_counter_ns()
).And as for the benchmark JSON:
error_occurred
and error_message
to also be included on success (but empty) to stabilize the schema (though we should take care to not print them to the console when everything went fine).description
key containing Benchmark.fn.__doc__
(might even be worth saving as a toplevel member like Benchmark.description()
?)fn
key with the fully qualified function name (i.e. f"{fn.__qualname__}.{fn.__name__}"
).As per this TODO:
nnbench/src/nnbench/types/types.py
Lines 104 to 119 in 65fc45b
Steps are
1 - bring back the global cache + lock,
2 - insert values based on the id of the inserting memo,
3 - add a clear_memo_cache
API, maybe factoring out everything related into a nnbench.types.memo
submodule.
4 - (optional) - check if evicting the memoized value in Memo.del() does anything (debug logs!)
5 - drop functools.cache decorators everywhere (examples!)
NB: Check that the Memo.__call__()
type annotation is not affected by the solution, since that breaks typechecking in benchmarks. (With a decorator, this should not be a problem.)
What would be super nice: Showcase a parametrized benchmark with a collection of large numpy array memos, checking if evicting an array after the last run of the family and triggering GC (gc.collect()
) does anything. memray is (hopefully) your friend: https://bloomberg.github.io/memray/index.html
-> This needs #125.
NOTE: this ticket is a subtask of Complete documentation task.
We need a simple and illustrative explanation of the product purpose and usage for the people visiting the product repository for the first time.
We will achieve this by adding a usage example (code snippets / notebook / equivalent) to the product documentation.
This example should focus on illustrating the nnbench mechanism, but at the same time, it should be embedded in a real ML use case.
Ideally this usage example should enhance its adoption to a user's code/user's ML case (after required adjustments of a user's code).
Merged here #25
Add user guide to the documentation:
Add integration example as noted here #28
For the time being, we can roll just fine with Benchmark
being the only admissible benchmark type.
However, if users want to roll their own runners, they might also want to run their own benchmarks. In that case, all of the decorator facilities in nnbench.core
will not do the correct thing anymore, since they unconditionally yield Benchmark
s, which would not be collected with the base BenchmarkRunner.collect()
method.
A solution is to define a base Benchmarkable
class (name TBD, could also be BenchmarkBase
or similar), and create conversions to the respective benchmark type by a builder pattern / Python classmethod.
Case in point:
from nnbench import AbstractBenchmarkRunner, Benchmark
class MyBenchmark(Benchmark):
@classmethod
def create(b: Benchmarkable):
...
class MyBenchmarkRunner(AbstractBenchmarkRunner):
benchmark_type = MyBenchmark
And then have nnbench.benchmark
and nnbench.parametrize
decorators yield Benchmarkable
s, respectively.
@nnbench.parametrize
calculates a multi-way "dot product" of all arguments. However, a cartesian product (i.e. a benchmark grid over the parameters) is also useful.
Demo:
@nnbench.product(
a=[1, 2, 3, 4, 5],
b=[True, False],
)
def double_if_b(a: int, b: bool) -> int:
return a * 2 if b else a
This decorator should produce a list of len(a) * len(b) = 10
benchmarks, and obviously saves space over a 10-way parametrization.
We've so far rolled pretty good with a bare-dict context, but it might be time to change that.
Newly, we encountered the problem of merging two benchmark records. An easy first solution is inlining the context into the accompanying benchmark list, but that has the problem of not being idempotent.
To fix, we could think of merging two context dictionaries by taking the union of the keys and padding missing values with a unique placeholder.
Together with the previously introduced filtering, flattening, and nested getitem methods, this points towards a custom class having value over a bare dictionary.
TL,DR: I would like to replace this pattern:
record = runner.run(__name__, params={...})
transform = MyTransform()
trecord = transform.apply(record)
FileIO.write(trecord)
with this:
record = runner.run(__name__, params={...})
FileIO.write(record, *, transform=MyTransform())
Meaning that a single transform can be given and applied to a record in a write() / read(), similarly in other IO types we have.
This two subtasks separated from #66 :
Add example or tutorial to the documentation on streaming results to a database with a custom data sink.
Please note: https://duckdb.org/2023/03/03/json.html for getting started with DuckDB.
Part of #24 (comment).
Implementation needs to decide whether or not to give a variadic __init__(*args, **kwargs)
or leave it empty; if no default state needs to be present on all reporters, the __init__()
can be omitted altogether.
Supersedes BenchmarkRunner.report(...)
in favor of reporter = MyReporter(); reporter.report(...)
.
We have two different test directories with benchmarks in them, because we are testing two different aspects of benchmark collection.
The number of tests for collection will probably increase, so we should come up with a way to improve the test directory setup. An obvious way would be to consolidate all benchmark files into one directory, tag these benchmarks with different test workloads, and then filter by these tags in their respective unit tests.
This has the obvious advantage that we get used to the concept of filtering benchmarks by tags, and have a more intuitive understanding of benchmark collection by our package.
During ideation for reporting data in duckDB, I read one of their blog posts (https://duckdb.org/2023/03/03/json.html) that gives an example of loading a large (>10GB) compressed JSON archive into memory.
It would be really nice if we could support this with our file IO. In theory, the following steps would need to happen:
compression: str | None = None
argument on (read|write)
that gives the option of using a compression when writing a record (also prompting a lookup in a dictionary of compression algos, like https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L138).(read|write)_batched
that takes all records, reads/writes them to that directory in the given driver mode, and then compresses said directory.Currently, in our official example, we promote artifacts as a type-safe way to specify not-yet-downloaded remote models and data. However, the DX is not where it needs to be, and I'd argue what we currently have is not worth it over just instantiating the models by hand (which is not preferred, since the user would then need to ensure reproducibility!)
Example: If the library needs me to
class MyModelArtifact(Artifact[MyModel]): # <- this needs the generic specialization for typing, no?
...
FSSpecLoader
, e.g.m = MyModelArtifact(FSSpecLoader("s3://my-bucket/my-model.npz"), checksum = "12345") # my model is in compressed numpy in some S3 bucket.
r = nnbench.BenchmarkRunner()
r.run(__file__, params={"model": m.value}, ...)
it's actually more or at least equal work than to just instantiate it outside.
For the Artifact
to have any use over raw models, it should be able to be given to Params
in place of a model, as a kind of future, and then a Loader
can be given to the BenchmarkRunner.run()
method (or its class state, if we do not wish to pollute the run() signature even further).
In this model, we can also let the loader handle caching. Pseudocode:
class FSSpecLoader:
@functools.lru_cache # make sure to alias Artifact.__hash__ to its checksum to get easy cache hits.
def load(a: Artifact[T], verify: bool = True) -> T:
...
m = MyModelArtifact("s3://my-bucket/my-model.npz", checksum = "12345")
r = nnbench.BenchmarkRunner()
# if raw models are given instead, that will also work and a loader is not required.
record = r.run(__file__, params={"model": m}, *, artifact_loader = FSSpecLoader) # <- or an instance.
That makes a lot more sense usage-wise to me. It would mean that Artifact
s are privileged types by default in nnbench, and that they bypass certain typechecks on the params by default (to be implemented).
Opinions welcome.
Each of these should inherit directly from BenchmarkReporter
and define a nice interface for the respective IO to be reusable or even useable out of the box for most users.
Pseudocode example:
class FileIOReporter:
def write_record(self, r):
...
def write_record_batched(self, rb):
...
def read_record(self):
...
def open(self, fp):
...
def close(self, fp):
...
class JSONFileReporter(FileIOReporter):
def open(self, fp):
with open(fp, "r") as f:
return json.load(f)
def close(self, fp):
fp.close()
class YAMLFileReporter(FileIOReporter):
... # same thing as for JSON, but use YAML read/write APIs.
Objective: Implement the base BenchmarkReporter's interface for a few specific families of reporters, not every reporter by itself.
Bonus: Maybe the IO (open, close) can even become a mixin that makes specific file or database reporters trivial by just acquiring and releasing a handle to write to.
Names should be clear and unique, communicating what they do and where/how to use them.
Candidates for refactoring:
You may encounter a situation where you want to parametrize many benchmarks in the same way.
An example could be label specific metrics like in the artifact-benchmarking
example #92.
@nnbench.parametrize(
(
{"class_label": "O"},
{"class_label": "B-PER"},
{"class_label": "I-PER"},
{"class_label": "B-ORG"},
{"class_label": "I-ORG"},
{"class_label": "B-LOC"},
{"class_label": "I-LOC"},
{"class_label": "B-MISC"},
{"class_label": "I-MISC"},
),
tags=("metric", "per-class"),
)
def precision_per_class(
class_label: str,
model: Module,
test_dataloader: DataLoader,
index_label_mapping: dict[int, str],
padding_id: int = -100,
) -> float:
...
To remove code redundancy, when this exact pattern applies to multiple benchmarks, it would be desirable to use e.g. functools.partial
on the nnbench.parametrize
decorator, like so:
per_class_parametrize = functools.partial(nnbench.parametrize, (
{"class_label": "O"},
{"class_label": "B-PER"},
{"class_label": "I-PER"},
{"class_label": "B-ORG"},
{"class_label": "I-ORG"},
{"class_label": "B-LOC"},
{"class_label": "I-LOC"},
{"class_label": "B-MISC"},
{"class_label": "I-MISC"},
),
tags=("metric", "per-class"))
@per_class_parametrize
def precision_per_class(
class_label: str,
model: Module,
test_dataloader: DataLoader,
index_label_mapping: dict[int, str],
padding_id: int = -100,
) -> float:
...
While the above approach does not throw an error it also does not work. The benchmarks that are parametrized with functools.partial
are not executed.
We should investigate that to allow the reuse of code without repetition.
Reported by @samuelburbulla (thanks!).
Simply put: In the lightning demo,
import nnbench
@nnbench.benchmark
def product(a: int, b: int) -> int:
return a * b
@nnbench.benchmark
def power(a: int, b: int) -> int:
return a ** b
runner = nnbench.BenchmarkRunner()
# run the above benchmarks with the parameters `a=2, b=10`...
record = runner.run("__main__", params={"a": 2, "b": 10})
rep = nnbench.BenchmarkReporter()
rep.display(record) # ...and print the results to the terminal.
# results in a table like the following:
# name function date value time_ns
# ------- ---------- ------------------- ------- ---------
# product product 2024-03-07T10:14:21 20 1000
# power power 2024-03-07T10:14:21 1024 750
we would expect to find the actual parameter values (a=2, b=10
) in the table, since they are important to contextualize results, and provide filtering opportunities (for example, "show me each product
result where a=2
").
As before, we should think about how to deal with these parameters. Inserting them raw into the result record is the simplest solution, but that again balloons up the schema of multiple concurring experiments taking different sets of parameters.
In contrast, inserting the parameters as a struct/dictionary makes access a little more difficult, but keeps the schema easier (and some modern SQL engines support dotted access for struct columns, see e.g. the duckDB example).
Checklist:
runner.run()
(take the union of params
and the defaults from the benchmark interface).Follow up from #92
Context.add
and Context.update
runner.run
method.Integration with workflow engines, structuring into a benchmarking task. (E.g. flyte, prefect, dagster, Beam, ...).
Add user guide to the documentation:
Currently, we have no serious stress test for our record IO in place - the examples all write structs of standard data types, which are really simple to deal with.
This changes potentially with the merge of #103, which adds a parameter struct to the records. That is absolutely necessary for keeping track of what actually went into the benchmarks, but it implicitly sets us up for a serious problem: How do we deal with results for benchmarks that take non-standard data types like models, functions, algorithms etc.?
Consider the following example:
import nnbench
class MyModel:
...
@nnbench.benchmark
def accuracy(m: MyModel, data: Any) -> float:
...
Serializing a record coming out of a benchmark run including accuracy
can potentially be really challenging, since it is unclear how to represent MyModel
in a written record.
There are multiple ways around this: First there is the option of requiring the user to make their records conform, but this has the downside of extra work for them, and can break reproducibility if the chosen representation lacks reproducibility. Adding serializer hooks for custom data types is an option, but it's convoluted and a lot of work.
Then there is the option of making the benchmarks take a unique identifier for the model instead, which is a standard data type (e.g. a hash, a remote URI etc.), and loading the artifact just in time for the user to access. This should mean easier reading/writing of records, but requires more code for the benchmark setup. It also requires us to change our story for the model artifact benchmarks, where we would need to come up with a way to efficiently instantiate models based on some information.
We should be able to use a setup task for the benchmarks to accomplish this. A cache lookup for the proper artifact, which is loaded before the start of all benchmarks seems like a good example.
Add user guide to the documentation:
@benchmark,@parametrize,@product)
- motivation, behavior, and one usage example for each.Follow-up of #113.
Transform.apply()
an abstract method, forcing it to be overridden.Transform.iapply()
to raise NotImplementedError
if Transform.invertible
is false.Add user guide to the documentation:
nnbench.Parameters
for type safetyCurrently, when having the following example directory structure:
src/benchmarks.py
/main.py
executing runner.run("benchmarks.py", ...)
from inside the src
directory results in an importlib error:
Apparently, the ismodule
check falls through to the very last import attempt, which fails since the found module does not have a __path__
set.
This is an indication that the discovery branch is backwards - to fix, we can try calling path.is_{file,dir}()
on the input first, and only then try to understand it as a module name.
This should be safe since the Python file has a suffix .py
, which eliminates collisions between e.g. __main__
(the module name) and __main__.py
.
Also, once it's fixed, docs should be updated.
Currently open: MNIST example for basic application in an ML pipeline.
Afterwards: We should add non-trivial examples on things we can accomplish using our tools.
Subtasks are addressed in separate tickets - ordered below by priority:
Right now, our final step (reporting obtained results) is a bit unfinished. We have a AbstractBenchmarkRunner.report()
method, which takes all reporters by name, and does not even use the runner's state, so it is (essentially) static.
At the same time, our Reporter
class has no state or worthwhile interface - it too gets a report
method, which is an awkward copy of the runner's report
method.
Maybe we should untangle that and come up with a better abstraction. How about this:
report()
method.BaseReporter
, i.e. the argument given in AbstractBenchmarkRunner.report(to=...)
, we introduce a new class (Sink
? DataSource
?) that maybe also reads benchmark results from, but at minimum writes to a given sink.str | BaseReporter
s, and require the user to loop over the sinks they want to use.Comments / suggestions welcome.
TL,DR: Retaining references to parameters in benchmark records prevents garbage collection and wastes memory - how can we do better?
#103 introduced saving the parameters to the records. This is fine for standard Python types, but wasteful for models and datasets which have a large memory footprint. In the worst (and unfortunately common) case, garbage collection is inhibited since the reference counts of models and data that are not needed anymore never drop to zero.
There are a few ideas here:
I'm leaning towards 2), but if the serialization is too difficult, I prefer dropping the parameters again.
As we added in-place deserialization in the Artifact
class, the ArtifactCollection
is a wrapper around a list with no added functionality.
We should therefore remove it from
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.