zeno-ml / zeno-build Goto Github PK
View Code? Open in Web Editor NEWBuild, evaluate, understand, and fix LLM-based apps
License: MIT License
Build, evaluate, understand, and fix LLM-based apps
License: MIT License
We have caching utils: https://github.com/zeno-ml/llm-compare/blob/main/llm_compare/cache_utils.py
But using them can be a bit verbose and opaque.
It would be nice if we could add a decorator like this:
@cache_utils.cache_function("text_classification", load_model)
where the first argument is the task name, and the second argument is the function used to load from the cached files.
Currently model names are displayed as model0
, model1
, but it'd be nice to have a better way of displaying them.
One suggestion is to display all of the non-constant parameters.
For example in the chatbots example, the prompt, model, and temperature are variable, so we could display those three parameters.
For locally hosted models from Hugging Face, it would be good to support multi-GPU inference, including:
Currently inference is handled using the Hugging Face provider:
Any code to support multi-GPU inference would have to be added there. Contributions are welcome!
We should change our report doc to a public-facing page.
Currently experiments are run in serial, but if they could be parallelized this would make life easier. We should consider how to do this.
In the text classification demo, right now it seems that the model names are just numbers, so it's hard to tell which model is which.
We should think about model naming. Here are some ideas:
Currently, all of the examples run Zeno Build from scratch, but it'd also be nice to demonstrate how Zeno Build can be used to analyze existing results.
One candidate would be to analyze the GPT-MT results from Microsoft: https://github.com/microsoft/gpt-MT
Implement an end-to-end example of a parameter sweep with text classification.
We have a script to aggregate results together, but it's not well documented. It should be.
https://github.com/zeno-ml/zeno-build/blob/main/zeno_build/reporting/aggregate_results.py
It'd be nice to get feedback from a few people, e.g. at CMU, on what they think.
We can find interesting outputs that are evaluated in different ways by the different evaluation metrics.
These should be recorded somewhere, such as a Google Sheet or Notion.
Vizier optimization isn't extensively tested yet because its core code required Jax, which was not easily installable on my macbook. It should be tested and tweaked if necessary.
Within the prompt gym, we have an example of summarization with API-based models with evaluation using Critique, I will implement an example of this next.
Right now Zeno Build supports CombinatorialSearchSpace
, which takes the cross product between all parameter configurations.
However, in many cases it's common to run multiple experiments, where you explore some part of the experiment space in the first experiment, and another part of the search space in the second experiment.
A current workaround would be to create two different configuration files and decide which one to use, or run sequentially on both. Another option is to specify this directly in the search space by having something like:
space = CompositeSearchSpace([
CombinatorialSearchSpace({...}), # experiment 1
CombinatorialSearchSpace({...}), # experiment 2
])
We'll probably want hosted Zeno instances to demonstrate any interesting results that we get out of our experiments. Where should we host these? Huggingface spaces?
Find at least 3, target 5, interesting trends that can be demonstrated by our browsing of the results.
Write these up in a doc, e.g. on Google Doc on Notion, so they can be posted as tweets.
Currently outputs only use a short conversational context, we should re-generate with the full context.
Here is the the fastchat dataset used in assessing Vicuna:
https://github.com/lm-sys/FastChat/tree/main/fastchat/eval/table
Vicuna uses a prompt-based metric for evaluation, maybe it should be implemented?
Currently the parameter sweep code is a bit opaque and it's hard to tell what goes on inside. It'd probably be a good idea to make the programming interface similar to Vizier:
Most generative models can provide an uncertainty level, and it would be interesting to be able to explore things, such as the correlation between model certainty and accuracy.
In order to do this, we would first need to modify the generate functions, such as in chat_generate.py, to output model uncertainties as well as strings.
This is certainly possible for Huggingface, and may be possible for API-based models.
Once these confidences are returned by the generate function, they would need to be passed on to the dataframe that is fed into Zeno in each individual example, such as the chatbot or summarization examples. For reference, the dataframe for the chatbot example is constructed here.
We should add LangChain as a provider for the chatbot task. It could be done by adding it to the chat_generate.py file:
We should decide what demo we will do for text classification.
We can create an example task for "chat your data" like is implemented in LangChain.
For the chatbots demo, we could support other evaluation metrics such as uni_eval
for dialog evaluation, which may give us better insights.
Currently metrics are implemented in multiple places. This is a potential source of bugs/disconnect, so we should deduplicate them and rely on the Zeno implementations.
Zeno build should be pushed to pypi.
We need to write up the main README. I can take a first stab at it.
We now have three examples of tasks, we can start consolidating the code to reduce copy-pasting across the different tasks.
Currently CI is not running unit tests. We should fix this.
OpenAI supports asynchronous requests: https://github.com/openai/openai-python/blob/75c90a71e88e4194ce22c71edeb3d2dee7f6ac93/openai/api_resources/chat_completion.py#L33
Using this has the potential to greatly increase the efficiency of making calls to OpenAI, so we could try implementing some infrastructure to make that work.
Many Zeno functions will be reusable across tasks, and can be passed into the visualize
function.
For example, things like text length, unique words, etc. could be useful distill
functions we want to use.
We might want to categorize this folder by task or data type.
Text classification outputs are not yet visualized using Zeno.
Currently libraries such as OpenAI, Cohere, and huggingface are used in the core library code indiscriminately. However, it would be better for at least OpenAI and Cohere to only be necessary if the libraries are actually being used. In order to do this, we can do dynamic imports and warn users that they need to install the library.
A few people have asked about adding models to the chatbot report, so we should do that!
In some experiments we just want to do a complete search over the entire search space and enumerate all of the possibilities.
Currently we only have the RandomOptimizer
, but we should also have an ExhaustiveOptimizer
that does an exhaustive search.
This optimizer will be incompatible with Float
search spaces, as they are not able to be enumerated.
Currently the protobuf library is pinned at a specific version:
This is due to model loading errors if we use a more modern version.
But it'd be nice to relax this requirement, or at least make the version newer.
When running Zeno and trying to visualize results, I get the following difficult-to-understand error.
@cabreraalex any ideas about how to fix this?
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 429, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/applications.py", line 276, in __call__
await super().__call__(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 443, in handle
await self.app(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/applications.py", line 276, in __call__
await super().__call__(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
raise e
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/zeno/server.py", line 116, in get_filtered_table
return zeno.get_filtered_table(req)
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/zeno/backend.py", line 587, in get_filtered_table
return filt_df[[str(col) for col in req.columns]].to_json(orient="records")
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/pandas/core/generic.py", line 2532, in to_json
return json.to_json(
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/pandas/io/json/_json.py", line 181, in to_json
s = writer(
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/pandas/io/json/_json.py", line 237, in __init__
self._format_axes()
File "/Users/gneubig/opt/anaconda3/envs/llm_compare/lib/python3.10/site-packages/pandas/io/json/_json.py", line 301, in _format_axes
raise ValueError(
ValueError: DataFrame columns must be unique for orient='records'.
Some people may feel more comfortable getting in touch through private methods such as email. Perhaps we should create an email address for zeno build or just zeno in general.
Set up CI
We now have chatbots implemented as of #30
We should:
This will require finishing of several sub-issues, which we can add and link to this issue.
Currently we do not have any automatically generated documentation including API doc. It would be nice to have this.
The main Zeno page has this, so maybe we could use the same method?
I don't know what to do with it. I made the following changes in the original file.I ran the code in vscode. do_prediction can work, but do_visualization can't.
I modified these hyperparameters
I took the l out of it and turned it into
I changed the address
I added this code at the very beginning of the file
The problem that appeared before no longer appears, but now the problem do not know how to solve.
For some reason, there are duplicate models in the results. This should be investigated.
For instance, there are 10 models in the "results" file, but only 4 after deduplication.
Visualizing 10 models
-1960601404797368550 {'training_dataset': 'sst2', 'base_model': 'bert-base-uncased', 'learning_rate': 7.032465170166586e-05, 'num_train_epochs': 3, 'weight_decay': 0.0013326587635397158, 'bias': 0.9619960134236489}
8087707616790777082 {'training_dataset': 'imdb', 'base_model': 'distilbert-base-uncased', 'learning_rate': 0.0007441349947622346, 'num_train_epochs': 2, 'weight_decay': 0.0022321073814882274, 'bias': 0.4729424283280248}
974984753089123939 {'training_dataset': 'imdb', 'base_model': 'bert-base-uncased', 'learning_rate': 4.1464852686965755e-05, 'num_train_epochs': 1, 'weight_decay': 0.0021863797480360337, 'bias': 0.010710576206724776}
-5709593449973764166 {'training_dataset': 'imdb', 'base_model': 'distilbert-base-uncased', 'learning_rate': 0.0007188594167931795, 'num_train_epochs': 4, 'weight_decay': 0.002204406220406967, 'bias': 0.17853136775181744}
We should decide what demo we will do for text summarization.
I was using openai_utils.py for ChatGPT inference. It always worked fine for the first couple hundred samples, and then it always crushed with the following error. I tried to lower request_per_minute
but the problem persists.
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x000001FC0B9CFD10>
92%|████████████████████████████████████████████████████████████████████████ | 1360/1473 [10:27<00:52, 2.17it/s]
Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x000001FC0C137AF0>, 482899.734), (<aiohttp.client_proto.ResponseHandler object at 0x000001FC0C2A5550>, 482900.89)]']
connector: <aiohttp.connector.TCPConnector object at 0x000001FC0B9CFD50>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.