microsoft / adatest Goto Github PK

Find and fix bugs in natural language machine learning models using adaptive testing.

License: MIT License

Python 2.26% CSS 0.06% JavaScript 1.25% Jupyter Notebook 96.40% TypeScript 0.03%

adatest's Introduction

adaptive-testing

adaptive-testing uses language models against themselves to build suites of unit tests. It is an interative (and fun!) process between a user and a language model that results in a tree of unit tests specifically adapted to the model you are testing. Fixing any failed tests with fine-tuning then leads to an iterative debugging process similar to traditional software development. See paper for details.

Note, adaptive-testing is currently a beta release so please share any issues you encounter.

Install

pip install adatest

Sentiment analysis example

adaptive-testing can test any NLP model you can call with a python function, here we will test a basic open source sentiment analysis model. Since adaptive-testing relies on a generative language model to help you create tests, you need to specify what generative model it will use, here we use GPT-3 from OpenAI or GPT-Neo locally. Tests are organized into a test tree that follows the DataFrame API and is organized like a file system, here we create a new empty tree, but you can also start with a previous test tree that targets a similar task. The core adaptive-testing loop starts when you call the .adapt() method on a test tree passing the model(s) you want to test and the backend generator you want to use. The code for all this is below:

import transformers
import adatest

# create a HuggingFace sentiment analysis model
classifier = transformers.pipeline("sentiment-analysis", return_all_scores=True)

# specify the backend generator used to help you write tests
generator = adatest.generators.OpenAI('curie', api_key=OPENAI_API_KEY)

# ...or you can use an open source generator
#neo = transformers.pipeline('text-generation', model="EleutherAI/gpt-neo-125M")
#generator = adatest.generators.Transformers(neo.model, neo.tokenizer)

# create a new test tree
tests = adatest.TestTree("hotel_reviews.csv")

# adapt the tests to our model to launch a notebook-based testing interface
# (wrap with adatest.serve to launch a standalone server)
tests.adapt(classifier, generator, auto_save=True)

Once we have launched a test tree browser, we can use the interface to create new topics and tests. Here we create the topic "/Clear positives/Location" to test how well this model classifies clearly positive statements about a hotel's location. We then add a few starting examples of what we want to see in this topic (clearly positive statements about hotel location):

Each test consists of a model input, a model output, a pass/fail label, and a score for the current target model. The input text should fall within the scope of the current topic, which here means it is a clearly positive statement about hotel locations. The output text is what the target model we are testing generated (or it can be manually specified, in which case it turns light grey to show it does not reflect the current model behavior). The label is a pass/fail indicator that denotes if the model output is correct with respect to the aspect being tested in the current topic, in our case the model was correct for all the inputs we entered. The model score represents if the testing model passes or fails and how confident the model is when producing the current output.

Note that in the above figure all the label indicators are hollow, this means that we have not yet labeled these examples, and adaptive-testing is just guessing that they are correct. They are all correct so can click the checkmarks to confirm and label all these examples. By confirming we teach adaptive-testing more about what we want this topic to test, so it becomes better at predicting future labels, and hence automating the testing process. Once we label these examples we can then click "Suggestions" and adaptive-testing will attempt to write new in-topic examples for us, labeling them and sorting then by score so we can see the most likely failures at the top of the list.

Starting at the top of the list we can confirm or change the label for each suggestion and so add them to the current topic (like marking "very convientent for walking" -> "POSITIVE" as correct model behavior), while we reject (or just ignore) examples that don't belong in the current topic (like "Second visit" which is not about a hotel's location). After we have added some new suggestions to the current topic (we normally only bother to look at the top few suggestions) we can repeat the process by clicking "Suggestions" again. Repeating the process a few times allows adaptive-testing to learn from our feedback and hill-climb towards generating better and better suggestions (ones that are more likely to be on-topic and reveal model failures). Doing this for a few rounds reveals lots of bugs in the model related to positive hotel location statements.

Once we have testing the location aspect enough we can repeat the process to test a new aspect of model behavior, for example comments about hotel swimming pools or gyms. The space of possible concepts for hotel reviews is large, so to help explore it adaptive-testing can suggest new topics once we have a few examples:

After we accept some of these new topic suggestions we can open them and fill them out without ever even writing seed examples. adaptive-testing can suggest new tests inside an empty topic by just using examples other topics and the current topic's name.

This is just a short example of how to find bugs in a sentiment analysis model, but the same process can be applied to any NLP model (even ones that generate free form text). Test trees can be adapted to new models and shared with others collaboratively (they are just CSV files). Once you have enough bugs you can fine tune your model against a mixture of your test tree and the original training data to fix all the bugs in the test tree while retaining performance on your original training data (we will share a full demo notebook of this soon).

Citation

If you find adaptive-testing or test trees useful in your work feel free to cite our ACL paper: Adaptive Testing and Debugging of NLP Models (Ribeiro & Lundberg, ACL 2022)

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

adatest's People

Contributors

Stargazers

Watchers

adatest's Issues

Has the target model been finetuned at first?

Hi,
Thanks for your work. After reading the paper, i have a problem.
Has the original target model been finetuned at first?

Test tree not firing up in browser

Hi Team! I was trying to get this up and running. I followed the instructions and notice that a tree does not open up in the browser and the process ends without any error. Is there any step that I am missing? I assume the process should ideally stay open as this is the server? I am using an anaconda environment. I would be happy to provide more details.

I have tried with both the package and directly from source, the result is the same in both the cases.

TypeError: 'TestTree' object is not callable

Running the sample notebook "Testing Models -- Sentence Pair Classification".
Following code:

# Launch AdaTest!

tests.adapt(
    model,
    generator=gen_model,
    auto_save=True, # Set to "True" to automatically save tests as they are made.
)

# Optionally:
#adatest.serve(tests(model, generator=gen_model, auto_save=True), host='127.0.0.1', port=8080)

Doesn't generate interactive Widget through VSCode or AzureML Notebook experience. Probably not unsurprising. However, when uncommenting the .serve function to generate standalone client, get the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 10
      3 tests.adapt(
      4     model,
      5     generator=gen_model,
      6     auto_save=True, # Set to "True" to automatically save tests as they are made.
      7 )
      9 # Optionally:
---> 10 adatest.serve(tests(model, generator=gen_model, auto_save=True), host='127.0.0.1', port=8080)

TypeError: 'TestTree' object is not callable

Running in Jupyter, the widget loads, but then just spins when clicking 'Suggest tests'.

Cannot import adatest in python 3.10

got the following error with "import adatest"

partially initialized module 'adatest' has no attribute 'generators' (most likely due to a circular import)

TopicMembershipModel is broken

When there are enough samples to fit a model TopicMembershipModel calls empty CVModel()
Line 160

OPT models currently unsupported as generator

Using any OPT model from the HuggingFace Transformers library (ex: https://huggingface.co/facebook/opt-350m) as a generator currently raises an exception when attempting to generate suggestions:

Error:

File ".../python3.9/site-packages/transformers/models/opt/modeling_opt.py", line 235, in forward
    raise ValueError(
        ValueError: Attention mask should be of size (1, 1, 0, 52), but is torch.Size([1, 1, 1, 1]
)

Workaround: We're actively looking into this, and recommend using GPT Neo (https://huggingface.co/docs/transformers/model_doc/gpt_neo) as an alternative:

import adatest
import transformers

# gen_model = "facebook/opt-125m"  # Currently unsupported
gen_model = "EleutherAI/gpt-neo-125M"
opt_gen = transformers.pipeline('text-generation', model=gen_model)
generator = adatest.generators.Transformers(opt_gen.model, opt_gen.tokenizer)

adatest.serve(tests(model, generator, auto_save=True))

Create a flexible template-based test structure

Right now all tests are constrained to have the form Prefix "value1" comparator "value2". While this supports the current set of tests, it is a bit obtuse (odd that the changing comparator changes the meaning of the values) and does not support any test types with more than two values. We don't want to change this after release since it would cause compatibility issues. Here is a proposal for a new system:

Allow for a free form test format with placeholders for values and outputs. Such as

{} should not output {}
{} should output {}
{} should have the same output as {}
{} should be more {} than {}
{} should match {} more than {}
etc.

These can all be called "test_format" instead of "comparator". Scorers will need to support each test format specifically, but need not support all of them if they are not used. The columns value1, value2, etc. match the {} placeholders in order.

Data files are not available for notebooks to run

Hi,

Thanks for the great system you designed! It is a great improvement over your previous CheckList system. I can not wait to try our your system into my project!

However, running the sample notebooks requires some external .csv files that do not seem to be provided in the notebooks/ directory or anywhere else in your repository. Specifically,

Testing Models -- Sentence Pair Classification.ipynb requires sequence_classification_tests.csv.
Two way sentiment analysis.ipynb requires hotel_reviews.csv.

Could these two files be provided? If this is not possible, could you explain how we create a .csv with same schema from existing sentiment classification or NLI dataset?

Best,
Guanqun

pip install not working

Just tried a pip install adatest in a fresh conda environment with Python 3.9:

(adt-39) riedgar@jormungand:~$ pip install adatest
Collecting adatest
  Using cached adatest-0.3.0.tar.gz (67 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-8a8ikdp8/adatest_00e6da821d3a46719714104947654cc0/setup.py", line 18, in <module>
          with open('requirements.txt') as f:
      FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Looks like the requirements.txt didn't make it into the wheel?

The UI fails to load

Adatest looks like a promising library and has piqued my interest since I'm researching Model Debugging methods for Deep learning systems. I ran the notebooks provided in the Notebooks folder, however, the UI fails to load and throws the following error

FileNotFoundError: [Errno 2] No such file or directory: '/home/paperspace/.local/lib/python3.8/site-packages/adatest/../client/dist/main.js'

Any pointers on resolving the error? Thanks

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.