Giter Site home page Giter Site logo

johnsnowlabs / langtest Goto Github PK

View Code? Open in Web Editor NEW
458.0 11.0 33.0 162.67 MB

Deliver safe & effective language models

Home Page: http://langtest.org/

License: Apache License 2.0

Python 99.98% Makefile 0.01% CSS 0.01% Batchfile 0.01% Shell 0.01%
benchmarks ethics-in-ai large-language-models ml-safety ml-testing mlops model-assessment nlp responsible-ai llm-test

langtest's Introduction

johnsnowlabs_logo

LangTest: Deliver Safe & Effective Language Models

Release Notes Blog Documentation GitHub star chart Open Issues Downloads CI Contributor Covenant

Langtest Workflow

Project's WebsiteKey FeaturesHow To UseBenchmark DatasetsCommunity SupportContributingMissionLicense

Project's Website

Take a look at our official page for user documentation and examples: langtest.org

Key Features

  • Generate and execute more than 60 distinct types of tests only with 1 line of code
  • Test all aspects of model quality: robustness, bias, representation, fairness and accuracy.​
  • Automatically augment training data based on test results (for select models)​
  • Support for popular NLP frameworks for NER, Translation and Text-Classifcation: Spark NLP, Hugging Face & Transformers.
  • Support for testing LLMS ( OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI LLMs) for question answering, toxicity, clinical-tests, legal-support, factuality, sycophancy, summarization and other popular tests.

Benchmark Datasets

LangTest comes with different datasets to test your models, covering a wide range of use cases and evaluation scenarios. You can explore all the benchmark datasets available here, each meticulously curated to challenge and enhance your language models. Whether you're focused on Question-Answering, text summarization etc, LangTest ensures you have the right data to push your models to their limits and achieve peak performance in diverse linguistic tasks.

How To Use

# Install langtest
!pip install langtest[transformers]

# Import and create a Harness object
from langtest import Harness
h = Harness(task='ner', model={"model":'dslim/bert-base-NER', "hub":'huggingface'})

# Generate test cases, run them and view a report
h.generate().run().report()

Note For more extended examples of usage and documentation, head over to langtest.org

Responsible Ai Blogs

You can check out the following LangTest articles:

Blog Description
Automatically Testing for Demographic Bias in Clinical Treatment Plans Generated by Large Language Models Helps in understanding and testing demographic bias in clinical treatment plans generated by LLM.
LangTest: Unveiling & Fixing Biases with End-to-End NLP Pipelines The end-to-end language pipeline in LangTest empowers NLP practitioners to tackle biases in language models with a comprehensive, data-driven, and iterative approach.
Beyond Accuracy: Robustness Testing of Named Entity Recognition Models with LangTest While accuracy is undoubtedly crucial, robustness testing takes natural language processing (NLP) models evaluation to the next level by ensuring that models can perform reliably and consistently across a wide array of real-world conditions.
Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance In this article, we discuss how automated data augmentation may supercharge your NLP models and improve their performance and how we do that using LangTest.
Mitigating Gender-Occupational Stereotypes in AI: Evaluating Models with the Wino Bias Test through Langtest Library In this article, we discuss how we can test the "Wino Bias” using LangTest. It specifically refers to testing biases arising from gender-occupational stereotypes.
Automating Responsible AI: Integrating Hugging Face and LangTest for More Robust Models In this article, we have explored the integration between Hugging Face, your go-to source for state-of-the-art NLP models and datasets, and LangTest, your NLP pipeline’s secret weapon for testing and optimization.
Detecting and Evaluating Sycophancy Bias: An Analysis of LLM and AI Solutions In this blog post, we discuss the pervasive issue of sycophantic AI behavior and the challenges it presents in the world of artificial intelligence. We explore how language models sometimes prioritize agreement over authenticity, hindering meaningful and unbiased conversations. Furthermore, we unveil a potential game-changing solution to this problem, synthetic data, which promises to revolutionize the way AI companions engage in discussions, making them more reliable and accurate across various real-world conditions.
Unmasking Language Model Sensitivity in Negation and Toxicity Evaluations In this blog post, we delve into Language Model Sensitivity, examining how models handle negations and toxicity in language. Through these tests, we gain insights into the models' adaptability and responsiveness, emphasizing the continuous need for improvement in NLP models.
Unveiling Bias in Language Models: Gender, Race, Disability, and Socioeconomic Perspectives In this blog post, we explore bias in Language Models, focusing on gender, race, disability, and socioeconomic factors. We assess this bias using the CrowS-Pairs dataset, designed to measure stereotypical biases. To address these biases, we discuss the importance of tools like LangTest in promoting fairness in NLP systems.
Unmasking the Biases Within AI: How Gender, Ethnicity, Religion, and Economics Shape NLP and Beyond In this blog post, we tackle AI bias on how Gender, Ethnicity, Religion, and Economics Shape NLP systems. We discussed strategies for reducing bias and promoting fairness in AI systems.
Evaluating Large Language Models on Gender-Occupational Stereotypes Using the Wino Bias Test In this blog post, we dive into testing the WinoBias dataset on LLMs, examining language models’ handling of gender and occupational roles, evaluation metrics, and the wider implications. Let’s explore the evaluation of language models with LangTest on the WinoBias dataset and confront the challenges of addressing bias in AI.
Streamlining ML Workflows: Integrating MLFlow Tracking with LangTest for Enhanced Model Evaluations In this blog post, we dive into the growing need for transparent, systematic, and comprehensive tracking of models. Enter MLFlow and LangTest: two tools that, when combined, create a revolutionary approach to ML development.
Testing the Question Answering Capabilities of Large Language Models In this blog post, we dive into enhancing the QA evaluation capabilities using LangTest library. Explore about different evaluation methods that LangTest offers to address the complexities of evaluating Question Answering (QA) tasks.
Evaluating Stereotype Bias with LangTest In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.
Testing the Robustness of LSTM-Based Sentiment Analysis Models Explore the robustness of custom models with LangTest Insights.
LangTest Insights: A Deep Dive into LLM Robustness on OpenBookQA Explore the robustness of Language Models (LLMs) on the OpenBookQA dataset with LangTest Insights.
LangTest: A Secret Weapon for Improving the Robustness of Your Transformers Language Models Explore the robustness of Transformers Language Models with LangTest Insights.

Note To check all blogs, head over to Blogs

Community Support

  • Slack For live discussion with the LangTest community, join the #langtest channel
  • GitHub For bug reports, feature requests, and contributions
  • Discussions To engage with other community members, share ideas, and show off how you use LangTest!

Mission

While there is a lot of talk about the need to train AI models that are safe, robust, and fair - few tools have been made available to data scientists to meet these goals. As a result, the front line of NLP models in production systems reflects a sorry state of affairs.

We propose here an early stage open-source community project that aims to fill this gap, and would love for you to join us on this mission. We aim to build on the foundation laid by previous research such as Ribeiro et al. (2020), Song et al. (2020), Parrish et al. (2021), van Aken et al. (2021) and many others.

John Snow Labs has a full development team allocated to the project and is committed to improving the library for years, as we do with other open-source libraries. Expect frequent releases with new test types, tasks, languages, and platforms to be added regularly. We look forward to working together to make safe, reliable, and responsible NLP an everyday reality.

Note For usage and documentation, head over to langtest.org

Contributing to LangTest

We welcome all sorts of contributions:

A detailed overview of contributing can be found in the contributing guide.

If you are looking to start working with the LangTest codebase, navigate to the GitHub "issues" tab and start looking through interesting issues. There are a number of issues listed under where you could start out. Or maybe through using LangTest you have an idea of your own or are looking for something in the documentation and thinking ‘This can be improved’...you can do something about it!

Feel free to ask questions on the Q&A discussions.

As contributors and maintainers to this project, you are expected to abide by LangTest's code of conduct. More information can be found at: Contributor Code of Conduct

Citation

We have published a paper that you can cite for the LangTest library:

@article{nazir2024langtest,
  title={LangTest: A comprehensive evaluation library for custom LLM and NLP models},
  author={Arshaan Nazir, Thadaka Kalyan Chakravarthy, David Amore Cecchini, Rakshit Khajuria, Prikshit Sharma, Ali Tarik Mirik, Veysel Kocaman and David Talby},
  journal={Software Impacts},
  pages={100619},
  year={2024},
  publisher={Elsevier}
}

Contributors

We would like to acknowledge all contributors of this open-source community project.

License

LangTest is released under the Apache License 2.0, which guarantees commercial use, modification, distribution, patent use, private use and sets limitations on trademark use, liability and warranty.

langtest's People

Contributors

agsfer avatar alierenak avatar alytarik avatar arkajyotichakraborty avatar arshaannazir avatar chakravarthik27 avatar gadde5300 avatar julesbelveze avatar luca-martial avatar mauro-nievoff avatar prikshit7766 avatar rakshitkhajuria avatar sugatoray avatar vkocaman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

langtest's Issues

Fill out design sheet

Sheet can be found in Development channel - nlptest features.xlsx\

Please fill out the Design tab

Reformat report method output

.report() should print this:

test factory test type pass count fail count pass rate minimum pass rate pass
Perturbation uppercase 34 16 68% 75% False

Features Backlog

Parked Ideas 🚗

  • Testing social stereotypes for MLMs (paper)
  • Adding support for cloud provider models to be tested
  • Support an installation for airgapped environments
  • Need to generate a spec of what data the model expects and can safely run with (for example, if a model has only been validated on females aged 18 and up, then the model should not be used on people outside that demographic group) - see https://ianwhitestone.work/hello-great-expectations/
  • Toxicity tests (swear words, offensive answers)
  • Data leakage tests (PHI)
  • Adversarial attacks tests
  • Freshness tests (replace _2023_name)
  • Runtime tests
  • Question answering
  • Text generation
  • Summarization
  • Paraphrasing
  • Translation

Documentation Roadmap

Description

This is the roadmap for all things related to knowledge translation and documentation. This includes:

  • tutorial notebooks
  • readme instructions
  • blogposts
  • docs

Tasks

Robustness fixing should accept a simple dictionary where keys are perturbation names and values are proportions to apply to all entities for that perturbation

Currently we are passing perturbations as single params:

augment_robustness(conll_path = 'data.conll',
                   uppercase = {'PROBLEM':0.05, 'TEST':0.05, 'TREATMENT':0.05},
                   lowercase = {'PROBLEM':0.05, 'TEST':0.05, 'TREATMENT':0.05})

We should change this to a new parameter that accepts a perturbation map that looks like this:

detailed_proportions = {
   "uppercase": {'PROBLEM':0.05, 'TEST':0.05, 'TREATMENT':0.05},
   "lowercase": {'PROBLEM':0.05, 'TEST':0.05, 'TREATMENT':0.05},
   "title": {'PROBLEM':0.05, 'TEST':0.05, 'TREATMENT':0.05},
   "add_punctuation": {'PROBLEM':0.05, 'TEST':0.05, 'TREATMENT':0.05},
}

augment_robustness(conll_path = 'data.conll',
                   entity_perturbation_map = detailed_proportions)

we should also accept a more simple version of this in another parameter:

proportions= {
   "uppercase": 0.05,
   "lowercase":  0.05}

augment_robustness(conll_path = 'data.conll',
                   perturbation_map = proportions)

Robustness Fixing Roadmap

Adaptations

  • #10
  • #12
  • Adapt NERDataHandler class for robustness fixing in transformers
  • Adapt NERDataHandler class for robustness fixing in spaCy

Improvements

  • #19
  • #29
  • Add optimization algo for all augmentations so that we get 100% augmentation coverage for all tests as much as possible (use similar method to new noise_proportion method in robustness testing -> sample after augmenting all)
  • Track if change occurred in every sentence for every augmentation
  • Apply entity swapping to all entities in sentence (instead of 1)
  • Add type checking to main function
  • Supporting classification, assertion, relation extraction tasks
  • Become compatible with many types of DOCSTART

Bug Fixes

🎉

Bias Testing Roadmap

Adaptations

  • Pick and implement final gender classification method
  • Replace NerDLMetrics with sklearn classification_report
  • Adapt handler classes for bias testing in Spark NLP
  • Adapt handler classes for bias testing in transformers
  • Adapt handler classes for bias testing in spaCy

Improvements

  • #30
  • Add type checking to main function
  • Supporting classification, assertion, relation extraction tasks

Bug Fixes

🎉

Noisy Labels Fixing Roadmap

Adaptations

  • Adapt handler classes for noisy labels testing in Spark NLP
  • Adapt handler classes for noisy labels testing in transformers
  • Adapt handler classes for noisy labels testing in spaCy

Improvements

  • #32
  • Prettify UI dropdown, groupby sentences like in ALAB UI
  • Add type checking to main function
  • Supporting classification, assertion, relation extraction tasks

Bug Fixes

  • Fix UI jupyter lab compatibility

Fixes Backlog

Parked Ideas 🚗

  • Removing transformers dependency if possible

Improve sampling method for `noise_prob` param by replacing with new `noise_proportion` param in robustness testing

noise_proportion = 0.5

# step 1: apply perturbation to all samples
1000 sentences -> apply contraction

# step 2: sample as many successfully augmented sentences as possible to reach noise_proportion
# we don't mind if some are already augmented
50 samples successfully contracted (augmented) + 100 already contracted

50 augmented + 50 (random.sample(n=50) from 1000 - 50) -> we don't mind if sampled ones are already augmented


noise_proportion = 0.5

contraction -> 5 samples to augment + 5 original samples -> f1 score: 0.60

uppercase -> 500 samples to augment + 500 original samples -> f1 score: 0.75

samples to augment == 0 -> "No samples to apply {test_name}, skipping this test."

samples to augment < 50  -> "Low number of samples ({n_samples}) to apply {test_name} to."
                            "F1-Score may not be representative of true perturbation effect."

total sentences —> 1000
noise_prob —> 0.5
for low augmentation coverage —> add, strip punc, accent_conversion, entity_swapping, add contraction
For this ones, we can apply some samples —> not all
add_punction —> sentence already have punctuation —> skip
Not all sentences can be contracted —> is not —> isn’t
1000 sentences —> noise prob 0.5 —> we can try to apply augmentation to only around 500, bcs of the noise prob
1000 —> 500 (added with noise prob) — 500 (will be searched for contraction)
Among 500 samples —> 25 contraction augmentation
While we are testing our perturbation —> perturbation set contains 500 original sentence and 25 augmented samples
Problem 1 -> we know 500 of them (original sentences) will be correct.
total noise samples will be 500 + 25 but we are comparing only 25 of them
this cause high f1 —> it seems like model don’t have problem in this perturbation test

NERDataHandler Roadmap

Description

Create a NERDataHandler class that establishes a common CoNLL data structure for all libraries to process labeled NER data. This includes:

  • Write and read methods
  • Storing docs indexes
  • Easily filtering
  • Converting inputs to match external library requirements (including direct dataset download from HF datasets)

This issue will be used to track the sub-tasks required to launch and maintain this class.

Tasks

  • Ensure class supports robustness testing/fixing with Spark NLP
  • Ensure class supports bias testing with Spark NLP
  • Ensure class supports noisy label testing/fixing with Spark NLP
  • Ensure class supports robustness testing/fixing with transformers
  • Ensure class supports bias testing with transformers
  • Ensure class supports noisy label testing/fixing with transformers
  • Ensure class supports robustness testing/fixing with spaCy
  • Ensure class supports bias testing with spaCy
  • Ensure class supports noisy label testing/fixing with spaCy

Noisy Labels Testing Roadmap

Adaptations

  • Adapt handler classes for noisy labels testing in Spark NLP
  • Adapt handler classes for noisy labels testing in transformers - notebook
  • Adapt handler classes for noisy labels testing in spaCy

Improvements

  • #20
  • #31
  • Improve scoring using sentence label quality score
  • Add type checking to main function
  • Supporting classification, assertion, relation extraction tasks

Bug Fixes

🎉

Refactor token filtering in robustness_testing

Token filtering was created to delete extra added tokens to match token lengths for comparing predictions from NER models. There are other ways we can do this like implementing something into metrics to ignore token length differences.

See slides for more details on possible approaches.

NERModelHandler Roadmap

Description

Create a NERModelHandler class that establishes a common way for inference and training on NER models from different libraries. This includes:

  • Wrapping NER inference pipelines for Spark NLP, transformers and spaCy
  • Standardizing output formats for all pipeline predictions
  • Wrapping training process for Spark NLP, transformers and spaCy models

This issue will be used to track the sub-tasks required to launch and maintain this class.

Tasks

  • Ensure class supports robustness testing with Spark NLP
  • Ensure class supports bias testing with Spark NLP
  • Ensure class supports noisy label testing/fixing with Spark NLP
  • Ensure class supports robustness testing with transformers
  • Ensure class supports bias testing with transformers
  • Ensure class supports noisy label testing/fixing with transformers
  • Ensure class supports robustness testing with spaCy
  • Ensure class supports bias testing with spaCy
  • Ensure class supports noisy label testing/fixing with spaCy

Privacy Attack Testing Roadmap

Description

We need to build mechanisms to test data for different categories of privacy attacks:

  • Membership inference attack: An adversary predicts whether a known subject was a present in the training data used for training the synthetic data model.

  • Re-identification attack: The adversary explores the probability of some features being re-identified using synthetic data and matching to the training data.

  • Attribute inference attack: The adversary predicts the value of sensitive features using synthetic data.

Main article discussing mechanisms.

Tasks

🕐

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.