Giter Site home page Giter Site logo

keras-team / keras-nlp Goto Github PK

View Code? Open in Web Editor NEW
698.0 27.0 204.0 4.37 MB

Modular Natural Language Processing workflows with Keras

License: Apache License 2.0

Python 99.30% Dockerfile 0.01% Shell 0.13% Jupyter Notebook 0.56%
deep-learning keras machine-learning nlp tensorflow

keras-nlp's Introduction

KerasNLP: Modular NLP Workflows for Keras

Python contributions welcome

KerasNLP is a natural language processing library that works natively with TensorFlow, JAX, or PyTorch. Built on Keras 3, these models, layers, metrics, and tokenizers can be trained and serialized in any framework and re-used in another without costly migrations.

KerasNLP supports users through their entire development cycle. Our workflows are built from modular components that have state-of-the-art preset weights when used out-of-the-box and are easily customizable when more control is needed.

This library is an extension of the core Keras API; all high-level modules are Layers or Models that receive that same level of polish as core Keras. If you are familiar with Keras, congratulations! You already understand most of KerasNLP.

See our Getting Started guide to start learning our API. We welcome contributions.

Quick Links

For everyone

For contributors

Installation

KerasNLP supports both Keras 2 and Keras 3. We recommend Keras 3 for all new users, as it enables using KerasNLP models and layers with JAX, TensorFlow and PyTorch.

Keras 2 Installation

To install the latest KerasNLP release with Keras 2, simply run:

pip install --upgrade keras-nlp

Keras 3 Installation

There are currently two ways to install Keras 3 with KerasNLP. To install the stable versions of KerasNLP and Keras 3, you should install Keras 3 after installing KerasNLP. This is a temporary step while TensorFlow is pinned to Keras 2, and will no longer be necessary after TensorFlow 2.16.

pip install --upgrade keras-nlp
pip install --upgrade keras>=3

To install the latest nightly changes for both KerasNLP and Keras, you can use our nightly package.

pip install --upgrade keras-nlp-nightly

Important

Keras 3 will not function with TensorFlow 2.14 or earlier.

Read Getting started with Keras for more information on installing Keras 3 and compatibility with different frameworks.

Quickstart

Fine-tune BERT on a small sentiment analysis task using the keras_nlp.models API:

import os
os.environ["KERAS_BACKEND"] = "tensorflow"  # Or "jax" or "torch"!

import keras_nlp
import tensorflow_datasets as tfds

imdb_train, imdb_test = tfds.load(
    "imdb_reviews",
    split=["train", "test"],
    as_supervised=True,
    batch_size=16,
)
# Load a BERT model.
classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_base_en_uncased", 
    num_classes=2,
    activation="softmax",
)
# Fine-tune on IMDb movie reviews.
classifier.fit(imdb_train, validation_data=imdb_test)
# Predict two new examples.
classifier.predict(["What an amazing movie!", "A total waste of my time."])

For more in depth guides and examples, visit https://keras.io/keras_nlp/.

Configuring your backend

If you have Keras 3 installed in your environment (see installation above), you can use KerasNLP with any of JAX, TensorFlow and PyTorch. To do so, set the KERAS_BACKEND environment variable. For example:

export KERAS_BACKEND=jax

Or in Colab, with:

import os
os.environ["KERAS_BACKEND"] = "jax"

import keras_nlp

Important

Make sure to set the KERAS_BACKEND before import any Keras libraries, it will be used to set up Keras when it is first imported.

Compatibility

We follow Semantic Versioning, and plan to provide backwards compatibility guarantees both for code and saved models built with our components. While we continue with pre-release 0.y.z development, we may break compatibility at any time and APIs should not be consider stable.

Disclaimer

KerasNLP provides access to pre-trained models via the keras_nlp.models API. These pre-trained models are provided on an "as is" basis, without warranties or conditions of any kind. The following underlying models are provided by third parties, and subject to separate licenses: BART, DeBERTa, DistilBERT, GPT-2, OPT, RoBERTa, Whisper, and XLM-RoBERTa.

Citing KerasNLP

If KerasNLP helps your research, we appreciate your citations. Here is the BibTeX entry:

@misc{kerasnlp2022,
  title={KerasNLP},
  author={Watson, Matthew, and Qian, Chen, and Bischof, Jonathan and Chollet, 
  Fran\c{c}ois and others},
  year={2022},
  howpublished={\url{https://github.com/keras-team/keras-nlp}},
}

Acknowledgements

Thank you to all of our wonderful contributors!

keras-nlp's People

Contributors

abheesht17 avatar abosamoor avatar abuelnasr0 avatar adhadse avatar adityadas1999 avatar aflah02 avatar chenmoneygithub avatar cyber-machine avatar dependabot[bot] avatar fchollet avatar grasskin avatar jbischof avatar jessechancy avatar mattdangerw avatar nkovela1 avatar pnacht avatar pranavvp16 avatar reedwm avatar ryanmullins avatar saberkun avatar sachinprasadhs avatar samanehsaadat avatar sampathweb avatar shivance avatar soma2000-lang avatar susnato avatar tanzhenyu avatar tensorflower-gardener avatar theathleticcoder avatar tirthasheshpatel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

keras-nlp's Issues

Add a BLEU metric

Splitting this issue out from #38.

We should add a BLEU metric as keras_nlp.metrics.Bleu.

Add a ByteTokenizer

This would be a simple tokenizer which has no vocabulary, and simply converts text to raw bytes.

Potential docstring and usage

"""
Raw byte tokenizer.

This tokenizer is a vocabulary free tokenizer which will tokenize text as
as raw bytes from [0, 256).
 
Args:
   lowercase: if true, lowercase text before tokenizing.
   sequence_length: If set, the output will be converted to a dense
     tensor and padded/trimmed so all outputs are of sequence_length.
   normalization_form: One of the following string values (None, 'NFC',
     'NFKC', 'NFD', 'NFKD'). If set will normalize unicode to the given form
     before tokenizing.
   errors: One of ('replace', 'remove', 'strict'). Specifies the `detokenize()`
     behavior when an invalid byte sequence is encountered. (same behavior as
     https://www.tensorflow.org/api_docs/python/tf/strings/unicode_transcode)
"""

tokenizer = keras_nlp.tokenizers.ByteTokenizer()

inputs = ["▀▁▂▃▄", "hello"]
tokens = tokenizer.tokenize(inputs)
>>> <tf.RaggedTensor [
      [226, 150, 128, 226, 150, 129, 226, 150, 130, 226, 150, 131, 226, 150, 132],
      [104, 101, 108, 108, 111]]>


outputs = tokenizer.detokenize(tokens)
outputs = [x.decode("utf-8") for x in strings.numpy().tolist()]
>>> ['▀▁▂▃▄', 'hello']

Colab demonstrating basic functionality:

https://colab.sandbox.google.com/gist/mattdangerw/99e8f3795e37fe539731dcbfc1c09d47/bytetokenizer.ipynb

Add Remaining Tokenizers

Below is a list of commonly used tokenizers which can be implemented. @mattdangerw, @chenmoneygithub, let me know what your thoughts on these tokenizers are (whether we should implement all of these, or skip some). Thanks!

  • Space + Punctuation Tokenizer: I know that subword tokenizers are in vogue nowadays. However, a model as recent as Transformer XL uses a generic Space + Punctuation Tokenizer. Hence, this can be added to the library for the sake of completeness.

  • Byte Pair Encoding (BPE): GPT-2 and RoBERTa use this.

  • WordPiece: Has already been implemented. BERT uses this.

  • SentencePiece: Mentioned in #27. XLNet uses this.

A note on the differences between the subword tokenizers mentioned above (source: https://blog.floydhub.com/tokenization-nlp/):

BPE: Just uses the frequency of occurrences to identify the best match at every iteration until it reaches the predefined vocabulary size.

WordPiece: Similar to BPE and uses frequency occurrences to identify potential merges but makes the final decision based on the likelihood of the merged token

Unigram: A fully probabilistic model which does not use frequency occurrences. Instead, it trains a LM using a probabilistic model, removing the token which improves the overall likelihood the least and then starting over until it reaches the final token limit.

If you think it's fine, I'll get started with Space + Punctuation Tokenizer and BPE Tokenizer.

Note on Unigram Tokenizer: Not sure if any model uses this. Can be skipped.

Add the gMLP Encoder Block

The gMLP model is from the paper "Pay Attention to MLPs". It has a decent number of citations - around 40. Every Encoder Block merely consists of linear layers, a "spatial gating unit", etc. Will be a good addition to the library, considering the research world is trying to find alternatives for self-attention, and because despite the simplicity of this model, it does achieve comparable performance with Transformers.

Add a BERT input packing layer

BERT models need additional input preprocessing after tokenization. Inputs need to "packed" into the context length, with [CLS] and [SEP] tokens, and a separate tensor of segment ids needs to be prepared to make the segment embedding. We should make a layer or set of layers that helps with this.

Add ROUGE metrics

Splitting this issue out from #38.

We should add a ROUGE metric as keras_nlp.metrics.RougeL, etc.

Add a token and position embedding layer

For simplicity of our examples, we should add a layer that combines keras.layers.Embedding and keras_nlp.layers.PositionEmbedding into a single offering. Initial take for an API signature...

keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size,
    max_length,
    embedding_dim,
    embeddings_initializer="glorot_uniform",
    embeddings_regularizer=None,
    mask_zero=False,
    **kwargs
)

Internally, this will combine the layers as follows...

# in __init__
self.token_embeddings = layers.Embedding(
    input_dim=vocab_size, output_dim=embedding_dim
)
self.position_embeddings = keras_nlp.layers.PositionEmbedding(
    max_length=max_length
)

# in call
embedded_tokens = self.token_embeddings(inputs)
embedded_positions = self.position_embeddings(embedded_tokens)
outputs = embedded_tokens + embedded_positions

Issues with Running lint.sh and format.sh

I noticed that some tests for code formatting were failing in #50
Hence I tried to run format.sh however encountered this issue -

(kerasnlp) aflah@LAPTOP:/mnt/c/Users/ASUS/Desktop/keras-nlp$ bash shell/format.sh
shell/format.sh: line 2: $'\r': command not found
shell/format.sh: line 4: $'\r': command not found
Broken 1 paths
Usage: black [OPTIONS] SRC ...
Try 'black -h' for help.

Error: Invalid value for 'SRC ...': Path '.\r\r' does not exist.
shell/format.sh: line 7: $'\r': command not found
shell/format.sh: line 8: syntax error near unexpected token `$'do\r''
'hell/format.sh: line 8: `for i in $(find ${base_dir} -name '*.py'); do

same when I run lint.sh

(kerasnlp) aflah@LAPTOP:/mnt/c/Users/ASUS/Desktop/keras-nlp$ bash shell/lint.sh
shell/lint.sh: line 2: $'\r': command not found
shell/lint.sh: line 4: $'\r': command not found
Broken 1 paths
shell/lint.sh: line 20: syntax error near unexpected token `$'do\r''
'hell/lint.sh: line 20: `for i in $(find "${base_dir}" -name '*.py'); do

I'm running these in a conda env in WSL on Windows 11

Adding Installation Process in the README

I think it would be a good idea if we add how someone can use the repository currently (pre-release) on Google Colab or locally by cloning and pip installing it as a part of the README or an Installation.md type file

The instructions would be pretty much:

  1. Clone the repository using git clone https://github.com/keras-team/keras-nlp or downloading it directly
  2. Moving into the directory using a cd command - cd keras-nlp
  3. Running pip install once inside the folder

This could be helpful for people who just which to quickly setup this library to try stuff out

Integrated gradients for nlp models. (Explainable AI)

Describe feature
The Integrated Gradients (IG) can be a great tool to understand the neural network.

How API will change?
It would enhance as follows:

from keras_nlp.utils.visualization import IntegratedGradients as IG 

ig  = IG (model, layer=layer, ...)

Candidate Solution

  • ALIBI-Explain: It's a toolbox that provides many solutions, including IG in TensorFlow 2 (Keras).

Demo o/p:

ig.explain(sample)

image

Synonym Replacement Layer - Data Augmentation

Since issue #39 is very broad I've created this to specifically discuss the Synonym Replacement Layer

  • Parse WordNet
  • Work On Integrating With KerasNLP

So far I've made a rough parse here I'll hopefully finalize it in a day or two post which we can have a look at the performance and implementation specifics.

I'll post all relevant updates here

Problems of running BERT example: BookCorpus cannot be downloaded

Describe the bug
Downloading bookcorpus via the [repo mentioned in BERT instruction]((https://github.com/soskek/bookcorpus/blob/master/README.md) hit an error: HTTPError: HTTP Error 503: Service Temporarily Unavailable Failed to open https://www.smashwords.com/books/download/459173/6/latest/0/0/imperfect-chemistry.txt

This might be transient since the error code is 503, but we need to further check it.

To Reproduce

git clone https://github.com/soskek/bookcorpus.git
cd bookscorpus
python download_files.py --list url_list.jsonl --out out_txts --trash-bad-count

Expected behavior
Should be able to download bookscorpus dataset.

Add a layer to dynamically generate mask

Is your feature request related to a problem? Please describe.
Masking is an essential part in LM models, so we need to have a layer to do the work.

Describe the solution you'd like
Create a new layer, which takes in sequence data (could be both RaggedTensor and Tensor), and returns masked sequence, and masked positions and ids.

Describe alternatives you've considered
We can directly use tf_text, but the API is not very user-friendly.

Additional context
N/A

Add Function(s)/Class(es) for Decoding Strategies [Text Generation]

Not sure if this is worth implementing, but basically, we can have functions/classes, which users can call in order to get decoded text. As far as I know, we have these decoding strategies:

  • Greedy Sampling
  • Random Sampling
  • Top-k Sampling
  • Top-p (Nucleus Sampling)
  • Beam Search

Users can feed a tensor of size (bsz, seq_len, vocab_size) (output of a generation model), and choose a decoding strategy from above. The output will be the decoded tokens (of size (bsz, seq_len)).

I realised this while implementing text generation metrics like ROUGE (#38). In metrics like ROUGE, we can provide an option to the user decoding_strategy, which can take values None, "greedy", "top_k", "nucleus", etc. and decode the output accordingly. In case of None, the user will be asked to provide a decoded output.

Something like this:

from keras_nlp.utils.text_decoder import TextDecoder
class Rouge:
  def __init__(self, decoding_strategy, ...):
    self.decoding_strategy = decoding_strategy
    if decoding_strategy:
      self.text_decoder = TextDecoder(decoding_strategy)

  def update_state(self, y_true, y_pred):
    if self.decoding_strategy:
      y_pred_tokens = self.text_decoder(y_pred)
    else:
      assert tf.rank(y_pred) == 2, "Please provide the decoded output, or set the decoding_strategy to a non-None value."
    ...
 
  def result(...):
    ...

  def reset_state(...):
    ...

Rename the Classes in `keras_nlp/layers` to "EncoderBlock"?

In keras_nlp/layers, we have classes like TransformerEncoder, TransformerDecoder, FNetEncoder. Should we rename these to TransformerEncoderBlock, TransformerDecoderBlock and FNetEncoderBlock, respectively?

TensorFlow Reference: https://github.com/tensorflow/models/tree/master/official/projects/longformer
Here, the LongformerEncoder is a stack of LongformerEncoderBlocks. This is the case for other models as well.

Hugging Face: BertEncoder is a stack of BertLayers.

Candidate metric: MAUVE

Is your feature request related to a problem? Please describe.
This idea is from one best paper in NeurIPS 2021: MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers. Basically it proposes a new metric that measures how close machine generated text is like human text. This metric is specifically useful for open-end text generation.

Describe the solution you'd like
Write a Keras metric class that implements MAUVE.

Describe alternatives you've considered
Probably... we should actually delete this section?

Additional context
MAUVE is not very easy to implement, because it does sampling, and requires an external model, so by nature it is not a light-weight metric. On the other hand, MAUVE is very successful at evaluating machine generated text in a human-like way. We will need to discuss a bit more about it.

Add docstring testing for API code examples

We need to add docstring testing for our API code blocks.

A couple decisions we should make here:

  • Do we want to do fenced examples (triple backticks), >>> style, or both?
  • Do we want to share common code for this with core keras? Or keep standalone code based on doctest?

Add Pre-commit Hooks

Random thought:
Currently, we run two bash scripts, format.sh and lint.sh, for formatting the code, and for linting. While this works for Linux/MacOS users, Windows users can't run it, and have to copy-paste every command separately.

An alternative is pre-commit hooks. Here, we can add hooks for black, isort and flake8 to a .pre-commit-config.yaml config file. This will not only save developers the trouble of running the two bash scripts, but also make this step easier for Windows users.

After making your changes, this is all you have to do:

git add .
git commit -m "Add feature"
# If an error is raised, your code is not compliant with flake8, isort or black
# Fix any flake8 errors by following their suggestions.
# isort and black will automatically format the files so they might look different, but you'll need to stage the files.

# again for committing
# After fixing any flake8 errors
git add .
git commit -m "Add feature"

Note that this will be reflected as one commit only, and not two, even though we commit twice :), because pre-commit fails the first commit.

Screenshots for Sample Usage

First Run:

(keras-nlp) C:\Users\abheesht\Desktop\keras-nlp>git add .

(keras-nlp) C:\Users\abheesht\Desktop\keras-nlp>git commit -m "Add feature" 
Fix End of Files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing sample.py

Trim Trailing Whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing sample.py

Check Yaml...........................................(no files to check)Skipped
black....................................................................Failed
- hook id: black
- files were modified by this hook

reformatted sample.py
All done! \u2728 \U0001f370 \u2728
1 file reformatted.

seed isort known_third_party.............................................Passed
isort....................................................................Failed
- hook id: isort
- files were modified by this hook

Fixing C:\Users\abheesht\Desktop\keras-nlp\sample.py

flake8...................................................................Passed

Second Run

(keras-nlp) C:\Users\abheesht\Desktop\keras-nlp>git add .

(keras-nlp) C:\Users\abheesht\Desktop\keras-nlp>git commit -m "Add feature"
Fix End of Files.........................................................Passed
Trim Trailing Whitespace.................................................Passed
Check Yaml...........................................(no files to check)Skipped
black....................................................................Passed
seed isort known_third_party.............................................Passed
isort....................................................................Passed
flake8...................................................................Passed
[sample-branch b046ce9] Add feature
 1 file changed, 41 insertions(+)  
 create mode 100644 sample.py      

(keras-nlp) C:\Users\abheesht\Desktop\keras-nlp>git push

Note that we will have to install it first using the following commands:

pip install pre-commit
pre-commit install

Add a WordPiece tokenizer layer

Tensorflow text provides a set of efficient, in graph, WordPiece tokenization ops.

We would like to expose these through a keras layer in a way that is easily configurable, supports both tokenization and detokenization, and integrates properly with Keras functional models.

This can serve as an example for future subword tokenization work in this library.

Add Notebooks for Examples

We can add notebooks (or share Colab notebooks) for existing examples in the library (with instructive text and explanation).

Add a SentencePiece tokenizer layer

After landing the WordPiece tokenizer (#22) and hammering out a layerized tokenizer design, we should also support for SentencePiece tokenization. SentencePiece is both language agnostic and reversible, and will be an important addition to our tokenization offering. SentencePiece graph ops are already supported by tensorflow text.

Add a learned positional embedding layer

We should expose the a keras layer for a learned positional embedding through keras_nlp.

This will take as input a maximum sequence length and embedding dimension size, and learn a positional embedding that can be combined with a token embedding.

Adding debug instructions incase lint.sh/format.sh fail

I noticed that when I run lint.sh and format.sh and the file is in CRLF mode it fails to run while in LF mode it does. It feel it could be a helpful addition here and might save people some time
If this seems fine I can raise a PR adding a couple of sentences in this regard

[Query] Implementing Models

Is the KerasNLP team also looking to implement entire models or are we restricting ourselves to building blocks?

For example, if we take FNet, we currently have its encoder block in KerasNLP. So, will we stack these encoder blocks and implement the entire FNet model in keras_nlp.models, and at the same time, will we upload pretrained models weights to TFHub? Or will we leave that to the main TensorFlow repository (since many models are already present in the TF Model Garden)?

Thanks!

Add a UnicodeCharacterTokenizer

This would be a simple tokenizer which has no vocabulary, and simply converts text to unicode codepoints.

Potential docstring and usage

"""
Unicode character codepoint tokenizer.
 
This tokenizer is a vocabulary free tokenizer which will tokenize text as
unicode character codepoints.
 
Args:
   lowercase: if true, lowercase text before tokenizing.
   sequence_length: If set, the output will be converted to a dense
     tensor and padded/trimmed so all outputs are of sequence_length.
   normalization_form: One of the following string values (None, 'NFC',
     'NFKC', 'NFD', 'NFKD'). If set will normalize unicode to the given form
     before tokenizing.
   errors: One of ('replace', 'remove', 'strict'). Specifies the `detokenize()`
     behavior when an invalid codepoint is encountered. (same behavior as
     https://www.tensorflow.org/api_docs/python/tf/strings/unicode_transcode)
"""

tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer()

inputs = ["▀▁▂▃▄", "hello"]
tokens = tokenizer.tokenize(inputs)
>>> <tf.RaggedTensor [
      [9600, 9601, 9602, 9603, 9604],
      [104, 101, 108, 108, 111]]>

outputs = tokenizer.detokenize(tokens)
outputs = [x.decode("utf-8") for x in strings.numpy().tolist()]
>>> ['▀▁▂▃▄', 'hello']

Colab demonstrating basic functionality:

https://colab.sandbox.google.com/gist/mattdangerw/9d9dcb8b640d8183a614ddd6f92fe368/unicodecharactertokenizer.ipynb

Add a byte pair encoding (BPE) tokenizer layer

We would like to add a BPE tokenizer (used by gpt-2, roberta and others). This ideally should be configurable to be compatible with the actual tokenization used by gpt-2 and roberta, and run inside a tensorflow graph.

Adding Encoder from TransEvolve

A recent work present at neurips which can be viewed here uses temporal evolution scheme which they've named TransEvolve, to bypass costly dot-product attention over multiple stacked layers. Their results after extensive testing have shown that it outperforms the traditional transformer on encoder only tasks and it comparable to original transformer in other tasks. An implementation already exists here and I feel this would be a worthy addition to the current set of encoders!
If this seems interesting enough I can create a demo colab and also implement this matching the format of existing encoders.
Would love to hear your thoughts

Purpose, scope and road-map of this repository

I just found this repository (and its CV counterpart, keras-cv). Unfortunately, its one-line description, i.e. "Industry-strength Natural Language Processing workflows with Keras", is a bit vague and does not help to understand its scope and purpose. It would be great if you could kindly explain what this repository intends to cover exactly, and if there is any road-map for development? Further, would you consider outside contributions as well?

Thanks.

Adding a Random Encoder for Baseline Runs

NLP Papers often compare against baselines and having a prebuilt random encoder could help with that. A random encoder is similar to a simple encoder with a slight difference here each self-attention sublayer is replaced with two constant random matrices, one applied to the hidden dimension and one applied to the sequence dimension. It is described in the F-Net Paper which was implemented by @abheesht17

Add Token Classification, Text Summarisation, QA Examples

We can add a few examples:

  • Token Classification with BERT
    Dataset: CoNLL 2003
    What's different? Here, we have to classify every word into its NER type. However, since BERT tokenises text into subwords, we will have more tokens than labels (number of labels will be equal to the number of words). So, we have to assign to all subwords the label of the word which spawned it. I think this will give users a good overview of how tokenisation in BERT is done.
  • Question Answering with BERT
    Dataset: SQuAD or CoQA
    What's different? Here, we have to assign to every token, the probability of it being a starting token and an ending token.
  • Text Summarisation (Abstractive)
    **Dataset: ** CNN/Daily Mail Dataset
    What's different?: Not much, pretty similar to the NMT example which is already present.

Let me know which tasks here are worth adding, and I will add them. I understand this is a low priority task, but it's taking me a while to understand the tokenizer code in the SentencePiece library. So, I can take this up in the meantime.

Adversarial Attacks and Testing the Robustness of Models

Branching off from the issue which @aflah02 opened a few weeks ago, #39:

Is the KerasNLP team interested in implementing adversarial attacks? We could start off with simple attacks on classification models.

I understand if this is a bit broad, and the team may want to integrate it later to the repository, especially because we may need some augmentation APIs. For example, some adversarial attacks may want to perturb only those words which are assigned a higher importance score by the model. For perturbation, we can leverage the augmentation APIs.

A good resource is https://github.com/QData/TextAttack.

Add FNet Encoder Layer

@mattdangerw , @chenmoneygithub and the rest of the keras-nlp team:

The FNet paper, which came out last year, uses FFTs instead of attention and achieves 95% of BERT's performance on various downstream tasks. It would be great if we could add this to the library. Simultaneously, a text classification example can be added (using a few FNet Encoder layers). Would love to take this up, if the team is interested :).

Add NLP-specific metrics

@mattdangerw and the keras-nlp team:

For standard classification metrics (AUC, F1, Precision, Recall, Accuracy, etc.), keras.metrics can be used. But there are several NLP-specific metrics which can be implemented here, i.e., we can expose native APIs for these metrics.

I would like to take this up. I can start with the popular ones first and open PRs. Let me know if this is something the team is looking to add!

I've listed a few metrics (this list is, by no means, comprehensive):

  • Perplexity

  • ROUGE
    paper
    Pretty standard metric for text generation. We can implement all variations: ROUGE-N, ROUGE-L, ROUGE-W, etc.

  • BLEU
    paper
    Another standard text generation metric.
    Note: We can also implement SacreBleu.

  • BertScore
    paper, code

  • Bleurt
    paper, code

  • (character n-gram F-score) chrF and chrF++
    paper, code

  • COMET
    paper, code

  • Character Error Rate, Word Error Rate, etc.
    paper

  • Pearson Coefficient and Spearman Coefficient
    Looks like keras.metrics does not have these two metrics. They are not NLP-specific metrics...so, maybe, implementing them in Keras is better than implementing them here.

Thank you!

Adding Data Augmentation Techniques Natively

I'm interested in contributing scripts which allow users to incorporate data augmentation techniques directly without using external libraries.
I can start with stuff like synonym replacement, random insertion, random swap, and random deletion from the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Over time this can be extended to incorporate more techniques as well such as the additional techniques mentioned here
Any hints and tips on how I can get started?

Edit: I also found this survey paper which seems pretty useful: A Survey of Data Augmentation Approaches for NLP

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.