webis-de / small-text Goto Github PK

Active Learning for Text Classification in Python

Home Page: https://small-text.readthedocs.io/

License: MIT License

Python 100.00%

active-learning text-classification pytorch transformers natural-language-processing python machine-learning deep-learning nlp looking-for-contributors

small-text's Introduction

Active Learning for Text Classification in Python.

Installation | Quick Start | Contribution | Changelog | Docs

Small-Text provides state-of-the-art Active Learning for Text Classification. Several pre-implemented Query Strategies, Initialization Strategies, and Stopping Critera are provided, which can be easily mixed and matched to build active learning experiments or applications.

What is Active Learning?
Active Learning allows you to efficiently label training data in a small data scenario.

Features

Provides unified interfaces for Active Learning so that you can easily mix and match query strategies with classifiers provided by sklearn, Pytorch, or transformers.
Supports GPU-based Pytorch models and integrates transformers so that you can use state-of-the-art Text Classification models for Active Learning.
GPU is supported but not required. In case of a CPU-only use case, a lightweight installation only requires a minimal set of dependencies.
Multiple scientifically evaluated components are pre-implemented and ready to use (Query Strategies, Initialization Strategies, and Stopping Criteria).

News

Version 1.3.3 (v1.3.3) - December 29th, 2023
- Bugfix release.
Version 1.3.2 (v1.3.2) - August 19th, 2023
- Bugfix release.
Paper accepted at EACL 2023 🎉
- The paper introducing small-text has been accepted at EACL 2023. Meet us at the conference in May!
- Update: the paper was awarded EACL Best System Demonstration. Thank you, for your support!
Version 1.3.0 (v1.3.0): Highlights - February 20th, 2023
- Added dropout sampling to SetFitClassification.
Version 1.2.0 (v1.2.0): Highlights - February 4th, 2023
- Make huggingface/setfit (SetFit) usable as a small-text classifier.
- New query strategy: BALD.
- Added two new SetFit notebooks, and also updated existing notebooks.

For a complete list of changes, see the change log.

Installation

Small-Text can be easily installed via pip:

pip install small-text

For a full installation include the transformers extra requirement:

pip install small-text[transformers]

It requires Python 3.7 or newer. For using the GPU, CUDA 10.1 or newer is required. More information regarding the installation can be found in the documentation.

Quick Start

For a quick start, see the provided examples for binary classification, pytorch multi-class classification, and transformer-based multi-class classification, or check out the notebooks.

Notebooks

#	Notebook
1	Intro: Active Learning for Text Classification with Small-Text
2	Using Stopping Criteria for Active Learning
3	Active Learning using SetFit
4	Using SetFit's Zero Shot Capabilities for Cold Start Initialization

Showcase

Tutorial: 👂 Active learning for text classification with small-text (Use small-text conveniently from the argilla UI.)

A full list of showcases can be found in the docs.

🎀 Would you like to share your use case? Regardless if it is a paper, an experiment, a practical application, a thesis, a dataset, or other, let us know and we will add you to the showcase section or even here.

Documentation

Read the latest documentation here. Noteworthy pages include:

Alternatives

modAL, ALiPy, libact

Contribution

Contributions are welcome. Details can be found in CONTRIBUTING.md.

Acknowledgments

This software was created by Christopher Schröder (@chschroeder) at Leipzig University's NLP group which is a part of the Webis research network. The encompassing project was funded by the Development Bank of Saxony (SAB) under project number 100335729.

Citation

Small-Text has been introduced in detail in the EACL23 System Demonstration Paper "Small-Text: Active Learning for Text Classification in Python" which can be cited as follows:

@inproceedings{schroeder2023small-text,
    title = "Small-Text: Active Learning for Text Classification in Python",
    author = {Schr{\"o}der, Christopher  and  M{\"u}ller, Lydia  and  Niekler, Andreas  and  Potthast, Martin},
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl-demo.11",
    pages = "84--95"
}

License

MIT License

small-text's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory entn-at sree181 jaedukseo trendingtechnology jamilbadama emamarela ywang1224 techthiyanes bigdatasciencegroup mengyujackson121 ddcas moritzbrunsch vvkishere woailaosang 24parida arthurjan1994 adbmd silviomgn phillette tamanna18 issifuabdulmajeed tille1989 kimbue shainaraza kanka-max smiyawaki0820 jgonsior cufeinfor databill86 dy05sep2019 mirekwilmer adebayo-oshingbesan sudharsan2020 youralien diogo-alves-alma vishalsingh17 pawaritl sibtainrazajamali omvishal1 rmitsch kgourgou baotg080599 agarwalishika kawdoco jxzhangjhu lordwaif jp-systemsx chschroeder saketkaswa20 raghavprabhakar66 ckruckenberg dominikmann zakih2 vmanc erivandev cybertoteme

small-text's Issues

More sophisticated early stopping mechanism

Feature description / Motivation

The current implementation is based on a paper where they monitored training accuracy. I want to monitor validation loss/accuarcy. Moreover, it's difficult to extend as it is right now.

LightweightCoreset should be batched

Feature description

The lightweight_coreset function should compute the distances in batches similar to greedy_coreset. Therefore a batch_size kwarg needs to be added and integrated into the function in the same manner. This keyword must also be added to LightweightCoreset (query strategy) and passed in the function call (similar to GreedyCoreset).

Motivation

This will reduce max memory used and, moreover, will align the lightweight and greedy coreset implementations.

Addition comments

Everything that needs to be adapted is currently located under small_text.query_strategies.coresets.

Quickstart Colab notebooks not working

AttributeError Traceback (most recent call last)
in
2
3
----> 4 train = TransformersDataset.from_arrays(raw_dataset['train']['text'],
5 raw_dataset['train']['label'],
6 tokenizer,

AttributeError: type object 'TransformersDataset' has no attribute 'from_arrays'

Getting error 'RuntimeError: expected scalar type Long but found Int' while running the starting code

Bug description

I am getting the following error

RuntimeError: expected scalar type Long but found Int

related to the line

indices_labeled = initialize_active_learner(active_learner, train.y)

in the code provided here

https://github.com/webis-de/small-text/blob/v1.1.1/examples/notebooks/02-active-learning-with-stopping-criteria.ipynb

I am using the latest version.

Python version: 3.8.8
small-text version: 1.1.1
torch version (if applicable): 1.13.0+cpu

Full error:

RuntimeError Traceback (most recent call last)
in
28
29 active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train)
---> 30 indices_labeled = initialize_active_learner(active_learner, train.y)
31

in initialize_active_learner(active_learner, y_train)
12
13 indices_initial = random_initialization_balanced(y_train, n_samples=20)
---> 14 active_learner.initialize_data(indices_initial, y_train[indices_initial])
15
16 return indices_initial

~\Anaconda3\lib\site-packages\small_text\active_learner.py in initialize_data(self, indices_initial, y_initial, indices_ignored, indices_validation, retrain)
149
150 if retrain:
--> 151 self._retrain(indices_validation=indices_validation)
152
153 def query(self, num_samples=10, representation=None, query_strategy_kwargs=dict()):

~\Anaconda3\lib\site-packages\small_text\active_learner.py in _retrain(self, indices_validation)
388
389 if indices_validation is None:
--> 390 self._clf.fit(dataset)
391 else:
392 indices = np.arange(self.indices_labeled.shape[0])

~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in fit(self, train_set, validation_set, weights, early_stopping, model_selection, optimizer, scheduler)
366 use_sample_weights=weights is not None)
367
--> 368 return self._fit_main(sub_train, sub_valid, sub_train_weights, early_stopping,
369 model_selection, fit_optimizer, fit_scheduler)
370

~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _fit_main(self, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler)
389
390 with tempfile.TemporaryDirectory(dir=get_tmp_dir_base()) as tmp_dir:
--> 391 self._train(sub_train, sub_valid, weights, early_stopping, model_selection,
392 optimizer, scheduler, tmp_dir)
393 self._perform_model_selection(optimizer, model_selection)

~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _train(self, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir)
435 start_time = datetime.datetime.now()
436
--> 437 train_acc, train_loss, valid_acc, valid_loss, stop = self._train_loop_epoch(epoch,
438 sub_train,
439 sub_valid,

~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _train_loop_epoch(self, num_epoch, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir)
471 validate_every = None
472
--> 473 train_loss, train_acc, valid_loss, valid_acc, stop = self._train_loop_process_batches(
474 num_epoch,
475 sub_train,

~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in train_loop_process_batches(self, num_epoch, sub_train, sub_valid_, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir, validate_every)
505 for i, (x, masks, cls, weight, *_) in enumerate(train_iter):
506 if not stop:
--> 507 loss, acc = self._train_single_batch(x, masks, cls, weight, optimizer)
508 scheduler.step()
509

~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _train_single_batch(self, x, masks, cls, weight, optimizer)
561 outputs = self.model(x, attention_mask=masks)
562
--> 563 logits, loss = self._compute_loss(cls, outputs)
564 loss = loss * weight
565 loss = loss.mean()

~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _compute_loss(self, cls, outputs)
585 logits = outputs.logits.view(-1, self.num_classes)
586 target = cls
--> 587 loss = self.criterion(logits, target)
588
589 return logits, loss

~\Anaconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
1188 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1189 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1190 return forward_call(*input, **kwargs)
1191 # Do not call functions when jit is used
1192 full_backward_hooks, non_full_backward_hooks = [], []

~\Anaconda3\lib\site-packages\torch\nn\modules\loss.py in forward(self, input, target)
1172
1173 def forward(self, input: Tensor, target: Tensor) -> Tensor:
-> 1174 return F.cross_entropy(input, target, weight=self.weight,
1175 ignore_index=self.ignore_index, reduction=self.reduction,
1176 label_smoothing=self.label_smoothing)

~\Anaconda3\lib\site-packages\torch\nn\functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
3024 if size_average is not None or reduce is not None:
3025 reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3026 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
3027
3028

RuntimeError: expected scalar type Long but found Int

Unify the classifiers' initialize() methods

Feature description

KimCNNClassifier, TransformerBasedClassification and SetFitClassification have initialize methods which, however, are nothing alike.

Motivation

Additional comments

Provide animated GIF showing an active learning experiment

Feature description

For the README.md, I would like to have an animated gif which shows the progress of an active learning experiment. The script might also be put in the repository (under /docs likely?) so that we can recerate it in the future.

How should this look like?
I am imagining at least a learning curve (possibly with some additional information on the side).

Expectations
This issue is solved by providing a small script and a gif, and by including the gif in the README.md.

Motivation

This would greatly benefit the readme and help people to grasp what this repo is about.

Additional comments

Add query strategy: QueryByCommittee

Feature description

The "query by committee" strategy should be available, and should be configurable to work either with vote entropy or KL divergence as selection criterion.

Pointers:

Paper reference: A. McCallum and K. Nigam. “Employing EM and Pool-Based Active Learning
for Text Classification”. ICML 98.
Start in a new file, e.g. small_text/query_strategies/commitees.py

Expectations:

The strategy should

Motivation

Additional comments

fit() got an unexpected keyword argument 'validation_set'

Hi,

I'm initializing an active learner for an Sklearn model with specific validation indices. Minimal code example is:

def initialize_learner(learner, train, test_sets, init_n): 
  print('\n----Initalising----\n')
  iter_results_dict = {}
  iter_preds_dict = {}
  #Initialize the model - This is required for model-based query strategies.
  indices_neg_label = np.where(train.y == 0)[0]
  indices_pos_label = np.where(train.y == 1)[0]
if init_n ==4:
     x_indices_initial = np.concatenate([np.random.choice(indices_pos_label, int(init_n/2), replace=False),
  np.random.choice(indices_neg_label, int(init_n/2), replace=False)])
      x_indices_initial = x_indices_initial.astype(int)
      y_initial = np.array([train.y[i] for i in x_indices_initial])
      val_indices = x_indices_initial[1:3]
      learner.initialize_data(x_indices_initial, y_initial, x_indices_validation=val_indices) # use half indices for validation
 iter_results_dict[0], iter_preds_dict[0] = evaluate(learner, train[x_indices_initial], test_sets, x_indices_initial)
 return learner, x_indices_initial, iter_results_dict, iter_preds_dict

The error I am getting is fit() got an unexpected keyword argument 'validation_set'. Digging into the code, it seems like if you pass x_indices_validation as not None this shouldn't happen.

Do you have any suggestions?

Specifying multiple query strategies

When initialising a PoolBasedActiveLearner as active_learner then using active_learner.query(num_samples=20), it is possible to specify more than one query strategy i.e. select 5 examples by PredictionEntropy(), 5 by EmbeddingKMeans(), 5 by RandomSampling() etc.?

I can initialise a new active learner object with a different query strategy for each sub-query but it would be great if you could specify multiple query strategies for the active learner.

When using EmbeddingBasedQueryStrategy with some transformers, model has an unsupported input `token_type_ids` when creating embeddings.

Bug description

Requires query_strategy to be a subclass of EmbeddingBasedQueryStrategy, such as EmbeddingKMeans;
Requires transformer_model to be a model that does not expect token_type_ids in its forward function, such as distilbert-base-uncased

Steps to reproduce

When performing active learning, the model has an unsupported input token_type_ids when creating embeddings.

Expected behavior

The keys of model input are adjusted according to the specific models.

Cause:

In file small_text/integrations/transformers/classifiers/classification.py, function _create_embeddings:
the following code:

outputs = self.model(text,
                             token_type_ids=None,
                             attention_mask=masks,
                             output_hidden_states=True)

need to be changed to

outputs = self.model(text,
                             attention_mask=masks,
                             output_hidden_states=True)

removing the token_type_ids field if the seed model does not expect token_type_ids in its forward function.

Environment:

Python version: 3.11.7
small-text version: 1.3.3
small-text integrations (e.g., transformers): transformers 4.36.2
PyTorch version: 2.1.2
PyTorch-cuda: 11.8

Adding special tokens to tokenizer (transformers-integration)

I need to add some special tokens to the BERT tokenizer. However, I am not sure how to resize the model tokenizer to incorporate the added special tokens with the small-text transformers integration.

With transformers, you can add special tokens using:

tokenizer.add_tokens(['newWord', 'newWord2'])
model.resize_token_embeddings(len(tokenizer)

How does this change with a clf_factory and initialising the transformers model as a pool based active learner? E.g. with the code from the 01-active-learning-for-text-classification-with-small-text-intro.ipynb notebook:

from small_text.integrations.transformers.datasets import TransformersDataset


def get_transformers_dataset(tokenizer, data, labels, max_length=60):

    data_out = []

    for i, doc in enumerate(data):
        encoded_dict = tokenizer.encode_plus(
            doc,
            add_special_tokens=True,
            padding='max_length',
            max_length=max_length,
            return_attention_mask=True,
            return_tensors='pt',
            truncation='longest_first'
        )

        data_out.append((encoded_dict['input_ids'], encoded_dict['attention_mask'], labels[i]))

    return TransformersDataset(data_out)


train = get_transformers_dataset(tokenizer, raw_dataset['train']['text'], raw_dataset['train']['label'])
test = get_transformers_dataset(tokenizer, raw_dataset['test']['text'], raw_dataset['test']['label'])

transformer_model = TransformerModelArguments(transformer_model_name)
clf_factory = TransformerBasedClassificationFactory(transformer_model, 
                                                    num_classes, 
                                                    kwargs=dict({'device': 'cuda', 
                                                                 'mini_batch_size': 32,
                                                                 'early_stopping_no_improvement': -1
                                                                }))
active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train)

torch.randperm: RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

I just noticed that torch>=1.9.0 seems to have brought a change that leads to an exception where none was thrown before.

Setup
Small-text: 1.0.0a4
Torch: 1.9.1

Description
The following error occurs when executing examples/pytorch_multiclass_classification.py:

  File "/my/path/.pyenv/versions/3.8.2/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/my/path/.pyenv/versions/3.8.2/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/my/path/small-text/examples/pytorch_multiclass_classification.py", line 102, in <module>
    main()
  File "/my/path/small-text/examples/pytorch_multiclass_classification.py", line 52, in main
    active_learner.initialize_data(labeled_indices, y_initial)
  File "/my/path/small-text/small_text/active_learner.py", line 141, in initialize_data
    self._retrain(x_indices_validation=x_indices_validation)
  File "/my/path/small-text/small_text/active_learner.py", line 384, in _retrain
    self._clf.fit(x)
  File "/my/path/small-text/small_text/integrations/pytorch/classifiers/kimcnn.py", line 176, in fit
    return self._fit_main(sub_train, sub_valid)
  File "/my/path/small-text/small_text/integrations/pytorch/classifiers/kimcnn.py", line 198, in _fit_main
    res = self._train(sub_train, sub_valid, tmp_dir)
  File "/my/path/small-text/small_text/integrations/pytorch/classifiers/kimcnn.py", line 218, in _train
    train_loss, train_acc = self._train_func(sub_train)
  File "/my/path/small-text/small_text/integrations/pytorch/classifiers/kimcnn.py", line 263, in _train_func
    for i, (text, cls) in enumerate(train_iter):
  File "/my/path/.local/share/virtualenvs/myvenv-123/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/my/path/.local/share/virtualenvs/myvenv-123/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 560, in _next_data
    index = self._next_index()  # may raise StopIteration
  File "/my/path/.local/share/virtualenvs/myvenv-123/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 512, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/my/path/.local/share/virtualenvs/myvenv-123/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in __iter__
    for idx in self.sampler:
  File "/my/path/.local/share/virtualenvs/myvenv-123/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 124, in __iter__
    yield from torch.randperm(n, generator=generator).tolist()
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

Solution

I have yet to investigate what would be the optimal solution for this.

A workaround is to downgrade pytorch (and torchtext):
pip install torch==1.8.1 torchtext==0.9.1

More information
pytorch/pytorch/issues/44714

Incremental Training Documentation

In active_learner.py, the incremental training parameter is described as:

incremental_training : bool
        If False, creates and trains a new classifier only before the first query,
        otherwise re-trains the existing classifier. Incremental training must be supported
        by the classifier provided by `clf_factory`."

Is there a way to retrain the model from scratch after each queried batch? This documentation suggests we are updating the existing classifier in both cases as even when False, it "creates and trains a new classifier only before the first query."

Thank you!

Batch size in greedy coreset batching is different than expected

Bug description

In order to reduce the memory usage, greedy coreset is computed in batches. The number and size of batches is currently based on the wrong set.

Nevertheless, the method has failed silently but gracefully so far, resulting in batches of a different size than expected, unless when number of unlabeled indices is less than the batch size, where it results in an error similar to the following:

<...>
  File "/path/to/site-packages/small_text/query_strategies/coresets.py", line 131, in sample
    return greedy_coreset(embeddings, indices_unlabeled, indices_labeled, n,
  File "/path/to/site-packages/small_text/query_strategies/coresets.py", line 79, in greedy_coreset
    dist = dist_func(batch, x[indices_s], normalized=normalized)
  File "/path/to/site-packages/small_text/query_strategies/coresets.py", line 25, in _euclidean_distance
    return pairwise_distances(a, b, metric='euclidean')
  File "/path/to/site-packages/sklearn/metrics/pairwise.py", line 2195, in pairwise_distances
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
  File "/path/to/site-packages/sklearn/metrics/pairwise.py", line 1765, in _parallel_pairwise
    return func(X, Y, **kwds)
  File "/path/to/site-packages/sklearn/metrics/pairwise.py", line 310, in euclidean_distances
    X, Y = check_pairwise_arrays(X, Y)
  File "/path/to/site-packages/sklearn/metrics/pairwise.py", line 165, in check_pairwise_arrays
    X = check_array(
  File "/path/to/site-packages/sklearn/utils/validation.py", line 969, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 768)) while a minimum of 1 is required by check_pairwise_arrays.

Steps to reproduce

Environment:

small-text version: 1.3.x, 2.0.0-dev

Addition information

Additional options for Discriminative Active Learning

Hi!

First of all, thanks so much for a fantastic library - this was amazing to discover and has saved me weeks of time.

I'm using your implementation of Discriminative Active Learning (DAL) and I believe it is cold-starting the classifier every iteration.

The DAL paper recommends further finetuning the model that was trained for classification - and in their accompanying blog post they show that this significantly outperforms cold starting the model. (Blog post: https://dsgissin.github.io/DiscriminativeActiveLearning/2018/07/05/DAL.html)

I was wondering if this could be an option in your implementation of DAL.

Thanks!
Emily

small-text for named entity recognition model

Can small-text be used with BERT-based NER / token classification model?

Query strategy that includes selecting high/medium certainty examples

Feature description

The existing query strategies mostly seem to select data the model is particularly uncertain about (high entropy, ties, least confident ...).
Are there other query strategies that also mix some data points into the training pool where the model is more certain?

Motivation

Many use-cases I work on deal with noisy data. So after a model has obtained a certain quality, query strategies that only select uncertain examples can actually select data that is of low quality. Instead, it would be good to have a way of also adding some high or medium certainty examples to the training pool. The idea is that this helps the model get some good, not-so-difficult examples to help it learn the task - instead of always feeding it very difficult and potentially noisy/wrong data points that can hurt performance.

This is also an important use-case for zero-shot or few-shot models (like the Hugging Face zero-pipeline), which are getting more and more popular. They already have decent accuracy for the task and selecting highly uncertain examples can actually hurt the training process by selecting noise / examples that are inherently uncertain.

Addition comments

I really like your library and planning on using it for my research in the coming months :)

EmbeddingKMeans sample() got an unexpected keyword argument 'embeddings_proba'

Hi,

I'm trying to run the 01-active-learning-for-text-classification-with-small-text-intro.ipynb notebook with EmbeddingKMeans.
I set the query strategy to EmbeddingKMeans and initialised the active learner:

transformer_model = TransformerModelArguments(transformer_model_name)
clf_factory = TransformerBasedClassificationFactory(transformer_model, 
                                                    num_classes, 
                                                    kwargs=dict({'device': 'cuda', 
                                                                 'mini_batch_size': 32,
                                                                 'early_stopping_no_improvement': -1
                                                                }))
query_strategy = EmbeddingKMeans()

active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train)
labeled_indices = initialize_active_learner(active_learner, train.y)

According to strategies.py embeddings can be passed as None into the class EmbeddingBasedQueryStrategy(Query Strategy) .

However I am getting the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-9f2deb8154a9> in <module>()
     23 for i in range(num_queries):
     24     # ...where each iteration consists of labelling 20 samples
---> 25     q_indices = active_learner.query(num_samples=20)
     26 
     27     # Simulate user interaction here. Replace this for real-world usage.

2 frames
/usr/local/lib/python3.7/dist-packages/small_text/active_learner.py in query(self, num_samples, x, query_strategy_kwargs)
    186                                                          self.y,
    187                                                          n=num_samples,
--> 188                                                          **query_strategy_kwargs)
    189         return self.queried_indices
    190 

/usr/local/lib/python3.7/dist-packages/small_text/query_strategies/strategies.py in query(self, clf, x, x_indices_unlabeled, x_indices_labeled, y, n, pbar, embeddings, embed_kwargs)
    249                                                   n, embeddings)
    250                 else:
--> 251                     raise e
    252 
    253         return x_indices_unlabeled[sampled_indices]

/usr/local/lib/python3.7/dist-packages/small_text/query_strategies/strategies.py in query(self, clf, x, x_indices_unlabeled, x_indices_labeled, y, n, pbar, embeddings, embed_kwargs)
    241                     if embeddings is None else embeddings
    242                 sampled_indices = self.sample(clf, x, x_indices_unlabeled, x_indices_labeled,
--> 243                                               y, n, embeddings, embeddings_proba=proba)
    244             except TypeError as e:
    245                 if 'got an unexpected keyword argument \'return_proba\'' in e.args[0]:

TypeError: sample() got an unexpected keyword argument 'embeddings_proba'

Could you advise a solution?

Add ActivePETs query strategy

Feature description

ActivePETs is a recent query strategy that has been shown to be very effective for few-shot tasks. It uses pattern exploiting training PET over an ensemble of classifiers.

Additional comments

This requires an EnsembleClassifier, which small-text does not have yet.
@XiaZeng0223 might be able to provide feedback on the implementation :)

Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer

Bug description

If using cuda, the transformer model will fail a device-side assertion if there are additional special tokens in the tokenizer. This does not happen when device='cpu' or device='mps' are specified, suggesting that this might be a PyTorch issue. However, workaround cannot be done using small-text API and require modification to its source code.

Steps to reproduce

Using small-text/tree/main/examples/examplecode/transformers_multiclass_classification.py as an example:

Change
tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
to

tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL.model, cache_dir='.cache/')
tokenizer.add_special_tokens({'additional_special_tokens': ['[SPECIAL1]', '[SPECIAL2]']})

This will cause the device-side assertion to fail when using cuda:

clf_factory = TransformerBasedClassificationFactory(TRANSFORMER_MODEL,
                                                        num_classes,
                                                        kwargs=dict({
                                                            'device': 'cuda'
                                                        }))

due to embedding size mismatch.

Expected behavior

The model adjusts new vocab size automatically.

Workaround:

In file small_text/integrations/transformers/utils/classification.py function _initialize_transformer_components, change the following

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )

model = AutoModelForSequenceClassification.from_pretrained(
        transformer_model.model,
        from_tf=False,
        config=config,
        cache_dir=cache_dir,
        force_download=from_pretrained_options.force_download,
        local_files_only=from_pretrained_options.local_files_only
    )
    model.resize_token_embeddings(new_num_tokens=NEW_VOCAB_SIZE)

adding the final line. This requires the new vocab size to be hard-coded because the customised tokenizer is inaccessible in this function. If the tokenizer is accessible, the final line can simply be changed to model.resize_token_embeddings(new_num_tokens=len(tokenizer))

Environment:

Python version: 3.11.7
small-text version: 1.3.3
small-text integrations (e.g., transformers): transformers 4.36.2
PyTorch version: 2.1.2
PyTorch-cuda: 11.8

What are the best query strategies to use as a baseline approach?

I'm not sure where to start to get a good baseline result with active learning for text classification.
What query strategies should be attempted first? Is there something like this survey https://arxiv.org/abs/2203.13450 implemented for text classification?

How does an active learning loop work in real life?

I have gone through the notebooks and am very keen to get started, but I would like to know how does an active learning loop work in the real-world. Every iteration requires a few samples to be labelled and then a new model has to be trained. How do we manage the versioning etc? How to make the process streamlined?

Setting up a PoolBaseActiveLearner without initialization.

Hi,
I am training a transformers model in a separate script over a pre-defined training set. I want to then use this classifier to query examples from the unlabelled pool. I can load the trained model from pre-trained pytorch model files or from PoolBasedActiveLearner.load('test-model/active_leaner.pkl').

However, I then don't want to initialise this model as it has already been trained on a portion of the labelled data. Is it possible to still query over data i.e. learner.query() without running the initialization step learner.initialize_data(x_indices_train, y_train, x_indices_validation=val_indices)?

Alternatively is it possible to still run this initialisation step but without running any training, i.e. just ignoring all indices for initialisation or setting the number of initialisation examples to zero in x_indices_initial = random_initialization(y_train, n_samples=0).

Really appreciate your help on this one!

Thanks :)

Data set path

if I have a dataset and part of the data set is labeled and other is not how I can path this dataset to label it

Update setfit version (>1.0.0)

Feature description

Update the setfit version to a more recent version. This will require changes in SetFitClassification.

Motivation

Additional comments

Especially the show_progress_bar arguments should be used.

Make more SetFitClassification parameters configurable

Feature description

There are still important parameters which cannot be easily configured. The following list of parameters should be configurable via SetFitClassification / SetFitModelArguments:

SetFitTrainer
- learning_rate
- num_iterations
- warmup_proportion

This list is intentionally not yet exhaustive, since especially with different loss functions configuration could get messy, which I want to avoid. The parameter amp would also be included, however, this is already included in #45.

Motivation

Additional comments

FileNotFoundError: [Errno 2] No such file or directory: '0-b0'

Bug description

Training a model with model selection enabled may raise an error:

FileNotFoundError: [Errno 2] No such file or directory: '0-b0'

Full stacktrace below.

Steps to reproduce

Model selection needs to exit here for this error to occur.

Environment:

Python version: 3.8
small-text version: 1.1.0
small-text integrations (e.g., transformers): transformers
PyTorch version (if applicable): does not matter here
CUDA version (if applicable): does not matter here

Addition information

  <...>
  File "<venv_path>/lib/python3.8/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 368, in fit
    return self._fit_main(sub_train, sub_valid, sub_train_weights, early_stopping,
  File "<venv_path>/lib/python3.8/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 393, in _fit_main
    self._perform_model_selection(optimizer, model_selection)
  File "<venv_path>/lib/python3.8/site-packages/small_text/integrations/pytorch/classifiers/base.py", line 59, in _perform_model_selection
    self.model.load_state_dict(torch.load(model_selection_result.model_path))
  File "<venv_path>/lib/python3.8/site-packages/torch/serialization.py", line 579, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "<venv_path>/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "<venv_path>/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '0-b0'

Process finished with exit code 1

initialize_active_learner error

I am trying to initialize a active learner for text classification using transformer. I have 11014 classes which need to be trained by the classification model. My data set is highly imbalanced. While doing the initialize_active_learner( active_learner, y_train) I have used

def initialize_active_learner(active_learner, y_train):

    x_indices_initial = random_initialization(y_train)
    #random_initialization_stratified(y_train, n_samples=11015)
    #random_initialization_balanced(y_train)
    
    y_initial = np.array([y_train[i] for i in x_indices_initial])

    active_learner.initialize_data(x_indices_initial, y_initial)

    return x_indices_initial

But I get this error always:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-d0348c5b7547> in <module>
      1 # Active learner
      2 active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, x_train)
----> 3 labeled_indices = initialize_active_learner(active_learner, y_train)
      4 #

<ipython-input-22-ed58e0714c48> in initialize_active_learner(active_learner, y_train)
     17     y_initial = np.array([y_train[i] for i in x_indices_initial])
     18 
---> 19     active_learner.initialize_data(x_indices_initial, y_initial)
     20 
     21     return x_indices_initial

~/.local/lib/python3.7/site-packages/small_text/active_learner.py in initialize_data(self, x_indices_initial, y_initial, x_indices_ignored, x_indices_validation, retrain)
    139 
    140         if retrain:
--> 141             self._retrain(x_indices_validation=x_indices_validation)
    142 
    143     def query(self, num_samples=10, x=None, query_strategy_kwargs=None):

~/.local/lib/python3.7/site-packages/small_text/active_learner.py in _retrain(self, x_indices_validation)
    380 
    381         if x_indices_validation is None:
--> 382             self._clf.fit(x)
    383         else:
    384             indices = np.arange(self.x_indices_labeled.shape[0])

~/.local/lib/python3.7/site-packages/small_text/integrations/transformers/classifiers/classification.py in fit(self, train_set, validation_set, optimizer, scheduler)
    332         self.class_weights_ = self.initialize_class_weights(sub_train)
    333 
--> 334         return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler)
    335 
    336     def initialize_class_weights(self, sub_train):

~/.local/lib/python3.7/site-packages/small_text/integrations/transformers/classifiers/classification.py in _fit_main(self, sub_train, sub_valid, optimizer, scheduler)
    351                 raise ValueError('Conflicting information about the number of classes: '
    352                                  'expected: {}, encountered: {}'.format(self.num_classes,
--> 353                                                                         np.max(y) + 1))
    354 
    355             self.initialize_transformer(self.cache_dir)

ValueError: Conflicting information about the number of classes: expected: 11014, encountered: 8530

Please help here.

Thanks in advance

Pass local_files_only kwarg in TransformerBasedClassification

Feature description

Provide a way to set local_files_only in TransformerBasedClassification.
huggingface/transformers#2867

Motivation

The integration tests are too slow and a majority of the time can be avoided with this setting. Moreover, in environments without an internet connection the current state will fail.

Addition comments

Separate query functions from query strategy classes

Motivation

Sometimes a query strategy is based on a simpler function (e.g., breaking ties / margin), which could be used independently of the query strategy.

Feature description

These function should be extracted and exposed as part of the API.

(List is not exhaustive.)

Addition comments

See #30.

Mulitlabel: Clf.predict(return_proba=True) only returns probabilities for labels over the threshold

Some query strategies require the probabilities for all labels of a sample, currently only probabilities for successfully predicted labels are returned.

Throwing away stale data

@chschroeder - AL deals primarily with selecting data points which need to be added to the training set. Can it also be used to select datapoints which need to be purged out of the training set because they are no longer useful?

Embeddings in EmbeddingKMeans and ContrastiveActiveLearning

Hi! Do they support embeddings from a language-agnostic model like LabSE or XLM-RoBERTa? (as this is not the case in their papers). Would it be possible to use any embeddings that we previously extract with those methods? If so, how we can do that? I believe that this could be very crucial for this library for not limiting its use to only English-language or any specific encoder.

Question: Regarding potential bottleneck during training of Classifiers

I am currently trying to optimize an AL pipeline using small-text therefore I am looking into the source code and came across something slightly bizarre looking, namely in the \integrations\transformers\classifiers\classification.py file in the _get_layer_params() function the following code gets executed to get the parameters to train:

 if hasattr(base_model, 'encoder'):
        layers = base_model.encoder.layer
 else:
        layers = base_model.transformer.layer

If I understand correctly that implies that instead of training the head of the (in my case) RobertaForSequenceClassification (RSC) model the whole encoder is trained and the head neglected, which would be far more computational expensive and quite a unique approach to the problem.
Therefore, the following questions arise for me:

Is this wanted?
If so why?
Is the Head (called classifier in RSC) somewhere else trained because I wasn't able to find where?
If so can the training of the encoder be disabled to speed up the training process ?

Add auto mixed precision flag to Pytorch-based Classifiers

Feature description

Expected behavior:
There should be a boolean flag (amp=True/False) in TransformerModelArguments and SetFitModelArguments. Moreover, the flag should be added as well for KimCNNClassifier whose model arguments are hopefully adjusted to match the former two classes, otherwise this flag should be in the constructor as a temporary solution.

Motivation

This could provide a nice speedup for use cases where mixed precision is good enough.

Additional comments

Note: I tried this once in a hacky way and stopped because I got nan values in the loss (and I did not really need this feature at that time). This was likely a Pytorch bug and should be kept in mind here:
https://github.com/pytorch/pytorch/releases/tag/v2.0.1

TransformerBasedClassification: validations_per_epoch > 2 leaves the model in eval mode

Bug description

When TransformerBasedClassification is initialized with validations_per_epoch greater 1, then the model is incorrectly in eval mode in _train_loop_process_batches() (starting with the second iteration).

Steps to reproduce

Set validations_per_epoch to 2 and observe the training loop using the debugger.

Expected behavior

Environment:

small-text version: 1.3.1
small-text integrations (e.g., transformers): transformers

Addition information

Fixing the functionality is a one liner, the test is most of the effort. Luckily, this setting is 1 by default, so unless you set this purpose, you are likely not affected by this bug.

LightweightCoreset should allow for other distance metrics

Feature description

To make lightweight_coreset() more similar to greedy_coreset(), it should be able to similarly change the distance metric in lightweight_coreset() via an additional keyword argument distance_metric.

Motivation

Those functions solve a similar problem, they should be kept as similar as possible. Also, it improves the functionality of lightweight_coreset().

Addition comments

A PR should start on the dev branch (as 2.0.0 will be next).
Code that is shared between those two functions should be extracted whereever possible.
lightweight_coreset() needs an additional unittest to verify the result.

Provide an example of how to construct datasets without labels

Feature description

In the documentation there should be an explanation with code examples of how Dataset objects are created when labels have not yet been assigned. This requires an explanation for both the single and multi-label scenario.

Motivation

This question has been asked several times, indicating that the current documentation on this part is inadequate.

Addition comments

See for example #30.

SEALS: Similarity Search for Efficient Active Learning and Search of Rare Concepts

Hello, thank you for open-sourcing this project. I would like to suggest adding the following method to the library:
"Similarity Search for Efficient Active Learning and Search of Rare Concepts" Link: https://arxiv.org/abs/2007.00077
It seems that it can it well in this library, it is also possible to combine that with other methods. Sincerely, Kamer

Query by committee-of-committees

For example, you could compute the confidence score, the margin, and the entropy for all of the records in the remaining pool, and then pick the records that are selected the most out of those querying strategies.

Has anyone tried this and could you share your experience please?

EGL throws error

Bug description

I'm trying to run some experiment with Expected Gradient Length. I've used the code in your sample notebook exactly, apart from I've swapped

query_strategy = PredictionEntropy() for query_strategy = ExpectedGradientLength(num_classes)

When calling active_learner.query this gives the error AttributeError: 'SequenceClassifierOutput' object has no attribute 'softmax'

The full traceback is below.

Environment:

Python version: I'm running on colab so currently 3.10, but I've replicated with 3.7 as well
small-text version: 1.3.3
small-text integrations (e.g., transformers): Transformers
PyTorch version (if applicable): 2.2.1+cu121

Installation (pip, conda, or from source): pip install small-text[transformers]==1.3.3

Addition information

The full traceback is:

AttributeError                            Traceback (most recent call last)
[<ipython-input-18-680c16aeec82>](https://localhost:8080/#) in <cell line: 23>()
     23 for i in range(num_queries):
     24     # ...where each iteration consists of labelling 20 samples
---> 25     indices_queried = active_learner.query(num_samples=20)
     26 
     27     # Simulate user interaction here. Replace this for real-world usage.

5 frames
[/usr/local/lib/python3.10/dist-packages/small_text/active_learner.py](https://localhost:8080/#) in query(self, num_samples, representation, query_strategy_kwargs)
    193 
    194         representation = self.dataset if representation is None else representation
--> 195         self.indices_queried = self.query_strategy.query(self._clf,
    196                                                          representation,
    197                                                          indices[self.mask],

[/usr/local/lib/python3.10/dist-packages/small_text/query_strategies/base.py](https://localhost:8080/#) in query(self, clf, datasets, indices_unlabeled, indices_labeled, y, n, *args, **kwargs)
     48                                        f'but single-label data was encountered')
     49 
---> 50             return super().query(clf, datasets, indices_unlabeled, indices_labeled, y,
     51                                  *args, n=n, **kwargs)
     52 

[/usr/local/lib/python3.10/dist-packages/small_text/integrations/pytorch/query_strategies/strategies.py](https://localhost:8080/#) in query(self, clf, dataset, indices_unlabeled, indices_labeled, y, n, pbar)
     64         with pbar_context as pbar:
     65             for i, (dataset, *_) in enumerate(dataset_iter):
---> 66                 self.compute_gradient_lengths(clf, criterion, gradient_lengths, offset, dataset)
     67 
     68                 batch_len = dataset.size(0)

[/usr/local/lib/python3.10/dist-packages/small_text/integrations/pytorch/query_strategies/strategies.py](https://localhost:8080/#) in compute_gradient_lengths(self, clf, criterion, gradient_lengths, offset, x)
     95         clf.model.zero_grad()
     96 
---> 97         self.compute_gradient_lengths_batch(clf, criterion, x, gradients, all_classes)
     98         self.aggregate_gradient_lengths_batch(batch_len, gradient_lengths, gradients, offset)
     99 

[/usr/local/lib/python3.10/dist-packages/small_text/integrations/pytorch/query_strategies/strategies.py](https://localhost:8080/#) in compute_gradient_lengths_batch(self, clf, criterion, x, gradients, all_classes)
    107         output = clf.model(x)
    108         with torch.no_grad():
--> 109             sm = F.softmax(output, dim=1)
    110 
    111         for j in range(self.num_classes):

[/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in softmax(input, dim, _stacklevel, dtype)
   1856         dim = _get_softmax_dim("softmax", input.dim(), _stacklevel)
   1857     if dtype is None:
-> 1858         ret = input.softmax(dim)
   1859     else:
   1860         ret = input.softmax(dim, dtype=dtype)

AttributeError: 'SequenceClassifierOutput' object has no attribute 'softmax'

Passing a scheduler through fit results in AttributeError

Bug description

Passing a scheduler to TransformerBasedClassification.fit() results in AttributeError: 'str' object has no attribute 'step'.

Steps to reproduce

clf.fit(dataset, optimizer=optimizer, scheduler=scheduler)

where clf is an instance of TransformerBasedClassification and optimizer and scheduler are not None.

Expected behavior

The given optimizer and scheduler should be used.

Environment:

small-text version: 1.0.0
small-text integrations (e.g., transformers): transformers

Addition information

The error is here

Inconsistency: TransformerBaseClassification.embed() returns np.float32 while SetFitClassification returns np.float64

Desired Solution

Clearly specify the EmbeddingMixin and make all implementations behave consistently. All returned embeddings should be np.float32.

This fits in well with the ongoing typing efforts.

Dataset cloning wraps the label

Bug description

Selecting a sub(data)set and then cloning a dataset wraps the labels in a superfluous "ndarray()". This affects PytorchTextClassificationDataset and TransformersDataset.

Edit:
I noticed this because clf.predict() on the cloned dataset raised TypeError: len() of unsized object.

Steps to reproduce

Example for TransformersDataset:

import unittest
from tests.utils.datasets import random_transformer_dataset


class CloneBugTest(unittest.TestCase):

    def test_asd(self):
        dataset = random_transformer_dataset(num_samples=20,
                                             multi_label=False,
                                             num_classes=3)
        indices = [0, 1]
  
        dataset_cloned = dataset[indices].clone()

        first_label = dataset.data[0][TransformersDataset.INDEX_LABEL]
        first_label_cloned = dataset_cloned.data[0][TransformersDataset.INDEX_LABEL]

        print(first_label, str(first_label), repr(first_label))
        print(first_label_cloned, str(first_label_cloned), repr(first_label_cloned))

Output:

0 0 0
0 0 array(0)

Expected behavior

Expected Output:

0 0 0
0 0 0

Environment:

Python version: 3.8
small-text version: 1.3.0
small-text integrations (e.g., transformers): transformers
PyTorch version (if applicable): -

Installation (pip, conda, or from source): pip
CUDA version (if applicable): -

Additional information

Improve README.md

This issue may be amended. It may also be processed partly in single issues if necessary.

Feature description

The README.md is already not bad, but could be greatly improved in some places:

Show a (short and concise) code example
Provide different entry points for different types of users. Especially there needs to be one entry point for users who are neither developers, nor NLP-researchers, but only want to setup an annotation workflow. (This should be as easy as possible since I have received the feedback that this process is too difficult, and we want to encourage new users to try small-text.)
Provide an overview of the current features. I am thinking of a similar table as in the small-text paper (cf. Table 1 here).

Motivation

Additional comments

active_learner.save('active_leaner.pkl'), can't pickle _abc_data objects

Hi,

I've trained an active_learner object, now trying to save it to file.

According to the doc: https://small-text.readthedocs.io/en/latest/patterns/serialization.html
active_learner.save('active_leaner.pkl') should work but I get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-79-3c088eb07e76> in <module>()
      1 
----> 2 active_learner.save(f"{DIR}/results/active_leaner.pkl")

22 frames
/usr/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    522             reduce = getattr(obj, "__reduce_ex__", None)
    523             if reduce is not None:
--> 524                 rv = reduce(self.proto)
    525             else:
    526                 reduce = getattr(obj, "__reduce__", None)

TypeError: can't pickle _abc_data objects

I can extract the transformer model and save that instead using active_learner.classifier.model.save_pretrained(f"{directory}") but not using active_learner.save()

Using active learning on already trained model

Hello :)

I am trying to use the library on a transformer model that is already trained. For that matter I don't need to use initialize_data method since the model is already trained, however it seems to be necessary before using query method (otherwise it throws an error).

To be more specific let's say I have an object model (multi-label model from hugging face) trained on data text_train and labels_train. Then I have text_test data for which no labels is available. I would like to use active learning to select the best (based on a given query strategy) samples in text_test to be labelled by my users. How could I use the library to do so ?

Thank you in advance for your help !

arrays doesn't match.

I tried multi classification, but the following error occurs when training. any solution?.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-97-34924934fd19> in <module>
      1 logging.getLogger('small_text').setLevel(logging.INFO)
----> 2 main()

<ipython-input-96-e3cc4fd7354b> in main()
     30         for i in range(20):
     31             # ...where each iteration consists of labelling 20 samples
---> 32             q_indices = active_learner.query(num_samples=20, x=train)
     33 
     34             # Simulate user interaction here. Replace this for real-world usage.

/opt/anaconda3/envs/small_text/lib/python3.7/site-packages/small_text-1.0.0a4-py3.7.egg/small_text/active_learner.py in query(self, num_samples, x, query_strategy_kwargs)
    175 
    176         self.mask = np.ones(size, bool)
--> 177         self.mask[np.concatenate([self.x_indices_labeled, self.x_indices_ignored])] = False
    178         indices = np.arange(size)
    179 

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 0 dimension(s)

Cloning a nested DatasetsViews raises an AttributeError

Bug description

Calling clone() on a nested dataset view raises the following error:

[...]
  File "/path/to/site-packages/small_text/active_learner.py", line 389, in _retrain
    dataset = self.dataset[self.indices_labeled].clone()
  File "/path/tob-v2-ifn6Asey/lib/python3.8/site-packages/small_text/integrations/transformers/datasets.py", line 32, in clone
    target_labels = None if self.dataset.track_target_labels else np.copy(self.target_labels)
AttributeError: 'TransformersDatasetView' object has no attribute 'track_target_labels'

The fix is easy, but this is also a sign that the "target label tracking" is not properly mapped to the dataset views.

Maybe the target tracking functionality was also never needed in the first place.

Steps to reproduce

Create a DatasetView of a DataSetView of a DataSet, then call .clone() (the outermost) view.

Expected behavior

Environment:

small-text integrations (e.g., transformers): pytorch, transformers

Addition information

Class weighting causes nan values in loss

Bug description

When using TransformerBasedClassification with class_weight='balanced', the class_weights can get nan. This does not happen always, but only when the label distribution in the current labeled set is so skewed that all labels are from the same class.

For a multi-label problem, the encountered error is the following:

<...>
  File "/path/to/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 591, in _train_single_batch
    loss.backward()
  File "/path/to/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/path/to/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'BinaryCrossEntropyWithLogitsBackward0' returned nan values in its 0th output.

Steps to reproduce

TransformerBasedClassification with class_weight='balanced'.
All labels of the current labeled set need to come from the same class. (Initialize active learning with such a set to quickly encounter the error.)
The error occurs during the first backpropagation.

Expected behavior

All weights are unequal to nan

Environment:

small-text v1.3.0

Addition information

The problem is here and is caused by the scaling operation.