Comments (11)
Hi,
PoolBasedActiveLearner.save()/load()
just writes/reads the object as it is. So after loading a previously saved active learner, you can continue you with you previously initialized model as if nothing happened in between.
The other option you mentioned is also possible, for example if your training data changes. In this case you can call learner.initialize_data(..., retrain=False)
, where retrain=False
to omit the training step. In your case, however, it sounds as you should not need this additional step.
from small-text.
Thank you! retrain=False
was what I was looking for as I'm also changing the data that the active learner has assess to so will have to initialise a new learner object.
However, I am still encountering issues. I am loading a pre-trained model from file, then initialising an active learner (on no new initialisation data as this model has already been trained on some training data). Then as you suggested, I'm setting retrain=False
:
transformer_model = TransformerModelArguments('path/to/pretrained-model', tokenizer='path/to/pretrained/tokenizer')
clf_factory = TransformerBasedClassificationFactory(transformer_model,
num_classes,
kwargs=dict({'device': 'cuda',
'mini_batch_size': 32,
'early_stopping_no_improvement': -1
}))
def init_pretrained_learner(clf_factory, query_strat, pool):
active_learner_pool = PoolBasedActiveLearner(clf_factory, query_strat, pool)
# select no examples for initialisation as learner is already trained
x_indices_initial = random_initialization(pool, n_samples=0)
y_initial = pool.y[x_indices_initial]
active_learner_pool.initialize_data(x_indices_initial, y_initial, retrain=False)
return active_learner_pool
active_learner_pool = init_pretrained_learner(clf_factory, RandomSampling(), pool)
However, when I then try to access the classifier, e.g. by running:
embeddings, proba = active_learner_pool.classifier.embed(pool, return_proba=True)
I get the following error: AttributeError: 'NoneType' object has no attribute 'embed'
so it seems the classifier hasn't been initialised.
Could you suggest how to proceed?
from small-text.
Hi @chschroeder, do you have any update on this error?
Thanks! :)
from small-text.
Hi @HannahKirk, sorry I completely missed your last edit. But now I think I have understood what your intention is:
- You have a pretrained huggingface model, not a serialized previously trained active learner.
- You want to use this model in combination with the active learner but without any further training.
This is something that probably will not be possible right now without any hassle but might be a valid use case to include in the future.
For now, you can try the following:
transformer_model = TransformerModelArguments('path/to/pretrained-model', tokenizer='path/to/pretrained/tokenizer')
clf_factory = TransformerBasedClassificationFactory(transformer_model,
num_classes,
kwargs=dict({'device': 'cuda',
'mini_batch_size': 32,
'early_stopping_no_improvement': -1
}))
active_learner_pool = PoolBasedActiveLearner(clf_factory, RandomSampling(), pool)
# initialize the classifier
active_learner_pool._clf = clf_factory.new()
# initialize the underlying model
active_learner_pool._clf.initialize_transformer(active_learner_pool._clf.cache_dir)
active_learner_pool._clf.num_classes = 123 # TODO: set number of classes
This manually initializes some objects which would otherwise be set up in active_learner._retrain()
and classifier.fit()
. This might work already but I would not rule out that I have missed something on the classifier side.
from small-text.
Thanks @chschroeder. I think even with the suggested changes, there are still some problems:
When trying to run embeddings, proba = active_learner_pool.classifier.embed(pool, return_proba=True)
, I now get the error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Perhaps the pool is has not been moved .to_device()
?
Thanks :)
from small-text.
Yes, this error makes sense, we need to prepare the classifier as if fit would have been called.
The error is caused by model still being on the CPU.
Append the following line to the code above:
active_learner_pool._clf.model = active_learner_pool._clf.model.to(active_learner_pool._clf.device)
Edit: Fixed device reference.
from small-text.
Thanks @chschroeder, one remaining point. What does the self
refer to here?
from small-text.
That was a copy/paste remnant, sorry :).
Fixed it above. What I am doing here is just copying the code from the classifier
from small-text.
Hi @chschroeder, I have followed this instructions in the code:
def initialise_learner(pt_model_path, n_classes, pool):
tf_model = TransformerModelArguments(pt_model_path)
n_epochs = 0
clf_factory = TransformerBasedClassificationFactory(tf_model,
n_classes,
n_epochs,
kwargs = dict({'device': 'cuda', 'early_stopping_no_improvement': -1}))
active_learner_pool = PoolBasedActiveLearner(clf_factory, RandomSampling(), pool)
# initialise classifier
active_learner_pool._clf = clf_factory.new()
# initialize the underlying model
active_learner_pool._clf.initialize_transformer(active_learner_pool._clf.cache_dir)
active_learner_pool._clf.num_classes = n_classes
active_learner_pool._clf.model.to(active_learner_pool._clf.device)
return active_learner_pool
However, when I now try and use this model for querying with:
selected_indices = trained_learner.query(num_samples=10)
I still get an error: LearnerNotInitializedException
.
How should I proceed? I thought the pre-trained model would now have been initialised with the learner so that the learner could be used to query the pool. I can initialise the data in a hacky way but selecting 0 samples with:
# initialise data
x_indices_initial = random_initialization(pool, n_samples=0)
y_initial = pool.y[x_indices_initial]
active_learner_pool.initialize_data(x_indices_initial, y_initial, retrain=False)
But there may be a better solution
Thanks :)
P.S the full source of the exception is:
[/usr/local/lib/python3.7/dist-packages/small_text/active_learner.py](https://localhost:8080/#) in query(self, num_samples, x, query_strategy_kwargs)
167 """
168 if self._label_to_position is None:
--> 169 raise LearnerNotInitializedException()
170
171 size = list_length(self.x_train)
from small-text.
You were almost there :). Sorry, I had no time to try this myself last week, and this distant trial and error takes a bit longer.
But now I tried it myself.
I used your function and changed this:
# initialise classifier
active_learner_pool._clf = clf_factory.new()
to this:
# initialise classifier
active_learner_pool._clf = clf_factory.new()
active_learner_pool._x_index_to_position = dict()
This got me so far that that I could use active_learner.query(...)
and active_learner.classifier.predict(...)
.
At second thought I am questioning now what we even get from the PoolBasedActiveLearner
at this point. Just using the classifier and query strategy directly would likely result in better code.
from small-text.
Hi, just to give an update: this issue has not been forgotten. I would call your use case here "pre-initialized" or "externally initialized". Similar problems regarding the API exist for cold start active learning for which we now have a notebook.
Both of these use cases are difficult to realize without breaking the current API. The solution that I would prefer will likely generalize the initialization mechanism, but this will have to wait until a next major version 2.0.0.
from small-text.
Related Issues (20)
- Class weighting causes nan values in loss
- TransformerBasedClassification: validations_per_epoch > 2 leaves the model in eval mode
- LightweightCoreset should allow for other distance metrics
- Add auto mixed precision flag to Pytorch-based Classifiers
- Make more SetFitClassification parameters configurable
- Cloning a nested DatasetsViews raises an AttributeError
- Inconsistency: TransformerBaseClassification.embed() returns np.float32 while SetFitClassification returns np.float64
- Add ActivePETs query strategy HOT 1
- Batch size in greedy coreset batching is different than expected
- Device-side assertion not passed when training on cuda device and when there are added tokens to the tokenizer HOT 2
- When using EmbeddingBasedQueryStrategy with some transformers, model has an unsupported input `token_type_ids` when creating embeddings. HOT 1
- Update setfit version (>1.0.0)
- Additional options for Discriminative Active Learning HOT 8
- Unify the classifiers' initialize() methods
- EGL throws error HOT 4
- Improve README.md
- Link to the respective publication in the bibliography
- 'SequenceClassifierOutput' object has no attribute 'softmax' HOT 9
- Low values in pytorch integration HOT 5
- Use of large custom transformer model HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from small-text.