Comments (10)
Hi there!
You are right, multiple instances can be selected using n_instances
in the built-in query strategies, but so far, it is very simple and only returns the instances with largest utility value. It can happen that they are very close to each other in the feature space. I think it is a good idea to implement more sophisticated strategies as you suggested.
Currently, this is how a query strategy looks like in general.
def custom_query_strategy(classifier, X, a_keyword_argument=42):
# measure the utility of each instance in the pool
utility = utility_measure(classifier, X)
# select the indices of the instances to be queried
query_idx = select_instances(utility)
# return the indices and the instances
return query_idx, X[query_idx]
In the built-in queries, the select_instances()
function is multi_argmax(values, n_instances=1)
from modAL.utils.selection
, which does what I have described earlier. By replacing this with the function described in the Ranked batch-mode paper, it can be included in the current query strategies easily.
What I would suggest is to add these functions to the modAL.utils.selection
. If you open a pull request as you suggested, we can work together on integrating these features!
from modal.
Sounds good! I'll try and get a PR in by tomorrow night (my time, so Wednesday morning your time?). Thanks!
from modal.
Hey Tivadar, a quick question: as it turns out, I'm having a bit of difficulty integrating my implementation of the ranked batch mode learner with modAL
's architecture. Not because I don't understand the individual pieces (I think I understand those...) but I think it's because it's a model and not just a query strategy...does that make sense? E.g. the paper describes being able to access the "model" training data, which doesn't seem accessible from the scope of any particular sampling function within modAL.utils.selection
, unless I'm mistaken.
For what it's worth, feel free to check out my implementation in the gist here (the actual implementation is below the notebook). It's not properly commented nor done, but maybe you get an idea of what I mean? The RankedBatchLearner
's query method (again, I know it's not aligned with the modAL
API/architecture) relies on being able to build off of the initial training set for building the ranked batch.
from modal.
Thanks, great work! I'll review your implementation in detail soon.
The problem you mention about training data not accessible in the scope of functions within modAL.utils.selection
is not necessarily true. The first argument of every query strategy is the active learner itself which has access to the training data. Since the selection functions are usually called within the query strategy, you can pass these as arguments for the function selecting the instances. I'll try to outline this in detail in code today after I finished reading the paper.
Also, a quick note. In modAL.density
, you can find the similarize_distance
decorator, which can be used for similar purposes you used the euclidean_sim
. Feel free to use it if you find it suitable!
from modal.
Ah! I see now. If that's the case, it shouldn't be too bad going forward — I had assumed the classifier
was the core estimator.
Also, a quick note. In modAL.density, you can find the similarize_distance decorator, which can be used for similar purposes you used the euclidean_sim. Feel free to use it if you find it suitable!
Hahaha yes, I hand-rolled my own during the implementation but realized it's already there and modular enough for any distance function afterwards. Thanks!!
from modal.
Ok, now the feature is implented and merged to the dev
branch! Thanks @dataframing!
from modal.
Thanks for the feature! I went through the tutorial and it works fine, but I see here https://github.com/cosmic-cortex/modAL/blob/308af9b0ffff30597431ffac5ca44e3ad518c607/examples/ranked_batch_mode.py#L36
a possibility to get X_raw.shape[0] as a training index that will result in IndexError...
Check my suspicion, please)
Also, in order to have consistent documentation across web-site, Jupyter notebooks and py-files it is convinient to use Sphinx with a couple of extensions... PyTorch tutorials, for example, are built using these things.
from modal.
I plan to switch to Sphinx, several people suggested that the API reference and the website should be merged. It might take a while however, since I am not familiar with it.
from modal.
Recently, I was responsible for documentation refactoring here http://docs.deeppavlov.ai/en/master/intro/hello_bot.html
I can contribute to modAL doing similar refactoring if you ok with it, but at first, we need to discuss several things.
I see some sphinx/readthedocs artifacts in the repo, but eventually, Github Pages is used for the site, am I right?
I strongly suggest choosing NumPy or Google style for docstrings (http://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html). Conversion could be done with https://github.com/dadadel/pyment. For example, sklearn uses NumPy style, while PyTorch uses Google style. As for me, the latter is preferred in case of type annotation in function signatures. What do you think?
Considering type annotation. As I understand, Python 2 support is not planned (and its great!), so it is useful to force type annotation, which simplifies docstrings (type annotation is more about syntax, while docstrings is more about semantic).
Also, I have not found licensing information. I suggest using Apache 2.0 as a friendly one for both academia and industry.
Waiting for your reply...
from modal.
Sorry for not answering sooner, I was on vacation for the last two weeks.
It sounds great! I would really appreciate your help! I have opened up a new issue (#22) for discussing this. Github pages is used for the site itself and readthedocs for the autogenerated documentation. I would like to merge these two just as the PyTorch docs for instance, as you have mentioned.
I am not really familiar with the NumPy or Google style themselves, but I'll take a look ASAP.
Regarding licensing, I use MIT license (https://github.com/cosmic-cortex/modAL/blob/master/LICENSE).
from modal.
Related Issues (20)
- keras image classification model using AL
- Error
- Multivariate Active regression
- How to extract the image names and labels in the training set after completing the active learning loop and write them to a CSV file
- decision_function instead of predict_proba HOT 5
- AttributeError: bootstrap_init HOT 3
- TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
- Can I use modAL with estimators from other libraries than scikit-learn like xgboost? HOT 1
- Which sampling method is best for very unbalanced data? HOT 1
- Encountering error with number of batches per epoch
- mmdetection integration with modAL
- Adding active learning regression implementations based on greedy sampling HOT 2
- modAL not installable via pypi anymore HOT 3
- the modAL package has been changed into modal in the pip repository HOT 7
- Data augmentation with `skorch`
- QBC approach for multi-class classification
- Suggestion on how to improve acquisition.UCB for active GP example HOT 1
- QBC stratified bootstrapping HOT 1
- Use modAL on BERT models HOT 1
- Spacy NER HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from modal.