Comments (6)
Thanks! Adding Pandas support is a great idea! I'll take a look at it soon, hopefully this can be included in the next release.
from modal.
I have started to implement the pandas.DataFrame support and I came across a major issue. The crux of the problem is that numpy arrays use row indexing, while pandas DataFrames use column indexing by default. That is, if you have a dataset X
, then for instance X[0]
gives the first row if it is a numpy array, but gives the column with index 0 for pandas DataFrames.
This causes a major incompatibility problem in the query strategy functions. When query strategies select the instance to query, they return the index and the instance as well. Currently, I have found no way to access the given instance in a type-agnostic way. One possible way to circumvent the problem is to remove the query instance from the return values of a query strategy. Since this would be a huge change, I am hesitant to do this.
from modal.
The sklearn package is a good example to load pandas data frame. It converts the pd df to numpy.
https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62
Hope this will be helpful.
Thanks!
from modal.
It would be a very useful feature to improve the workflow and integration with other packages. Is there a branch we can help on this feature?
from modal.
Currently, there are no feature branches specifically for this, but feel free to create one in a fork from the dev branch! I am happy to help, since I also think it is an important problem, I just haven't solved it yet. As I outlined in my previous comment, the main issue for me is that pandas DataFrames are indexed by column first, while numpy arrays are row first. One possible way to solve this is to immediately convert to numpy array, but this kind of defeats the purpose for me.
from modal.
I have started to implement the pandas.DataFrame support and I came across a major issue. The crux of the problem is that numpy arrays use row indexing, while pandas DataFrames use column indexing by default. That is, if you have a dataset
X
, then for instanceX[0]
gives the first row if it is a numpy array, but gives the column with index 0 for pandas DataFrames.This causes a major incompatibility problem in the query strategy functions. When query strategies select the instance to query, they return the index and the instance as well. Currently, I have found no way to access the given instance in a type-agnostic way. One possible way to circumvent the problem is to remove the query instance from the return values of a query strategy. Since this would be a huge change, I am hesitant to do this.
One could handle pandas data frames separately, e.g.
if isinstance(X, pd.DataFrame):
return X.iloc[query_indices]
return X[query_indices]
In order not to include this in all query strategies, these could return indices only as you suggested. This functionality can then be added only in the query() method implementations.
Once this is done, the only changes remaining to support pandas data frames are when working with instance representations, e.g. calculating similarities between them. If #104 is resolved (I am working on it), one could replace the used estimator with an sklearn transformation + estimator pipeline, where the transformation converts the data frame to a matrix. Something similar to:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
ActiveLearner(
# it's important to clone the model to have separate models for
# predictions and active learning loop; ActiveLearner fits the
# provided estimator to the provided training data
estimator=Pipeline(steps=[
('transform', OneHotEncoder()), # results in a matrix
('classify', RandomForestClassifier())
]),
query_strategy=uncertainty_batch_sampling,
X_training=X_training,
y_training=y_training,
on_transformed=True # not implemented, should force query strategies to work on transformed data (one hot encoded)
)
from modal.
Related Issues (20)
- Multivariate Active regression
- How to extract the image names and labels in the training set after completing the active learning loop and write them to a CSV file
- decision_function instead of predict_proba HOT 5
- AttributeError: bootstrap_init HOT 3
- TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
- Can I use modAL with estimators from other libraries than scikit-learn like xgboost? HOT 1
- Which sampling method is best for very unbalanced data? HOT 1
- Encountering error with number of batches per epoch
- mmdetection integration with modAL
- Adding active learning regression implementations based on greedy sampling HOT 2
- modAL not installable via pypi anymore HOT 3
- the modAL package has been changed into modal in the pip repository HOT 7
- Data augmentation with `skorch`
- QBC approach for multi-class classification
- Suggestion on how to improve acquisition.UCB for active GP example HOT 1
- QBC stratified bootstrapping HOT 1
- Use modAL on BERT models HOT 1
- Spacy NER HOT 1
- raise ImportError( ImportError: C extension: None not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext' to build the C extensions first.
- uncertainty query for 2d classifier output
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from modal.