Giter Site home page Giter Site logo

Comments (6)

cosmic-cortex avatar cosmic-cortex commented on May 27, 2024

Thanks! Adding Pandas support is a great idea! I'll take a look at it soon, hopefully this can be included in the next release.

from modal.

cosmic-cortex avatar cosmic-cortex commented on May 27, 2024

I have started to implement the pandas.DataFrame support and I came across a major issue. The crux of the problem is that numpy arrays use row indexing, while pandas DataFrames use column indexing by default. That is, if you have a dataset X, then for instance X[0] gives the first row if it is a numpy array, but gives the column with index 0 for pandas DataFrames.

This causes a major incompatibility problem in the query strategy functions. When query strategies select the instance to query, they return the index and the instance as well. Currently, I have found no way to access the given instance in a type-agnostic way. One possible way to circumvent the problem is to remove the query instance from the return values of a query strategy. Since this would be a huge change, I am hesitant to do this.

from modal.

fighting41love avatar fighting41love commented on May 27, 2024

The sklearn package is a good example to load pandas data frame. It converts the pd df to numpy.
https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62
Hope this will be helpful.
Thanks!

from modal.

jpzhangvincent avatar jpzhangvincent commented on May 27, 2024

It would be a very useful feature to improve the workflow and integration with other packages. Is there a branch we can help on this feature?

from modal.

cosmic-cortex avatar cosmic-cortex commented on May 27, 2024

Currently, there are no feature branches specifically for this, but feel free to create one in a fork from the dev branch! I am happy to help, since I also think it is an important problem, I just haven't solved it yet. As I outlined in my previous comment, the main issue for me is that pandas DataFrames are indexed by column first, while numpy arrays are row first. One possible way to solve this is to immediately convert to numpy array, but this kind of defeats the purpose for me.

from modal.

BoyanH avatar BoyanH commented on May 27, 2024

I have started to implement the pandas.DataFrame support and I came across a major issue. The crux of the problem is that numpy arrays use row indexing, while pandas DataFrames use column indexing by default. That is, if you have a dataset X, then for instance X[0] gives the first row if it is a numpy array, but gives the column with index 0 for pandas DataFrames.

This causes a major incompatibility problem in the query strategy functions. When query strategies select the instance to query, they return the index and the instance as well. Currently, I have found no way to access the given instance in a type-agnostic way. One possible way to circumvent the problem is to remove the query instance from the return values of a query strategy. Since this would be a huge change, I am hesitant to do this.

One could handle pandas data frames separately, e.g.

    if isinstance(X, pd.DataFrame):
        return X.iloc[query_indices]

    return X[query_indices]

In order not to include this in all query strategies, these could return indices only as you suggested. This functionality can then be added only in the query() method implementations.

Once this is done, the only changes remaining to support pandas data frames are when working with instance representations, e.g. calculating similarities between them. If #104 is resolved (I am working on it), one could replace the used estimator with an sklearn transformation + estimator pipeline, where the transformation converts the data frame to a matrix. Something similar to:

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

ActiveLearner(
            # it's important to clone the model to have separate models for
            # predictions and active learning loop; ActiveLearner fits the
            # provided estimator to the provided training data
            estimator=Pipeline(steps=[
                ('transform', OneHotEncoder()),  # results in a matrix 
                ('classify', RandomForestClassifier())
            ]),
            query_strategy=uncertainty_batch_sampling,
            X_training=X_training,
            y_training=y_training,
            on_transformed=True  # not implemented, should force query strategies to work on transformed data (one hot encoded)
        )

from modal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.