Thanks for sharing the great code! Lightgbm is a popular package, which supports n

support Pandas dataframe as training data about modal HOT 6 CLOSED

modal-python commented on May 27, 2024 1

support Pandas dataframe as training data

from modal.

Comments (6)

cosmic-cortex commented on May 27, 2024

Thanks! Adding Pandas support is a great idea! I'll take a look at it soon, hopefully this can be included in the next release.

from modal.

cosmic-cortex commented on May 27, 2024

I have started to implement the pandas.DataFrame support and I came across a major issue. The crux of the problem is that numpy arrays use row indexing, while pandas DataFrames use column indexing by default. That is, if you have a dataset X, then for instance X[0] gives the first row if it is a numpy array, but gives the column with index 0 for pandas DataFrames.

This causes a major incompatibility problem in the query strategy functions. When query strategies select the instance to query, they return the index and the instance as well. Currently, I have found no way to access the given instance in a type-agnostic way. One possible way to circumvent the problem is to remove the query instance from the return values of a query strategy. Since this would be a huge change, I am hesitant to do this.

from modal.

fighting41love commented on May 27, 2024

The sklearn package is a good example to load pandas data frame. It converts the pd df to numpy.
https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62
Hope this will be helpful.
Thanks!

from modal.

jpzhangvincent commented on May 27, 2024

It would be a very useful feature to improve the workflow and integration with other packages. Is there a branch we can help on this feature?

from modal.

cosmic-cortex commented on May 27, 2024

Currently, there are no feature branches specifically for this, but feel free to create one in a fork from the dev branch! I am happy to help, since I also think it is an important problem, I just haven't solved it yet. As I outlined in my previous comment, the main issue for me is that pandas DataFrames are indexed by column first, while numpy arrays are row first. One possible way to solve this is to immediately convert to numpy array, but this kind of defeats the purpose for me.

from modal.

BoyanH commented on May 27, 2024

I have started to implement the pandas.DataFrame support and I came across a major issue. The crux of the problem is that numpy arrays use row indexing, while pandas DataFrames use column indexing by default. That is, if you have a dataset X, then for instance X[0] gives the first row if it is a numpy array, but gives the column with index 0 for pandas DataFrames.

This causes a major incompatibility problem in the query strategy functions. When query strategies select the instance to query, they return the index and the instance as well. Currently, I have found no way to access the given instance in a type-agnostic way. One possible way to circumvent the problem is to remove the query instance from the return values of a query strategy. Since this would be a huge change, I am hesitant to do this.

One could handle pandas data frames separately, e.g.

    if isinstance(X, pd.DataFrame):
        return X.iloc[query_indices]

    return X[query_indices]

In order not to include this in all query strategies, these could return indices only as you suggested. This functionality can then be added only in the query() method implementations.

Once this is done, the only changes remaining to support pandas data frames are when working with instance representations, e.g. calculating similarities between them. If #104 is resolved (I am working on it), one could replace the used estimator with an sklearn transformation + estimator pipeline, where the transformation converts the data frame to a matrix. Something similar to:

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

ActiveLearner(
            # it's important to clone the model to have separate models for
            # predictions and active learning loop; ActiveLearner fits the
            # provided estimator to the provided training data
            estimator=Pipeline(steps=[
                ('transform', OneHotEncoder()),  # results in a matrix 
                ('classify', RandomForestClassifier())
            ]),
            query_strategy=uncertainty_batch_sampling,
            X_training=X_training,
            y_training=y_training,
            on_transformed=True  # not implemented, should force query strategies to work on transformed data (one hot encoded)
        )

from modal.

support Pandas dataframe as training data about modal HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent