Giter Site home page Giter Site logo

skrub-data / skrub Goto Github PK

View Code? Open in Web Editor NEW
1.0K 21.0 88.0 8.15 MB

Prepping tables for machine learning

Home Page: https://skrub-data.org/

License: BSD 3-Clause "New" or "Revised" License

Python 98.59% Shell 1.41%
machine-learning data-science data-cleaning data data-preparation data-preprocessing data-analysis dirty-data data-wrangling

skrub's Introduction

skrub

skrub logo

py_ver pypi_var pypi_dl codecov circleci black

skrub (formerly dirty_cat) is a Python library that facilitates prepping your tables for machine learning.

If you like the package, spread the word and ⭐ this repository!

What can skrub do?

skrub provides data assembling tools (TableVectorizer, fuzzy_join...) and encoders (GapEncoder, MinHashEncoder...) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations

See our examples.

What skrub cannot do

Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.

This kind of problem is tackled by Natural Language Processing methods.

skrub can still help with handling typos and variations in this kind of setting.

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables1 and Encoding high-cardinality string categorical variables2.

Installation

The easiest way to install skrub is via `pip`:

pip install skrub -U

or `conda`:

conda install -c conda-forge skrub

The documentation includes more detailed installation instructions.

Dependencies

Dependencies and minimal versions are listed in the setup file.

Are listed on the skrub's website

Contributing

The best way to support the development of skrub is to spread the word!

Also, if you already are a skrub user, we would love to hear about your use cases and challenges in the Discussions section.

To report a bug or suggest enhancements, please open an issue and/or submit a pull request.

Additional resources

References


  1. Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.

  2. Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.

skrub's People

Contributors

ahoyosid avatar alexdt1982 avatar alexis-cvetkov avatar amy12xx avatar ariera avatar badr-moufad avatar dcor01 avatar dsleo avatar flefebv avatar gaelvaroquaux avatar glemaitre avatar hamedonline avatar jeremiedbb avatar jeromedockes avatar jjerphan avatar jovan-stojanovic avatar juanitorduz avatar leogrin avatar lilianboulard avatar mcuny avatar myungkim930 avatar nicolasgensollen avatar pcerda avatar pierreglaser avatar samronsin avatar theooj avatar tialo avatar twsthomas avatar vhoulbreque avatar vincent-maladiere avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skrub's Issues

Location of data download

Currently the data are downloaded in the source directory of dirty cat. This is not a good solution, because dirty cat might be installed as root and hence this directory not writable. We need to use something either in the home dir or in the local dir.

Automatic column vectorizer depending on the column structure

Automatically select the type of encoder for a given column depending on its structure:

  • If the column type is 'float' or 'int', pass through.

  • If the column is type is 'categorical', select the encoder depending on the cardinality 'k' of the variable. For example: If k<=20, select one-hot encoding; if not, select similarity encoding.

dirty-cat cannot be used with Python 2.7

The super().__init__() located there cannot be used in Python 2.7
In Python 2.7, the syntax is super(ChildClass, self).__init__(arg1, arg2)

This is one example of line that should be changed. There may be others..

Implement the n-gram similarity with a HashingVectorizer

To avoid having to retrain a CountVectorizer when new data comes, we should add the option to use a HashingVectorizer. This should be exposed via an option. Maybe "hashing_dim=2**16", and when "hashing_dim=None" we fall back to CountVectorizer.

datasets import error

I installed dirty_cat with pip but there don't appear to be any datasets. Am I doing something wrong or is the notebook at the bottom of this page out of date?

ImportError                               Traceback (most recent call last)
<ipython-input-6-fe859d2d7c31> in <module>()
      1 import pandas as pd
----> 2 from dirty_cat import datasets
      3 
      4 employee_salaries = datasets.fetch_employee_salaries()
      5 print(employee_salaries['description'])

ImportError: cannot import name 'datasets' from 'dirty_cat' (/Users/danielm/anaconda3/lib/python3.7/site-packages/dirty_cat/__init__.py)

Expose the seed of the kmeans

We need to expose the seed of the kmeans in the similarity encoder, to make the pipelines fully reproducible. For this, we need to add a "random_state" option to the similarity encoder, as in many scikit-learn models, and pass it to kmeans.

BUG specifiying a list of categories to use fails at init time

Specifiying a list of categories to use when creating a SimilarityEncoder fails:

from dirty_cat import SimilarityEncoder
SimilarityEncoder(categories=['foo', 'bar'])
AssertionError                            Traceback (most recent call last)
<ipython-input-2-713e2faae33f> in <module>
----> 1 SimilarityEncoder(categories=['foo', 'bar'])

~/repos/dirty_cat/dirty_cat/similarity_encoder.py in __init__(self, similarity, ngram_range, categories, dtype, handle_unknown, hashing_dim, n_prototypes, random_state)
    193         self.random_state = random_state
    194 
--> 195         assert categories in [None, 'auto', 'k-means', 'most_frequent']
    196         if categories in ['k-means', 'most_frequent'] and (n_prototypes is None or n_prototypes == 0):
    197             raise ValueError('n_prototypes expected None or a positive non null integer')

AssertionError: 

Apart from the obvious fix, will look for any further errors before submitting a PR.

Change default of "handle_unknown" in SimilarEncoder

The default should be to "ignore", rather than "error", as the SimilarityEncoder encoders very well unknown categories.

We should change the default, adapt examples (ie remove the option from the examples that set it, and always use the default), and add a line in the CHANGES.rst.

can't read the data?

data = pd.read_csv(employee_salaries['./home/azureuser/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dirty_cat/data'])

I give the path in Jupyter Notebook - Run on Azure

Error

The downloaded data contains the employee_salaries dataset.
It can originally be found at: https://catalog.data.gov/dataset/ employee-salaries-2016

KeyError Traceback (most recent call last)
in
1 employee_salaries = datasets.fetch_employee_salaries()
2 print(employee_salaries['description'])
----> 3 data = pd.read_csv(employee_salaries['./home/azureuser/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dirty_cat/data'])
4 print(data.head(n=5))

KeyError: './home/azureuser/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dirty_cat/data'

Document random projections

In practice it is very useful to add a random projection step after the similarity encoder when the dimension is high (ie the cardinality of the categorical variable). This is easy, but we should document it.

Example 04 uses race and gender attributes

I know the purpose of the example is on performance, but we are currently illustrating it on the traffic violations dataset, and using race and gender attributes while building the model. It may be better to remove these two attributes (race, and gender) from the modeling.

Comparison to factorization methods (e.g. SVD)

I used the notebook 02_fit_predict_plot_employee_salaries.html to compare the performance using SVD (with same number of components as MinHashEncoder). This could be added to the current or different notebook, if you agree.

one-hot encoding
r2 score: mean: 0.856; std: 0.034

target encoding
r2 score: mean: 0.774; std: 0.033

similarity encoding
r2 score: mean: 0.915; std: 0.012

minhash encoding
r2 score: mean: 0.753; std: 0.025

svd encoding
r2 score: mean: 0.856; std: 0.014

Better documentation on compiling doc

It took me a couple attempts to compile the documentation + i've plenty sklearn future warnings.

maybe you should add a small contributing guide with instructions for quickly replicate the doc.

Add a "kmeans" strategy to select the prototypes

Using a kmeans to define prototypes is useful to have a reduced dimensionality. The steps are the following:

  1. hash all the strings of the categories
  2. random project them in 256 dimensions
  3. run a kmeans on the resulting data with number of clusters the desired dimensionality (and n_init=1)
  4. use a nearest neighbor from scikit-learn to assign the cluster centers to original categories.

04_dimension_reduction_and_performance throws a multiprocessing exception on windows

This is due to the resource_used function used in this line:
results = resource_used(model_selection.cross_validate)(model, df, y, )

This can be fixed by having this inside the if name == 'main': block. I tested this on windows, and can confirm that doing this fixes the problem.

See here: https://stackoverflow.com/questions/18204782/runtimeerror-on-windows-trying-python-multiprocessing

Details of exception raised:
File "C:\Users\Amanda\Miniconda3\envs\dirtycat\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Apply minhash_encoder to more than 1024 categories returns -1

Hy all,
I am trying to apply minhash_encoder to a somewhat large dataset of strings (~200k distinct).
I was testing my code with 10 strings, and it was running fine.
But when I tested using all dataset, most of the strings were represented as all '-1' vectors.
I took a look at the source code and find this line inside 'minhash_encoder.py', that maybe is causing the problem:
self.hash_dict = LRUDict(capacity=2**10)
Not sure why this is used, but I checked with 1025 strings, and only the first one returns -1.
This encoder should work with a lot more variables, right?

Code to replicate:

from dirty_cat import MinHashEncoder
import random
import string

def get_random_string(length):
    letters = string.ascii_lowercase
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str

# 1024 categories -> all ok
raw_data = [get_random_string(10) for x in range(1024)]
hash_encoder = MinHashEncoder(n_components=10)
transformed_values = hash_encoder.fit_transform(raw_data)
print(transformed_values)

# 1025 categories -> first represented as -1's
raw_data = [get_random_string(10) for x in range(1025)]
hash_encoder = MinHashEncoder(n_components=10)
transformed_values = hash_encoder.fit_transform(raw_data)
print(transformed_values)

requests version required

When trying to fetch data I get the following error:

In [14]: employee_salaries = datasets.fetch_employee_salaries()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-dd00eef23e9e> in <module>()
----> 1 employee_salaries = datasets.fetch_employee_salaries()

/home/pcerda/parietal/dirty-cat/dirty_cat/dirty_cat/datasets/fetching.py in fetch_employee_salaries()
    289 
    290 def fetch_employee_salaries():
--> 291     return fetch_dataset(EMPLOYEE_SALARIES_CONFIG, show_progress=False)
    292 
    293 

/home/pcerda/parietal/dirty-cat/dirty_cat/dirty_cat/datasets/fetching.py in fetch_dataset(configfile, show_progress)
    184     for urlinfo in configfile.urlinfos:
    185         _fetch_file(urlinfo.url, data_dir, filenames=urlinfo.filenames,
--> 186                     uncompress=urlinfo.uncompress, show_progress=show_progress)
    187     # returns the absolute path of the csv file where the data is
    188     result_dict = {

/home/pcerda/parietal/dirty-cat/dirty_cat/dirty_cat/datasets/fetching.py in _fetch_file(url, data_dir, filenames, overwrite, md5sum, uncompress, show_progress)
    271 
    272     if download:
--> 273         _download_and_write(url, temp_full_name, show_progress=show_progress)
    274 
    275     # chunk writing is not implemented, see if necessary

/home/pcerda/parietal/dirty-cat/dirty_cat/dirty_cat/datasets/fetching.py in _download_and_write(url, file, show_progress)
    153         # accessing the content attribute
    154         from ..datasets.utils import request_get
--> 155         with request_get(url, stream=True) as r:
    156             total_length = r.headers.get('Content-Length')
    157             if total_length is not None:

AttributeError: __enter__

My version of requests was '2.12.14'. Updating to '2.18.4' solved the problem.

Comparison to embeddings?

Have you ran any comparison to embeddings since you published the paper? Or looked like that didn't include a comparison yet.

the last version (0.0.7) isn't available on pip

Hello,

According to the recent changes list (https://dirty-cat.github.io/stable/CHANGES.html), the last version is 0.0.7, but the version available on PyPI is 0.0.5 (https://pypi.org/project/dirty_cat/).

I noticed the issue when I tried to import 'MinHashEncoder' (as described in the documentation and the examples), as it throwed the error

ImportError: cannot import name 'MinHashEncoder' from 'dirty_cat'

I had to install dirty_cat directly from the repository (pip install https://github.com/dirty-cat/dirty_cat/archive/master.zip), and now it works like a charm, but I guess that other people may stumble upon this problem!

Thanks, and congrats by the way, that's a really great library!

Add support for missing values in the encoders

Encoding a missing value as a vector of zeros is a reasonable thing. Our theoretical study (https://arxiv.org/abs/1902.06931) shows that the most important thing is to encode them in a special value that can be later picked up by the supervised step.

Our encoders should have an option that controls whether missing values are encoded as zeros or an error is raised (following scikit-learn encoders).

Codecov for ``count_3_grams.py`` at 0%

Hi,

On the codecov report, I can see that it considers the file count_3_grams.py as not tested (coverage at 0%).

However, when we take a peak into the file, we can see there are actually test units inside.

It is also said in the docstring:

Run unit tests with "pytest count_3_grams.py"

In order for the code coverage to improve, I would advise splitting the file in two so that the test units move into dirty_cat/tests/.

Please tell me if it's fine to do that, and I'll submit a PR.

Thanks
Lilian

dirty_cat jaro distance is different from python-Levenshtein package

The function test_compare_implementations of test_string_distances.py compares dirty_cat implementation of jaro and jaro_winkler metrics with the python-Levenshtein package.
In practice these 2 implementations are different, but I believe it has never been signaled by Travis since the comparison is done on a small number (N=10) of random strings for which both method returns 0 with very high probability.

Here is a simple code to expose the difference:

from dirty_cat import string_distances
import Levenshtein

for s1, s2 in[('acc', 'babbb'), ('acc', 'bbabb'), ('acc', 'bbbab')]:
    print(f'dirty_cat {s1, s2} = {string_distances._jaro_winkler(s1, s2, winkler=False)}')
    print(f'Levenshtein {s1, s2} = {Levenshtein.jaro(s1, s2)}')
    print('-'*50)

>>> dirty_cat ('acc', 'babbb') = 0.5111111111111111
>>> Levenshtein ('acc', 'babbb') = 0.5111111111111111
>>> --------------------------------------------------
>>> dirty_cat ('acc', 'bbabb') = 0.0
>>> Levenshtein ('acc', 'bbabb') = 0.5111111111111111
>>> --------------------------------------------------
>>> dirty_cat ('acc', 'bbbab') = 0.0
>>> Levenshtein ('acc', 'bbbab') = 0.0
>>> --------------------------------------------------

From this examples and other tests I've made I believe that the difference lies in the definition of matching characters. The dirty_cat implementation uses the wikipedia definition: two characters from s1 and s2 are considered matching only if they are the same and not farther than
int(max(|s1|,|s2|) / 2) - 1 = 1 in my example.

Instead, the Levensthein implementation seems to use int( (min(|s1|,|s2|) + 1) / 2) = 2.

Add a CHANGES.rst

We should add a file CHANGES.rst in which we list the major changes between releases, and that is linked to in the documentation, probably in the sidebar:

  • Add the file in the project root and populate it with the major change since 0.0.2 release
  • Add a symlink in the doc folder
  • Add a link in the sphinx sidebar

BUG: uncompressed csv filepath changed for medical_charge

from dirty_cat.datasets import fetch_medical_charge
import os


medical_charge_info = fetch_medical_charge()

file_exists = os.path.exists(medical_charge_info['path'])
print('medical_charge path exists: {}'.format(file_exists))

folder_contents = os.listdir(os.path.dirname(medical_charge_info['path']))
print('content of medial_charge data folder {}'.format(folder_contents))

Outputs:

Extracting data from /home/pierreglaser/.virtualenvs/dc_filepath_issue_py37/lib/python3.7/site-packages/dirty_cat/data/medical_charge/Inpatient_Data_2011_CSV.zip..... done.
medical_charge path exists: False
content of medial_charge data folder ['Medicare_Hospital_Inpatient_PUF_Methodology_2014-05-30.pdf', 'Medicare_Provider_Charge_Inpatient_DRG100_FY2011.csv']

The filename of the uncompressed csv has changed.
May be worth having a routine downloading and loading all datasets periodically (apart from our test suite) to look for any errors.

Link to kernel methods

I feel like there is a very strong link to kernel methods here. You're basically using a string kernel on the dirty categories. If you'd do PCA of the similarity matrix this would be nearly the same as a nystoem embedding on this kernel. Have you looked at related literature from that angle?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.