skrub-data / skrub Goto Github PK

Prepping tables for machine learning

License: BSD 3-Clause "New" or "Revised" License

Python 98.59% Shell 1.41%

machine-learning data-science data-cleaning data data-preparation data-preprocessing data-analysis dirty-data data-wrangling

skrub's Introduction

skrub

skrub (formerly dirty_cat) is a Python library that facilitates prepping your tables for machine learning.

If you like the package, spread the word and ⭐ this repository!

What can skrub do?

skrub provides data assembling tools (TableVectorizer, fuzzy_join...) and encoders (GapEncoder, MinHashEncoder...) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations

See our examples.

What skrub cannot do

Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.

This kind of problem is tackled by Natural Language Processing methods.

skrub can still help with handling typos and variations in this kind of setting.

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables ¹ and Encoding high-cardinality string categorical variables ².

Installation

The easiest way to install skrub is via `pip`:

pip install skrub -U

or `conda`:

conda install -c conda-forge skrub

The documentation includes more detailed installation instructions.

Dependencies

Dependencies and minimal versions are listed in the setup file.

Are listed on the skrub's website

Contributing

The best way to support the development of skrub is to spread the word!

Also, if you already are a skrub user, we would love to hear about your use cases and challenges in the Discussions section.

To report a bug or suggest enhancements, please open an issue and/or submit a pull request.

Additional resources

References

Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.↩
Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.↩

skrub's People

Contributors

Stargazers

Watchers

Forkers

gaelvaroquaux pierreglaser pcerda mrahim jorisvandenbossche axelderomblay zhongkailv mcuny dsleo samronsin minghao2016 alexiseidelman alexis-cvetkov brp-sara mozartkun sandy4321 twsthomas sanyam07 mfcabrera amy12xx mgechter world4jason nicolasgensollen riemann85 lilianboulard frankfan007 allensmile patelashutosh jjerphan leogrin thomasjpfan abdoulayegk marcosvppfernandes jovan-stojanovic chkoar rohitpandey13 techthiyanes advit200 dy05sep2019 fatmaaouani mjboos alexdt1982 dcolinmorgan beecadox ahmad-abdellatif chclam myungkim930 rcap107 phymucs lesteve dirty-cat arpitjain799 kp-forks dirty-cat iq-scm vincent-maladiere ariera alanderex jeremiedbb glemaitre flefebv brunoscaglione barseghyanartur simonamaggio tinymarsh coconutattitude baggiponte kazmierczakp jeromedockes theooj juanitorduz tialo marcogorelli dcor01 badr-moufad muditmoody trung-dan-phan puscianm cmougan jqgoh priya-gittest zbenmo mbrukman

skrub's Issues

Location of data download

Currently the data are downloaded in the source directory of dirty cat. This is not a good solution, because dirty cat might be installed as root and hence this directory not writable. We need to use something either in the home dir or in the local dir.

Add all fetcher to documentation with a small docstring

Automatic column vectorizer depending on the column structure

Automatically select the type of encoder for a given column depending on its structure:

If the column type is 'float' or 'int', pass through.
If the column is type is 'categorical', select the encoder depending on the cardinality 'k' of the variable. For example: If k<=20, select one-hot encoding; if not, select similarity encoding.

dirty-cat cannot be used with Python 2.7

The super().__init__() located there cannot be used in Python 2.7
In Python 2.7, the syntax is super(ChildClass, self).__init__(arg1, arg2)

This is one example of line that should be changed. There may be others..

Add n_jobs option to SimilarityEncoder

Paralelize the ngram_similarity function by adding an n_jobs option to SimilarityEncoder

Add a "most_frequent" strategy to choose prototypes

As per [Cerda 2018], add an option to select the n most frequent categories as prototypes.

improve target encoder

A good way of doing target encoding (or impact coding) is to compute the value to encoding in a left out subset of the data, to avoid overfitting, for instance in a cross-validation loop.

The following code (nicely contributed by dataiku) can be adapted:
https://gist.github.com/GaelVaroquaux/0d477f34f2e1e8036737a7ac88cce1fc

Add the run-circleci-artifacts-redirector to link to rendered docs

As detailed in:
https://github.com/marketplace/actions/run-circleci-artifacts-redirector?version=0.1.0

It is used to link from the PR to the docs rendered by circleci, for instance in scikit-learn or sphinx-gallery. It helps reviewing PRs.

DOC note block and python code block have the same background

Title says it all.
I suggest using light grey for the note cells.

should examples call plt.show() ?

otherwise we never get to see the figures :)

Implement the n-gram similarity with a HashingVectorizer

To avoid having to retrain a CountVectorizer when new data comes, we should add the option to use a HashingVectorizer. This should be exposed via an option. Maybe "hashing_dim=2**16", and when "hashing_dim=None" we fall back to CountVectorizer.

Allow user to pass parameters for tokenizing dirty column in similarity encoder

Currently, we force the CountVectorizer to use "char" analyzer, and to tokenize on spaces in similarity encoder (and string distances methods. We could allow the user to pass these parameters instead to call CountVectorizer, and tokenize based on their requirement.

similarity encoder broken by recent change in scikit-learn _BaseEncoder

_BaseEncoder._check_X now returns X_columns, n_samples, n_features
(from git blame:

b2344a4e3f7924d92ff02b6e9dc476919da606da (maikia 2019-03-01 11:02:57 +0100 70) return X_columns, n_samples, n_features

)

datasets import error

I installed dirty_cat with pip but there don't appear to be any datasets. Am I doing something wrong or is the notebook at the bottom of this page out of date?

ImportError                               Traceback (most recent call last)
<ipython-input-6-fe859d2d7c31> in <module>()
      1 import pandas as pd
----> 2 from dirty_cat import datasets
      3 
      4 employee_salaries = datasets.fetch_employee_salaries()
      5 print(employee_salaries['description'])

ImportError: cannot import name 'datasets' from 'dirty_cat' (/Users/danielm/anaconda3/lib/python3.7/site-packages/dirty_cat/__init__.py)

Expose the seed of the kmeans

We need to expose the seed of the kmeans in the similarity encoder, to make the pipelines fully reproducible. For this, we need to add a "random_state" option to the similarity encoder, as in many scikit-learn models, and pass it to kmeans.

encoding new categories ( present in the train set and not in the train set)

Implement get_feature_names for SimilarityEncoder

Following the sklearn API eg in ColumnTransformers

BUG specifiying a list of categories to use fails at init time

Specifiying a list of categories to use when creating a SimilarityEncoder fails:

from dirty_cat import SimilarityEncoder
SimilarityEncoder(categories=['foo', 'bar'])

AssertionError                            Traceback (most recent call last)
<ipython-input-2-713e2faae33f> in <module>
----> 1 SimilarityEncoder(categories=['foo', 'bar'])

~/repos/dirty_cat/dirty_cat/similarity_encoder.py in __init__(self, similarity, ngram_range, categories, dtype, handle_unknown, hashing_dim, n_prototypes, random_state)
    193         self.random_state = random_state
    194 
--> 195         assert categories in [None, 'auto', 'k-means', 'most_frequent']
    196         if categories in ['k-means', 'most_frequent'] and (n_prototypes is None or n_prototypes == 0):
    197             raise ValueError('n_prototypes expected None or a positive non null integer')

AssertionError:

Apart from the obvious fix, will look for any further errors before submitting a PR.

Change default of "handle_unknown" in SimilarEncoder

The default should be to "ignore", rather than "error", as the SimilarityEncoder encoders very well unknown categories.

We should change the default, adapt examples (ie remove the option from the examples that set it, and always use the default), and add a line in the CHANGES.rst.

can't read the data?

data = pd.read_csv(employee_salaries['./home/azureuser/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dirty_cat/data'])

I give the path in Jupyter Notebook - Run on Azure

Error

The downloaded data contains the employee_salaries dataset.
It can originally be found at: https://catalog.data.gov/dataset/ employee-salaries-2016

KeyError Traceback (most recent call last)
in
1 employee_salaries = datasets.fetch_employee_salaries()
2 print(employee_salaries['description'])
----> 3 data = pd.read_csv(employee_salaries['./home/azureuser/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dirty_cat/data'])
4 print(data.head(n=5))

KeyError: './home/azureuser/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dirty_cat/data'

Benchmarking K-means with different hashing dims and sample sizes

Add option for sample weight parameter in kmeans
Benchmark K-means in similarity encoding with:

hash dimensions: 64, 128
max samples at: 20k, 50k, 100k and nunique

Document random projections

In practice it is very useful to add a random projection step after the similarity encoder when the dimension is high (ie the cardinality of the categorical variable). This is easy, but we should document it.

CircleCI is no longer showing up in the pull requests.

CircleCI no longer shows up at the bottom of PRs, like travis. It's an inconvenience, as its useful to review the docs in PRs.

Example 04 uses race and gender attributes

I know the purpose of the example is on performance, but we are currently illustrating it on the traffic violations dataset, and using race and gender attributes while building the model. It may be better to remove these two attributes (race, and gender) from the modeling.

Comparison to factorization methods (e.g. SVD)

I used the notebook 02_fit_predict_plot_employee_salaries.html to compare the performance using SVD (with same number of components as MinHashEncoder). This could be added to the current or different notebook, if you agree.

one-hot encoding
r2 score: mean: 0.856; std: 0.034

target encoding
r2 score: mean: 0.774; std: 0.033

similarity encoding
r2 score: mean: 0.915; std: 0.012

minhash encoding
r2 score: mean: 0.753; std: 0.025

svd encoding
r2 score: mean: 0.856; std: 0.014

CircleCI is no longer storing the generated docs as artifacts

I cannot see them on:
https://circleci.com/gh/dirty-cat/dirty_cat/505#artifacts/containers/0

Building the docs should error if sphinx raises warnings.

Following #88.
There is one warning that has not been fixed yet:

CHANGES.rst: WARNING: document isn't included in any toctree

The index.rst file references CHANGES.html and not CHANGES.rst Wonder if this is on purpose @GaelVaroquaux?

Better documentation on compiling doc

It took me a couple attempts to compile the documentation + i've plenty sklearn future warnings.

maybe you should add a small contributing guide with instructions for quickly replicate the doc.

intersphinx does not link to scikit-learn for inline code

we should use sklearn and not scikit-learn for the key of sphinx_gallery_conf, in order to enable inline code links to the scikit-learn documentation.

Add a "kmeans" strategy to select the prototypes

Using a kmeans to define prototypes is useful to have a reduced dimensionality. The steps are the following:

hash all the strings of the categories
random project them in 256 dimensions
run a kmeans on the resulting data with number of clusters the desired dimensionality (and n_init=1)
use a nearest neighbor from scikit-learn to assign the cluster centers to original categories.

04_dimension_reduction_and_performance throws a multiprocessing exception on windows

This is due to the resource_used function used in this line:
results = resource_used(model_selection.cross_validate)(model, df, y, )

This can be fixed by having this inside the if name == 'main': block. I tested this on windows, and can confirm that doing this fixes the problem.

See here: https://stackoverflow.com/questions/18204782/runtimeerror-on-windows-trying-python-multiprocessing

Details of exception raised:
File "C:\Users\Amanda\Miniconda3\envs\dirtycat\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Apply minhash_encoder to more than 1024 categories returns -1

Hy all,
I am trying to apply minhash_encoder to a somewhat large dataset of strings (~200k distinct).
I was testing my code with 10 strings, and it was running fine.
But when I tested using all dataset, most of the strings were represented as all '-1' vectors.
I took a look at the source code and find this line inside 'minhash_encoder.py', that maybe is causing the problem:
self.hash_dict = LRUDict(capacity=2**10)
Not sure why this is used, but I checked with 1025 strings, and only the first one returns -1.
This encoder should work with a lot more variables, right?

Code to replicate:

from dirty_cat import MinHashEncoder
import random
import string

def get_random_string(length):
    letters = string.ascii_lowercase
    result_str = ''.join(random.choice(letters) for i in range(length))
    return result_str

# 1024 categories -> all ok
raw_data = [get_random_string(10) for x in range(1024)]
hash_encoder = MinHashEncoder(n_components=10)
transformed_values = hash_encoder.fit_transform(raw_data)
print(transformed_values)

# 1025 categories -> first represented as -1's
raw_data = [get_random_string(10) for x in range(1025)]
hash_encoder = MinHashEncoder(n_components=10)
transformed_values = hash_encoder.fit_transform(raw_data)
print(transformed_values)

Migrate Travis worker from Python 3.7 to 3.8

As requested by @GaelVaroquaux, migrating one Travis worker from Python 3.7 to 3.8.

diry_cat release 0.0.8

Hello

Considering the important updates of dirty_cat recently, especially the Gamma-Poisson Factorization, we will push a new release in the coming days.

FYI @amy12xx @alexis-cvetkov

Thanks

CircleCI is failing

I'm not sure why:
https://circleci.com/gh/dirty-cat/dirty_cat/482

Cc @alexis-cvetkov

Travis is gone

We need to find out why travis is gone.

requests version required

When trying to fetch data I get the following error:

In [14]: employee_salaries = datasets.fetch_employee_salaries()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-dd00eef23e9e> in <module>()
----> 1 employee_salaries = datasets.fetch_employee_salaries()

/home/pcerda/parietal/dirty-cat/dirty_cat/dirty_cat/datasets/fetching.py in fetch_employee_salaries()
    289 
    290 def fetch_employee_salaries():
--> 291     return fetch_dataset(EMPLOYEE_SALARIES_CONFIG, show_progress=False)
    292 
    293 

/home/pcerda/parietal/dirty-cat/dirty_cat/dirty_cat/datasets/fetching.py in fetch_dataset(configfile, show_progress)
    184     for urlinfo in configfile.urlinfos:
    185         _fetch_file(urlinfo.url, data_dir, filenames=urlinfo.filenames,
--> 186                     uncompress=urlinfo.uncompress, show_progress=show_progress)
    187     # returns the absolute path of the csv file where the data is
    188     result_dict = {

/home/pcerda/parietal/dirty-cat/dirty_cat/dirty_cat/datasets/fetching.py in _fetch_file(url, data_dir, filenames, overwrite, md5sum, uncompress, show_progress)
    271 
    272     if download:
--> 273         _download_and_write(url, temp_full_name, show_progress=show_progress)
    274 
    275     # chunk writing is not implemented, see if necessary

/home/pcerda/parietal/dirty-cat/dirty_cat/dirty_cat/datasets/fetching.py in _download_and_write(url, file, show_progress)
    153         # accessing the content attribute
    154         from ..datasets.utils import request_get
--> 155         with request_get(url, stream=True) as r:
    156             total_length = r.headers.get('Content-Length')
    157             if total_length is not None:

AttributeError: __enter__

My version of requests was '2.12.14'. Updating to '2.18.4' solved the problem.

Comparison to embeddings?

Have you ran any comparison to embeddings since you published the paper? Or looked like that didn't include a comparison yet.

add 1% threshold in codecove yml

To avoid spurious failures.

See nilearn/nilearn#2434

Minhash encoder does not perform well on the dirty-cat examples

To solve this, the parameter n_components of the MinHashEncoder should be set to 30 at least.

Warning messages in notebook 04_dimension_reduction_and_performance

Too many ConvergenceWarning messages in the current notebook make it hard to see the actual output.
https://dirty-cat.github.io/stable/auto_examples/04_dimension_reduction_and_performance.html#sphx-glr-auto-examples-04-dimension-reduction-and-performance-py

the last version (0.0.7) isn't available on pip

Hello,

According to the recent changes list (https://dirty-cat.github.io/stable/CHANGES.html), the last version is 0.0.7, but the version available on PyPI is 0.0.5 (https://pypi.org/project/dirty_cat/).

I noticed the issue when I tried to import 'MinHashEncoder' (as described in the documentation and the examples), as it throwed the error

ImportError: cannot import name 'MinHashEncoder' from 'dirty_cat'

I had to install dirty_cat directly from the repository (pip install https://github.com/dirty-cat/dirty_cat/archive/master.zip), and now it works like a charm, but I guess that other people may stumble upon this problem!

Thanks, and congrats by the way, that's a really great library!

Deprecation warning in CI

On the last commit, the CI raised a DeprecationWarning regarding a regular expression:

Add support for missing values in the encoders

Encoding a missing value as a vector of zeros is a reasonable thing. Our theoretical study (https://arxiv.org/abs/1902.06931) shows that the most important thing is to encode them in a special value that can be later picked up by the supervised step.

Our encoders should have an option that controls whether missing values are encoded as zeros or an error is raised (following scikit-learn encoders).

Codecov for ``count_3_grams.py`` at 0%

Hi,

On the codecov report, I can see that it considers the file count_3_grams.py as not tested (coverage at 0%).

However, when we take a peak into the file, we can see there are actually test units inside.

It is also said in the docstring:

Run unit tests with "pytest count_3_grams.py"

In order for the code coverage to improve, I would advise splitting the file in two so that the test units move into dirty_cat/tests/.

Please tell me if it's fine to do that, and I'll submit a PR.

Thanks
Lilian

dirty_cat jaro distance is different from python-Levenshtein package

The function test_compare_implementations of test_string_distances.py compares dirty_cat implementation of jaro and jaro_winkler metrics with the python-Levenshtein package.
In practice these 2 implementations are different, but I believe it has never been signaled by Travis since the comparison is done on a small number (N=10) of random strings for which both method returns 0 with very high probability.

Here is a simple code to expose the difference:

from dirty_cat import string_distances
import Levenshtein

for s1, s2 in[('acc', 'babbb'), ('acc', 'bbabb'), ('acc', 'bbbab')]:
    print(f'dirty_cat {s1, s2} = {string_distances._jaro_winkler(s1, s2, winkler=False)}')
    print(f'Levenshtein {s1, s2} = {Levenshtein.jaro(s1, s2)}')
    print('-'*50)

>>> dirty_cat ('acc', 'babbb') = 0.5111111111111111
>>> Levenshtein ('acc', 'babbb') = 0.5111111111111111
>>> --------------------------------------------------
>>> dirty_cat ('acc', 'bbabb') = 0.0
>>> Levenshtein ('acc', 'bbabb') = 0.5111111111111111
>>> --------------------------------------------------
>>> dirty_cat ('acc', 'bbbab') = 0.0
>>> Levenshtein ('acc', 'bbbab') = 0.0
>>> --------------------------------------------------

From this examples and other tests I've made I believe that the difference lies in the definition of matching characters. The dirty_cat implementation uses the wikipedia definition: two characters from s1 and s2 are considered matching only if they are the same and not farther than
int(max(|s1|,|s2|) / 2) - 1 = 1 in my example.

Instead, the Levensthein implementation seems to use int( (min(|s1|,|s2|) + 1) / 2) = 2.

Add a CHANGES.rst

We should add a file CHANGES.rst in which we list the major changes between releases, and that is linked to in the documentation, probably in the sidebar:

Add the file in the project root and populate it with the major change since 0.0.2 release
Add a symlink in the doc folder
Add a link in the sphinx sidebar

AttributeError: 'tuple' object has no attribute 'shape'

Hello!

I was trying to reproduce "Investigating dirty categories" (https://dirty-cat.github.io/stable/auto_examples/01_investigating_dirty_categories.html#sphx-glr-auto-examples-01-investigating-dirty-categories-py) and got this error: AttributeError: 'tuple' object has no attribute 'shape'.

Log says it is in line 241, in fit n_samples, n_features = X.shape

Am I doing something wrong or is it a issue?

I'm on python 3.7.

Thanks

BUG: uncompressed csv filepath changed for medical_charge

from dirty_cat.datasets import fetch_medical_charge
import os


medical_charge_info = fetch_medical_charge()

file_exists = os.path.exists(medical_charge_info['path'])
print('medical_charge path exists: {}'.format(file_exists))

folder_contents = os.listdir(os.path.dirname(medical_charge_info['path']))
print('content of medial_charge data folder {}'.format(folder_contents))

Outputs:

Extracting data from /home/pierreglaser/.virtualenvs/dc_filepath_issue_py37/lib/python3.7/site-packages/dirty_cat/data/medical_charge/Inpatient_Data_2011_CSV.zip..... done.
medical_charge path exists: False
content of medial_charge data folder ['Medicare_Hospital_Inpatient_PUF_Methodology_2014-05-30.pdf', 'Medicare_Provider_Charge_Inpatient_DRG100_FY2011.csv']

The filename of the uncompressed csv has changed.
May be worth having a routine downloading and loading all datasets periodically (apart from our test suite) to look for any errors.

Link to kernel methods

I feel like there is a very strong link to kernel methods here. You're basically using a string kernel on the dirty categories. If you'd do PCA of the similarity matrix this would be nearly the same as a nystoem embedding on this kernel. Have you looked at related literature from that angle?