capitalone / datacompy Goto Github PK

View Code? Open in Web Editor NEW

386.0 25.0 122.0 9.04 MB

Pandas and Spark DataFrame comparison for humans and more!

Home Page: https://capitalone.github.io/datacompy/

License: Apache License 2.0

Python 99.93% Makefile 0.07%

python pandas spark data data-science compare dataframes numpy pyspark fugue

datacompy's Introduction

DataComPy

DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be). Then extended to carry that functionality over to Spark Dataframes.

Quick Installation

pip install datacompy

conda install datacompy

Installing extras

If you would like to use Spark or any other backends please make sure you install via extras:

pip install datacompy[spark]
pip install datacompy[dask]
pip install datacompy[duckdb]
pip install datacompy[polars]
pip install datacompy[ray]

Legacy Spark Deprecation

Starting with version 0.12.0

The original SparkCompare implementation differs from all the other native implementations. To align the API better, and keep behaviour consistent we are deprecating SparkCompare into a new module LegacySparkCompare

If you wish to use the old SparkCompare moving forward you can

import datacompy.legacy.LegacySparkCompare

Supported versions and dependncies

Different versions of Spark, Pandas, and Python interact differently. Below is a matrix of what we test with. With the move to Pandas on Spark API and compatability issues with Pandas 2+ we will for the mean time note support Pandas 2 with the Pandas on Spark implementation. Spark plans to support Pandas 2 in Spark 4

With version 0.12.0:

Not support Pandas 2.0.0 For the native Spark implemention
Spark 3.1 support will be dropped
Python 3.8 support is dropped

	Spark 3.2.4	Spark 3.3.4	Spark 3.4.2	Spark 3.5.1
Python 3.9	✅	✅	✅	✅
Python 3.10	✅	✅	✅	✅
Python 3.11	❌	❌	✅	✅
Python 3.12	❌	❌	❌	❌

	Pandas < 1.5.3	Pandas >=2.0.0
Native Pandas	✅	✅
Native Spark	✅	❌
Fugue	✅	✅

Note

At the current time Python 3.12 is not supported by Spark and also Ray within Fugue.

Supported backends

Pandas: (See documentation)
Spark (Pandas on Spark API): (See documentation)
Polars (Experimental): (See documentation)
Fugue is a Python library that provides a unified interface for data processing on Pandas, DuckDB, Polars, Arrow, Spark, Dask, Ray, and many other backends. DataComPy integrates with Fugue to provide a simple way to compare data across these backends. Please note that Fugue will use the Pandas (Native) logic at its lowest level (See documentation)

Contributors

We welcome and appreciate your contributions! Before we can accept any contributions, we ask that you please be sure to sign the Contributor License Agreement (CLA).

This project adheres to the Open Source Code of Conduct. By participating, you are expected to honor this code.

Roadmap

Roadmap details can be found here

datacompy's People

Contributors

Stargazers

Watchers

Forkers

santhisridharan ryanshiroma y788zhan sheikhusmanshakeel ahasha theianrobertson priya-gittest ginnyqiu jbernec-zz rabbitdata fiolbs enayet01 firefoxwang gforsyth redcatsec nikolayvoronchikhin hugovk znimon kunguken vincent-lefrere cinguva parvbhullar dan-coates cwade9 matt-lam-bench harris-chiu pz789as pauloasilva noritanushree julinjustin rohith295 spark-bnm rishavghosh605 yangchenghuang gaochaozhu vpatel812 milenkomarkovic janakiram-sundar baratrion isabella232 baspartovi headorb erichear bhurkesiddhesh filipcecha tczhao mnizol krishanbhasin gandhis1 carlbray cosmo65 bhaktikarva arjunk93 javidan1997 infused-kim joekeating abmn614 pushpendra-tripathi hudakas cpbarry80 littleone01 ospo-capitalone nicholas-eng systematicmethods zxecon afrothwell skartan-amount rajrao fdosani jparag n0nent1ty centific-dev-tool darkfox43 harumikiyama yslhzj felixxc stanleyjacob hanaez x1g1 elijahahianyo azadekhalaj kleschenko mehdiselbi linbaiwh neslihanyetik pronoym99 htrakhman alecgraham hswale74 rohanjacob23 kvnkho goodwanghan jlopezpena iq-scm robne1982 ghas-results responsibleaiml mikecrowe-8451 awahle1 lva290

datacompy's Issues

Issue with nulls in join columns when de-duping

There's an issue with the logic that tries to handle duplicate data. This is a minimal example:

import numpy as np
import pandas as pd
import datacompy

df1 = pd.DataFrame({
    'fld_1': [1,2,2,3,3,4,5,5],
    'fld_2': ['A',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})

df2 = pd.DataFrame({
    'fld_1': [1,2,3,4,5],
    'fld_2': ['A',np.nan,np.nan,np.nan,np.nan]})

comp = datacompy.Compare(df1, df2, join_columns=['fld_1', 'fld_2'])
assert comp.subset()

df2 is a subset of df1, but it's not matching. The issue is here, the groupby.cumcount just lumps any rows with nulls in them into one big category. Need to make that more resilient to nulls.

Ignore case when comparing strings

Might be a nice to have a optional ignore case feature so that STRING == string.
Should be relatively simple to add. Happy to take it on.

@theianrobertson thoughts?

Rebuild docs automagically with Travis

Need to dig into this a bit more, but this would probably mean moving docs to the gh-pages branch, and rebuilding every time there's a commit to master.

Issue with installing on Python 2.7

In setup.py, should change the open(os.path.join(CURR_DIR, "README.rst"), encoding="utf-8") as file_open to work with Python 2.7

`Compare._validate_dataframe` assumes columns have string labels

It appears that _validate_dataframe assumes that columns have string labels (it unconditionally calls .lower() on them), which raises AttributeError if columns instead have other types of labels (e.g. numerical):

In [18]: df = pd.DataFrame(np.random.randn(2), index=np.arange(0.0, 1.0, 0.5),
    ...:                             columns=[0.01])

In [19]: df
Out[19]:
         0.01
0.0 -0.283165
0.5 -0.388745

In [20]: compare = datacompy.Compare(df, df, on_index=True)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-d3440cd3839e> in <module>
----> 1 compare = datacompy.Compare(df, df, on_index=True)

lib/python3.6/site-packages/datacompy/core.py in __init__(self, df1, df2, join_columns, on_index, abs_tol, rel_tol, df1_name, df2_name, ignore_spaces, ignore_case)
    103
    104         self._any_dupes = False
--> 105         self.df1 = df1
    106         self.df2 = df2
    107         self.df1_name = df1_name

lib/python3.6/site-packages/datacompy/core.py in df1(self, df1)
    121         """Check that it is a dataframe and has the join columns"""
    122         self._df1 = df1
--> 123         self._validate_dataframe("df1")
    124
    125     @property

lib/python3.6/site-packages/datacompy/core.py in _validate_dataframe(self, index)
    145             raise TypeError("{} must be a pandas DataFrame".format(index))
    146
--> 147         dataframe.columns = [col.lower() for col in dataframe.columns]
    148         # Check if join_columns are present in the dataframe
    149         if not set(self.join_columns).issubset(set(dataframe.columns)):

lib/python3.6/site-packages/datacompy/core.py in <listcomp>(.0)
    145             raise TypeError("{} must be a pandas DataFrame".format(index))
    146
--> 147         dataframe.columns = [col.lower() for col in dataframe.columns]
    148         # Check if join_columns are present in the dataframe
    149         if not set(self.join_columns).issubset(set(dataframe.columns)):

AttributeError: 'float' object has no attribute 'lower'

Understanding Intersect_rows when df1 and df2 have one-to-many relationships

If I am trying to compare the intersect rows for the following two dataframes:

df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
                   'value': [1, 2, 3, 5]              
                   })
df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                    'value': [5, 6, 7]})

compare = datacompy.Compare(
            df1,
            df2,
            join_columns=['key'],  #You can also specify a list of columns
            abs_tol=0,
            rel_tol=0,
            df1_name='df1',
            df2_name='df2')

compare.intersect_rows only return the following rows:

key	value_df1	value_df2	_merge	value_match
foo	1	5	both	False
bar	2	6	both	False
baz	3	7	both	False

While I am expecting an additional row:

key	value_df1	value_df2	_merge	value_match
foo	5	5	both	True

In fact, when you use pandas merge on df1 and df2, you do see the last row in df1:

df1.merge(df2, how="outer", suffixes=("_df1", "_df2"), indicator=True, on='key')

key	value_df1	value_df2	_merge
foo	1	5	both
foo	5	5	both
bar	2	6	both
baz	3	7	both

Why is the last row from df1 missing from intersect_row dataframe?

Regression in Pandas is causing tests to fail?

Seems like the unit tests are failing due to a potential regression in Pandas?
pandas-dev/pandas#21119

def test_decimal_with_joins():
    df1 = pd.DataFrame([{'a': Decimal('1'), 'b': 2}, {'a': Decimal('2'), 'b': 2}])
    df2 = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 2, 'b': 2}])
    compare = datacompy.Compare(df1, df2, 'a')
    assert compare.matches()
    assert compare.all_columns_match()
    assert compare.all_rows_overlap()
    assert compare.intersect_rows_match()

Following test raises: ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

Would it make sense to pin the Pandas version?

Align the APIs between Compare and SparkCompare

Not sure if it makes sense to go all the way to subclassing or ABCs, but the API calls between Compare and SparkCompare` are quite different. I think they could be aligned somewhat before adding any new functionality.

NameError: name 'F' is not defined - getting this issue in sparkCompare

~\Anaconda3\lib\site-packages\datacompy\sparkcompare.py in (.0)
521 def helper(c):
522 # Create a predicate for each match type, comparing column values to the match type value
--> 523 predicates = [F.col(c) == k.value for k in MatchType]
524 # Create a tuple(number of match types found for each match type in this column)
525 return F.struct([F.lit(F.sum(pred.cast("integer"))) for pred in predicates]).alias(c)

NameError: name 'F' is not defined

Update docs to cover 0.7.0 changes

Keep track of 0.7.0 changes
Update docs with regards to Py2 support.

No. of rows shown in comparison are limited to 10.

'Sample Rows with Unequal Values' shows just 10 rows which are different.
I want to see what changed in all the rows in the 2 dataframes, not just 10 rows.

Is there a way I can see all the unequal rows?

Please advise.

CVE-2020-13091 (High) detected in pandas-0.24.2-cp27-cp27mu-manylinux1_x86_64.whl

CVE-2020-13091 - High Severity Vulnerability

Vulnerable Library - pandas-0.24.2-cp27-cp27mu-manylinux1_x86_64.whl

Powerful data structures for data analysis, time series, and statistics

Library home page: https://files.pythonhosted.org/packages/db/83/7d4008ffc2988066ff37f6a0bb6d7b60822367dcb36ba5e39aa7801fda54/pandas-0.24.2-cp27-cp27mu-manylinux1_x86_64.whl

Path to dependency file: datacompy/requirements.txt

Path to vulnerable library: datacompy/requirements.txt

Dependency Hierarchy:

❌ pandas-0.24.2-cp27-cp27mu-manylinux1_x86_64.whl (Vulnerable Library)

Found in HEAD commit: d321c1b49544384bde8893327ea768552b3f175d

Vulnerability Details

** DISPUTED ** pandas through 1.0.3 can unserialize and execute commands from an untrusted file that is passed to the read_pickle() function, if reduce makes an os.system call. NOTE: third parties dispute this issue because the read_pickle() function is documented as unsafe and it is the user's responsibility to use the function in a secure manner.

Publish Date: 2020-05-15

URL: CVE-2020-13091

CVSS 3 Score Details (9.8)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: None
- User Interaction: None
- Scope: Unchanged
Impact Metrics:
- Confidentiality Impact: High
- Integrity Impact: High
- Availability Impact: High

For more information on CVSS3 Scores, click here.

Step up your Open Source Security Game with WhiteSource here

Max Diff in Report not working if comparing object and numeric datatypes

Although the tolerance values and overall matching seem to work fine when comparing object to float64 or int64, in the summary report, the Max Diff is always 0.000000

Probably can be fixed here with .astype(float)?:

datacompy/datacompy/core.py

Lines 295 to 297 in 4d7fc48

    
           max_diff = ( 
        
               self.intersect_rows[col_1] 
        
               - self.intersect_rows[col_2]).abs().max()

`Compare._validate_dataframe` lowercases column names prior to checking for uniqueness

Prior to the check for uniqueness, _validate_dataframe does dataframe.columns = [col.lower() for col in dataframe.columns]. If columns have unique mixed-case labels, these could collide when lower-cased as in this example:


In [30]: df
Out[30]:
          OSs       Oss
0.0 -0.438774 -0.853128
0.5  0.919049 -0.795371

In [31]: x = datacompy.Compare(df, df, on_index=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-69a8e3399fdd> in <module>
----> 1 x = datacompy.Compare(df, df, on_index=True)

lib/python3.6/site-packages/datacompy/core.py in __init__(self, df1, df2, join_columns, on_index, abs_tol, rel_tol, df1_name, df2_name, ignore_spaces, ignore_case)
    103
    104         self._any_dupes = False
--> 105         self.df1 = df1
    106         self.df2 = df2
    107         self.df1_name = df1_name

lib/python3.6/site-packages/datacompy/core.py in df1(self, df1)
    121         """Check that it is a dataframe and has the join columns"""
    122         self._df1 = df1
--> 123         self._validate_dataframe("df1")
    124
    125     @property

lib/python3.6/site-packages/datacompy/core.py in _validate_dataframe(self, index)
    151
    152         if len(set(dataframe.columns)) < len(dataframe.columns):
--> 153             raise ValueError("{} must have unique column names".format(index))
    154
    155         if self.on_index:

ValueError: df1 must have unique column names```

Entire data mismatch throws None

When two table have same schema and data were different. this will throws exception

Return inner-joined data with comparisons

It would be useful to be able to analyze a dataset based on the comparison, for example:

Compare dataset A, B: approximately 2% do not match on my relative tolerance threshold for variable X. I would like to get a dataset that has the inner join of A, B and an indicator variable for (difference in X > tol) that I could use for anlaysis.

As it stands I have to cook this myself, and I wasn't able to get sample_mismatch() to return what I needed easily

Error in compare.report : Sample Values with Unequal Values

The output for report under section : Sample Values with Unequal Values

Shows same values for df1 and df2

no site_id net_revenue (df1) net_revenue (df2)
46 TRVED01 63.234000 63.234000
49 TRVEGH01 236.445673 236.445673

in actual it is

no site_id net_revenue (Source) net_revenue (Target)
46 TRVED01 63.234 63.234000
49 TRVEGH01 236.4456732 236.445673

CVE-2020-11022 (Medium) detected in jquery-3.4.1.js

CVE-2020-11022 - Medium Severity Vulnerability

Vulnerable Library - jquery-3.4.1.js

JavaScript library for DOM operations

Library home page: https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.js

Path to vulnerable library: /datacompy/_static/jquery-3.4.1.js

Dependency Hierarchy:

❌ jquery-3.4.1.js (Vulnerable Library)

Found in HEAD commit: dd34b569a2407940480bfd5f3df171741533b660

Vulnerability Details

In jQuery versions greater than or equal to 1.2 and before 3.5.0, passing HTML from untrusted sources - even after sanitizing it - to one of jQuery's DOM manipulation methods (i.e. .html(), .append(), and others) may execute untrusted code. This problem is patched in jQuery 3.5.0.

Publish Date: 2020-04-29

URL: CVE-2020-11022

CVSS 3 Score Details (6.1)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: None
- User Interaction: Required
- Scope: Changed
Impact Metrics:
- Confidentiality Impact: Low
- Integrity Impact: Low
- Availability Impact: None

For more information on CVSS3 Scores, click here.

Suggested Fix

Type: Upgrade version

Origin: https://blog.jquery.com/2020/04/10/jquery-3-5-0-released/

Release Date: 2020-04-29

Fix Resolution: jQuery - 3.5.0

Step up your Open Source Security Game with WhiteSource here

Get CI up and running

Maybe travis CI? Cloud Custodian uses Travis, hopefully I can poach their stuff.

SparkCompare Throwing NameError.

Hi author, I am trying to use the SparkCompare functionality with the sample code that you have provided but it is throwing me this error:

Traceback (most recent call last):
  File ".\spark.py", line 69, in <module>
    comparison.report()
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 862, in report
    self._print_row_matches_by_column(file)
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 689, in _print_row_matches_by_column
    self._populate_columns_match_dict()
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 529, in _populate_columns_match_dict
    match_data = match_dataframe.agg(*[helper(col) for col in self.columns_compared]).collect()[
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 529, in <listcomp>
    match_data = match_dataframe.agg(*[helper(col) for col in self.columns_compared]).collect()[
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 523, in helper
    predicates = [F.col(c) == k.value for k in MatchType]
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 523, in <listcomp>
    predicates = [F.col(c) == k.value for k in MatchType]
NameError: name 'F' is not defined

Any idea why is this happening? Thanks! :)

Dropping support for Python 2?

Wondering if we should have some conversation about dropping Python 2 support as it is coming soon to EOL. Right now we are support both, but it might help simplify things if we only go with 3.6+ in the next major release?

@theianrobertson @jborchma @gforsyth @slott56 Would appreciate any feedback on this.

Drop Matching Rows and Columns in Mismatch Output for Readability.

Currently the csv that's being written out has all the columns written out on a failed comparison making it harder to read with a larger data-set.
This is more of a feature request but the ability to only have columns that didn't match written to the mismatch csv would really help and speed up the QA process on our end.

Add "Sample Rows with Unequal Values" to SparkCompare

I'd love to see SparkCompare produce "Sample Rows with Unequal Values". This is a really useful section from the pandas version of datacompy.

Project CLA

@ryanshiroma Can you please sign the project CLA for datacompy project? If you plan to submit a PR in the near future, CLA-Assistant bot will prompt to to sign. If you aren't planning to submit any new PRs to the project, pls use this link to sign: https://docs.google.com/forms/d/e/1FAIpQLSfwtl1s6KmpLhCY6CjiY8nFZshDwf_wrmNYx1ahpsNFXXmHKw/viewform

We appreciate your contributions to the project and your understanding on the contribution requirement for CLA.

add pre-commit hooks and black

Interested in using this to clean up some style and enforce that the style is consistent across everything...

https://github.com/pre-commit/pre-commit
https://github.com/ambv/black

CVE-2020-14343 (High) detected in PyYAML-5.3.1.tar.gz

CVE-2020-14343 - High Severity Vulnerability

Vulnerable Library - PyYAML-5.3.1.tar.gz

YAML parser and emitter for Python

Library home page: https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz

Path to dependency file: datacompy/test-requirements.txt

Path to vulnerable library: datacompy/test-requirements.txt

Dependency Hierarchy:

pre_commit-1.21.0-py2.py3-none-any.whl (Root Library)
- aspy.yaml-1.3.0-py2.py3-none-any.whl
  - ❌ PyYAML-5.3.1.tar.gz (Vulnerable Library)

Found in HEAD commit: 4f516c4de1453b3db34b7f14ea561dd05f7110c1

Found in base branch: develop

Vulnerability Details

A vulnerability was discovered in the PyYAML library in all versions, where it is susceptible to arbitrary code execution when it processes untrusted YAML files through the full_load method or with the FullLoader loader. .load() defaults to using FullLoader and FullLoader is still vulnerable to RCE when run on untrusted input. Applications that use the library to process untrusted input may be vulnerable to this flaw. An attacker could use this flaw to execute arbitrary code on the system by abusing the python/object/new constructor.
The fix for CVE-2020-1747 was not enough to fix this issue.

Publish Date: 2020-07-21

URL: CVE-2020-14343

CVSS 3 Score Details (9.8)

Base Score Metrics:

Exploitability Metrics:
- Attack Vector: Network
- Attack Complexity: Low
- Privileges Required: None
- User Interaction: None
- Scope: Unchanged
Impact Metrics:
- Confidentiality Impact: High
- Integrity Impact: High
- Availability Impact: High

For more information on CVSS3 Scores, click here.

Step up your Open Source Security Game with WhiteSource here

Make SparkCompare.report return string

Rather than make it print to stdout (or a file)?

The Compare.report returns a string by default. I think it's better to make those method behave consistently.

Make lowercasing of column names optional

If you are using camel case naming conventions and have several multi-word column names, then automatic lowercasing of names will make the final result less readable.

There is only a need to lowercase when you have case-insensitive duplicates. If you do not, there is no need to disambiguate. So why not add an option to turn this off? If there are duplicates, then simply throw - it already does this, anyhow.

Add in docs how to change the number of mismatches in report

This is in the API reference /api/core.html#datacompy.core.Compare.report but should be more explicit in the usage docs too.

Allow customization of comparison report

At present it does not seem the built-in comparison report accepts any parameters. For one, I would like to be able to set the row limit myself as far as how many differences are printed per field. By default this is 10 - I'd like to optionally change this.

Appreciate the use of smart defaults, but any hard-coded parameter - 10 or otherwise, I feel ought to be something that can be configured externally.

Use deterministic output order for column-based reporting

One section of the output report is called "Sample Rows with Unequal Values". It prints each column with differences and then 10 rows of differences. However, I've noticed that this output changes with multiple runs - so multiple calls of .report() will give you output in a different order. As an example, if you have two dataframes with the columns A, B, and C, all of which have differences, then one run may give you results in the order of B, C, A, the second in C, A, B, etc.

This isn't a huge issue - it's just cosmetic. But I do think the output would be cleaner if it the ordering was deterministic. Moreover, it should probably match the ordering of the columns of the first dataframe, which is referred to as df1 in the code.

This section of the report is generated by looping through self.column_stats and generated sample rows of differences. self.column_stats is generated from the method _intersect_compare(), which loops through self.intersect_columns(). And finally, the issue is in self.interesect_columns(), which uses set - an unordered data structure - and does an intersection.

datacompy/datacompy/core.py

Lines 212 to 214 in 8d96429

    
           def intersect_columns(self): 
        
               """Get columns that are shared between the two dataframes""" 
        
               return set(self.df1.columns) & set(self.df2.columns)

I'm thinking the solution to this problem is use an ordered set. There are packages available on PyPi for this.

Are there any methods or any way to get different sections of the report individually?

For eg.
When I run the report on my dataframes, I want only the output under the 'Columns with Unequal Values or Types' columns.

I also want the output of 'Sample Rows with Unequal Values'.

These are the only part of the compare.report I need. Is there any way to capture only this part of the report?

Inequality when comparing two empty numpy arrays

Comparison of two empty numpy arrays currently return False, which results in showing diffs where there shouldn't be.

This is due to the way numpy compares empty arrays.
Running bool(np.array([]) == np.array([])) returns False and throws this warning:

The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.

Reproduce this bug with:

df1 = pd.DataFrame({"some_col": [np.array([]) for _ in range(10)], "id": [i for i in range(10)]})
df2 = pd.DataFrame({"some_col": [np.array([]) for _ in range(10)], "id": [i for i in range(10)]})

pdcompare = datacompy.Compare(df1, df2, join_columns="id")
print(pdcompare.report())

output:

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0       df1        2    10
1       df2        2    10

Column Summary
--------------

Number of columns in common: 2
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0

Row Summary
-----------

Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 10
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 0

Number of rows with some compared columns unequal: 10
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 10

Columns with Unequal Values or Types
------------------------------------

     Column df1 dtype df2 dtype  # Unequal  Max Diff  # Null Diff
0  some_col    object    object         10         0            0

Sample Rows with Unequal Values
-------------------------------

   id some_col (df1) some_col (df2)
9   9             []             []
0   0             []             []
3   3             []             []
7   7             []             []
5   5             []             []
1   1             []             []
4   4             []             []
2   2             []             []
8   8             []             []
6   6             []             []

Use type hinting & support Python 3.5+

Since we target Python 3.5+ in our CI testing, we can make this explicit and commit to only supporting that version and higher. This will enable us to use type hinting in our code, which would have a few benefits:

IDE support for types while developing datacompy
IDE support for types for users using datacompy
Ability to do static type checking in our CI pipeline
Automatic type inference during documentation generation

This should probably best be done in concert with something like #13 as long as we're messing with the code, but since this will involve committing to supporting certain versions of Python, it should be a conscious choice. Python 3.4 reached EOL in March 2019 so I don't see a problem with explicitly targeting 3.5+.

Issues with Travis CI and pytest-spark

Seems like the latest pytest-spark has a bug or something. I've been tinkering around with it to get Spark up and running. Have a MVP repo here: https://github.com/fdosani/travis-pytest-spark

We should probably pin pytest-spark until this sort itself out. Also in the repo is a .travis.yml on how to setup the environment so a standalone spark instance can be used for unit testing. Hope this helps with the upcoming changing 👍

idea/question: Writing compare/output results in a structured format(sqlite, csv, json)

Having a structured format output can help in below cases

There is a large number of differences between the two dataframes
There is a need to filter, perform operations, generate dashboards on output/compare report

	row_num	key[symbol:quotedate]	close (Original)	close (New)	numdiff	text (Original)	text (New)	textdiff
0	4	[AAN:01-05-2016]	20.0	23.12	3.12
1	4	[AAN:01-05-2016]	1030099.0	1030097.0	-2
2	7	[AAP:01-05-2016]				b	a	True

Comparison result is completely wrong

I use this tool to compare two parquets and surprisingly found the comparison result is completely wrong:
The two parquet files both has 40 fields with exactly the same name, difference is one parquet is 27MB and the other is 250MB.
Here is the summary:

Column Summary

Number of columns in common: 7
Number of columns in first_parquet but not in second_parquet: 33
Number of columns in second_parquet but not in first_parquet: 25

I understand it would help your troubleshooting if you have the actual parquet files, but unfortunately I can't give the files, a screenshot of the schemas side-by-side is here:
https://snag.gy/rnpmBX.jpg

Check out pytest-spark for spark tests

https://github.com/malexer/pytest-spark

Could get spark tests up and running on Travis then maybe?

Please update codeowners file

Please be sure to add trusted reviewers to your codeowners file and the project write team.

Issue with setup.py and locale

Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/something/datacompy/setup.py", line 12, in <module>
        LONG_DESCRIPTION = file_open.read()
      File "/something/envs/py36/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 10137: ordinal not in range(128)

Issue is that file_open should be explicitly opened as utf-8. If someone's default environment isn't utf-8 this read fails.

Make duplicate handling better

You can get at duplicate rows like:

comp.df1_unq_rows[comp.df1_unq_rows['acct_id'].isin(comp.intersect_rows['acct_id'])]

The compare report just says Any duplicates on match values: Yes. Could have some things like

Count of duplicates (i.e. that weren't matched)
More information in docs about how duplicates are picked (explain the algorithm)
Shortcut on the class to get at the duplicates (like just shortcut the above?) Or maybe some way to point a discarded duplicate at the corresponding record it could have matched with?

Warning on some comparisons

Not sure why this is, but I'm getting a warning message from the heart of Pandas here:

import logging
import sys
import pandas as pd
import numpy as np
import datacompy

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

expected = pd.DataFrame([{'acct': 1, 'c': 2}, {'acct': 3, 'c': 4}, {'acct': np.nan}])
actual = expected.copy()

compare = datacompy.Compare(actual, expected, 'acct')

DEBUG:datacompy.core:Checking equality
INFO:datacompy.core:df1 Pandas.DataFrame.equals df2
INFO:datacompy.core:Number of columns in common: 2
DEBUG:datacompy.core:Checking column overlap
INFO:datacompy.core:Number of columns in df1 and not in df2: 0
INFO:datacompy.core:Number of columns in df2 and not in df1: 0
DEBUG:datacompy.core:Merging dataframes
DEBUG:datacompy.core:Outer joining
DEBUG:datacompy.core:Selecting df1 unique rows
DEBUG:datacompy.core:Selecting df2 unique rows
INFO:datacompy.core:Number of rows in df1 and not in df2: 0
INFO:datacompy.core:Number of rows in df2 and not in df1: 0
DEBUG:datacompy.core:Selecting intersecting rows
INFO:datacompy.core:Number of rows in df1 and df2 (not necessarily equal): 3
DEBUG:datacompy.core:Comparing intersection
INFO:datacompy.core:acct: 3 / 3 (100.00%) match
/Users/myname/miniconda3/envs/t2/lib/python3.6/site-packages/pandas/core/generic.py:6515: RuntimeWarning: invalid value encountered in absolute
  return np.abs(self)
INFO:datacompy.core:c: 3 / 3 (100.00%) match
INFO:datacompy.core:df1 matches df2

It doesn't seem to happen if a) I only have two rows (take out the second row), or b) if I step through the code myself (?), or c) if I have four rows UGH

Having trouble isolating the issue, so 🤷‍♂️

Setup better branching and split out the docs.

I'd like to move to a master / develop branching style.
Also I'd like to move to having docs on gh-pages as a separate branch vs on master.

@jborchma: Thoughts on this?

Would be useful to have a parameter to strip spaces for comparison

As probably expected, the following code will return a mismatch since 'B'<>'B ':

import pandas as pd
import datacompy

df1 = pd.DataFrame([
        {'id': 1234, 'column_value': 'A'},
        {'id': 2345, 'column_value': 'B'}])

df2 = pd.DataFrame([
        {'id': 1234, 'column_value': 'A'},
        {'id': 2345, 'column_value': 'B '}])

compare = datacompy.Compare(
    df1,
    df2,
    join_columns='id',
    abs_tol=0,
    rel_tol=0,
    )
compare.matches(ignore_extra_columns=False)
# False

# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())

What I propose is an optional parameter to ignore differences where the only difference is leading or trailing spaces. In this example it is obvious that there is a trailing space. However, when we are dealing with extracts from different databases/source files, without real control over the ETL of these, sometimes we can't prevent these discrepancies. We may wish to ignore these types of mismatches to identify 'worse' mismatches more effectively.

Another candidate could be ignoring case sensitivity differences.
Of course these could both be easily handled with preprocessing the dataframes, but still could be some convenient enhancements!

tests are failing due to test_decimal_with_joins

Seems like the TravisCI tests are failing. I think this is due to the version of various packages not being pinned. As the packages change and evolve so will testing behaviour.

The following test is failing:

def test_decimal_with_joins():
    df1 = pd.DataFrame([{"a": Decimal("1"), "b": 2}, {"a": Decimal("2"), "b": 2}])
    df2 = pd.DataFrame([{"a": 1, "b": 2}, {"a": 2, "b": 2}])
    with raises(ValueError):
        compare = datacompy.Compare(df1, df2, "a")

I think in the past Pandas might have has different behaviour. Testing with 0.19.0 version of Pandas and Python 3.5 it still seems to NOT throw any error.

Update repo description, link to docs and tags

Hey @SanthiSridharan I think I need your help on this one too. Hoping to add:

Repo description: Pandas and Spark DataFrame comparison for humans
Tags: python, pandas, spark, data
Link: (Whatever the link is from #4)

Doesnt take into account join columns when doing strip() of white spaces

Join columns can have whitespaces as well,so ideally below section should be enhanced to check for ignore_spaces and strip() if it is true for string columns
This is causing complete mismatch of 2 dataframes here in case 1 of join column has space:

if on_index and join_columns is not None:
raise Exception("Only provide on_index or join_columns")
elif on_index:
self.on_index = True
self.join_columns = []
elif isinstance(join_columns, str):
self.join_columns = [join_columns.lower()]
self.on_index = False
else:
self.join_columns = [col.lower() for col in join_columns]
self.on_index = False

TypeError: an integer is required - during import

there is an error during import when I run
import datacompy

this error arises
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\datacompy\__init__.py", line 20, in <module> from datacompy.sparkcompare import SparkCompare, NUMERIC_SPARK_TYPES File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\datacompy\sparkcompare.py", line 26, in <module> from pyspark.sql import functions as F File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\__init__.py", line 51, in <module> from pyspark.context import SparkContext File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\context.py", line 31, in <module> from pyspark import accumulators File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\accumulators.py", line 97, in <module> from pyspark.serializers import read_int, PickleSerializer File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\serializers.py", line 72, in <module> from pyspark import cloudpickle File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\cloudpickle.py", line 145, in <module> _cell_set_template_code = _make_cell_set_template_code() File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code return types.CodeType( TypeError: an integer is required (got type bytes)

I have another version of this library running on two different pc's but this error must be new

Idea: Output join_cols in sample rows output for the report

For the individual column comparisons, the join_cols are always outputted followed by the values in df1, df2.

However for the Sample Rows Only in df1/df2, the first 10 columns are outputted.
Suggestion would be to force the output to show the join columns in this as well, as sometimes (for whatever reason) the join columns are not in the first 10.

Having it present would make downstream investigations of mismatches easier, instead of having to use the pandas df index to locate the join columns (assuming your join columns are also typically the lookup columns for investigations).

Get docs hosted on capitalone.github.io

Hey @SanthiSridharan are you able to give me a hand with this? I've got the built docs in the /docs subfolder, but I think I need someone with full repo rights to tell GH where to serve pages from, and maybe some help getting the capitalone.github.io connection set up?

	max_diff = (
	self.intersect_rows[col_1]
	- self.intersect_rows[col_2]).abs().max()

	def intersect_columns(self):
	"""Get columns that are shared between the two dataframes"""
	return set(self.df1.columns) & set(self.df2.columns)

capitalone / datacompy Goto Github PK

datacompy's Introduction

DataComPy

Quick Installation

Installing extras

Legacy Spark Deprecation

Starting with version 0.12.0

Supported versions and dependncies

Supported backends

Contributors

Roadmap

datacompy's People

Contributors

Stargazers

Watchers

Forkers

datacompy's Issues

CVE-2020-13091 - High Severity Vulnerability

CVE-2020-11022 - Medium Severity Vulnerability

CVE-2020-14343 - High Severity Vulnerability

Column Summary

Recommend Projects

Recommend Topics

Recommend Org