Giter Site home page Giter Site logo

capitalone / datacompy Goto Github PK

View Code? Open in Web Editor NEW
386.0 25.0 122.0 9.04 MB

Pandas and Spark DataFrame comparison for humans and more!

Home Page: https://capitalone.github.io/datacompy/

License: Apache License 2.0

Python 99.93% Makefile 0.07%
python pandas spark data data-science compare dataframes numpy pyspark fugue

datacompy's Introduction

DataComPy

PyPI - Python Version Code style: black PyPI version Anaconda-Server Badge PyPI - Downloads

DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be). Then extended to carry that functionality over to Spark Dataframes.

Quick Installation

pip install datacompy

or

conda install datacompy

Installing extras

If you would like to use Spark or any other backends please make sure you install via extras:

pip install datacompy[spark]
pip install datacompy[dask]
pip install datacompy[duckdb]
pip install datacompy[polars]
pip install datacompy[ray]

Legacy Spark Deprecation

Starting with version 0.12.0

The original SparkCompare implementation differs from all the other native implementations. To align the API better, and keep behaviour consistent we are deprecating SparkCompare into a new module LegacySparkCompare

If you wish to use the old SparkCompare moving forward you can

import datacompy.legacy.LegacySparkCompare

Supported versions and dependncies

Different versions of Spark, Pandas, and Python interact differently. Below is a matrix of what we test with. With the move to Pandas on Spark API and compatability issues with Pandas 2+ we will for the mean time note support Pandas 2 with the Pandas on Spark implementation. Spark plans to support Pandas 2 in Spark 4

With version 0.12.0:

  • Not support Pandas 2.0.0 For the native Spark implemention
  • Spark 3.1 support will be dropped
  • Python 3.8 support is dropped
Spark 3.2.4 Spark 3.3.4 Spark 3.4.2 Spark 3.5.1
Python 3.9
Python 3.10
Python 3.11
Python 3.12
Pandas < 1.5.3 Pandas >=2.0.0
Native Pandas
Native Spark
Fugue

Note

At the current time Python 3.12 is not supported by Spark and also Ray within Fugue.

Supported backends

  • Pandas: (See documentation)
  • Spark (Pandas on Spark API): (See documentation)
  • Polars (Experimental): (See documentation)
  • Fugue is a Python library that provides a unified interface for data processing on Pandas, DuckDB, Polars, Arrow, Spark, Dask, Ray, and many other backends. DataComPy integrates with Fugue to provide a simple way to compare data across these backends. Please note that Fugue will use the Pandas (Native) logic at its lowest level (See documentation)

Contributors

We welcome and appreciate your contributions! Before we can accept any contributions, we ask that you please be sure to sign the Contributor License Agreement (CLA).

This project adheres to the Open Source Code of Conduct. By participating, you are expected to honor this code.

Roadmap

Roadmap details can be found here

datacompy's People

Contributors

aguiddir avatar azadekhalaj avatar brainlogic avatar dan-coates avatar fdosani avatar gandhis1 avatar gforsyth avatar github-actions[bot] avatar gliptak avatar goodwanghan avatar hugovk avatar jborchma avatar jdawang avatar krishanbhasin avatar mark-thm avatar mend-bolt-for-github[bot] avatar mend-for-github-com[bot] avatar mnizol avatar nikhiljarora avatar pronoym99 avatar ryanshiroma avatar santhisridharan avatar theianrobertson avatar tmbjmu avatar vaporjawn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datacompy's Issues

Issue with nulls in join columns when de-duping

There's an issue with the logic that tries to handle duplicate data. This is a minimal example:

import numpy as np
import pandas as pd
import datacompy

df1 = pd.DataFrame({
    'fld_1': [1,2,2,3,3,4,5,5],
    'fld_2': ['A',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})

df2 = pd.DataFrame({
    'fld_1': [1,2,3,4,5],
    'fld_2': ['A',np.nan,np.nan,np.nan,np.nan]})

comp = datacompy.Compare(df1, df2, join_columns=['fld_1', 'fld_2'])
assert comp.subset()

df2 is a subset of df1, but it's not matching. The issue is here, the groupby.cumcount just lumps any rows with nulls in them into one big category. Need to make that more resilient to nulls.

Rebuild docs automagically with Travis

Need to dig into this a bit more, but this would probably mean moving docs to the gh-pages branch, and rebuilding every time there's a commit to master.

`Compare._validate_dataframe` assumes columns have string labels

It appears that _validate_dataframe assumes that columns have string labels (it unconditionally calls .lower() on them), which raises AttributeError if columns instead have other types of labels (e.g. numerical):

In [18]: df = pd.DataFrame(np.random.randn(2), index=np.arange(0.0, 1.0, 0.5),
    ...:                             columns=[0.01])

In [19]: df
Out[19]:
         0.01
0.0 -0.283165
0.5 -0.388745

In [20]: compare = datacompy.Compare(df, df, on_index=True)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-d3440cd3839e> in <module>
----> 1 compare = datacompy.Compare(df, df, on_index=True)

lib/python3.6/site-packages/datacompy/core.py in __init__(self, df1, df2, join_columns, on_index, abs_tol, rel_tol, df1_name, df2_name, ignore_spaces, ignore_case)
    103
    104         self._any_dupes = False
--> 105         self.df1 = df1
    106         self.df2 = df2
    107         self.df1_name = df1_name

lib/python3.6/site-packages/datacompy/core.py in df1(self, df1)
    121         """Check that it is a dataframe and has the join columns"""
    122         self._df1 = df1
--> 123         self._validate_dataframe("df1")
    124
    125     @property

lib/python3.6/site-packages/datacompy/core.py in _validate_dataframe(self, index)
    145             raise TypeError("{} must be a pandas DataFrame".format(index))
    146
--> 147         dataframe.columns = [col.lower() for col in dataframe.columns]
    148         # Check if join_columns are present in the dataframe
    149         if not set(self.join_columns).issubset(set(dataframe.columns)):

lib/python3.6/site-packages/datacompy/core.py in <listcomp>(.0)
    145             raise TypeError("{} must be a pandas DataFrame".format(index))
    146
--> 147         dataframe.columns = [col.lower() for col in dataframe.columns]
    148         # Check if join_columns are present in the dataframe
    149         if not set(self.join_columns).issubset(set(dataframe.columns)):

AttributeError: 'float' object has no attribute 'lower'

Understanding Intersect_rows when df1 and df2 have one-to-many relationships

If I am trying to compare the intersect rows for the following two dataframes:

df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'],
                   'value': [1, 2, 3, 5]              
                   })
df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                    'value': [5, 6, 7]})

compare = datacompy.Compare(
            df1,
            df2,
            join_columns=['key'],  #You can also specify a list of columns
            abs_tol=0,
            rel_tol=0,
            df1_name='df1',
            df2_name='df2')

compare.intersect_rows only return the following rows:

key value_df1 value_df2 _merge value_match
foo 1 5 both False
bar 2 6 both False
baz 3 7 both False

While I am expecting an additional row:

key value_df1 value_df2 _merge value_match
foo 5 5 both True

In fact, when you use pandas merge on df1 and df2, you do see the last row in df1:

df1.merge(df2, how="outer", suffixes=("_df1", "_df2"), indicator=True, on='key')
key value_df1 value_df2 _merge
foo 1 5 both
foo 5 5 both
bar 2 6 both
baz 3 7 both

Why is the last row from df1 missing from intersect_row dataframe?

Regression in Pandas is causing tests to fail?

Seems like the unit tests are failing due to a potential regression in Pandas?
pandas-dev/pandas#21119

def test_decimal_with_joins():
    df1 = pd.DataFrame([{'a': Decimal('1'), 'b': 2}, {'a': Decimal('2'), 'b': 2}])
    df2 = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 2, 'b': 2}])
    compare = datacompy.Compare(df1, df2, 'a')
    assert compare.matches()
    assert compare.all_columns_match()
    assert compare.all_rows_overlap()
    assert compare.intersect_rows_match()

Following test raises: ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

Would it make sense to pin the Pandas version?

Align the APIs between Compare and SparkCompare

Not sure if it makes sense to go all the way to subclassing or ABCs, but the API calls between Compare and SparkCompare` are quite different. I think they could be aligned somewhat before adding any new functionality.

NameError: name 'F' is not defined - getting this issue in sparkCompare

~\Anaconda3\lib\site-packages\datacompy\sparkcompare.py in (.0)
521 def helper(c):
522 # Create a predicate for each match type, comparing column values to the match type value
--> 523 predicates = [F.col(c) == k.value for k in MatchType]
524 # Create a tuple(number of match types found for each match type in this column)
525 return F.struct([F.lit(F.sum(pred.cast("integer"))) for pred in predicates]).alias(c)

NameError: name 'F' is not defined

No. of rows shown in comparison are limited to 10.

'Sample Rows with Unequal Values' shows just 10 rows which are different.
I want to see what changed in all the rows in the 2 dataframes, not just 10 rows.

Is there a way I can see all the unequal rows?

Please advise.

CVE-2020-13091 (High) detected in pandas-0.24.2-cp27-cp27mu-manylinux1_x86_64.whl

CVE-2020-13091 - High Severity Vulnerability

Vulnerable Library - pandas-0.24.2-cp27-cp27mu-manylinux1_x86_64.whl

Powerful data structures for data analysis, time series, and statistics

Library home page: https://files.pythonhosted.org/packages/db/83/7d4008ffc2988066ff37f6a0bb6d7b60822367dcb36ba5e39aa7801fda54/pandas-0.24.2-cp27-cp27mu-manylinux1_x86_64.whl

Path to dependency file: datacompy/requirements.txt

Path to vulnerable library: datacompy/requirements.txt

Dependency Hierarchy:

  • pandas-0.24.2-cp27-cp27mu-manylinux1_x86_64.whl (Vulnerable Library)

Found in HEAD commit: d321c1b49544384bde8893327ea768552b3f175d

Vulnerability Details

** DISPUTED ** pandas through 1.0.3 can unserialize and execute commands from an untrusted file that is passed to the read_pickle() function, if reduce makes an os.system call. NOTE: third parties dispute this issue because the read_pickle() function is documented as unsafe and it is the user's responsibility to use the function in a secure manner.

Publish Date: 2020-05-15

URL: CVE-2020-13091

CVSS 3 Score Details (9.8)

Base Score Metrics:

  • Exploitability Metrics:
    • Attack Vector: Network
    • Attack Complexity: Low
    • Privileges Required: None
    • User Interaction: None
    • Scope: Unchanged
  • Impact Metrics:
    • Confidentiality Impact: High
    • Integrity Impact: High
    • Availability Impact: High

For more information on CVSS3 Scores, click here.


Step up your Open Source Security Game with WhiteSource here

`Compare._validate_dataframe` lowercases column names prior to checking for uniqueness

Prior to the check for uniqueness, _validate_dataframe does dataframe.columns = [col.lower() for col in dataframe.columns]. If columns have unique mixed-case labels, these could collide when lower-cased as in this example:


In [30]: df
Out[30]:
          OSs       Oss
0.0 -0.438774 -0.853128
0.5  0.919049 -0.795371

In [31]: x = datacompy.Compare(df, df, on_index=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-69a8e3399fdd> in <module>
----> 1 x = datacompy.Compare(df, df, on_index=True)

lib/python3.6/site-packages/datacompy/core.py in __init__(self, df1, df2, join_columns, on_index, abs_tol, rel_tol, df1_name, df2_name, ignore_spaces, ignore_case)
    103
    104         self._any_dupes = False
--> 105         self.df1 = df1
    106         self.df2 = df2
    107         self.df1_name = df1_name

lib/python3.6/site-packages/datacompy/core.py in df1(self, df1)
    121         """Check that it is a dataframe and has the join columns"""
    122         self._df1 = df1
--> 123         self._validate_dataframe("df1")
    124
    125     @property

lib/python3.6/site-packages/datacompy/core.py in _validate_dataframe(self, index)
    151
    152         if len(set(dataframe.columns)) < len(dataframe.columns):
--> 153             raise ValueError("{} must have unique column names".format(index))
    154
    155         if self.on_index:

ValueError: df1 must have unique column names```

Return inner-joined data with comparisons

It would be useful to be able to analyze a dataset based on the comparison, for example:

Compare dataset A, B: approximately 2% do not match on my relative tolerance threshold for variable X. I would like to get a dataset that has the inner join of A, B and an indicator variable for (difference in X > tol) that I could use for anlaysis.

As it stands I have to cook this myself, and I wasn't able to get sample_mismatch() to return what I needed easily

Error in compare.report : Sample Values with Unequal Values

The output for report under section : Sample Values with Unequal Values

Shows same values for df1 and df2

no site_id net_revenue (df1) net_revenue (df2)
46 TRVED01 63.234000 63.234000
49 TRVEGH01 236.445673 236.445673

in actual it is

no site_id net_revenue (Source) net_revenue (Target)
46 TRVED01 63.234 63.234000
49 TRVEGH01 236.4456732 236.445673

CVE-2020-11022 (Medium) detected in jquery-3.4.1.js

CVE-2020-11022 - Medium Severity Vulnerability

Vulnerable Library - jquery-3.4.1.js

JavaScript library for DOM operations

Library home page: https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.1/jquery.js

Path to vulnerable library: /datacompy/_static/jquery-3.4.1.js

Dependency Hierarchy:

  • jquery-3.4.1.js (Vulnerable Library)

Found in HEAD commit: dd34b569a2407940480bfd5f3df171741533b660

Vulnerability Details

In jQuery versions greater than or equal to 1.2 and before 3.5.0, passing HTML from untrusted sources - even after sanitizing it - to one of jQuery's DOM manipulation methods (i.e. .html(), .append(), and others) may execute untrusted code. This problem is patched in jQuery 3.5.0.

Publish Date: 2020-04-29

URL: CVE-2020-11022

CVSS 3 Score Details (6.1)

Base Score Metrics:

  • Exploitability Metrics:
    • Attack Vector: Network
    • Attack Complexity: Low
    • Privileges Required: None
    • User Interaction: Required
    • Scope: Changed
  • Impact Metrics:
    • Confidentiality Impact: Low
    • Integrity Impact: Low
    • Availability Impact: None

For more information on CVSS3 Scores, click here.

Suggested Fix

Type: Upgrade version

Origin: https://blog.jquery.com/2020/04/10/jquery-3-5-0-released/

Release Date: 2020-04-29

Fix Resolution: jQuery - 3.5.0


Step up your Open Source Security Game with WhiteSource here

SparkCompare Throwing NameError.

Hi author, I am trying to use the SparkCompare functionality with the sample code that you have provided but it is throwing me this error:

Traceback (most recent call last):
  File ".\spark.py", line 69, in <module>
    comparison.report()
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 862, in report
    self._print_row_matches_by_column(file)
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 689, in _print_row_matches_by_column
    self._populate_columns_match_dict()
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 529, in _populate_columns_match_dict
    match_data = match_dataframe.agg(*[helper(col) for col in self.columns_compared]).collect()[
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 529, in <listcomp>
    match_data = match_dataframe.agg(*[helper(col) for col in self.columns_compared]).collect()[
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 523, in helper
    predicates = [F.col(c) == k.value for k in MatchType]
  File "C:\Users\\AppData\Roaming\Python\Python37\site-packages\datacompy\sparkcompare.py", line 523, in <listcomp>
    predicates = [F.col(c) == k.value for k in MatchType]
NameError: name 'F' is not defined

Any idea why is this happening? Thanks! :)

Drop Matching Rows and Columns in Mismatch Output for Readability.

Currently the csv that's being written out has all the columns written out on a failed comparison making it harder to read with a larger data-set.
This is more of a feature request but the ability to only have columns that didn't match written to the mismatch csv would really help and speed up the QA process on our end.

CVE-2020-14343 (High) detected in PyYAML-5.3.1.tar.gz

CVE-2020-14343 - High Severity Vulnerability

Vulnerable Library - PyYAML-5.3.1.tar.gz

YAML parser and emitter for Python

Library home page: https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz

Path to dependency file: datacompy/test-requirements.txt

Path to vulnerable library: datacompy/test-requirements.txt

Dependency Hierarchy:

  • pre_commit-1.21.0-py2.py3-none-any.whl (Root Library)
    • aspy.yaml-1.3.0-py2.py3-none-any.whl
      • PyYAML-5.3.1.tar.gz (Vulnerable Library)

Found in HEAD commit: 4f516c4de1453b3db34b7f14ea561dd05f7110c1

Found in base branch: develop

Vulnerability Details

A vulnerability was discovered in the PyYAML library in all versions, where it is susceptible to arbitrary code execution when it processes untrusted YAML files through the full_load method or with the FullLoader loader. .load() defaults to using FullLoader and FullLoader is still vulnerable to RCE when run on untrusted input. Applications that use the library to process untrusted input may be vulnerable to this flaw. An attacker could use this flaw to execute arbitrary code on the system by abusing the python/object/new constructor.
The fix for CVE-2020-1747 was not enough to fix this issue.

Publish Date: 2020-07-21

URL: CVE-2020-14343

CVSS 3 Score Details (9.8)

Base Score Metrics:

  • Exploitability Metrics:
    • Attack Vector: Network
    • Attack Complexity: Low
    • Privileges Required: None
    • User Interaction: None
    • Scope: Unchanged
  • Impact Metrics:
    • Confidentiality Impact: High
    • Integrity Impact: High
    • Availability Impact: High

For more information on CVSS3 Scores, click here.


Step up your Open Source Security Game with WhiteSource here

Make SparkCompare.report return string

Rather than make it print to stdout (or a file)?

The Compare.report returns a string by default. I think it's better to make those method behave consistently.

Make lowercasing of column names optional

If you are using camel case naming conventions and have several multi-word column names, then automatic lowercasing of names will make the final result less readable.

There is only a need to lowercase when you have case-insensitive duplicates. If you do not, there is no need to disambiguate. So why not add an option to turn this off? If there are duplicates, then simply throw - it already does this, anyhow.

Allow customization of comparison report

At present it does not seem the built-in comparison report accepts any parameters. For one, I would like to be able to set the row limit myself as far as how many differences are printed per field. By default this is 10 - I'd like to optionally change this.

Appreciate the use of smart defaults, but any hard-coded parameter - 10 or otherwise, I feel ought to be something that can be configured externally.

Use deterministic output order for column-based reporting

One section of the output report is called "Sample Rows with Unequal Values". It prints each column with differences and then 10 rows of differences. However, I've noticed that this output changes with multiple runs - so multiple calls of .report() will give you output in a different order. As an example, if you have two dataframes with the columns A, B, and C, all of which have differences, then one run may give you results in the order of B, C, A, the second in C, A, B, etc.

This isn't a huge issue - it's just cosmetic. But I do think the output would be cleaner if it the ordering was deterministic. Moreover, it should probably match the ordering of the columns of the first dataframe, which is referred to as df1 in the code.

This section of the report is generated by looping through self.column_stats and generated sample rows of differences. self.column_stats is generated from the method _intersect_compare(), which loops through self.intersect_columns(). And finally, the issue is in self.interesect_columns(), which uses set - an unordered data structure - and does an intersection.

datacompy/datacompy/core.py

Lines 212 to 214 in 8d96429

def intersect_columns(self):
"""Get columns that are shared between the two dataframes"""
return set(self.df1.columns) & set(self.df2.columns)

I'm thinking the solution to this problem is use an ordered set. There are packages available on PyPi for this.

Inequality when comparing two empty numpy arrays

Comparison of two empty numpy arrays currently return False, which results in showing diffs where there shouldn't be.

This is due to the way numpy compares empty arrays.
Running bool(np.array([]) == np.array([])) returns False and throws this warning:

The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.

Reproduce this bug with:

df1 = pd.DataFrame({"some_col": [np.array([]) for _ in range(10)], "id": [i for i in range(10)]})
df2 = pd.DataFrame({"some_col": [np.array([]) for _ in range(10)], "id": [i for i in range(10)]})

pdcompare = datacompy.Compare(df1, df2, join_columns="id")
print(pdcompare.report())

output:

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0       df1        2    10
1       df2        2    10

Column Summary
--------------

Number of columns in common: 2
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0

Row Summary
-----------

Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 10
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 0

Number of rows with some compared columns unequal: 10
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 10

Columns with Unequal Values or Types
------------------------------------

     Column df1 dtype df2 dtype  # Unequal  Max Diff  # Null Diff
0  some_col    object    object         10         0            0

Sample Rows with Unequal Values
-------------------------------

   id some_col (df1) some_col (df2)
9   9             []             []
0   0             []             []
3   3             []             []
7   7             []             []
5   5             []             []
1   1             []             []
4   4             []             []
2   2             []             []
8   8             []             []
6   6             []             []

Use type hinting & support Python 3.5+

Since we target Python 3.5+ in our CI testing, we can make this explicit and commit to only supporting that version and higher. This will enable us to use type hinting in our code, which would have a few benefits:

  • IDE support for types while developing datacompy
  • IDE support for types for users using datacompy
  • Ability to do static type checking in our CI pipeline
  • Automatic type inference during documentation generation

This should probably best be done in concert with something like #13 as long as we're messing with the code, but since this will involve committing to supporting certain versions of Python, it should be a conscious choice. Python 3.4 reached EOL in March 2019 so I don't see a problem with explicitly targeting 3.5+.

Issues with Travis CI and pytest-spark

Seems like the latest pytest-spark has a bug or something. I've been tinkering around with it to get Spark up and running. Have a MVP repo here: https://github.com/fdosani/travis-pytest-spark

We should probably pin pytest-spark until this sort itself out. Also in the repo is a .travis.yml on how to setup the environment so a standalone spark instance can be used for unit testing. Hope this helps with the upcoming changing 👍

idea/question: Writing compare/output results in a structured format(sqlite, csv, json)

Having a structured format output can help in below cases

  • There is a large number of differences between the two dataframes
  • There is a need to filter, perform operations, generate dashboards on output/compare report
row_num key[symbol:quotedate] close (Original) close (New) numdiff text (Original) text (New) textdiff
0 4 [AAN:01-05-2016] 20.0 23.12 3.12
1 4 [AAN:01-05-2016] 1030099.0 1030097.0 -2
2 7 [AAP:01-05-2016] b a True

Comparison result is completely wrong

I use this tool to compare two parquets and surprisingly found the comparison result is completely wrong:
The two parquet files both has 40 fields with exactly the same name, difference is one parquet is 27MB and the other is 250MB.
Here is the summary:

Column Summary

Number of columns in common: 7
Number of columns in first_parquet but not in second_parquet: 33
Number of columns in second_parquet but not in first_parquet: 25

I understand it would help your troubleshooting if you have the actual parquet files, but unfortunately I can't give the files, a screenshot of the schemas side-by-side is here:
https://snag.gy/rnpmBX.jpg

Issue with setup.py and locale

Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/something/datacompy/setup.py", line 12, in <module>
        LONG_DESCRIPTION = file_open.read()
      File "/something/envs/py36/lib/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 10137: ordinal not in range(128)

Issue is that file_open should be explicitly opened as utf-8. If someone's default environment isn't utf-8 this read fails.

Make duplicate handling better

You can get at duplicate rows like:

comp.df1_unq_rows[comp.df1_unq_rows['acct_id'].isin(comp.intersect_rows['acct_id'])]

The compare report just says Any duplicates on match values: Yes. Could have some things like

  • Count of duplicates (i.e. that weren't matched)
  • More information in docs about how duplicates are picked (explain the algorithm)
  • Shortcut on the class to get at the duplicates (like just shortcut the above?) Or maybe some way to point a discarded duplicate at the corresponding record it could have matched with?

Warning on some comparisons

Not sure why this is, but I'm getting a warning message from the heart of Pandas here:

import logging
import sys
import pandas as pd
import numpy as np
import datacompy

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

expected = pd.DataFrame([{'acct': 1, 'c': 2}, {'acct': 3, 'c': 4}, {'acct': np.nan}])
actual = expected.copy()

compare = datacompy.Compare(actual, expected, 'acct')
DEBUG:datacompy.core:Checking equality
INFO:datacompy.core:df1 Pandas.DataFrame.equals df2
INFO:datacompy.core:Number of columns in common: 2
DEBUG:datacompy.core:Checking column overlap
INFO:datacompy.core:Number of columns in df1 and not in df2: 0
INFO:datacompy.core:Number of columns in df2 and not in df1: 0
DEBUG:datacompy.core:Merging dataframes
DEBUG:datacompy.core:Outer joining
DEBUG:datacompy.core:Selecting df1 unique rows
DEBUG:datacompy.core:Selecting df2 unique rows
INFO:datacompy.core:Number of rows in df1 and not in df2: 0
INFO:datacompy.core:Number of rows in df2 and not in df1: 0
DEBUG:datacompy.core:Selecting intersecting rows
INFO:datacompy.core:Number of rows in df1 and df2 (not necessarily equal): 3
DEBUG:datacompy.core:Comparing intersection
INFO:datacompy.core:acct: 3 / 3 (100.00%) match
/Users/myname/miniconda3/envs/t2/lib/python3.6/site-packages/pandas/core/generic.py:6515: RuntimeWarning: invalid value encountered in absolute
  return np.abs(self)
INFO:datacompy.core:c: 3 / 3 (100.00%) match
INFO:datacompy.core:df1 matches df2

It doesn't seem to happen if a) I only have two rows (take out the second row), or b) if I step through the code myself (?), or c) if I have four rows UGH

Having trouble isolating the issue, so 🤷‍♂️

Would be useful to have a parameter to strip spaces for comparison

As probably expected, the following code will return a mismatch since 'B'<>'B ':

import pandas as pd
import datacompy

df1 = pd.DataFrame([
        {'id': 1234, 'column_value': 'A'},
        {'id': 2345, 'column_value': 'B'}])

df2 = pd.DataFrame([
        {'id': 1234, 'column_value': 'A'},
        {'id': 2345, 'column_value': 'B '}])

compare = datacompy.Compare(
    df1,
    df2,
    join_columns='id',
    abs_tol=0,
    rel_tol=0,
    )
compare.matches(ignore_extra_columns=False)
# False

# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())

What I propose is an optional parameter to ignore differences where the only difference is leading or trailing spaces. In this example it is obvious that there is a trailing space. However, when we are dealing with extracts from different databases/source files, without real control over the ETL of these, sometimes we can't prevent these discrepancies. We may wish to ignore these types of mismatches to identify 'worse' mismatches more effectively.

Another candidate could be ignoring case sensitivity differences.
Of course these could both be easily handled with preprocessing the dataframes, but still could be some convenient enhancements!

tests are failing due to test_decimal_with_joins

Seems like the TravisCI tests are failing. I think this is due to the version of various packages not being pinned. As the packages change and evolve so will testing behaviour.

The following test is failing:

def test_decimal_with_joins():
    df1 = pd.DataFrame([{"a": Decimal("1"), "b": 2}, {"a": Decimal("2"), "b": 2}])
    df2 = pd.DataFrame([{"a": 1, "b": 2}, {"a": 2, "b": 2}])
    with raises(ValueError):
        compare = datacompy.Compare(df1, df2, "a")

I think in the past Pandas might have has different behaviour. Testing with 0.19.0 version of Pandas and Python 3.5 it still seems to NOT throw any error.

Doesnt take into account join columns when doing strip() of white spaces

Join columns can have whitespaces as well,so ideally below section should be enhanced to check for ignore_spaces and strip() if it is true for string columns
This is causing complete mismatch of 2 dataframes here in case 1 of join column has space:

if on_index and join_columns is not None:
raise Exception("Only provide on_index or join_columns")
elif on_index:
self.on_index = True
self.join_columns = []
elif isinstance(join_columns, str):
self.join_columns = [join_columns.lower()]
self.on_index = False
else:
self.join_columns = [col.lower() for col in join_columns]
self.on_index = False

TypeError: an integer is required - during import

there is an error during import when I run
import datacompy

this error arises
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\datacompy\__init__.py", line 20, in <module> from datacompy.sparkcompare import SparkCompare, NUMERIC_SPARK_TYPES File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\datacompy\sparkcompare.py", line 26, in <module> from pyspark.sql import functions as F File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\__init__.py", line 51, in <module> from pyspark.context import SparkContext File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\context.py", line 31, in <module> from pyspark import accumulators File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\accumulators.py", line 97, in <module> from pyspark.serializers import read_int, PickleSerializer File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\serializers.py", line 72, in <module> from pyspark import cloudpickle File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\cloudpickle.py", line 145, in <module> _cell_set_template_code = _make_cell_set_template_code() File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\pyspark\cloudpickle.py", line 126, in _make_cell_set_template_code return types.CodeType( TypeError: an integer is required (got type bytes)

I have another version of this library running on two different pc's but this error must be new

Idea: Output join_cols in sample rows output for the report

For the individual column comparisons, the join_cols are always outputted followed by the values in df1, df2.

However for the Sample Rows Only in df1/df2, the first 10 columns are outputted.
Suggestion would be to force the output to show the join columns in this as well, as sometimes (for whatever reason) the join columns are not in the first 10.

Having it present would make downstream investigations of mismatches easier, instead of having to use the pandas df index to locate the join columns (assuming your join columns are also typically the lookup columns for investigations).

Get docs hosted on capitalone.github.io

Hey @SanthiSridharan are you able to give me a hand with this? I've got the built docs in the /docs subfolder, but I think I need someone with full repo rights to tell GH where to serve pages from, and maybe some help getting the capitalone.github.io connection set up?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.