Giter Site home page Giter Site logo

d6tjoin's Introduction

Databolt Smart Join

Easily join different datasets without writing custom code. Does best match joins on strings, dates and numbers. For example you can quickly join similar but not identical stock tickers, addresses, names and dates without manual processing.

Installation

0.2.x is currently in beta. The github master is the latest dev version. The docs refer to <0.2.0

We recommend using the latest version from github pip install git+https://github.com/d6t/d6tjoin.git

If you cannot install from github, use the latest published version pip install d6tjoin. To update, run pip install d6tflow -U --no-deps

We recommend using AffineGap which is not an official requirement, you can install using pip install affinegap.

For the jellyfish library, make sure the C implementation is working else d6tjoin will be very slow. You can test by running import jellyfish.cjellyfish if the C version is installed. If you don't have a C compiler, you can conda install -c conda-forge jellyfish.

Sample Use


import d6tjoin.top1
import d6tjoin.utils
import d6tjoin

#************************
# pre join diagnostics
#************************

# check join quality => none of the ids match

d6tjoin.Prejoin([df1,df2],['id','date']).match_quality()

  key left key right  all matched  inner  left  right  outer  unmatched total  unmatched left  unmatched right
0       id        id        False      0    10     10     20               20              10               10
1     date      date         True    366   366    366    366                0               0                0
2  __all__   __all__        False      0  3660   3660   7320             7320            3660             3660

#************************
# best match join on id
#************************

result = d6tjoin.top1.MergeTop1(df1,df2,fuzzy_left_on=['id'],fuzzy_right_on=['id'],
    exact_left_on=['date'],exact_right_on=['date']).merge()

result['merged'].head(2)

        date        id   val1 id_right  val1_right   val2
0 2010-01-01  e3e70682  0.020   3e7068       0.020  0.034
1 2010-01-01  f728b4fa  0.806   728b4f       0.806  0.849

#************************
# debug best matches
#************************

result['top1']['id'].head(2)

         date __top1left__ __top1right__  __top1diff__ __matchtype__
10 2010-01-01     e3e70682        3e7068             2     top1 left
34 2010-01-01     e443df78        443df7             2     top1 left

#************************
# customize similarity fct
#************************
import affinegap

result = d6tjoin.top1.MergeTop1(df1,df2,fuzzy_left_on=['id'],fuzzy_right_on=['id'], 
    fun_diff=[affinegap.affineGapDistance]).merge()

#************************
# token-based substring clusters and joins
#************************
dftoken=d6tjoin.utils.splitcharTokenCount(df2['id'])

      word  count
0   Equity      7
1       US      5
2       NA      2
3  PRIVATE      2


d6tjoin.utils.unique_contains(df2['id'], dftoken['word'].values)
>>> [('Equity', ['AAPL_US_Equity', 'AAP_US_Equity', 'AD_NA_Equity', 'AMZN_US_Equity', 'APRN_US_Equity', 'BBY_US_Equity', 'BMW_NA_Equity']), ('US', ['AAPL_US_Equity', 'AAP_US_Equity', 'AMZN_US_Equity', 'APRN_US_Equity', 'BBY_US_Equity']), ('NA', ['AD_NA_Equity', 'BMW_NA_Equity']), ('PRIVATE', ['PRIVATE_JLP', 'PRIVATE_NMG'])]

import re
splitchars="[^a-zA-Z0-9]+"
def tokenmatch(s1,s2):
    return 3-len(set(re.split(splitchars,s1)) & set(re.split(splitchars,s2)))

d6tjoin.top1.MergeTop1(df1,df2,fuzzy_left_on=['id'],fuzzy_right_on=['id'], fun_diff=[tokenmatch], top_limit=[2]).merge()['top1']['id']

  __top1left__   __top1right__ __matchtype__  __top1diff__
0          AAP   AAP_US_Equity     top1 left             2
1         AAPL  AAPL_US_Equity     top1 left             2
2    AMZN-AMZN  AMZN_US_Equity     top1 left             2
3         APRN  APRN_US_Equity     top1 left             2
4          JLP     PRIVATE_JLP     top1 left             2
5          NMG     PRIVATE_NMG     top1 left             2

Features include

Enhances pd.merge() function with:

  • Pre join diagnostics to identify mismatched join keys
  • Best match joins that finds the top1 most similar value
    • Quickly join stock identifiers, addresses, names without manual processing
    • Ability to customize similarity functions, set max difference and other advanced features

Documentation

Pro version

Additional features:

  • Join >2 dataframes
  • Automatic Content-based similarity joins
  • Advanced join quality checks
  • Fast approximations for big data

Request demo

Faster Data Engineering

Check out other d6t libraries to solve common data engineering problems, including

  • data ingest, quickly ingest raw data
  • fuzzy joins, quickly join data
  • data pipes, quickly share and distribute data

https://github.com/d6t/d6t-python

And we encourage you to join the Databolt blog to get updates and tips+tricks http://blog.databolt.tech

d6tjoin's People

Contributors

d6tdev avatar haijingli94 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

d6tjoin's Issues

MergeTop1 doesn't work with timezone-aware dates

The following code runs just fine:

import d6tjoin.top1
import d6tjoin.utils
import pandas as pd
import datetime as dt
import pytz

df1 = pd.DataFrame({'date': pd.date_range(dt.datetime.today()
                                          #.astimezone(pytz.timezone('US/Pacific'))
                                          , periods=5)})
df2 = pd.DataFrame({'date': pd.date_range(dt.datetime.today()
                                          #.astimezone(pytz.timezone('US/Pacific'))
                                          , periods=6)})

result = d6tjoin.top1.MergeTop1(df1, df2, fuzzy_left_on=['date'], fuzzy_right_on=['date']).merge()['merged']

print(result)

yet, if I uncomment the two commented-out lines I get

ValueError: Unrecognized data type for top match, need to pass fun_diff in arguments

Would it be of interest to make this work automatically for timezone-aware objects? If so, I'd be happy to work on it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.