Giter Site home page Giter Site logo

usc-isi-i2 / ppjoin Goto Github PK

View Code? Open in Web Editor NEW
6.0 9.0 0.0 176 KB

PPJoin and P4Join Python 3 implementation

License: MIT License

Python 100.00%
ppjoin p4join string-similarity join pper privacy-preserving-record-linkage jaccard jaccard-similarity recordlinkage deduplication

ppjoin's Introduction

PPJoin doi

PPJoin and P4Join Python 3 implementation.

PPJoin

PPJoin stands for Position Prefix Join which is an efficient set similarity join algorithm using the Jaccard similarity with several filtering techniques.

PPJoin is introduced in

Xiao, Chuan, et al. "Efficient similarity joins for near-duplicate detection." ACM Transactions on Database Systems (TODS) 36.3 (2011): 1-41.

This implementation is based on https://github.com/teh/ppjoin.

join function takes a list of datasets from different parties and a threshold t as input. Each dataset is a list of records and each record is formed by list of tokens.

ppjoin.join(datasets: List[List[List[str]]], t: float) -> Set[Tuple[Tuple]]

The return will be a set of tuples and each tuple contains two inner tuples:

((dataset1 index, record index), (dataset2 index, record index))

Example:

from ppjoin import ppjoin

def tokenizer(record):
    return set(ppjoin.whitespace_tokenizer(record.lower()))


ds0 = ['a b d', 'a b c', 'h k']
ds1 = ['a b k', 'a b', 'h k', 'a c h']
ds2 = ['a c h']
ds = [
    [tokenizer(w) for w in ds0],
    [tokenizer(w) for w in ds1],
    [tokenizer(w) for w in ds2]
]


result = ppjoin.join(ds, t=0.5)

for r in result:
    ds1_id, r1id = r[0]
    ds2_id, r2id = r[1]
    print('Found pair: {} from dataset {}, {} from dataset {}'.format(
        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id
    ))

Output:

Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b', 'k'] from dataset 1
Found pair: ['h', 'k'] from dataset 0, ['h', 'k'] from dataset 1
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 2
Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b', 'k'] from dataset 1
Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b'] from dataset 1
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 1
Found pair: ['a', 'c', 'h'] from dataset 1, ['a', 'c', 'h'] from dataset 2
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b'] from dataset 1

P4Join

P4Join (Privacy-Preserving Prefix Position Join) adapts PPJoin with bit operations to solve privacy-preserving record linkage problem. It supports length, prefix and an optimized position filter.

This is introduced in

Sehili, Ziad, et al. "Privacy preserving record linkage with PPJoin." Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015).

First step of P4Join is to encode original record into bit vector. encode_record takes a list of records each contains a list of tokens, a HMAC key, the length of the bit vector and k which indicates applying how many rounds of combined hash functions. The return of it is a list of encoded record vectors.

p4join.encode_record(record: List[List[str]], hmac_key: str, vec_len: int, k: int = 2) -> List[int]

P4Join's join function is similar to PPJoin's but takes encoded datasets as input. The return format is also identical to PPJoin.

p4join.join(datasets: List[List[int]], t: float = 0, vec_len: int = 0) -> Set[Tuple[Tuple]]

Example:

from ppjoin import ppjoin, p4join


def tokenizer(record):
    return set(ppjoin.whitespace_tokenizer(record.lower()))


hash_key = 'key'
vec_len = 40
k = 2

ds0 = ['a b d', 'a b c', 'h k']
ds1 = ['a b k', 'a b', 'h k', 'a c h']
ds2 = ['a c h']
ds = [
    [tokenizer(w) for w in ds0],
    [tokenizer(w) for w in ds1],
    [tokenizer(w) for w in ds2]
]

ds_encoded = [
    [p4join.encode_record(w, hash_key, vec_len, k) for w in d] for d in ds
]


result = p4join.join(ds_encoded, t=0.5, vec_len=vec_len)

for r in result:
    ds1_id, r1id = r[0]
    ds2_id, r2id = r[1]
    print('Found pair: {} from dataset {}, {} from dataset {}'.format(
        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id
    ))

Installation

pip install -e .

Test

To run all unit tests:

python -m unittest discover ppjoin/tests

Tests on real world dataset Abt-Buy is from DBGroup of Leipzig.

ppjoin's People

Contributors

greatyyx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.