Giter Site home page Giter Site logo

data61 / blocklib Goto Github PK

View Code? Open in Web Editor NEW
19.0 4.0 3.0 1.15 MB

Python implementations of record linkage blocking techniques.

License: Apache License 2.0

Python 100.00%
privacy-preserving-record-linkage privacy-enhancing-technologies record-linkage

blocklib's Introduction

codecov Documentation Status Typechecking Testing Downloads

Blocklib

Python implementations of record linkage blocking techniques. Blocking is a technique that makes record linkage scalable. It is achieved by partitioning datasets into groups, called blocks and only comparing records in corresponding blocks. This can reduce the number of comparisons that need to be conducted to find which pairs of records should be linked.

blocklib is part of the Anonlink project for privacy preserving record linkage.

Installation

Install with pip:

pip install blocklib

Documents

You can find comprehensive documentation and tutorials in readthedocs

Tests

Run unit tests with pytest::

$ pytest

Discussion

If you run into bugs, you can file them in our issue tracker on GitHub.

There is also an anonlink mailing list for development discussion and release announcements.

Wherever we interact, we strive to follow the Python Community Code of Conduct

License and Copyright

blocklib is copyright (c) Commonwealth Scientific and Industrial Research Organisation (CSIRO).

Licensed under the Apache License, Version 2.0 (the "License"). You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

blocklib's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar gusmith avatar hardbyte avatar joyceyuu avatar wilko77 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

blocklib's Issues

Ideas for extra signature strategies

Possible extra strategies and set of names for signature generation:

Existing strategies

  • ExactCharMatchSig: The letter at given index. (implemented as generate_by_char_at)
  • ExactMatchSig: the value of the whole field (implemented as generate_by_feature_value)
  • WordSoundSimilarSig: comparison of the sound of the word using metaphone. (implemented as generate_by_metaphone)

New strategies

  • FirstWordSig: the first word of the field
  • LastWordSig: the last word of the field
  • InitialLastWordSig: the first letter and last word of the field (e.g. for formatted names)
  • AnyWordSig: any of the words in field
  • WordNGramsSig: n-grams of words extracted from a text field
  • LetterNGramsSig: n-grams of letters extracted from a text field
  • LastNWordsSig: the last n words of the field
  • FirstNWordsSig: the first n words of the field
  • ArrayCombinationSig: all n-grams of an array field

Support column names in blocking schema

Currently we only support column index in blocking schema definition. This is inconvenient if there are many columns. It is better to also support column names.

Establish if Fuzzy is cross-platform

I'm not sure if the tests are currently broken or just on my operating system

self = <test_signature_generator.TestPSig testMethod=test_generate_signatures>

    def test_generate_signatures(self):
        """Test a multi-stragegy signatures."""
        dtuple = ('Joyce', 'Wang', 2134)
        signatures = [
            [
                {'type': 'feature-value', 'feature_idx': 0},
                {'type': 'feature-value', 'feature_idx': 1},
            ],
            [
                {'type': 'soundex', 'feature_idx': 0},
                {'type': 'soundex', 'feature_idx': 1},
            ]
        ]
        signatures = generate_signatures(signatures, dtuple)
>       assert signatures == {"J2W52", "JoyceWang"}
E       AssertionError: assert {'', 'JoyceWang'} == {'J2W52', 'JoyceWang'}
E         Extra items in the left set:
E         ''
E         Extra items in the right set:
E         'J2W52'
E         Use -v to get the full diff

If the tests are working for you we may need to check Fuzzy's support for Windows/MacOS/Linux

Blocking Schema consistency

The next version of the blocking schema should be consistent with use of under_scores, camelCase and hyphenated-keys..

Convert block key into string

Currently P-sig generates block keys of type set while Lambda-fold generates block key of type string. For entity service deployment, we prefer string block keys. This issue is to convert block key into string for P-Sig

Typecheck fails using mypy

On the feature-azure-pipelines branch, running mypy blocklib --ignore-missing-imports --no-implicit-optional --disallow-untyped-calls with Python 3.7 fails. The logs are:

blocklib/signature_generator.py:28: error: Incompatible types in assignment (expression has type "object", variable has type "str")
blocklib/signature_generator.py:33: error: Incompatible types in assignment (expression has type "int", variable has type "str")
blocklib/signature_generator.py:34: error: Incompatible types in assignment (expression has type "int", variable has type "str")
blocklib/signature_generator.py:36: error: Incompatible types in assignment (expression has type "object", variable has type "str")
blocklib/signature_generator.py:37: error: Incompatible types in assignment (expression has type "object", variable has type "str")
blocklib/signature_generator.py:38: error: Slice index must be an integer or None
blocklib/signature_generator.py:40: error: Incompatible types in assignment (expression has type "int", variable has type "str")
blocklib/signature_generator.py:41: error: Incompatible types in assignment (expression has type "object", variable has type "str")
blocklib/signature_generator.py:42: error: Slice index must be an integer or None
blocklib/signature_generator.py:44: error: Incompatible types in assignment (expression has type "int", variable has type "str")
blocklib/signature_generator.py:45: error: Incompatible types in assignment (expression has type "object", variable has type "str")
blocklib/signature_generator.py:46: error: Slice index must be an integer or None
blocklib/signature_generator.py:60: error: "int" has no attribute "__iter__"; maybe "__str__", "__int__", or "__invert__"? (not iterable)
blocklib/signature_generator.py:141: error: Cannot call function of unknown type
blocklib/encoding.py:46: error: Need type annotation for 'candidate_bloom_filter' (hint: "candidate_bloom_filter: Set[<type>] = ...")
blocklib/simmeasure.py:103: error: Need type annotation for 'q_gram_cache' (hint: "q_gram_cache: Dict[<type>, <type>] = ...")
blocklib/simmeasure.py:106: error: Need type annotation for 'sim_cache' (hint: "sim_cache: Dict[<type>, <type>] = ...")
blocklib/pprlindex.py:42: error: Need type annotation for 'rec_to_block' (hint: "rec_to_block: Dict[<type>, <type>] = ...")
blocklib/pprlindex.py:73: error: Need type annotation for 'ref_val_list' (hint: "ref_val_list: Set[<type>] = ...")
blocklib/pprlpsig.py:27: error: Call to untyped function "__init__" in typed context
blocklib/pprlpsig.py:35: error: Need type annotation for 'reversed_index' (hint: "reversed_index: Dict[<type>, <type>] = ...")
blocklib/pprllambdafold.py:25: error: Call to untyped function "__init__" in typed context
blocklib/pprllambdafold.py:63: error: Need type annotation for 'invert_index' (hint: "invert_index: Dict[<type>, <type>] = ...")
blocklib/pprllambdafold.py:65: error: Need type annotation for 'lambda_table'
blocklib/candidate_blocks_generator.py:49: error: Call to untyped function (unknown) in typed context
blocklib/blocks_generator.py:53: error: Need type annotation for 'map_rec_block'
blocklib/blocks_generator.py:72: error: Need type annotation for 'cbf' (hint: "cbf: Set[<type>] = ...")
blocklib/blocks_generator.py:76: error: "PPRLIndex" has no attribute "blocking_config"
blocklib/blocks_generator.py:83: error: Unsupported operand types for >= ("Set[Any]" and "int")
Found 29 errors in 8 files (checked 12 source files)

float division by zero issue

When I want to asses blocking result using metrics rr $ pc , I got the message error "ZeroDivisionError: float division by zero". How could I fix it?

Add number of encodings in blocking metadata

Before computing similarity score computation and matching, we need to check if the number of encodings in blocking data is consistent with number of encodings in CLK data.

Currently we either load the whole JSON just to get the number of encodings or use ijson to iteratively count the number of encodings.

It would be better to store the count in the metadata and just read this metadata when needing the count.

unnecessary Bloom filter generation in lambda fold

Here:

for i in range(self.mylambda):
lambda_table = defaultdict(list) # type: Dict[Any, Any]
# sample K indices from [0, bf-len]
indices = rnd.choice(range(self.bf_len), self.K, replace=False)
for rec_id, rec in zip(record_ids, data):
bloom_filter = self.__record_to_bf__(rec)

you create the same Bloom filter for each record self.mylambda times, whereas once would be sufficient.

=> fixing the filter generation will make the code self.mylambda times faster. Yay!

'CandidateBlockingResult' object has no attribute 'print_summary_statistics'

In the following code:

from blocklib import generate_candidate_blocks
block_obj_alice = generate_candidate_blocks(alice['data'], blocking_config, header=alice['columns'])
block_obj_alice.print_summary_statistics()

By executing the following line:
block_obj_alice.print_summary_statistics()

I got the error

AttributeError: 'CandidateBlockingResult' object has no attribute 'print_summary_statistics'

Tutorial Issue: Missing Function "generate_blocks_2party"

It appears there is a missing function within the blocklib library that is trying to be used in the tutorial notebook found in the /docs/tutorial folder.

When running the notebook, the program will stall on the from blocklib import generate_blocks_2party line. generate_blocks_2party is not present in the "init.py" file of blocklib.

Issue replicated when building the package from it's git repo and installing through pip.

An example of the error is shown in the image attached.

screenshot_blocklib_error

Throw exception when clks are fed to p-sig blocking

Currently blocklib throws exception "All records are filtered out" when feeding clks to p-sig blocking, which is wrong. We should throw exception like "CLKs are fed to p-sig blocking, but it only accepts raw CSV"

Automate release with CI

Currently we only use manual Pypi release for blocklib. It would be nice to enable automatic release with CI so that we can track the release history in Github

Python API for signature generation

Consider creating a python api as an alternative to creating signatures with JSON. At the moment we conflate signature specifications and data together a bit - e.g. generate_by_char_at(attr_ind: int, dtuple: Sequence, pos: List[Any]) requires the specification for the signature in attr_ind.

Possible example using names from #48:

signatures = [
  FirstWordSig('street') & InitialLastWordSig(name) | 128,
  ExactMatchSig("postcode") & WordSoundSimilarSig('name') | 128,
  ExactCharMatchSig("firstname", 0) & ExactCharMatchSig("surname", 0)
  ...
]

Key points for usability would be using the field name instead of the index and easily being able to compose together a signature spec. Here the binary operators __and__ and __or__ could be overloaded to chain together signature specs, of course this could be done in many different ways.

A builder pattern might be something like:

Signature()
  .add(ExactMatchSig("postcode"))
  .add(WordSoundSimilarSig('name'))
  .filter(128)

(I don't like this way but want to put a few ideas up)

Add tests

There are still part of code that are not covered by tests

Two tests failing to be corrected

On master and feature-azure-pipelines branches, the tests blocklib.signature_generator.generate_by_soundex and test_generate_signatures are consistently failing running pytest.

Remove min threshold in P-Sig

I don't think that filtering rare signatures is a good idea. I though we get the privacy protection by inserting the signatures into a Bloom filter. By carefully restricting the size of the filter, you will get a certain amount of collisions, and thus reducing the effectiveness of frequency attacks.

feedback on filtering for P-Sig blocking

P-sig filters too large blocks. Thus, a signature definition which mainly creates large blocks will not contribute much to the usable blocks.
Currently there is no way of telling if a signature is good or not.
The ideal signature covers the whole data, i.e. put every entity in a block which is not filtered out.

Thus, I propose to compute and output the coverage information of each signature (the percentage of entities that ended up in a block after filtering), such that a analyst gets feedback on the signatures.

Sentinel check for input type

For p-sig, we only allow csv input, but for lambda-fold, we allow both csv and clks upload. It is better to check the input type before proceed to avoid confusing error messages

Make Multi-party P-Sig

We will support multiparty blocking for P-Sig in this issue. Note that we will allow subset matching i.e. a record is included in the final blocking if it appears at least K times within M parties where K <= M. To implement this, we will use counting Bloom Filter.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.