Comments (2)
There are two different counts:
- The count of encodings as produced by clkhash.
- The count of encodings as referenced in the blocking data.
Anonlink-client writes the first count into the meta data. I think Joyce was talking about the latter one here.
Those two counts can be different, depending on the choice of blocking algorithm. Whereas some algorithms nicely map every entity in at least one block, P-Sig does not guarantee that, because of the filtering.
For these probabilistic schemes we found it useful to sanity check. That is, count the entities that are part of at least one block. This way we get an understanding of how aggressively the algorithm filters big blocks.
There is code somewhere in blocklib that does just that. This count allows you to compute the coverage of the blocking scheme (percentage of entities referenced in at least one block). High coverage is a necessary condition for good linkage results. - An entity, that is not referenced in any block will never be matched.
As coverage is crucial for linkage success it makes sense to expose this measure downstream.
from blocklib.
Anonlink client already does this, I'm not sure there needs to be any functionality added to blocklib.
cc @wilko77
from blocklib.
Related Issues (20)
- float division by zero issue HOT 9
- Docs, examples and tests should use feature names
- Convert printing to logging
- Serialize to a blocking schema
- Blocking Schema consistency
- module 'blocklib.validation' has no attribute 'validate_blocking_schema' HOT 1
- 'CandidateBlockingResult' object has no attribute 'print_summary_statistics' HOT 1
- Dependabot errors HOT 1
- Automate release with CI
- feedback on filtering for P-Sig blocking
- Add tests
- Ideas for extra signature strategies
- Python API for signature generation
- Sentinel check for input type HOT 1
- Inconsistent block keys in filtered reversed index with psig
- Convert block key into string
- Throw exception when clks are fed to p-sig blocking HOT 1
- Support column names in blocking schema
- Dependabot couldn't authenticate with https://pypi.python.org/simple/
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blocklib.