data61 / anonlink-entity-service Goto Github PK
View Code? Open in Web Editor NEWPrivacy Preserving Record Linkage Service
License: Apache License 2.0
Privacy Preserving Record Linkage Service
License: Apache License 2.0
I'm concerned with the following line on compute_filter_similarity()
in async_worker.py
:
chunk_results = anonlink.entitymatch.calculate_filter_similarity(chunk_dp1, chunk_dp2,
threshold=threshold,
k=5,
use_python=False)
Why is k
arbitrarily set to 5? Is there a better value for k
and why?
From engineering created by hardbyte : n1analytics/engineering#215
Issue by tho802
Thursday Jul 28, 2016 at 21:40 GMT
Originally opened as https://github.csiro.au/magic/n1-compute/issues/215
Running the 1M x 1M on the entity service I got a failure on the db:
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
LOG: database system was not properly shut down; automatic recovery in progress
LOG: invalid record length at 0/727F6AC0
Which causes connection errors in the worker process:
08:11:03 INFO Received task: async_worker.compute_filter_similarity[77e6485d-4673-474b-8be2-f08125fa4568]
08:11:03 WARNING warning connecting to default postgres db
08:11:03 WARNING warning connecting to default postgres db
08:11:03 WARNING Can't connect to database
08:11:03 ERROR Task async_worker.compute_filter_similarity[8f104a8c-4798-4e7b-8abd-6a74c0eccc07] raised unexpected: ConnectionError('Issue connecting to database',)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.5/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/var/www/async_worker.py", line 361, in compute_filter_similarity
db = connect_db()
File "/var/www/database.py", line 39, in connect_db
raise ConnectionError("Issue connecting to database")
ConnectionError: Issue connecting to database
08:11:03 WARNING Can't connect to database
Need to deal with that somehow by changing the task rate, rerunning the failed tasks or marking the mapping as "failed".
Ingress resources can specify URL rewriting and paths as well as domains/virtual host names.
It would be nice to have multiple versions of the entity service available behind the same domain so end users can test or pin to particular versions - we have even allowed for this in the path: https://es.data61.xyz/api/version
es.data61.xyz/api/v1
-> point to latest stable 1.x servicees.data61.xyz/api/v1.1
-> point to a specific version servicees.data61.xyz/api/v1.2
-> point to a specific version serviceLots of options: ElasticSearch, PostGresQL, AWS Cloud Watch...
Would be nice to support the same as N1-Engine which suggests logging to postgresql.
The image representing the deployment is saved in a png
file which cannot be easily updated in case of typo (for example, the container is named traefik
, not traefic
).
And having red underlines for words which are not recognised is not really nice...
More generally, it would be nice to keep the source files which created our images for easy updates if necessary.
For long term maintainability it would be good to refactor the database code to use sqlalchamy
.
Thinking something similar to how the redis helm chart allows loading a "sidecar" prometheus exporter of metrics. This optional component would export regular updates into our monitoring solution of choice e.g. postgresdb or elasticsearch.
https://github.com/kubernetes/charts/tree/master/stable/redis#configuration
Create sphinx docs for entity service
Need to look at best practise.
Probably implement via a shared BaseTask
.
Check that any hashing done as part of testing uses clkhash instead of anonlink.
Note some testing code uses the fake pii generators that were in anonlink.
Seeing these logs from redis on a system with memory constraints:
redis_1 | 1:M 08 Nov 06:22:22.105 # Client id=1048 addr=172.19.0.5:51698 fd=31 name= age=9 idle=0 flags=N db=0 sub=616549 psub=0 multi=-1 qbuf=16305 qbuf-free=16569 obl=0 oll=3307 omem=81319152 events=rw cmd=subscribe scheduled to be closed ASAP for overcoming of output buffer limits.
We should determine and document minimum reasonable requirements.
Test that we can deploy the service to a kubernetes cluster.
Aha! Link: https://csiro.aha.io/features/ANONLINK-14
I see in nginx conf that:
# Disable buffering of client data so we can handle larger uploads
proxy_request_buffering off;
Which might make sense when receiving large uploads to choose what to do in the backend, but I'm not sure it is working/handled well in general.
In fact, I cannot post a Json to the mapping endpoint with a single chunk.
To repeat the issue, have the file containing the following request:
{
"paillier_context": {
"public_key":{
"n": "AKkOPnV97gEWxlWxE2VzSolyEI-5x0TFf_kQaBa7ykuFo6gy8Mi6VVbEHPmNCYcCXBWhMPiGrkCID2lOYr_PKbx8npyblbRRXyPFlx9h1XbUugTUIoHE_jJiz2mVd7tJwoX8odCGPnEioxb0fZpNI8yNvfAjMTx7MnLw6uGvhkI_U-JbYKg-QJV-SGjeWz5nz6dHz7G1d9yKLAcwMFrW-3-ZkkwNb8SbYE7dJCElEiddAPUoBOyoFB-hy4JMYO3Avj3XD6kOkIBlyge8TpvkMHPjCoFRd7Qszi70xSebgtMrEWdYdd-4Ama306q4NG6y2KLsBH4f_mdIJRKzqhNext8",
"key_ops": ["encrypt"],
"kty":"DAJ",
"kid":"Paillier public key for entity matching service",
"alg":"PAI-GN1"
},
"s":true,
"p":2048,
"base":2,
"encoded":true
},
"public_key":{
"n": "AKkOPnV97gEWxlWxE2VzSolyEI-5x0TFf_kQaBa7ykuFo6gy8Mi6VVbEHPmNCYcCXBWhMPiGrkCID2lOYr_PKbx8npyblbRRXyPFlx9h1XbUugTUIoHE_jJiz2mVd7tJwoX8odCGPnEioxb0fZpNI8yNvfAjMTx7MnLw6uGvhkI_U-JbYKg-QJV-SGjeWz5nz6dHz7G1d9yKLAcwMFrW-3-ZkkwNb8SbYE7dJCElEiddAPUoBOyoFB-hy4JMYO3Avj3XD6kOkIBlyge8TpvkMHPjCoFRd7Qszi70xSebgtMrEWdYdd-4Ama306q4NG6y2KLsBH4f_mdIJRKzqhNext8",
"key_ops":["encrypt"],
"kty":"DAJ",
"kid":"Paillier public key for entity matching service",
"alg":"PAI-GN1"
},
"schema":[
{"identifier":"INDEX","weight":0,"notes":"","unigram":false,"toRemove":""},
{"identifier":"NAME first last","weight":1,"notes":"","unigram":false,"toRemove":""},
{"identifier":"DOB YYYY/MM/DD","weight":1,"notes":"","unigram":false,"toRemove":"/"},
{"identifier":"GENDER M or F","weight":1,"notes":"","unigram":true,"toRemove":""}
],
"result_type":"permutation_unencrypted_mask"
}
Start the entity service.
The following command works (return 200 response)
curl -v -X POST --header "Content-Type: application/json" -d @erquestFile http://0.0.0.0:8851/api/v1/mappings
However, the following return a 400 status:
curl -v -X POST --header "Transfer-Encoding: chunked" --header "Content-Type: application/json" -d @erquestFile http://0.0.0.0:8851/api/v1/mappings
The received message is
{
"message": "Failed to decode JSON object: Expecting value: line 1 column 1 (char 0)"
}
Either within the deployed app, or at least with the standalone docs #3 we should serve the open api spec that was written.
Flask plugin looks like one way. I've tried sphinx-swaggerdoc but just found too many issues... I opened a PR but I've already decided it is beyond help.
Currently an unauthed user can see a mappings status
Eg GET https://es.data61.xyz/api/v1/mappings/77bc11914e957d00c82d32cae965a040e3514a2fd66ef0c8/status
{
"ready": true,
"time_completed": "2017-08-02T09:44:18.211053",
"time_started": "2017-08-02T09:43:05.998527",
"time_added": "2017-08-02T09:42:59.726863",
"threshold": 0.95
}
Should we also expose the current progress and the size of the matching job?
From engineering created by hardbyte : n1analytics/engineering#398
Issue by smi9c4
Monday Jan 30, 2017 at 04:35 GMT
Originally opened as https://github.csiro.au/magic/n1-compute/issues/398
For the branch feature-es-database-refactor
and the PR #391
When a new permutation is posted (with encrypted mask), the Paillier public key and the context are saved in the database as such (the public key is checked but not the context).
Then when encrypting a number, only the public key and the base are used, not the remaining information from the context as the precision and the signed variables.
We should first check the received context, use it when encrypting the mask and send it well with the encrypted values.
In some use cases the actual decision of what to do with a possible link could be made outside this server if the similarity scores were exposed.
This proposal is to add a new view type where all the links above a certain threshold are returned. Note this would be a many 2 many linkage where some rows may be referenced multiple times.
Put together a load testing test suite.
The tool https://locust.io/ has proven good for this kind of thing.
I have a few jupyter notebooks which have been used to demonstrate this system. It would be good to tidy them up and include them in this repo.
Ideally include them in the production deployment so we can demo this thing.
To make it easier for external people developing a client side tool we need to more strictly define the translation of PII into CLKs.
I see this as the resource that configures the hashing tools and both participants can download the hashing schema from the server to check they agree on how they are creating bloom filters.
There are two components:
An example schema:
{
"version": "1.0",
"hash": {
"type": "double hash"
}
"features": [
{"identifier": "firstname", "type": "freetext", "ngram": 2, "weight": 5, "notes":""},
{"identifier": "gender", "type": "enum", "values": ["M", "F"], "ngram": 1, "weight": 1},
{"identifier": "phone", "type": "freetext", "ngram": 1, "weight": 1, "transforms": [{"type": "strip", "values": "()-"}]},
{"identifier": "postcode", "type": "freetext", "ngram": 1, "positional": true, "weight": 2}
]
}
Eventually I'd like to document the schema using http://json-schema.org which is very similar to OpenAPI spec/swagger but for JSON instead of REST.
Started in branch jenkins-pipeline
Writing the swagger docs for #4 I noticed that we don't allow cross origin requests for the entity-service.
We just need to add the CORS header as documented here.
From engineering created by hardbyte : n1analytics/engineering#409
Issue by tho802
Thursday Feb 09, 2017 at 00:49 GMT
Originally opened as https://github.csiro.au/magic/n1-compute/issues/409
As @smi9c4 points out in a pr comment we just chunk up the hashes from one data provider without considering if the other is perhaps larger. Even better would be to chunk both.
We should have a high level architecture diagram showing the communication between the various containers.
Some bad news while testing scalability of the entity service for ICML... Something went wrong. This issue is to record and investigate what happened, other issues will track the required fixes.
The only external indication is that progress continued past 1:
The test in question was 1M x 1M match with 20 worker pods across a 8 spot instances.
At least one worker has had an exception:
23:02:29 INFO Received task: async_worker.compute_filter_similarity[484710f7-a9d5-4a9b-b12c-ee96e4e853e0]
23:02:29 INFO Received task: async_worker.compute_filter_similarity[f51346d6-ba08-46de-8f4e-c76947a729a9]
23:07:40 INFO Timings: Prep: 17.1777 + 32.3618, Solve: 37268.0178, Total: 37317.5574 Comparisons: 499857456
23:07:50 ERROR Chord callback '9f9a597c-b781-486d-98bd-2780d94aa02a' raised: ValueError('9f9a597c-b781-486d-98bd-2780d94aa02a',)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/celery/backends/base.py", line 549, in on_chord_part_return
raise ValueError(gid)
ValueError: 9f9a597c-b781-486d-98bd-2780d94aa02a
23:07:55 INFO Task async_worker.compute_filter_similarity[1c2625ab-4454-48ca-a9f1-d55e812429e4] succeeded in 37332.14917168999s: [(429312, 0.9969604863221885, 923257), (429313, 0.995417048579285, 479794), (429314, 0.9966254218222722, 454009), (429315,...
23:07:55 INFO Received task: async_worker.compute_filter_similarity[7f406fe6-ecf8-441f-bf4c-f1302794a287]
00:09:24 INFO Timings: Prep: 19.9007 + 32.9159, Solve: 37639.4037, Total: 37692.2203 Comparisons: 499857456
00:09:34 ERROR Chord callback '9f9a597c-b781-486d-98bd-2780d94aa02a' raised: ValueError('9f9a597c-b781-486d-98bd-2780d94aa02a',)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/celery/backends/base.py", line 549, in on_chord_part_return
raise ValueError(gid)
ValueError: 9f9a597c-b781-486d-98bd-2780d94aa02a
Need to look into how we are using chords.
Centralised (searchable) logging from all workers would be nice for this.
It may be required that companies want to create multiple CLKs per row so they can link with multiple other orginisations.
To support this:
Following on from the discussion in #71 this issue is to edit the tools/build.sh
and tools/upload.sh
scripts and to allow the tagging logic to be set in the jenkinsfile.
I'd like to keep an easy way for a developer to build the docker images locally. In that case I think it is acceptable to tag with latest
, but we should avoid such tagging in jenkins.
Think about using Alembic to migrate between database schemas.
This is especially great to have once we are deployed for real in more than one place.
Aha! Link: https://csiro.aha.io/features/ANONLINK-20
From engineering created by hardbyte : n1analytics/engineering#394
Thursday Jan 19, 2017 at 00:39 GMT
Originally opened as https://github.csiro.au/magic/n1-compute/issues/394
A deployment time configurable for the entity service is the maximum number of comparisons each task should be. The trade off being that if it is too large most jobs won't be executed in parallel, however if too small the network and serialisation overhead dominates the actual work.
I've been running tests with it set to 10M - but as seen in the logs below for a job comparing 1M x 1M, the solving is only taking 5% of the actual time per task. I think 100M will improve things but it would be good to consider different chunk sizes for each task depending on the size of the overall job.
2017-01-19T00:34:09.471822433Z 00:34:09 INFO Received task: async_worker.compute_filter_similarity[9315ed1e-5d5a-4c39-a2d3-a464111883cf]
2017-01-19T00:34:09.955952543Z 00:34:09 INFO Timings: Prep: 94.2852 + 0.0058, Solve: 5.7127, Total: 100.0037
2017-01-19T00:34:09.957798556Z 00:34:09 INFO Progress. Compared 10000000 CLKS
There is a bottle neck in the main entity service app container must be able to fit any uploaded hashes entirely in memory before processing them. It shouldn't be too hard to instead allow uploading of binary hashes directly to the object store.
Something that is untested is broken.
At the moment there is fairly decent unit tests of the anonlink library and a script which does some end to end testing of a deployed entity matching service. However there are no tests of the end to end service that actually check the results of the matching!
Testing the flask endpoints isn't that straightforward as we are coupled with celery, redis, postgresql and minio.
http://flask.pocoo.org/docs/0.12/testing/
Celery tasks can also be tested with a bit of mocking - http://docs.celeryproject.org/en/latest/userguide/testing.html
From engineering created by hardbyte : n1analytics/engineering#401
Issue by tho802
Monday Feb 06, 2017 at 05:40 GMT
Originally opened as https://github.csiro.au/magic/n1-compute/issues/401
Found an issue when the entity service was passed a request with incorrect json structure - instead of failing gracefully the server threw a 500 error. In this case an object was found where a string was expected and the server said the dict
was unhashable.
This ticket is to explore libraries that offer type checking of json structures and implement one.
From engineering created by hardbyte : n1analytics/engineering#365
Issue by smi9c4
Wednesday Dec 14, 2016 at 07:37 GMT
Originally opened as https://github.csiro.au/magic/n1-compute/issues/365
There are quite a few tests in the entity service, finishing by receiving the mapping/permutation. However we are not checking that it is right.
This would be very easy on kubernetes so this depends on automated k8s tests
It has gotten a bit unwieldy dealing with each view type - it should be refactored into more of a dispatcher.
Need a script to remove all git history, IDE configurations, build the docs and zip everything up.
Attached a candidate release:
n1-es-v1.4.10.zip
The output of the many compute_filter_similarity
celery tasks are "reduced" to become the arguments to another task. In a very large matching this will be a fairly large amount of data. Unclear if there is a maximum other than what can fit in the underlying queue (redis).
I suspect we should save this match data else where and simply return a database id, or filename.
Aha! Link: https://csiro.aha.io/features/ANONLINK-9
This would allow us to use a standard postgresql docker container, helm deployment. Use hosted database solutions like RDS etc.
I get the following error:
Step 8 : RUN cd AnonymousLinking && pip install -U -r requirements.txt && pip install -e . && cd ..
---> Running in 42dc08862697
Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'
The command '/bin/sh -c cd AnonymousLinking && pip install -U -r requirements.txt && pip install -e . && cd ..' returned a non-zero code: 1
Currently the similarity scores are only stored if the result type is set to "similarity_scores"
. In the future we may want to store the similarity scores regardless of the result type. However, beware that it may need a lot of storage!
Aha! Link: https://csiro.aha.io/features/ANONLINK-10
Thresholds are configured for the server, they should be part of each match.
Friday Jan 06, 2017 at 06:32 GMT
Originally opened as https://github.csiro.au/magic/n1-compute/issues/383
The entity service should record and report mapping's states.
For example it currently might return:
{'current': '375000750000',
'elapsed': 3677.268006,
'message': "Mapping isn't ready.",
'progress': 1.0,
'total': '375000750000'}
Looking at the server logging is required to see if it was busy creating a permutation, or encrypting data.
In Feb 2017 Dongxi Liu from Data61 Marsfield have proposed a method for doing secure division using paillier to calculate the dice coefficient. Meaning the entity matching could be worked out without a semi-trusted third party.
Update: Calculate E(A)/E(B) by sending E(r A + e0) and E(r B + e) for random r, e0, and e, and then calculating (r A+e0) / (r B + e), which will be an estimate of the dice coefficient with a bit more noise.
The attached Excel spreadsheet illustrates this updated calculation of dice coefficient.
approximate-dice-coefficient.xlsx
I think the correctness proof is not hard, and when we have an implementation, the correctness and the accuracy can also be verified. For the security, our security goal is to protect the bloom filter of each party, that is each party cannot know more information about the number of common 1-bits and the number of 1-bits of the bloom filter of another party, compared with the ideal model in which both bloom filters are sent to a third trusted party. Informally, if we assume the homomorphic encryption is secure, then the party A (owning the private key) has those values encrypted, so B cannot know any information about A's bloom filter. For B's bloom filter, the information of 1-bits are randomized with three random numbers, and the way of randomization means that A cannot recover the number of 1-bits based on the hardness of approximate GCD problem. In addition, the bloom filter cannot be too short; otherwise, a party can do a dictionary attack to recover the bloom filter of another party based on the similarity ratio.
Consider allowing users to upload "blocks" along with the CLKs.
This is a repeatable issue. I'm trying to compute a permutation with unencrypted mask where DP1 has less CLKS than DP2 (but I cannot ensure that the data arrive in any order).
From logs:
es_backend_1 | [2017-02-06 23:38:49 +0000] [12] [ERROR] Error handling request
es_backend_1 | Traceback (most recent call last):
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 130, in handle
es_backend_1 | self.handle_request(listener, req, client, addr)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 171, in handle_request
es_backend_1 | respiter = self.wsgi(environ, resp.start_response)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/app.py", line 1836, in __call__
es_backend_1 | return self.wsgi_app(environ, start_response)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/app.py", line 1820, in wsgi_app
es_backend_1 | response = self.make_response(self.handle_exception(e))
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask_restful/__init__.py", line 271, in error_router
es_backend_1 | return original_handler(e)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/app.py", line 1403, in handle_exception
es_backend_1 | reraise(exc_type, exc_value, tb)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/_compat.py", line 32, in reraise
es_backend_1 | raise value.with_traceback(tb)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/app.py", line 1817, in wsgi_app
es_backend_1 | response = self.full_dispatch_request()
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/app.py", line 1477, in full_dispatch_request
es_backend_1 | rv = self.handle_user_exception(e)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask_restful/__init__.py", line 271, in error_router
es_backend_1 | return original_handler(e)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/app.py", line 1381, in handle_user_exception
es_backend_1 | reraise(exc_type, exc_value, tb)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/_compat.py", line 32, in reraise
es_backend_1 | raise value.with_traceback(tb)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/app.py", line 1475, in full_dispatch_request
es_backend_1 | rv = self.dispatch_request()
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/app.py", line 1461, in dispatch_request
es_backend_1 | return self.view_functions[rule.endpoint](**req.view_args)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask_restful/__init__.py", line 477, in wrapper
es_backend_1 | resp = resource(*args, **kwargs)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask/views.py", line 84, in view
es_backend_1 | return self.dispatch_request(*args, **kwargs)
es_backend_1 | File "/usr/local/lib/python3.5/site-packages/flask_restful/__init__.py", line 587, in dispatch_request
es_backend_1 | resp = meth(*args, **kwargs)
es_backend_1 | File "/var/www/entityservice.py", line 293, in get
es_backend_1 | "progress": (comparisons/total_comparisons) if total_comparisons is not 'NA' else 0.0
es_backend_1 | ZeroDivisionError: division by zero
Need to check if the entity service can actually carry out record linkage with large test sets before deploying to production.
Aha! Link: https://csiro.aha.io/features/ANONLINK-15
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.