blindstore / blindstore Goto Github PK

Official Blindstore repository

License: MIT License

C++ 64.03% C 35.97%

blindstore's Introduction

Introduction to Blindstore

Blindstore is a private information retrieval data store. It provides an online lookup service which returns the correct answer without ever knowing what the question was.

Currently this repo is a stub.

For the core cryptography implementation, see libshe.
For a working proof of concept of the Blindstore concept, please see the blindstore-old-scarab repository. The implementation there is based on another encryption scheme.

This repo will contain an implementation of Blindstore in C++ based on the V-DGHV scheme of somewhat homomorphic encryption (libshe) presented in [1].

Disclaimer

The library presented here is beta software and should not be used for any mission critical applications. No warranty expressed or implied is given.

References

[1] Yi, Xun; Kaosar, Mohammed Golam; Paulet, Russell; Bertino, Elisa, "Single-Database Private Information Retrieval from Fully Homomorphic Encryption", Knowledge and Data Engineering, IEEE Transactions on , vol.25, no.5, pp.1125,1134, May 2013

blindstore's People

Contributors

Stargazers

Watchers

Forkers

omegak eipifi andronat davidek yigitdemirag dmuktro tristartom

blindstore's Issues

Random performance

Request execution can take from 0.22s to 12s. We must find out what causes this nondeterministic behaviour. Possible suspects:

serialization (unlikely)
python-C communication layer (unlikely)
error in Flask (very unlikely)
inherently unstable performance of the homomorphic operations or encryption/decryption (need to benchmark directly in C)

Request times (100 iterations):

9.07
0.7799999999999994
12.35
7.059999999999999
4.390000000000001
12.230000000000004
2.1299999999999955
0.21999999999999886
0.22000000000000597
1.0399999999999991
3.289999999999999
1.0300000000000011
2.0700000000000003
4.289999999999999
5.590000000000003
0.8599999999999994
0.6700000000000017
2.559999999999988
0.7400000000000091
7.670000000000002
5.359999999999999
4.239999999999995
2.3700000000000045
3.019999999999996
8.019999999999996
9.429999999999993
6.8500000000000085
2.8799999999999955
1.4500000000000028
1.5600000000000023
9.649999999999991
7.740000000000009
0.37999999999999545
0.27000000000001023
0.4099999999999966
1.2299999999999898
3.539999999999992
0.6500000000000057
0.27000000000001023
1.490000000000009
0.23999999999998067
0.28000000000000114
1.9399999999999977
5.670000000000016
7.6200000000000045
11.849999999999994
3.799999999999983
1.670000000000016
1.210000000000008
0.6599999999999966
6.189999999999998
1.3499999999999943
1.490000000000009
1.6999999999999886
3.0500000000000114
4.149999999999977
4.25
4.550000000000011
1.2299999999999898
1.990000000000009
2.490000000000009
3.7299999999999898
6.460000000000008
1.4099999999999966
8.25
2.75
9.960000000000008
1.8499999999999943
3.2199999999999704
2.2200000000000273
0.660000000000025
3.9799999999999613
0.9700000000000273
0.9599999999999795
4.1200000000000045
1.3600000000000136
0.9300000000000068
1.079999999999984
3.2799999999999727
8.260000000000048
0.9599999999999795
3.069999999999993
3.829999999999984
1.8199999999999932
2.1200000000000045
0.5299999999999727
1.2200000000000273
0.2999999999999545
3.3600000000000136
1.25
7.470000000000027
2.8700000000000045
0.5
2.3799999999999955
0.5299999999999727
2.519999999999982
1.8500000000000227
10.949999999999989
2.0300000000000296

Random number in encryption not chosen exactly like in the paper

In the encryption function of libshe, a random integer in [1, 2^s) is chosen, but following the paper, this should be (-2^s, 2^s). We made this change because decryption failed when using the interval from the paper. We have to investigate why.

 // Chooses random noise r from [1, 2^s)
mpz_class bound;
mpz_ui_pow_ui(bound.get_mpz_t(), 2, s);
auto r = _random_mpz(1, bound-1);

In fact, this issue belongs into the repository of libshe, but as it will be merged into blindstore soon, I decided to document this here.

Code parallelization

Today me, @bogdan-kulynych and @northdpole we had a long conversation about parallelization and optimizations. We broke the code into steps and we concluded on how we should proceed with the subject.

What we found:

The XOR operation between the clients' encrypted key and the encrypted vector of 1's, can be moved on the client side
We can optimize a lot the needed space for the calculation matrices
All intermediate data seems to be independent and thus we can parallelize in a high degree.

How we proceed:

Create flexible benchmarking scripts that will create test input date
Test each part of the algorithm with different sizes of input to find bottlenecks

`libscarab` entropy source to be reviewed

Since the entropy source for libscarab used for encryption is current time, malicious PIR server can use the obtained Public Key and time information to guess the encrypted bits:

encrypted_bit = query.cipher_index[i]
pk = query.pk
for time in possible_milliseconds:
    guess0 = pk.quasi_encrypt(0, time)
    if guess0 == encrypted_bit:
        print ('Encrypted bit i was 0')
    guess1 == pk.quasi_encrypt(1, time)
    if guess1 = encrypted_bit:
        print ('Encrypted bit i was 1')

Merge libshe into blindstore repository

The library libshe will not be useful for any other application than the Private Information Retrieval for which the algorithm was made. Thus, we decided to merge it into the blindstore repository. Please find a way to preserve the history when merging.

Investigate HElib feasibility as FHE backend

HElib is a much more complicated to use in our protocol context, but it looks more promising than libscarab.

Benchmarks for server

Benchmarks for Blindstore server only for further analysis of performance improvement when we implement parallelization.

Decide on the granularity of the choice of privacy

If we want to expose the possibility to choose the level of privacy, we have to decide on the details.

The simplest approach I see at the moment is to just say which part of the database I want to query, given that the database will be split into n parts of equal size. Thus, the server would expect a query to consist of: [encrypted_index, number of parts to split the database into, part to query] and as an example [encrypted_index, 10, 2]. The server then would calculate the according index range for the third of ten parts and run the cryptographic algorithm just on these indices.

A more sophisticated approach would be to let the client specifiy an index range: [encrypted_index, start_index, end_index]. This provides the possibility to choose the database size more freely.

The most general approach would be to completely define the database subset that should be queried. The client could then send a list of index ranges.

Implement V-DGHV somewhat homomorphic scheme

It appears that the V-DGHV scheme proposed in Single-Database PIR from FHE allows for significant performance boost compared to using FHE as a black box. Using this scheme, we don't need to encrypt the indexes.

To achieve this, we need to marry PIR protocol to this particular HE scheme, so no abstraction on HE is needed anymore.

Decide on licensing

We need to decide on a license for Blindstore. This would need to be compatible with the licenses of the libraries which we use, and the aims and objectives of the project.

Investigate possible bug in set call

During the demos, a set call for row 1 appeared to change two rows (1 and 2) in the database, as displayed on the Web display. I attempted (in a very short and hurried period of time) to reproduce the error on master, but could not. We need to discover whether this was an issue with the demo Web display, changes on the demo branch, or an issue with master.

Investigate possible adaptation of the schema to integer indices

Following up from #3, check whether it is feasible to port the protocol to support integer indices, instead of being forced to convert to binary and encrypt bit by bit.

Collect and provide information about the impact of choosing a lower privacy level

By querying only a certain part of the database, the user leaks information to the database. Thus, the feature we provide by exposing the choice of privacy to the user, has to be used wisely. We should give some guidelines to the user of the library: How can the user decrease his privacy but not expose too much information? How can the user choose the part of the database such that the server learns as less as possible? What should be the minimum size of the part of the database? Should the index range be chosen randomly around the index the user queries?

Investigate the benefit of running Blindstore on GPUs

Looking for „gmplib GPU“ and „gmplib graphics card“ leads to some interesting results. Some of them:

GPUMP: a Multiple-Precision Integer Library for GPUs, on CUDA
The CUDA Multiple Precision Arithmetic Library: but this is for floating point numbers, and currently only provides addition, substraction and multiplication

Abstraction on FHE

If we want to benchmark different FHE libraries, first we would need to abstract the interface with them.

Don't use C-style new and free

Hey guys,
pretty awesome work you're doing. Just a quick remark.

I would strongly advise against C-style new and free as in for example https://github.com/blindstore/blindstore/blob/master/server/array.cpp where you do

blindstore_array_t* blindstore_array_make() {
    return new blindstore_array_t();
}

void blindstore_array_free(blindstore_array_t* blindstore_array) {
    delete blindstore_array;
}

Use C++ classes instead. Destructors are one of the most awesome things C++ has to offer that C doesn't. A possible case where this is important is sketched out below:

auto a = blindstore_array_make();
throw SomeException;
blindstore_array_free(a);

And a is leaked...

The only possible way this could work well that I can think of is if you always use std::unique_ptr (or std::shared_ptr respectively) with the appropriate deleter set as in e.g. (not tested)

using unique_blindstore_array_ptr = std::unique_ptr<blindstore_array_t, [&](blindstore_array_t *arr) { blindstore_array_free(arr); }>;

(Actually the default deleter should work fine but since you define a function wrapper around it that might be used to do more than just delete it might be better to use it as the custom deleter...)

Easier installation process

We should work on this before going public

Python wrapper for HElib

Language choice and parallelization tool

Because the main issue with the current prototype is performance, we should re-consider using only python for implementig the protocol functions. Some brainstorming lead to several proposals:

Language choice

(to be matched with the parallelization choice)

Priority is on ease to write and maintain

Stick with Python and C
Rust (by Mozilla)
Go (by Google)
C++ (much less likely)
Scala/Java (there's fhe-core library for FHE)
Other solutions?

Parallelizing tool

(to be matched with the language choice)

Hadoop
Apache Storm – any language
Apache Spark – Python, Scala or Java
Other solutions?

XPIR and Blindstore

Hello all,
I am one of the developers of XPIR and I recently stumbled across your project, which seems to have a ton of good ideas !

As you have probably though a lot about making PIR usable I wondered if you would have some suggestions of ideas we could integrate to XPIR from this project or things that we got completely wrong and that we should rethink. Of course if any of you is interested in joining our project it is feasible too ! On the opposite side, if there are things you would like to get from XPIR we can of course help :)

Best,

Carlos