Data privacy is fundamentally important in big data analysis, machine learning and distributed systems. To guarantee practical privacy, the notion of differential privacy is adopted to maximize the accuracy of queries in statistical database while minimizing the possibility of identifying records. In the matrix completion problem, recovery accuracy and data privacy are two contradicting aspects, since revealing more entries will increase the estimation accuracy but sacrifices privacy. Thus, privacy-preserving matrix completion is a challenging issue. In this paper, we propose a novel scheme for differentially private matrix completion with provably better accuracy bound. First, we present our differential private version of the practically efficient alternating minimization algorithm. Secondly, we present theoretical results for ensuring required privacy. Thirdly, on real world data sets, we compare the performances of our scheme with that of state-of-the-art schemes.
- We use Matlab to achieve a practical approach to complete matrix.
- For synthetic data, we compare our Differentially Private Matrix Completion via Alternating Minimization with Private Frank-Wolf algorithm, since both are designed for low-rank matrix completion, to show the advantage of our algorithm. We conduct experiment to recover a low-rank matrix with rank-
, from observed elements in the subset
.
- For recovery error, we adopt the relative square metric (RSE) and root mean square error (RMSE).
- For running time, varying the matrix size and fixing other parameters, we measure CPU time in seconds.
- For convergence speed, we measure the decreasing rate of the RES across the iterations by linearly fitting the measured RSEs (in log scale). We include those plots due to two reasons: 1) both algorithms are iterative; 2) the decreasing speed of the RSE provides explanations for the observed performance of the recovery error and the running time.
Our new algorithm are shown as follows:
As we want to preserve privacy of every user, and the output for each user is -dimensional, we expect the private recommendations to be accurate only when
. Due to this constraint, we conduct experiments on the following datasets: 1) Synthetic: We generate a random rank-one matrix
with unit
-norm,
, and
, 2) MovieLens10M (Top 400): We pick the
most rated movies from the Movielens10M dataset, resulting in
users, 3) Netflix (Top 400): We pick the
most rated movies from the Netflix prize dataset, resulting in
users.