Enabling privacy-preserving Machine Learning
As part of Outlier Ventures’ Research Programme with the Imperial Centre for Cryptocurrency Research and Engineering, Florian Apfelbeck has researched the main approaches for privacy-preserving techniques as applicable to machine learning in Fetch. He implemented two machine learning algorithms in a secure multiparty computation framework. He presents his work in two blog posts. We have been very impressed with his results, demonstrating a neural network in a secure multiparty computation with more parties and better trust assumptions than previous implementations.
Today, a world without machine learning is hard to imagine. From voice assistants to the analysis of medical data to self-driving cars, these algorithms power the latest and most advanced technologies. All this is enabled by large amounts of data, provided by individuals and businesses. Companies backed by Outlier Ventures, such as Fetch.AI, Haja Networks, and Ocean Protocol aim to unlock further potential of data as well as ease access to it and its monetisation in decentralized environments. Not only for businesses, health care or financial institutions, but also for private individuals, there are various examples where data has to be kept confidential. Privacy-preserving techniques have been proposed and a variety of available schemes offer a broad spectrum in terms of security, performance, and applicability. In combination with machine learning algorithms, they are a powerful tool to handle privacy critical operations. However, practicability is not always given and so far, its use is limited to specific cases.
A project at Imperial College, together with Outlier Ventures and Fetch.AI, investigated on this challenge. First, the potential of several approaches to data privacy in the machine learning context were evaluated (see also this post). Based on this, we decided to implement two learning algorithms from scratch – a simple linear regression and a single-hidden-layer neural network – with Secure Multiparty Computation (MPC) to assess the performance and practicability in comparison to a non-MPC implementation in Python. For the implementation, we relied on the MPC software-package SCALE-MAMBA, which enables trustless computation on encrypted data input with multiple parties.
Secure Multiparty Computation, SPDZ and SCALE-MAMBA
This section provides a basic understanding of the concepts in order to understand the next section better. If you are familiar with MPC and the SPDZ/Overdrive protocol we used here, or interested only in the results, feel free to skip this part.
MPC enables computation on data from different providers/parties, such that the other participating parties gain no additional information about each others’ inputs, except what can be learned from the public output of the algorithm. In other words, when we have the parties Alice, Bob and Casper, all three have access to the output. However, it is not possible for, e.g., Alice to know the plain data Bob and Casper provided.
The underlying mechanics depend on the protocol used. In our case, we decided to employ Overdrive, a successor of the SPDZ protocol. These schemes are based on so called secret-sharing. Put simply, the idea of secret-sharing is to split every input value using a defined operation; addition in the case of SPDZ. The parts of the value are shared between all parties, which then execute the operations given by the algorithm on their shares of the different data inputs. Combining the results of the individual computations eventually yields the output value.
To increase speed of the computation, a preprocessing phase was introduced with the SPDZ protocol. In this phase, all computationally heavy operations are performed and items are created. These items build the basis for multiplications, squaring and bit operations for the execution of the algorithm, the so called online phase. The preprocessing phase is independent of the online phase and the data inputs. This separation allows a fairly fast online phase.
Some of the researchers of the original SPDZ team, e.g., Nigel Smart, developed a software-package for research purposes, now called SCALE-MAMBA. We have used it for this project. SCALE-MAMBA consists of the SCALE virtual machine, providing the basis for the SPDZ protocol, and MAMBA, a specific language similar to Python.
Implementation and Performance Evaluation of Simple Linear Regression and Neural Network
We now detail our implementation of the simple linear regression algorithm and the neural network. For the first algorithm, we were inspired by the Python implementation found here and also compare the performance to this version. The neural network has the structure as in this source, which has a single hidden layer, sigmoid activation function, mean squared error cost function, backpropagation and uses gradient descent for optimization (Figure 1).
Figure 1: Architecture of the neural network. It consists of an input layer with two, a hidden layer with nine and an output layer with one node.
Simple Linear Regression
To evaluate performance of the linear regression, we measure timing and precision in two-, three- and five-party settings in a LAN and simulated WAN environment. Each party has at least 32GB RAM available. The preprocessing takes several ms per item, which can sum up to one hour in our setup and is highly RAM consuming. The online phase requires about 3, 4 and 5.6 seconds for two-, three- and five-party computation, respectively (Figure 2). This is equivalent to a decrease of between three and four orders of magnitude in comparison to the Python version (Figure 3). In the WAN setting, the preprocessing becomes one order of magnitude slower. The time of the online phase does not change. On an example with 64 data points, the deviation in test accuracy is 6 x 10^(-5).
Figure 2: Timing results of MPC and Python linear regression implementation. Besides the Full Threshold SPDZ/Overdrive protocol (FT), we also compare it to a Shamir Secret Sharing protocol based on SCALE-MAMBA.
Figure 3: Timing comparison of MPC protocols in multiples of the Python implementation runtime.
We were also able to partly test the neural network’s precision and timing of individual parts. Whereas the precision of the MPC implementation shows low deviation for resulting losses (Figure 4), the time consumed by the online phase is significantly higher compared to a Python implementation (104 times for 10 epochs). During the creation of the neural network, we faced some challenges. First, our implementation of the sigmoid can exceed the possible fix point bit size for our algorithm which causes overflow errors and the training to stop prematurely. Second, we have indication that the division operation of SCALE-MAMBA is not always numerically stable. In both cases, further investigation is required. A possible solution could be to switch to a polynomial approximation of the sigmoid and to double check the correctness of every division.
Figure 4: Top: Error development over 20 epochs with MPC network. Bottom: Error development over 28 epochs with Python network. The black line represents the mean of the four training inputs of the XOR operation. The coloured graphs show the error of each input sample individually.
Due to the nature of MPC in general and MAMBA in particular, some of its functionality is rather different than in Python. Some functions present in many modern programming languages are not naturally given, for example else if-statements, floating point number read-in from a file, random number generation within an interval, or a convenient way to handle arrays/vectors. Therefore, in addition to the measurements, a few implementations such as a way to import floating point numbers in MAMBA and a custom vector-type class for ease of implementing MAMBA is part of the code base. The code for this can be found in the Github repository and we hope it helps for future projects with SCALE-MAMBA.
Summary of Findings and Outlook
The results presented in this work show, that multiparty computation based on SPDZ is not ready for production, yet. Especially, improvements on the speed of the preprocessing, but also the online phase are necessary for widespread use. We implemented and tested the performance of two algorithms in regards to speed and precision in comparison to a non-MPC version in Python. The results show an overall decrease of speed of the online phases of three to four orders or magnitude in case of the simple linear regression (depending on number of participants) and four orders of magnitude for the more complex neural network in comparison to the Python version. In some cases, this could be tolerable. However, the preprocessing phase can become very time consuming. Also, for IoT devices and other computationally weak devices, the computationally heavy preprocessing can pose a significant challenge.
Nevertheless, there are promising other approaches. For example, when it comes to usability, improvement can be reached with the approach of Morten Dahl (described here and here). A trusted third party creates the preprocessed items. By doing so, Python including TensorFlow and other frameworks can be used for the implementation of the algorithm. However, this is not trustless anymore, which is only applicable to some use cases. To the best of our knowledge, our neural network is the first implementation of a more than 2-party privacy preserving neural network based on the SPDZ protocol, where no trusted third party as crypto provider is required for the preprocessing phase.
We conclude that MPC computation has only restricted practicability in its current state, even with a sophisticated library such as SCALE-MAMBA. It also mainly depends on the application and the environment, if the use of SPDZ and MPC can currently make sense. Given the enhanced attention of public and research and the dramatic advances in privacy-preserving techniques in general and in MPC in particular throughout the past few years, real world applications in this context, not only based on SPDZ and the techniques investigated on here, are possible the future.