We at Meta care deeply about protecting the privacy of our users’ data and that of advertisers and their customers. We’re researching and developing several privacy-enhancing technologies (PETs) to continue to improve privacy and data security for individuals and businesses while still building innovative products powered by AI. PETs are a set of foundational technologies and techniques that foster increased data protection and minimize the possession of personal data. Integrating PETs into wider policy and business frameworks can enable necessary data privacy and security at scale. One of the privacy-enhancing technologies Meta is investing in is secure multi-party computation–based machine learning (MPC-ML). It leverages secure multi-party computation (MPC) to allow any two (or more) computing parties to jointly train a machine learning (ML) model without any user data leaving respective servers. MPC-ML protocols also compose well with differential privacy mechanisms to guarantee a specific level (ε) of differential privacy. Consequently, any information exchanged between computing parties via MPC-ML protocols is encrypted and unintelligible, ensuring input and output privacy, with neither party able to infer the other’s user data. In this post, we share an end-to-end system for privately training ML models using secure multi-party computation. Implementing privacy-preserving machine learning in practical settings, with end-to-end privacy guarantees, involves two stages. First, private data preprocessing is necessary to establish reliable training data. For example, information necessary to compute features may be owned by one party while that of labels may be owned by another party, necessitating joining on common identifiers to generate training examples, followed by privately computing features and labels. Second, aligned features and labels are used to privately train differentially private ML models using secure multiparty computing–based machine learning. Below, we elaborate on privacy mechanisms and protocols that help us perform each of the above steps. In practical settings, data used to train an ML model may be split over multiple entities to protect data privacy, as a single party may not own it. In such situations, the first step involves privately joining data records and computing features or labels on joined data. This is a fundamental private analytics problem, often referred to as “private join and compute,” and has been extensively studied in academia and industry. A few examples of cryptographic protocols include PS3I and Private-ID, developed at Meta, and private intersection protocol by Google to enable the private matching and aligning of data records that are split over multiple entities. Specifically, we use the Private-ID protocol that privately computes a full outer join and limits the leakage to only the total number of matched records while keeping individual datasets private. Private-ID is very scalable, allowing the matching of over 100 million records in an hour.
Once data records are aligned, we can perform private computations to establish features and labels. As an example, labels may depend on data from two or more parties, and we may need to perform private computations, such as comparisons, to generate labels. Only obfuscated components (also known as secret shares, described below) of labels, resulting from such private computations, are stored with different parties, and passed on to the private machine learning phase. We use MPC-ML to privately train a model when training data is split over multiple entities. An MPC protocol uses cryptography and is intended to mathematically ensure that user data is used only in a specific algorithm that has been agreed to, without any sharing of raw data. MPC is a cryptographic technology with a long history of academic research, and recent advancements have made it feasible to be deployed in big-data applications. It works by transforming data into random secret shares, which are split among the computing participants. It then calculates the desired result using the random-looking incomplete shares. At the very end of the calculation, the final shares are combined, revealing only the results. The raw and intermediate calculation steps are not required to be revealed and remain indistinguishable from random noise.
Alice (knows X) and Bob (knows Y) participate in an MPC protocol to privately compute X+Y. Training an MPC-ML model involves using MPC protocols to privatize each step of training ML models, such as: For the initial research, Meta leveraged CrypTen, an open source software framework built on PyTorch that aims to make modern MPC primitives accessible. We can implement all four steps described above in CrypTen, quickly allowing rapid prototyping. Alternatively, long-term development will likely leverage Meta’s Private Computation Framework (PCF, white paper), an open source application development framework. PCF provides the ability for users to specify their high-level MPC application, and it automatically generates optimized and scalable MPC protocols. Encrypting features, labels, and model updates (creating secret shares) renders them unintelligible, and you need all the secret shares to recover the private data. Having access to all but one secret share is not sufficient to recover private data, highlighting the privacy protection provided by MPC-ML. The MPC-ML protocols also allow differentially private noise addition to the training process (algorithms like DP-SGD) to guarantee a specific level (ε) of differential privacy. This allows us to further bolster the privacy afforded by MPC-ML, ensuring output privacy.
MPC-ML can be applied across many different industries and use cases. Below are a few of those examples: Personalized ads help people discover products and services they want or will love, creating tremendous value to people and businesses alike. Secure MPC-ML can enable ad platforms to privately train ML models for personalized ad delivery while giving strong privacy protections to user data. Existing industry proposals such as Google’s Aggregate Conversion API and Microsoft’s Masked Learning, Aggregations, Reporting workflow use MPC for private analytics and model training. Several important research areas in health care require end-to-end private machine learning among institutionally siloed datasets. Recent work shows applications of private MPC-ML protocols for medical imaging, while MPC-based record linkage is used for risk stratification, among other applications. Applications above require processing of medical records that are often regulated by strict privacy laws (e.g., HIPAA in the U.S., GDPR in the E.U.). This necessitates the use of privacy-enhancing technologies to analyze and learn from the data while providing strong privacy guarantees. MPC is specifically well suited because of its strong privacy protection—data stays with owners, data is encrypted (secret-shared) before it’s exchanged with others, and computations are performed over encrypted secret shares—and its ability for MPC protocols to operate in semi- to low-trust settings. Alexandra Institute developed a private marketplace using MPC and tested it to conduct Danish sugar beet auctions. Auctions are used to decide the market clearing price and contracts based on bids (price and quantity) provided by farmers. The bids are private data, as they may reveal a farmer’s economic position and productivity-related information. MPC is a natural fit for building a secure auction system as it protects both private information (bids) from all other participants and the fidelity (correctness, as no random noise addition is involved) of the auction system. We believe MPC-ML is a promising privacy-enhancing technology for privately training ML models. It provides: Strong privacy guarantees and composability: MPC-ML ensures that any information shared between parties through MPC-ML is encrypted and unintelligible—while composing well with other privacy-enhancing technologies, such as randomized algorithms that provide differential privacy. Reduced barriers to adoption: Continuous improvements in new protocols, computing paradigms (such as delegation), and software automations will lower the computational and communication overheads while reducing the barrier to adoption. Low cost overhead: MPC-ML is a server-side solution, involving only enterprise entities for compute-intensive operations and incurring almost negligible overhead to users. Aligning with our vision to provide strong privacy guarantees while developing novel and impactful AI products, we’re investing in several privacy-enhancing technologies. In this post, we presented our efforts around MPC-ML, but we’re simultaneously investing in other technologies, such as on-device (federated) learning, differential privacy, and trusted execution environments (TEE), to improve user data privacy.